20_Appendix.Rmd

# (APPENDIX) Appendix {-}

# Proofs 

## Proofs of results in Chapter \@ref(FPSI)

### Proof of Theorem \@ref(thm:uppsampnoise) {#proofcheb}

In order to use Theorem \@ref(thm:cheb) for studying the behavior of $\hat{\Delta^Y_{WW}}$, we have to prove that it is unbiased and we have to compute $\var{\hat{\Delta^Y_{WW}}}$.
Let's first prove that the $WW$ estimator is an unbiased estimator of $TT$:

```{lemma,unbiasww,name="Unbiasedness of $\\hat{\\Delta^Y_{WW}}$"}
Under Assumptions \@ref(def:noselb), \@ref(def:fullrank) and \@ref(def:iid),

\begin{align*}
\esp{\hat{\Delta^Y_{WW}}}& = \Delta^Y_{TT}.
\end{align*}

```

```{proof}
In order to prove Lemma \@ref(lem:unbiasww), we are going to use a trick.
We are going to compute the expectation of the $WW$ estimator conditional on a given treatment allocation.
Because the resulting estimate is independent of treatment allocation, we will have our proof. 
This trick simplifies derivations a lot and is really natural: think first of all the samples with the same treatment allocation, then average your results over all possible treatment allocations.

\begin{align*}
\esp{\hat{\Delta^Y_{WW}}} & = \esp{\esp{\hat{\Delta^Y_{WW}}|\mathbf{D}}}\\
                          & = \esp{\esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\
                          & = \esp{\esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i|\mathbf{D}}-\esp{\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\
                          & = \esp{\frac{1}{\sum_{i=1}^N D_i}\esp{\sum_{i=1}^N Y_iD_i|\mathbf{D}}-\frac{1}{\sum_{i=1}^N (1-D_i)}\esp{\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\
                          & = \esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N \esp{Y_iD_i|\mathbf{D}}-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N \esp{Y_i(1-D_i)|\mathbf{D}}}\\
                          & = \esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N \esp{Y_iD_i|D_i}-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N \esp{Y_i(1-D_i)|D_i}}\\
                          & = \esp{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N D_i\esp{Y_i|D_i=1}-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N(1-D_i)\esp{Y_i|D_i=0}}\\
                          & = \esp{\frac{\sum_{i=1}^N D_i}{\sum_{i=1}^N D_i}\esp{Y_i|D_i=1}-\frac{\sum_{i=1}^N(1-D_i)}{\sum_{i=1}^N (1-D_i)}\esp{Y_i|D_i=0}}\\
                          & = \esp{\esp{Y_i|D_i=1}-\esp{Y_i|D_i=0}}\\
                          & = \esp{Y_i|D_i=1}-\esp{Y_i|D_i=0} \\
                          & = \Delta^Y_{TT}.
\end{align*}

The first equality uses the Law of Iterated Expectations (LIE).
The second and fourth equalities use the linearity of conditional expectations.
The third equality uses the fact that, conditional on $\mathbf{D}$, the number of treated and untreated is a constant.
The fifth equality uses Assumption \@ref(def:iid).
The sixth equality uses the fact that $\esp{Y_iD_i|D_i}=D_i\esp{Y_i*1|D_i=1}+(1-D_i)\esp{Y_i*0|D_i=0}$.
The seventh and ninth equalities use the fact that $\esp{Y_i|D_i=1}$ is a constant.
The last equality uses Assumption \@ref(def:noselb).
````

Let's now compute the variance of the $WW$ estimator:

```{lemma,varww,name="Variance of $\\hat{\\Delta^Y_{WW}}$"}
Under Assumptions \@ref(def:noselb), \@ref(def:fullrank) and \@ref(def:iid), 

\begin{align*}
\var{{\hat{\Delta^Y_{WW}}}} & = \frac{1-(1-\Pr(D_i=1))^N}{N\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{1-\Pr(D_i=1)^N}{N(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}.
\end{align*}

```

```{proof}
Same trick as before, but now using the Law of Total Variance (LTV):
  
\begin{align*}
\var{{\hat{\Delta^Y_{WW}}}} & = \esp{\var{\hat{\Delta^Y_{WW}}|\mathbf{D}}}+\var{\esp{\hat{\Delta^Y_{WW}}|\mathbf{D}}}\\
                            & = \esp{\var{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i-\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}} \\
                            & = \esp{\var{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i|\mathbf{D}}}+\esp{\var{\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}}\\
                            & \phantom{=}+\esp{\cov{\frac{1}{\sum_{i=1}^N D_i}\sum_{i=1}^N Y_iD_i,\frac{1}{\sum_{i=1}^N (1-D_i)}\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}} \\
                            & = \esp{\frac{1}{(\sum_{i=1}^N D_i)^2}\var{\sum_{i=1}^N Y_iD_i|\mathbf{D}}}+\esp{\frac{1}{(\sum_{i=1}^N (1-D_i))^2}\var{\sum_{i=1}^N Y_i(1-D_i)|\mathbf{D}}} \\
                            & = \esp{\frac{1}{(\sum_{i=1}^N D_i)^2}\var{\sum_{i=1}^N Y_iD_i|D_i}}+\esp{\frac{1}{(\sum_{i=1}^N (1-D_i))^2}\var{\sum_{i=1}^N Y_i(1-D_i)|D_i}} \\
                            & = \esp{\frac{1}{(\sum_{i=1}^N D_i)^2}\sum_{i=1}^ND_i\var{Y_i|D_i=1}}+\esp{\frac{1}{(\sum_{i=1}^N (1-D_i))^2}\sum_{i=1}^N(1-D_i)\var{Y_i|D_i=0}} \\
                            & = \var{Y_i|D_i=1}\esp{\frac{1}{\sum_{i=1}^N D_i}}+\var{Y_i|D_i=0}\esp{\frac{1}{\sum_{i=1}^N (1-D_i)}} \\
                            & = \frac{1-(1-\Pr(D_i=1))^N}{N\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{1-\Pr(D_i=1)^N}{N(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}.
\end{align*}

The first equality stems from the LTV.
The second and third equalities stems from the definition of the $WW$ estimator and of the variance of a sum of random variables.
The fourth equality stems from Assumption \@ref(def:iid), which means that the covariance across observations is zero, and from the formula for a variance of a random variable multiplied by a constant.
The fifth and sixth equalities stems from Assumption \@ref(def:iid) and from $\var{Y_iD_i|D_i}=D_i\var{Y_i*1|D_i=1}+(1-D_i)\var{Y_i*0|D_i=0}$.
The seventh equality stems from $\var{Y_i|D_i=1}$ and $\var{Y_i|D_i=0}$ being constant.
The last equality stems from the formula for the expectation of the inverse of a sum of Bernoulli random variables with at least one of them taking value one which is the case under Assumption \@ref(def:fullrank).
```

Using Theorem \@ref(thm:cheb), we have:

\begin{align*}
2\epsilon & \leq 2\sqrt{\frac{1}{N(1-\delta)}\left(\frac{1-(1-\Pr(D_i=1))^N}{\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{1-\Pr(D_i=1)^N}{(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}\right)}\\
          & \leq 2\sqrt{\frac{1}{N(1-\delta)}\left(\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{(1-\Pr(D_i=1))}\right)},
\end{align*}

where the second equality stems from the fact that $\frac{(1-\Pr(D_i=1))^N}{\Pr(D_i=1)}\var{Y_i^1|D_i=1}+\frac{\Pr(D_i=1)^N}{(1-\Pr(D_i=1))}\var{Y_i^0|D_i=0}\geq0$.
This proves the result.

### Proof of Theorem \@ref(thm:asympnoiseWW) {#proofCLT}

Before proving Theorem \@ref(thm:asympnoiseWW), let me state a very useful result: $\hat{WW}$ can be computed using OLS:

```{lemma,WWOLS,name="WW is OLS"}
Under Assumption \@ref(def:fullrank), the OLS coefficient $\beta$ in the following regression: 

\begin{align*}
		Y_i &  = \alpha +  \beta D_i + U_i
	\end{align*}

is the WW estimator:

\begin{align*}
\hat{\beta}_{OLS} & = \frac{\frac{1}{N}\sum_{i=1}^N\left(Y_i-\frac{1}{N}\sum_{i=1}^NY_i\right)\left(D_i-\frac{1}{N}\sum_{i=1}^ND_i\right)}{\frac{1}{N}\sum_{i=1}^N\left(D_i-\frac{1}{N}\sum_{i=1}^ND_i\right)^2} \\
  								& = \hat{\Delta^Y_{WW}}.
\end{align*}

```

```{proof}
In matrix notation, we have:

\begin{align*}
  \underbrace{\left(\begin{array}{c}  Y_1 \\	\vdots \\	Y_N \end{array}\right)}_{Y} & = 
  \underbrace{\left(\begin{array}{cc}	1 & D_1\\	\vdots & \vdots\\	1 & D_N\end{array}\right)}_{X}
  \underbrace{\left(\begin{array}{c}	\alpha \\	\beta \end{array}\right)}_{\Theta}+
  \underbrace{\left(\begin{array}{c}	U_1 \\	\vdots \\	U_N \end{array}\right)}_{U}
\end{align*}

The OLS estimator is:

\begin{align*}
	\hat{\Theta}_{OLS} &  = (X'X)^{-1}X'Y
\end{align*}

Under the Full Rank Assumption, $X'X$ is invertible and we have:

\begin{align*}
(X'X)^{-1} &  = \left(\begin{array}{cc} N & \sum_{i=1}^ND_i \\ \sum_{i=1}^ND_i & \sum_{i=1}^ND_i^2 \end{array}\right)^{-1} \\
                & = \frac{1}{N\sum_{i=1}^ND_i^2-\left(\sum_{i=1}^ND_i\right)^2}\left(\begin{array}{cc} \sum_{i=1}^ND_i^2 & -\sum_{i=1}^ND_i \\ -\sum_{i=1}^ND_i & N \end{array}\right)
\end{align*}

For simplicity, I omit the summation index:

\begin{align*}
  \hat{\Theta}_{OLS} &  = \frac{1}{N\sum D_i^2-\left(\sum D_i\right)^2}
                          \left(\begin{array}{cc} \sum D_i^2 & -\sum D_i \\ -\sum D_i & N \end{array}\right)
                          \left(\begin{array}{c} \sum Y_i \\  \sum Y_iD_i \end{array}\right) \\
                    & = \frac{1}{N\sum D_i^2-\left(\sum D_i\right)^2}
                        \left(\begin{array}{c} \sum D_i^2\sum Y_i-\sum D_i\sum_{i=1}^NY_iD_i \\
                                              -\sum D_i\sum Y_i+ N\sum Y_iD_i \end{array}\right) \\
\end{align*}

Using $D_i^2=D_i$, we have:

\begin{align*}
  \hat{\Theta}_{OLS} &  =  \left(\begin{array}{c} 
          \frac{\left(\sum D_i\right)\left(\sum Y_i-\sum Y_iD_i\right)}{\left(\sum D_i\right)\left(N-\sum D_i\right)} \\
          \frac{N\sum Y_iD_i-\sum D_i\sum Y_i}{N\sum D_i-\left(\sum D_i\right)^2} 
                            \end{array}\right) 
                        =     \left(\begin{array}{c} 
          \frac{\sum (Y_iD_i+Y_i(1-D_i))-\sum Y_iD_i}{\sum(1-D_i)} \\
          \frac{N^2}{N^2}\frac{\frac{1}{N}\sum Y_iD_i-\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i+\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i-\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i}{\frac{1}{N}\sum D_i-2\left(\frac{1}{N}\sum D_i\right)^2+\left(\frac{1}{N}\sum D_i\right)^2} 
                            \end{array}\right) \\
                      &  =     \left(\begin{array}{c} 
          \frac{\sum Y_i(1-D_i)}{\sum(1-D_i)} \\
          \frac{\frac{1}{N}\sum \left(Y_iD_i-D_i\frac{1}{N}\sum Y_i-Y_i\frac{1}{N}\sum D_i+\frac{1}{N}\sum D_i\frac{1}{N}\sum Y_i\right)}{\frac{1}{N}\sum\left(D_i-2D_i\frac{1}{N}\sum D_i+\left(\frac{1}{N}\sum D_i\right)^2\right)} 
                            \end{array}\right) 
                      =     \left(\begin{array}{c} 
          \frac{\sum Y_i(1-D_i)}{\sum(1-D_i)} \\
    \frac{\frac{1}{N}\sum\left(Y_i-\frac{1}{N}\sum Y_i\right)\left(D_i-\frac{1}{N}\sum D_i\right)}{\frac{1}{N}\sum \left(D_i-\frac{1}{N}\sum D_i\right)^2} 
                            \end{array}\right), 
\end{align*}

which proves the first part of the lemma.
Now for the second part of the lemma:

\begin{align*}
  \hat{\beta}_{OLS} &  = \frac{\sum Y_iD_i-\frac{1}{N}\sum D_i\sum Y_i}{\sum D_i\left(1-\frac{1}{N}\sum D_i\right)}
                       = \frac{\sum Y_iD_i-\frac{1}{N}\sum D_i\sum\left(Y_iD_i+(1-D_i)Y_i\right)}{\sum D_i\left(1-\frac{1}{N}\sum D_i\right)}\\
                    &  = \frac{\sum Y_iD_i\left(1-\frac{1}{N}\sum D_i\right)-\frac{1}{N}\sum D_i\sum(1-D_i)Y_i}{\sum D_i\left(1-\frac{1}{N}\sum D_i\right)}\\
                    &  = \frac{\sum Y_iD_i}{\sum D_i}-\frac{\frac{1}{N}\sum(1-D_i)Y_i}{\left(1-\frac{1}{N}\sum D_i\right)}\\
                    &  = \frac{\sum Y_iD_i}{\sum D_i}-\frac{\frac{1}{N}\sum(1-D_i)Y_i}{\frac{1}{N}\sum\left(1-D_i\right)}\\
                     &  = \frac{\sum Y_iD_i}{\sum D_i}-\frac{\sum(1-D_i)Y_i}{\sum\left(1-D_i\right)}\\
                     & = \hat{\Delta^Y_{WW}},
\end{align*}

which proves the result.
```

Now, let me state the most important lemma behind the result in Theorem \@ref(thm:asympnoiseWW):

```{lemma,asympOLS,name="Asymptotic Distribution of the OLS Estimator"}
Under Assumptions \@ref(def:noselb), \@ref(def:fullrank), \@ref(def:iid) and \@ref(def:finitevar), we have:

\begin{align*}
  \sqrt{N}(\hat{\Theta}_{OLS}-\Theta) &  \stackrel{d}{\rightarrow}
  \mathcal{N}\left(\begin{array}{c}	0\\	0\end{array},
  \sigma_{XX}^{-1}\mathbf{V_{xu}}\sigma_{XX}^{-1}\right), 
\end{align*}

with 
\begin{align*}
\sigma_{XX}^{-1}& = \left(\begin{array}{cc}	\frac{\Pr(D_i=1)}{\Pr(D_i=1)(1-\Pr(D_i=1))} & -\frac{\Pr(D_i=1)}{\Pr(D_i=1)(1-\Pr(D_i=1))}\\
                                          -\frac{\Pr(D_i=1)}{\Pr(D_i=1)(1-\Pr(D_i=1))} & \frac{1}{\Pr(D_i=1)(1-\Pr(D_i=1))} 
                          \end{array}\right)\\
\mathbf{V_{xu}}&= \esp{U_i^2\left(\begin{array}{cc}  1 & D_i\\  D_i & D_i\end{array}\right)}                        
\end{align*}

```

```{proof}

\begin{align*}
\sqrt{N}(\hat{\Theta}_{OLS}-\Theta) & = \sqrt{N}((X'X)^{-1}X'Y-\Theta) \\
                                    & = \sqrt{N}((X'X)^{-1}X'(X\Theta+U)-\Theta) \\
                                    & = \sqrt{N}((X'X)^{-1}X'X\Theta+(X'X)^{-1}X'U)-\Theta) \\
                                    & = \sqrt{N}(X'X)^{-1}X'U \\
                                    & = N(X'X)^{-1}\frac{\sqrt{N}}{N}X'U
\end{align*}

Using Slutsky's Theorem, we can study both terms separately.
Slutsky's Theorem states that if $Y_N\stackrel{d}{\rightarrow}y$ and $\text{plim}(X_N)=x$, then:

1. $X_N+Y_N\stackrel{d}{\rightarrow}x+y$
2. $X_NY_N\stackrel{d}{\rightarrow}xy$
3. $\frac{Y_N}{X_N}\stackrel{d}{\rightarrow}\frac{x}{y}$ if $x\neq0$

Using this theorem, we have:

\begin{align*}
\sqrt{N}(\hat{\Theta}_{OLS}-\Theta) & \stackrel{d}{\rightarrow} \sigma_{XX}^{-1}xu,
\end{align*}

Where $\sigma_{XX}^{-1}$ is a matrix of constants and $xu$ is a random variable.

Let's begin with $\frac{\sqrt{N}}{N}X'U\stackrel{d}{\rightarrow}xu$:

\begin{align*}
\frac{\sqrt{N}}{N}X'U & = \sqrt{N}\left(\begin{array}{c}  \frac{1}{N}\sum^{i=1}_{N}U_i\\  \frac{1}{N}\sum^{i=1}_{N}D_iU_i\end{array}\right)
\end{align*}

In order to determine the asymptotic distribution of $\frac{\sqrt{N}}{N}X'U$, we are going to use the vector version of the CLT:

If $X_i$ and $Y_i$ are two i.i.d. random variables with finite first and second moments, we have:

\begin{align*}
	\sqrt{N}
  \left(
	  \begin{array}{c}	
	   \frac{1}{N}\sum_{i=1}^NX_i-\esp{X_i}\\	
	   \frac{1}{N}\sum_{i=1}^NY_i-\esp{Y_i}
	   \end{array}
	 \right) 
      &
  \stackrel{d}{\rightarrow}
  \mathcal{N}
  \left(
    \begin{array}{c}	
    0\\
    0
    \end{array},
  \mathbf{V}
  \right),
\end{align*}

where $\mathbf{V}$ is the population covariance matrix of $X_i$ and $Y_i$.
  
We know that, under Assumption \@ref(def:noselb), both random variables have mean zero: 

\begin{align*}
\esp{U_i}& = \esp{U_i|D_i=1}\Pr(D_i=1)+\esp{U_i|D_i=0}\Pr(D_i=0)=0 \\
\esp{U_iD_i}& = \esp{U_i|D_i=1}\Pr(D_i=1)=0
\end{align*}

Their covariance matrix $\mathbf{V_{xu}}$ can be computed as follows:

\begin{align*}
\mathbf{V_{xu}} & = \esp{\left(\begin{array}{c}  U_i\\	UiD_i\end{array}\right)\left(\begin{array}{cc}	U_i&	UiD_i\end{array}\right)}
                  - \esp{\left(\begin{array}{c}	U_i\\	UiD_i\end{array}\right)}\esp{\left(\begin{array}{cc}	U_i&	UiD_i\end{array}\right)}\\
                & = \esp{\left(\begin{array}{cc}	U_i^2 & U_i^2D_i\\	Ui^2D_i & U_i^2D_i^2\end{array}\right)} 
                  = \esp{U_i^2\left(\begin{array}{cc}	1 & D_i\\	D_i & D_i^2\end{array}\right)} 
                  = \esp{U_i^2\left(\begin{array}{cc}	1 & D_i\\	D_i & D_i\end{array}\right)} 
\end{align*}

Using the Vector CLT, we have that $\frac{\sqrt{N}}{N}X'U\stackrel{d}{\rightarrow}\mathcal{N}\left(\begin{array}{c}  0\\	0\end{array},\mathbf{V_{xu}}\right)$.

Let's show now that $\plims N(X'X)^{-1}=\sigma_{XX}^{-1}$:

\begin{align*}
N(X'X)^{-1} & = \frac{N}{N\sum_{i=1}^ND_i-\left(\sum_{i=1}^ND_i\right)^2}
                \left(\begin{array}{cc} \sum_{i=1}^ND_i & -\sum_{i=1}^ND_i \\ -\sum_{i=1}^ND_i & N \end{array}\right) \\
            & = \frac{1}{N}\frac{1}{\frac{1}{N}\sum_{i=1}^ND_i-\left(\frac{1}{N}\sum_{i=1}^ND_i\right)^2}
                \left(\begin{array}{cc} \sum_{i=1}^ND_i & -\sum_{i=1}^ND_i \\ -\sum_{i=1}^ND_i & N \end{array}\right)\\
            & = \frac{1}{\frac{1}{N}\sum_{i=1}^ND_i-\left(\frac{1}{N}\sum_{i=1}^ND_i\right)^2}
                \left(\begin{array}{cc} \frac{1}{N}\sum_{i=1}^ND_i & -\frac{1}{N}\sum_{i=1}^ND_i \\ -\frac{1}{N}\sum_{i=1}^ND_i & 1 \end{array}\right)\\
\plims N(X'X)^{-1} & = \frac{1}{\plims\frac{1}{N}\sum_{i=1}^ND_i-\left(\plims\frac{1}{N}\sum_{i=1}^ND_i\right)^2}
                \left(\begin{array}{cc} \plims\frac{1}{N}\sum_{i=1}^ND_i & -\plims\frac{1}{N}\sum_{i=1}^ND_i \\ -\plims\frac{1}{N}\sum_{i=1}^ND_i & 1 \end{array}\right)\\
                  & = \frac{1}{\Pr(D_i=1)-\Pr(D_i=1)^2}
                \left(\begin{array}{cc} \Pr(D_i=1) & -\Pr(D_i=1) \\ -\Pr(D_i=1) & 1 \end{array}\right)\\
                 & = \sigma_{XX}^{-1}
\end{align*}

The fourth equality uses Slutsky's Theorem.
The fifth equality uses the Law of Large Numbers (LLN): if $Y_i$ are i.i.d. variables with finite first and second moments, $\plim{N}\frac{1}{N}\sum_{i=1}^NY_i = \esp{Y_i}$.

In order to complete the proof, we have to use the Delta Method Theorem.
This theorem states that: 

\begin{gather*}
  \sqrt{N}(\begin{array}{c}	\bar{X}_N-\esp{X_i}\\	\bar{Y}_N-\esp{Y_i}\end{array})  \stackrel{d}{\rightarrow}\mathcal{N}(\begin{array}{c}	0\\	0\end{array},\mathbf{V}) \\
\Rightarrow \sqrt{N}(g(\bar{X}_N,\bar{Y}_N)-g(\esp{X_i},\esp{Y_i})  \stackrel{d}{\rightarrow}\mathcal{N}(0,G'\mathbf{V}G)
\end{gather*}

where $G(u)=\partder{g(u)}{u}$ and $G=G(\esp{X_i},\esp{Y_i})$.

In our case, $g(xu)=\sigma_{XX}^{-1}xu$, so $G(xu)=\sigma_{XX}^{-1}$.
The results follows from that and from the symmetry of $\sigma_{XX}^{-1}$.
```

A last lemma uses the previous result to derive the asymptotic distribution of $\hat{WW}$:

```{lemma,asymWW,name="Asymptotic Distribution of $\\hat{WW}$"}
Under Assumptions \@ref(def:noselb), \@ref(def:fullrank), \@ref(def:iid) and \@ref(def:finitevar), we have:

\begin{align*}
  \sqrt{N}(\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}) &  \stackrel{d}{\rightarrow}
  \mathcal{N}\left(0,\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}\right).
\end{align*}
```

```{proof}
In order to derive the asymptotic distribution of WW, I use first Lemma \@ref(lem:WWOLS) which implies that the asymptotic distribution of WW is the same as that of $\hat{\beta}_{OLS}$.
Now, from Lemma \@ref(lem:asympOLS), we know that $\sqrt{N}(\hat{\beta}_{OLS}-\beta)\stackrel{d}{\rightarrow}\mathcal{N}(0,\sigma^2_{\beta})$, where $\sigma^2_{\beta}$ is the lower diagonal term of $\sigma_{XX}^{-1}\mathbf{V_{xu}}\sigma_{XX}^{-1}$.
Using the convention $p=\Pr(D_i=1)$, we have:

\begin{align*}
\sigma_{XX}^{-1}\mathbf{V_{xu}}\sigma_{XX}^{-1} 
                  & = \left(\begin{array}{cc}  
                                          \frac{p}{p(1-p)} & -\frac{p}{p(1-p)}\\
                                          -\frac{p}{p(1-p)} & \frac{1}{p(1-p)} 
                          \end{array}\right)
                          \esp{U_i^2\left(\begin{array}{cc}  1 & D_i\\  D_i & D_i\end{array}\right)}        
                          \left(\begin{array}{cc}  
                                          \frac{p}{p(1-p)} & -\frac{p}{p(1-p)}\\
                                          -\frac{p}{p(1-p)} & \frac{1}{p(1-p)} 
                          \end{array}\right)\\
                  & = \frac{1}{(p(1-p))^2}
                          \left(\begin{array}{cc}  
                                          p\esp{U_i^2}-p\esp{U_i^2D_i} & p\esp{U_i^2D_i}-p\esp{U_i^2D_i}\\
                                          -p\esp{U_i^2}+\esp{U_i^2D_i} &  -p\esp{U_i^2D_i}+\esp{U_i^2D_i}
                          \end{array}\right)
                         \left(\begin{array}{cc}  
                                          p & -p\\
                                          -p & 1 
                          \end{array}\right)\\
                 & = \frac{1}{(p(1-p))^2}
                          \left(\begin{array}{cc}  
                                          p^2(\esp{U_i^2}-\esp{U_i^2D_i}) & p^2(\esp{U_i^2D_i}-\esp{U_i^2})\\
                                          p^2(\esp{U_i^2D_i}-\esp{U_i^2}) &  p^2\esp{U_i^2}+(1-2p)\esp{U_i^2D_i}
                          \end{array}\right)
 \end{align*}

The final result comes from the fact that:

\begin{align*}
\esp{U_i^2} & = \esp{U_i^2|D_i=1}p + (1-p)\esp{U_i^2|D_i=0}\\
            & = p\var{Y_i^1|D_i=1}+(1-p)\var{Y_i^0|D_i=0} \\
\esp{U_i^2D_i}  & = \esp{U_i^2|D_i=1}p  \\
                & = p\var{Y_i^1|D_i=1}.
 \end{align*}

As a consequence:

\begin{align*}
\sigma^2_{\beta} &= \frac{1}{(p(1-p))^2}\left(\var{Y_i^1|D_i=1}p(p^2-2p+1) + p^2(1-p)\var{Y_i^0|D_i=0}\right) \\
                  &= \frac{1}{(p(1-p))^2}\left(\var{Y_i^1|D_i=1}p(1-p)^2 + p^2(1-p)\var{Y_i^0|D_i=0}\right)\\
                  & = \frac{\var{Y_i^1|D_i=1}}{p}+\frac{\var{Y_i^0|D_i=0}}{1-p}.
 \end{align*}
```


Using the previous lemma, we can now approximate the confidence level of $\hat{WW}$:

\begin{align*}
\Pr&(|\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}|\leq\epsilon) = \Pr(-\epsilon\leq\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}\leq\epsilon) \\
& = \Pr\left(-\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\leq\frac{\hat{\Delta^Y_{WW}}-\Delta^Y_{TT}}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\leq\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)\\
& \approx \Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)-
\Phi\left(-\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)\\
& = \Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)- 1 + \Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)\\
& = 2\Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)-1.
\end{align*}

As a consequence, 

\begin{align*}
\delta & \approx 2\Phi\left(\frac{\epsilon}{\frac{1}{\sqrt{N}}\sqrt{\frac{\var{Y_i^1|D_i=1}}{\Pr(D_i=1)}+\frac{\var{Y_i^0|D_i=0}}{1-\Pr(D_i=1)}}}\right)-1.
\end{align*}

Hence the result.

## Proofs of results in Chapter \@ref(RCT)

### Proof of Theorem \@ref(thm:IdentLATE) {#proofIdentLATE}

In order to prove the theorem, it is going to be very helpful to prove the following lemma:

```{lemma,UnconfTypes,name='Unconfounded Types'}
Under Assumptions \@ref(def:RandEncouragValid) and \@ref(def:IndepEncourag), the types $T_i$ are independent of the allocation of the treatment: 
  
\begin{align*}
(Y_i^{1,1},Y_i^{0,1},Y_i^{0,0},Y_i^{1,0},T_i)\Ind R_i|E_i=1.
\end{align*}
```

```{proof}
Lemma 4.2 in [Dawid (1979)](https://www.jstor.org/stable/2984718) shows that if $X\Ind Y|Z$ and $U$ is a function of $X$ then $U\Ind Y|Z$.
The fact that $T_i$ is a function of $(D_i^1,D^0_i)$ proves the result.
```

The four sets defined by $T_i$ are a partition of the sample space.
As a consequence, we have (ommitting the conditioning on $E_i=1$ all along for simplicity):

\begin{align*}
\esp{Y_i|R_i=1} & = \esp{Y_i|T_i=a,R_i=1}\Pr(T_i=a|R_i=1)\\
                & \phantom{=}+ \esp{Y_i|T_i=c,R_i=1}\Pr(T_i=c|R_i=1) \\
  							& \phantom{=} + \esp{Y_i|T_i=d,R_i=1}\Pr(T_i=d|R_i=1)\\
  							& \phantom{=} + \esp{Y_i|T_i=n,R_i=1}\Pr(T_i=n|R_i=1)\\
\esp{Y_i|R_i=0} & = \esp{Y_i|T_i=a,R_i=0}\Pr(T_i=a|R_i=0)\\
                & \phantom{=} + \esp{Y_i|T_i=c,R_i=0}\Pr(T_i=c|R_i=0) \\
  							& \phantom{=} + \esp{Y_i|T_i=d,R_i=0}\Pr(T_i=d|R_i=0)\\
  							& \phantom{=}+ \esp{Y_i|T_i=n,R_i=0}\Pr(T_i=n|R_i=0).
\end{align*}

Let's look at all these terms in turn:

\begin{align*}
  \esp{Y_i|T_i=a,R_i=1} & =   \esp{Y_i^{1,1}D_iR_i+Y_i^{1,0}D_i(1-R_i)+Y_i^{0,1}(1-D_i)R_i+Y_i^{0,0}(1-D_i)(1-R_i)|T_i=a,R_i=1} \\
   & =   \esp{Y_i^{1,1}(D^1_iR_i+D_i^0(1-R_i))R_i+Y_i^{0,1}(1-(D^1_iR_i+D_i^0(1-R_i)))R_i|T_i=a,R_i=1} \\
   & =   \esp{Y_i^{1,1}D^1_iR_i^2+Y_i^{0,1}(1-D^1_iR_i)R_i|D_i^1=D_i^0=1,R_i=1} \\
   & =   \esp{Y_i^{1,1}|T_i=a,R_i=1} \\
   & =   \esp{Y_i^{1,1}|T_i=a}, \\
\end{align*}

where the first equality uses Assumption \@ref(def:RandEncouragValid), the second equality uses the fact that $R_i=1$ in the conditional expectation and Assumption \@ref(def:RandEncouragValid), the third equality uses the fact that $R_i=1$, the fourth equality uses the fact that $T_i=a \Leftrightarrow D_i^1=D_i^0=1$ and the last equality uses Lemma \@ref(lem:UnconfTypes).

Using a similar reasoning, we have:

\begin{align*}
  \esp{Y_i|T_i=c,R_i=1} & = \esp{Y_i^{1,1}|T_i=c} \\
  \esp{Y_i|T_i=d,R_i=1} & = \esp{Y_i^{0,1}|T_i=d} \\
  \esp{Y_i|T_i=n,R_i=1} & = \esp{Y_i^{0,1}|T_i=n} \\
  \esp{Y_i|T_i=a,R_i=0} & = \esp{Y_i^{1,0}|T_i=c} \\
  \esp{Y_i|T_i=c,R_i=0} & = \esp{Y_i^{0,0}|T_i=c} \\
  \esp{Y_i|T_i=d,R_i=0} & = \esp{Y_i^{1,0}|T_i=d} \\
  \esp{Y_i|T_i=n,R_i=0} & = \esp{Y_i^{0,0}|T_i=n}.
\end{align*}

Also, Lemma \@ref(lem:UnconfTypes) implies that $\Pr(T_i=a|R_i)=\Pr(T_i=a)$, and the same is true for all other types.
As a consequence, we have:

\begin{align*}
\esp{Y_i|R_i=1} & = \esp{Y_i^{1,1}|T_i=a}\Pr(T_i=a)\\
                & \phantom{=} + \esp{Y_i^{1,1}|T_i=c}\Pr(T_i=c) \\
  							& \phantom{=} + \esp{Y_i^{0,1}|T_i=d}\Pr(T_i=d)\\
  							& \phantom{=} + \esp{Y_i^{0,1}|T_i=n}\Pr(T_i=n)\\
\esp{Y_i|R_i=0} & = \esp{Y_i^{1,0}|T_i=a}\Pr(T_i=a)\\
                & \phantom{=} + \esp{Y_i^{0,0}|T_i=c}\Pr(T_i=c) \\						
								& \phantom{=} + \esp{Y_i^{1,0}|T_i=d}\Pr(T_i=d)\\
								& \phantom{=} + \esp{Y_i^{0,0}|T_i=n}\Pr(T_i=n).
\end{align*}

And thus:

\begin{align*}
\esp{Y_i|R_i=1}-\esp{Y_i|R_i=0} & = (\esp{Y_i^{1,1}|T_i=a}-\esp{Y_i^{1,0}|T_i=a})\Pr(T_i=a)\\
                                & \phantom{=}+ (\esp{Y_i^{1,1}|T_i=c}-\esp{Y_i^{0,0}|T_i=c})\Pr(T_i=c) \\
  							                & \phantom{=} - (\esp{Y_i^{1,0}|T_i=d}-\esp{Y_i^{0,1}|T_i=d})\Pr(T_i=d)\\
  							                & \phantom{=} + (\esp{Y_i^{0,1}|T_i=n}-\esp{Y_i^{0,0}|T_i=n})\Pr(T_i=n).
\end{align*}

Using Assumption \@ref(def:ExclRestr), we have:

\begin{align*}
\esp{Y_i|R_i=1}-\esp{Y_i|R_i=0} & = (\esp{Y_i^{1}|T_i=a}-\esp{Y_i^{1}|T_i=a})\Pr(T_i=a)\\
                                & \phantom{=}+ (\esp{Y_i^{1}|T_i=c}-\esp{Y_i^{0}|T_i=c})\Pr(T_i=c) \\
  							                & \phantom{=} - (\esp{Y_i^{1}|T_i=d}-\esp{Y_i^{0}|T_i=d})\Pr(T_i=d)\\
  							                & \phantom{=} + (\esp{Y_i^{0}|T_i=n}-\esp{Y_i^{0}|T_i=n})\Pr(T_i=n)\\
  							                & = \esp{Y_i^{1}-Y_i^{0}|T_i=c}\Pr(T_i=c) \\
  							                & \phantom{=} - \esp{Y_i^{1}-Y_i^{0}|T_i=d}\Pr(T_i=d).
\end{align*}

Under Assumption \@ref(def:Mono), we have:
  
\begin{align*}
\esp{Y_i|R_i=1}-\esp{Y_i|R_i=0} &  = \esp{Y_i^{1}-Y_i^{0}|T_i=c}\Pr(T_i=c)\\
                                 & = \Delta^Y_{LATE}\Pr(T_i=c).
\end{align*}

We also have:

\begin{align*}
\Pr(D_i=1|R_i=1) & = \Pr(D^1_i=1|R_i=1)\\
                & =  \Pr(D^1_i=1\cap (D_i^0=1\cup D_i^0=0) |R_i=1)\\
                & =  \Pr(D^1_i=1\cap D_i^0=1\cup D^1_i=1\cap D_i^0=0 |R_i=1)\\
                & =  \Pr(D^1_i=D_i^0=1\cup D^1_i-D_i^0=0 |R_i=1)\\
                & =  \Pr(T_i=a\cup T_i=c |R_i=1)\\
               & =  \Pr(T_i=a|R_i=1)+\Pr(T_i=c|R_i=1)\\
               & =  \Pr(T_i=a)+\Pr(T_i=c),
\end{align*}

where the first equality follows from Assumption \@ref(def:RandEncouragValid) and the fact that $D_i=R_iD_i^1+(1-R_i)D_i^0$, so that $D_i|R_i=1=D_i^1$.
The second equality follows from the fact that $\left\{ D_i^0=1,D_i^0=0\right\}$ is a partition of the sample space.
The third equality follows from usual rules of logic and the fourth equality from the fact that $D_i^1$ and $D_i^0$ can only take values zero and one.
The fifth equality follows from the definition of $T_i$.
The sixth equaity follows from the rule of addition in probability and the fact that $T_i=a$ and $T_i=c$ are disjoint.
The final equality follows from Lemma \@ref(lem:UnconfTypes).

Using a similar reasoning, we have:

\begin{align*}
\Pr(D_i=1|R_i=0) & = \Pr(T_i=a)+ \Pr(T_i=d).
\end{align*}

As a consequence, under Assumption \@ref(def:Mono), we have:

\begin{align*}
\Pr(D_i=1|R_i=1)-\Pr(D_i=1|R_i=0) & = \Pr(T_i=c).
\end{align*}

Using Assumption \@ref(def:Fstage) proves the result. 

### Proof of Theorem \@ref(thm:WaldIV) {#proofWaldIV}

In matrix notation, we have:

\begin{align*}
  \underbrace{\left(\begin{array}{c}  Y_1 \\	\vdots \\	Y_N \end{array}\right)}_{Y} & =
  \underbrace{\left(\begin{array}{cc}	1 & D_1\\	\vdots & \vdots\\	1 & D_N\end{array}\right)}_{X}
  \underbrace{\left(\begin{array}{c}	\alpha \\	\beta \end{array}\right)}_{\Theta}+
  \underbrace{\left(\begin{array}{c}	U_1 \\	\vdots \\	U_N \end{array}\right)}_{U}
\end{align*}

and

\begin{align*}
  \left(\begin{array}{c}  D_1 \\	\vdots \\	D_N \end{array}\right) & =
  \underbrace{\left(\begin{array}{cc}	1 & R_1\\	\vdots & \vdots\\	1 & R_N\end{array}\right)}_{R}
  \left(\begin{array}{c}	\gamma \\	\tau \end{array}\right)+
  \left(\begin{array}{c}	V_1 \\	\vdots \\	V_N \end{array}\right)
\end{align*}

The IV estimator is:

\begin{align*}
	\hat{\Theta}_{IV} &  = (R'X)^{-1}R'Y
\end{align*}

If there is at least one observation with $R_i=1$ and $D_i=1$, $R'X$ is invertible (its determinant is non null) and we have (ommitting the summation index for simplicity):

\begin{align*}
(R'X)^{-1} &  = \left(\begin{array}{cc} N & \sum D_i \\ \sum R_i & \sum D_iR_i \end{array}\right)^{-1} \\
                & = \frac{1}{N\sum D_iR_i-\sum D_i\sum R_i}\left(\begin{array}{cc} \sum D_iR_i & -\sum D_i \\ -\sum R_i & N \end{array}\right)
\end{align*}

Since:

\begin{align*}
R'Y &  = \left(\begin{array}{c} \sum Y_i \\ \sum Y_iR_i \end{array}\right),
\end{align*}

we have:

\begin{align*}
  \hat{\Theta}_{IV} &  =  \left(
                              \begin{array}{c}
                                \frac{\sum Y_i\sum D_iR_i-\sum D_i\sum Y_iR_i}{N\sum D_iR_i -\sum D_iR_i}\\
                                \frac{N\sum Y_iR_i-\sum R_i\sum Y_i}{N\sum D_iR_i-\sum D_iR_i}
                              \end{array}
                          \right)
\end{align*}

As a consequence, $\hat{\beta}_{IV}$ is equal to the ratio of two OLS estimators ($Y_i$ on $R_i$ and a constant and $D_i$ on the same regressors) (see the proof of Lemma \@ref(lem:WWOLS) in section \@ref(proofCLT), just after "Using $D_i^2=D_i$").
We can use Lemma \@ref(lem:WWOLS) stating that the OLS estimator is the WW estimator to prove the result.

### Proof of Theorem \@ref(thm:asymWald) {#ProofAsymWald}

In order to derive the asymptotic distribution of the Wald estimator, I first use Theorem \@ref(thm:WaldIV) which implies that the asymptotic distribution of Wald is the same as that of $\hat{\beta}_{IV}$.
Now, I'm going to derive the asymptotic distribution of the IV estimator.

```{lemma,asympIV,name='Asymptotic Distribution of the IV Estimator'}
Under Independence and Validity of the Instrument, Exclusion Restriction and Full Rank, we have:

\begin{align*}
  \sqrt{N}(\hat{\Theta}_{IV}-\Theta) &  \stackrel{d}{\rightarrow}
  \mathcal{N}\left(\begin{array}{c}	0\\	0\end{array},
  (\sigma_{RX}^{-1})'\mathbf{V_{ru}}\sigma_{RX}^{-1}\right), 
\end{align*}

with 
\begin{align*}
\sigma_{RX}^{-1}& = \frac{\left(\begin{array}{cc}	\esp{D_iR_i} & -\Pr(D_i=1)\\
                                          -\Pr(R_i=1) & 1 
                          \end{array}\right)}{(\Pr(D_i=1|R_i=1)-\Pr(D_i=1|R_i=0))\Pr(R_i=1)(1-\Pr(R_i=1))}
                          \\
\mathbf{V_{ru}}&= \esp{U_i^2\left(\begin{array}{cc}  1 & R_i\\  R_i & R_i\end{array}\right)}                                                                
\end{align*}
```

```{proof}
\begin{align*}
\sqrt{N}(\hat{\Theta}_{IV}-\Theta) & = \sqrt{N}((R'X)^{-1}R'Y-\Theta) \\
                                    & = \sqrt{N}((R'X)^{-1}R'(X\Theta+U)-\Theta) \\
                                    & = \sqrt{N}((R'X)^{-1}R'X\Theta+(X'X)^{-1}X'U)-\Theta) \\
                                    & = \sqrt{N}(R'X)^{-1}R'U \\
                                    & = N(R'X)^{-1}\frac{\sqrt{N}}{N}R'U
\end{align*}

Using Slutsky's Theorem, we have:

\begin{align*}
\sqrt{N}(\hat{\Theta}_{IV}-\Theta) & \stackrel{d}{\rightarrow} \sigma_{RX}^{-1}ru,
\end{align*}

where $\sigma_{RX}^{-1}$ is a matrix of constants and $ru$ is a random variable.

We know that $\plims N(R'X)^{-1}=\sigma_{RX}^{-1}$.
So:

\begin{align*}
N(R'X)^{-1} & = \frac{N}{N\sum D_iR_i-\sum D_i\sum R_i}\left(\begin{array}{cc} \sum D_iR_i & -\sum D_i \\ -\sum R_i & N \end{array}\right) \\
            & = \frac{1}{\frac{\sum D_iR_i}{N}-\frac{\sum D_i}{N}\frac{\sum R_i}{N}} 
                  \left(\begin{array}{cc} 
                          \frac{\sum D_iR_i}{N} & -\frac{\sum D_i}{N} \\ -\frac{\sum R_i}{N} & 1 
                        \end{array}
                  \right) 
\end{align*}

$\frac{\sum D_iR_i}{N}-\frac{\sum D_i}{N}\frac{\sum R_i}{N}$ is equal to the numerator of the OLS coefficient of a regression of $D_i$ on $R_i$ and a constant (Proof of Lemma 3 in Lecture 0).
As a consequence of Lemma 3 in Lecture 0, it can be written as the With/Without estimator multiplied by the denominator of the OLS estimator, which is simply the variance of $R_i$.

Let's turn to $\frac{\sqrt{N}}{N}R'U\stackrel{d}{\rightarrow}xu$:

\begin{align*}
\frac{\sqrt{N}}{N}R'U & = \sqrt{N}\left(\begin{array}{c}  \frac{1}{N}\sum^{i=1}_{N}U_i\\  \frac{1}{N}\sum^{i=1}_{N}R_iU_i\end{array}\right)
\end{align*}

We know that, under Validity of Randomization, both random variables have mean zero: 

\begin{align*}
\esp{U_i}& = \esp{U_i|R_i=1}\Pr(R_i=1)+\esp{U_i|R_i=0}\Pr(R_i=0)=0 \\
\esp{U_iR_i}& = \esp{U_i|R_i=1}\Pr(R_i=1)=0
\end{align*}

Their covariance matrix $\mathbf{V_{ru}}$ can be computed as follows:

\begin{align*}
\mathbf{V_{ru}} & = \esp{\left(\begin{array}{c}  U_i\\	UiR_i\end{array}\right)\left(\begin{array}{cc}	U_i&	UiR_i\end{array}\right)}
                  - \esp{\left(\begin{array}{c}	U_i\\	UiR_i\end{array}\right)}\esp{\left(\begin{array}{cc}	U_i&	UiR_i\end{array}\right)}\\
                & = \esp{\left(\begin{array}{cc}	U_i^2 & U_i^2R_i\\	Ui^2R_i & U_i^2R_i^2\end{array}\right)} 
                  = \esp{U_i^2\left(\begin{array}{cc}	1 & R_i\\	R_i & R_i^2\end{array}\right)} 
                  = \esp{U_i^2\left(\begin{array}{cc}	1 & R_i\\	R_i & R_i\end{array}\right)} 
\end{align*}

Using the Vector CLT, we have that $\frac{\sqrt{N}}{N}R'U\stackrel{d}{\rightarrow}\mathcal{N}\left(\begin{array}{c}  0\\	0\end{array},\mathbf{V_{ru}}\right)$.
Using Slutsky's theorem and the LLN gives the result.
```

From Lemma \@ref(lem:asympIV), we know that $\sqrt{N}(\hat{\beta}_{IV}-\beta)\stackrel{d}{\rightarrow}\mathcal{N}(0,\sigma^2_{\beta})$, where $\sigma^2_{\beta}$ is the lower diagonal term of $(\sigma_{RX}^{-1})'\mathbf{V_{ru}}\sigma_{RX}^{-1}$.
Using the convention $p^R=\Pr(R_i=1)$, $p^D=\Pr(D_i=1)$, $p^D_1=\Pr(D_i=1|R_i=1)$, $p^D_0=\Pr(D_i=1|R_i=0)$ and $p^{DR}=\esp{D_iR_i}$, we have:

\begin{align*}
(&\sigma_{RX}^{-1})'\mathbf{V_{ru}}\sigma_{RX}^{-1} \\
                   & = \frac{1}{((p^D_1-p^D_0)p^R(1-p^R))^2}
                  \left(\begin{array}{cc}
                         p^{DR}  & -p^R\\
                        -p^D & 1
                          \end{array}\right)
                          \esp{U_i^2\left(\begin{array}{cc}  1 & R_i\\  R_i & R_i\end{array}\right)}
                        \left(\begin{array}{cc}
                                         p^{DR}  & -p^D\\
                                          -p^R & 1
                          \end{array}\right)\\
                    & = \frac{1}{((p^D_1-p^D_0)p^R(1-p^R))^2}
                         \left(\begin{array}{cc}
                         p^{DR}\esp{U_i^2}-p^R\esp{U_i^2R_i} & \esp{U_i^2R_i}(p^{DR}-p^R)\\
                        \esp{U_i^2R_i}-p^D\esp{U_i^2} & \esp{U_i^2R_i}(1-p^D)
                          \end{array}\right)
                          \left(\begin{array}{cc}
                                         p^{DR}  & -p^D\\
                                          -p^R & 1
                          \end{array}\right)\\
                    & = \frac{\left(\begin{array}{cc}
                          p^{DR}(p^{DR}\esp{U_i^2}-p^R\esp{U_i^2R_i})- p^R\esp{U_i^2R_i}(p^{DR}-p^R)
                            & \esp{U_i^2R_i}(p^{DR}-p^R)-p^{D}(p^{DR}\esp{U_i^2}-p^R\esp{U_i^2R_i})\\
                          p^{DR}(\esp{U_i^2R_i}-p^D\esp{U_i^2})-p^R\esp{U_i^2R_i}(1-p^D)
                            & \esp{U_i^2R_i}(1-p^D) - p^{D}(\esp{U_i^2R_i}-p^D\esp{U_i^2})
                          \end{array}\right)}{((p^D_1-p^D_0)p^R(1-p^R))^2}
 \end{align*}

As a consequence:

\begin{align*}
\sigma^2_{\beta} & = \frac{\esp{U_i^2R_i}(1-p^D) - p^{D}(\esp{U_i^2R_i}-p^D\esp{U_i^2})}{((p^D_1-p^D_0)p^R(1-p^R))^2} \\
                  & = \frac{(p^D)^2\esp{U_i^2}+(1-2p^D)\esp{U_i^2R_i}}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\
                  & = \frac{(p^D)^2(\esp{U_i^2|R_i=1}p^R+\esp{U_i^2|R_i=0}(1-p^R))+(1-2p^D)\esp{U_i^2|R_i=1}p^R}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\
                  & = \frac{(p^D)^2\esp{U_i^2|R_i=0}(1-p^R)+(1-2p^D+(p^D)^2)\esp{U_i^2|R_i=1}p^R}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\
                  & = \frac{(p^D)^2\esp{U_i^2|R_i=0}(1-p^R)+(1-p^D)^2\esp{U_i^2|R_i=1}p^R}{((p^D_1-p^D_0)p^R(1-p^R))^2}\\
                  & = \frac{1}{(p^D_1-p^D_0)^2}\left[\left(\frac{p^D}{p^R}\right)^2\frac{\esp{U_i^2|R_i=0}}{1-p^R}+\left(\frac{1-p^D}{1-p^R}\right)^2\frac{\esp{U_i^2|R_i=1}}{p^R}\right].
\end{align*}

Note that, under monotonicity, $p^C=p^D_1-p^D_0$ and:

\begin{align*}
\esp{U_i^2|R_i=1} & = p^{AT}\var{Y_i^1|T_i=AT}+p^C\var{Y_i^1|T_i=C}+p^{NT}\var{Y_i^0|T_i=NT} \\
\esp{U_i^2|R_i=0}  & = p^{AT}\var{Y_i^1|T_i=AT}+p^C\var{Y_i^0|T_i=C}+p^{NT}\var{Y_i^0|T_i=NT}.
 \end{align*}

The final result comes from the fact that:

\begin{align*}
\frac{1}{(p^C)^2} &  \left[\left(\frac{p^D}{p^R}\right)^2\frac{1}{1-p^R}+\left(\frac{1-p^D}{1-p^R}\right)^2\frac{1}{p^R}\right]\\
  & = \frac{(p^D)^2(1-p^R)+(1-p^D)^2p^R}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^D)^2-(p^D)^2p^R+p^R-2p^Dp^R+(p^D)^2p^R}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^D)^2+p^R-2p^Dp^R}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^D-p^R)^2+p^R-(p^R)^2}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^D-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^{AT}+p^Cp^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^{AT}+(1-p^{AT}-p^{NT})p^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^{AT}+(1-p^{AT}-p^{NT})p^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^{AT}+p^R-p^{AT}p^R-p^{NT}p^R-p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2} \\
  & = \frac{(p^{AT}(1-p^R)-p^{NT}p^R)^2+p^R(1-p^R)}{(p^Cp^R(1-p^R))^2},
\end{align*}

where the seventh equality uses the fact that $p^C+p^{AT}+p^{NT}=1$.