diff --git a/_quarto.yml b/_quarto.yml index dd3b714..c6e58ae 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -8,6 +8,8 @@ website: left: - href: index.qmd text: Home + - href: intuition.qmd + text: Build Intuition - href: data.qmd text: Simulate Data - href: outcomemodeling.qmd diff --git a/assets/simplematching.png b/assets/simplematching.png new file mode 100644 index 0000000..bdbd50f Binary files /dev/null and b/assets/simplematching.png differ diff --git a/assets/simpleoutcomemodeling.png b/assets/simpleoutcomemodeling.png new file mode 100644 index 0000000..64b3ffa Binary files /dev/null and b/assets/simpleoutcomemodeling.png differ diff --git a/assets/simplesetting.png b/assets/simplesetting.png new file mode 100644 index 0000000..eb9953e Binary files /dev/null and b/assets/simplesetting.png differ diff --git a/assets/simpleweighting.png b/assets/simpleweighting.png new file mode 100644 index 0000000..110d6b1 Binary files /dev/null and b/assets/simpleweighting.png differ diff --git a/data.qmd b/data.qmd index 1b2dc55..8258234 100644 --- a/data.qmd +++ b/data.qmd @@ -1,5 +1,5 @@ --- -title: "Generate Data" +title: "Simulate Data" --- The code below will generate a dataset of $n = 100$ observations. Each observation contains several observed variables: @@ -15,13 +15,24 @@ Each observation also contains outcomes that we know only because the data are s * `Y0` The potential outcome under control * `Y1` The potential outcome under treatment -To run this code, you will need the `dplyr` package. If you don't have it, first run the line `install.packages("dplyr")` in your R console. +To run this code, you will need the `dplyr` package. If you don't have it, first run the line `install.packages("dplyr")` in your R console. Then, add this line to your R script to load the package. -```{r} +```{r, message = F, warning = F} library(dplyr) -n <- 100 -data <- tibble(L1 = rnorm(n), - L2 = rnorm(n)) |> +``` + +If you want your simulation to match our numbers exactly, add a line to set your seed. + +```{r} +set.seed(90095) +``` + +```{r} +n <- 500 +data <- tibble( + L1 = rnorm(n), + L2 = rnorm(n) +) |> # Generate potential outcomes as functions of L mutate(Y0 = rnorm(n(), mean = L1 + L2, sd = 1), Y1 = rnorm(n(), mean = Y0 + 1, sd = 1)) |> diff --git a/docs/assets/simplematching.png b/docs/assets/simplematching.png new file mode 100644 index 0000000..bdbd50f Binary files /dev/null and b/docs/assets/simplematching.png differ diff --git a/docs/assets/simpleoutcomemodeling.png b/docs/assets/simpleoutcomemodeling.png new file mode 100644 index 0000000..64b3ffa Binary files /dev/null and b/docs/assets/simpleoutcomemodeling.png differ diff --git a/docs/assets/simplesetting.png b/docs/assets/simplesetting.png new file mode 100644 index 0000000..eb9953e Binary files /dev/null and b/docs/assets/simplesetting.png differ diff --git a/docs/assets/simpleweighting.png b/docs/assets/simpleweighting.png new file mode 100644 index 0000000..110d6b1 Binary files /dev/null and b/docs/assets/simpleweighting.png differ diff --git a/docs/data.html b/docs/data.html index 03f1efc..03ec6ff 100644 --- a/docs/data.html +++ b/docs/data.html @@ -7,7 +7,7 @@ -causalestimators - Generate Data +causalestimators - Simulate Data + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+ +
+ + + + +
+ +
+
+

Building Intuition

+
+ + + +
+ + + + +
+ + +
+ +

Before diving into a more complex simulated setting, this page builds intuition for causal estimators in a simple setting with only 6 observations. At confounder value 1, 2 out of 3 are treated. At confounder value 2, 1 out of 3 is treated.

+

+

We assume causal identification given the confounder. The causal problem is to use the observed data to learn about the missing values.

+
+

Outcome modeling

+

One strategy is to estimate an outcome model for the conditional mean of the observed outcomes.

+

\[E(Y\mid A, X) = \alpha + \beta X + \gamma A\]

+

In these data, we would estimate \(\hat\alpha = 0\), \(\hat\beta = 1\), and \(\hat\gamma = 1\). We could then predict the counterfactual outcomes.

+

+
+
+

Inverse probability weighting

+

Another strategy is to consider the treated units as a sample of all units, drawn with unequal probabilities across confounder values. Likewise, for the control units. Just as in sampling we would weight by the inverse probability of sample inclusion, we can weight by the inverse probability of the observed treatment.

+

+
+
+

Matching

+

A third is to find for each unit a matched case that had the other treatment condition. We then impute the missing outcome value by the observed outcome of the matched case.

+

+

Matching can be conceptualized as a special case of outcome modeling, where the outcome model is a nearest neighbor estimator.

+
+
+

What to do next

+

Now that you have a conceptual idea of these strategies, move on to the next pages to practice them with simulated data.

+ + +
+ +
+ +
+ + + + \ No newline at end of file diff --git a/docs/matching.html b/docs/matching.html index f2ac3aa..74b3940 100644 --- a/docs/matching.html +++ b/docs/matching.html @@ -120,6 +120,10 @@ +
  • Aggregate by a weighted mean or outcome model
  • There are many methods for matching. The code below walks through the particular case of propensity score matching.

    -

    The code below assumes you have generated data as on the data page.

    -
    -
    -
    
    -Attaching package: 'dplyr'
    -
    -
    -
    The following objects are masked from 'package:stats':
    -
    -    filter, lag
    -
    -
    -
    The following objects are masked from 'package:base':
    -
    -    intersect, setdiff, setequal, union
    -
    -
    +

    The code below assumes you have generated data as on the data page.

    1) Target population

    While the target population is relevant to all causal estimands and estimators, it is especially apparent when matching. One might choose

    @@ -262,50 +250,50 @@

    4) Aggregate

    Code illustration

    The MatchIt package is one way to implement various matching strategies. You can install with install.package("MatchIt") in your R console.

    -
    library(MatchIt)
    +
    library(MatchIt)

    The code below uses MatchIt to conduct nearest-neighbor 1:1 propensity score matching.

    -
    matched <- matchit(
    -  A ~ L1 + L2,
    -  data = data, 
    -  distance = "glm",
    -  method = "nearest"
    -)
    +
    matched <- matchit(
    +  A ~ L1 + L2,
    +  data = data, 
    +  distance = "glm",
    +  method = "nearest"
    +)

    The code below appends the matching weights to the data. Units with match_weight == 1 are matched, while those with match_weight == 0 are unmatched.

    -
    # Append matching weights to the data
    -with_weights <- data |>
    -  mutate(match_weight = matched$weights) |>
    -  select(A, L1, L2, Y, match_weight)
    +
    # Append matching weights to the data
    +with_weights <- data |>
    +  mutate(match_weight = matched$weights) |>
    +  select(A, L1, L2, Y, match_weight)
    -
    # A tibble: 100 × 5
    -       A      L1       L2      Y match_weight
    -   <int>   <dbl>    <dbl>  <dbl>        <dbl>
    - 1     0  0.621   0.182    0.168            0
    - 2     0 -1.27   -0.930   -3.31             0
    - 3     1  3.13    0.506    3.11             1
    - 4     0  0.0818  2.70     0.707            1
    - 5     0 -0.596   1.33     1.84             0
    - 6     0 -2.51    1.43     0.378            0
    - 7     0 -0.452   2.29     0.884            0
    - 8     0  1.17   -0.00888  3.30             0
    - 9     0  0.155   1.35    -1.31             0
    -10     0  1.13   -0.511    0.310            0
    -# ℹ 90 more rows
    +
    # A tibble: 500 × 5
    +       A       L1      L2       Y match_weight
    +   <int>    <dbl>   <dbl>   <dbl>        <dbl>
    + 1     0  0.00304  1.03    0.677             1
    + 2     0 -2.35    -1.66   -4.09              0
    + 3     0  0.104   -0.912   0.0659            0
    + 4     0 -0.522    0.439   0.390             0
    + 5     0 -1.18    -0.815  -2.14              0
    + 6     0  0.477   -0.0314  0.396             0
    + 7     0 -0.0607  -0.462  -1.96              0
    + 8     0  0.987    0.426   2.27              1
    + 9     0 -0.122   -0.564  -0.0581            0
    +10     0 -1.34    -0.618  -2.73              0
    +# ℹ 490 more rows

    The code below estimates the ATT by OLS regression on the matched set.

    -
    model <- lm(
    -  Y ~ A + L1 + L2,
    -  data = with_weights,
    -  weights = match_weight
    -)
    -summary(model)
    +
    model <- lm(
    +  Y ~ A + L1 + L2,
    +  data = with_weights,
    +  weights = match_weight
    +)
    +summary(model)
    
     Call:
    @@ -313,20 +301,20 @@ 

    Code illustration

    Weighted Residuals: Min 1Q Median 3Q Max --2.755 0.000 0.000 0.000 2.033 +-4.150 0.000 0.000 0.000 3.297 Coefficients: - Estimate Std. Error t value Pr(>|t|) -(Intercept) 0.3283 0.4589 0.715 0.47987 -A 1.2807 0.4599 2.785 0.00919 ** -L1 0.8728 0.3027 2.883 0.00721 ** -L2 0.8105 0.2366 3.425 0.00180 ** + Estimate Std. Error t value Pr(>|t|) +(Intercept) 0.2674 0.1641 1.630 0.104923 +A 0.6716 0.1964 3.419 0.000779 *** +L1 0.8176 0.1144 7.143 2.25e-11 *** +L2 0.9689 0.1119 8.656 2.86e-15 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 -Residual standard error: 1.307 on 30 degrees of freedom -Multiple R-squared: 0.4954, Adjusted R-squared: 0.445 -F-statistic: 9.819 on 3 and 30 DF, p-value: 0.0001136
    +Residual standard error: 1.311 on 178 degrees of freedom +Multiple R-squared: 0.4097, Adjusted R-squared: 0.3998 +F-statistic: 41.18 on 3 and 178 DF, p-value: < 2.2e-16

    The coefficient on the treatment A is an estiamte of the ATT.

    diff --git a/docs/outcomemodeling.html b/docs/outcomemodeling.html index 882de20..8a448cd 100644 --- a/docs/outcomemodeling.html +++ b/docs/outcomemodeling.html @@ -7,7 +7,7 @@ -causalestimators - Generate Data +causalestimators - Outcome Modeling