In: Statistics and Probability
Explain how you can use difference in difference to identify causal identification in a research paper
Here, we use g = 1 ... G to index cross-sectional units and t =
1 ...T to index time periods. In DID studies, g often refers to
geographical areas such as states, counties or census tracts,
although it could also refer to distinct groups such as those
separated by age Most of the time, t represents years, quarters, or
months. In most applications, researchers are concerned
with outcomes in two alternative treatment regimes: the treatment
condition and the control
condition. To make the idea concrete, let Dgt = 1 if unit g is
exposed to treatment in period t,
and Dgt = 0 if unit g is exposed to the control condition in period
t. In public health applications, the set of treatments might
consist, for example, of two alternative approaches to the
regulation of syringe exchange programs that are adopted in
different states in different years .
Research on the causal effects of the treatment condition revolves
around the outcomes that
would prevail in each unit and time period under the alternative
levels of treatment. One way to
make this idea more tangible is to define potential outcomes that
describe the same unit under different (hypothetical) treatment
situations. To that end, let Y(1)gt represent an outcome of
interest for unit g in period t under a hypothetical scenario in
which the treatment was active in g at t; Y(0)gt is the outcome of
the same unit and time under the alternative scenario in which the
control condition was active in g at t. The treatment effect for
this specific unit and time period is -gt = Y(1)gt − Y(0)gt, which
is simply the difference in the value of the outcome variable for
the same unit across the two hypothetical situations. The notation
suggests this would be easily done, but applied researchers cannot
observe the identical unit under two different scenarios as
one
could through a lab experiment; in practice, each unit is exposed
to only one treatment condition
in a specific time period, and we observe the corresponding
outcome. Specifically, for a given unit and time, we observe Ygt =
Y(0)gt + [Y(1)gt − Y(0)gt]Dgt. The notation so far describes the
counterfactual inference problem that arises in every causal
inference study. In a typical study, researchers have access to
data on Ygt and Dgt, and they aim to combine the data with research
design assumptions to learn about the average value of Y(1)gt −
Y(0)gt in a study population. The DID design is a
quasi-experimental alternative to the well-understood and
straightforward RCT design, seen for example in the health
insurance context in
the RAND Health Insurance Experiment in the 1970s and more recently
in the Oregon Health
Insurance Experiment RCT and DID share some characteristics: Both
involve a well-defined study population and set of treatment
conditions, where it is easy to distinguish between a treatment
group and a control group and between pretreatment and
post-treatment time periods. The most important distinction is that
treatment conditions are randomly assigned across units in an RCT
but not in a DID design. Under random assignment, treatment
exposure is statistically independent of any (measured or
unmeasured) factor that might also affect outcomes. In a DID
design, researchers
cannot rely on random assignment to avoid bias from unmeasured
confounders and instead impose assumptions that restrict the scope
of the possible confounders. Specifically, DID designs assume that
confounders varying across the groups are time invariant, and
time-varying confounders are group invariant. Researchers refer to
these twin claims as a common trend assumption. In the next two
sections, we describe the DID design further and explain how the
key assumptions of the design lead to a statistical modeling
framework in which treatment effects are easy to estimate.
We start with the simple two-group two-period DID model and then
examine a more general
design that allows for multiple groups and time periods.