"The Apple iPod iTunes Anti-Trust Litigation"
Filing
737
Administrative Motion to File Under Seal Portions of Plaintiffs' Daubert Motion to Exclude Certain Opinion Testimony of Kevin M. Murphy and Robert H. Topel and Exhibits 1-10 Pursuant to Civil L.R. 7-11 and 79-5 filed by Somtai Troy Charoensak, Mariana Rosen, Melanie Tucker. (Attachments: # 1 Declaration of Bonny E. Sweeney in support thereof, # 2 Proposed Order regarding Plaintiffs' Administrative Motion to File Under Seal, # 3 Redacted Version of Plaintiffs' Notice of Motion and Daubert Motion to Exclude Certain Opinion Testimony of Kevin M. Murphy and Robert H. Topel, # 4 Unredacted Version of Plaintiffs' Notice of Motion and Daubert Motion to Exclude Certain Opinion Testimony of Kevin M. Murphy and Robert H. Topel, # 5 Declaration of Bonny E. Sweeny in Support of Plaintiffs' Daubert Motion to Exclude Certain Opinion Testimony of Kevin M. Murphy and Robert H. Topel, # 6 Exhibit Redacted Version of Exhibits 1-10, # 7 Exhibit Unredacted Version of Exhibits 1-4, # 8 Exhibit Unredacted Version of Exhibit 5, # 9 Exhibit Unredacted Version of Exhibits 6-10, # 10 Exhibit 11-14, # 11 Proposed Order Granting Plaintiffs' Daubert Motion to Exclude Certain Opinion Testimony of Kevin M. Murphy and Robert H. Topel)(Sweeney, Bonny) (Filed on 12/20/2013)
EXHIBIT 11
EXHIBIT 12
EXHIBIT 13
econstor
www.econstor.eu
Der Open-Access-Publikationsserver der ZBW – Leibniz-Informationszentrum Wirtschaft
The Open Access Publication Server of the ZBW – Leibniz Information Centre for Economics
Cameron, A. Colin; Miller, Douglas L.
Working Paper
Robust inference with clustered data
Working Papers, University of California, Department of Economics, No. 10,7
Provided in Cooperation with:
University of California, Davis, Department of Economics
Suggested Citation: Cameron, A. Colin; Miller, Douglas L. (2010) : Robust inference with
clustered data, Working Papers, University of California, Department of Economics, No. 10,7
This Version is available at:
http://hdl.handle.net/10419/58373
Nutzungsbedingungen:
Die ZBW räumt Ihnen als Nutzerin/Nutzer das unentgeltliche,
räumlich unbeschränkte und zeitlich auf die Dauer des Schutzrechts
beschränkte einfache Recht ein, das ausgewählte Werk im Rahmen
der unter
→ http://www.econstor.eu/dspace/Nutzungsbedingungen
nachzulesenden vollständigen Nutzungsbedingungen zu
vervielfältigen, mit denen die Nutzerin/der Nutzer sich durch die
erste Nutzung einverstanden erklärt.
zbw
Leibniz-Informationszentrum Wirtschaft
Leibniz Information Centre for Economics
Terms of use:
The ZBW grants you, the user, the non-exclusive right to use
the selected work free of charge, territorially unrestricted and
within the time limit of the term of the property rights according
to the terms specified at
→ http://www.econstor.eu/dspace/Nutzungsbedingungen
By the first use of the selected work the user agrees and
declares to comply with these terms of use.
Working Paper Series
Robust Inference with Clustered Data
A. Colin Cameron
Douglas L. Miller
April 06, 2010
Paper # 10-7
In this paper we survey methods to control for regression model error that is correlated
within groups or clusters, but is uncorrelated across groups or clusters. Then failure to
control for the clustering can lead to understatement of standard errors and
overstatement of statistical significance, as emphasized most notably in empirical
studies by Moulton (1990) and Bertrand, Duflo and Mullainathan (2004). We
emphasize OLS estimation with statistical inference based on minimal assumptions
regarding the error correlation process. Complications we consider include
cluster-specific fixed effects, few clusters, multi-way clustering, more efficient feasible
GLS estimation, and adaptation to nonlinear and instrumental variables estimators.
Department of Economics
One Shields Avenue
Davis, CA 95616
(530)752-0741
http://www.econ.ucdavis.edu/working_search.cfm
Robust Inference with Clustered Data
A. Colin Cameron and Douglas L. Miller
Department of Economics, University of California - Davis.
This version: Feb 10, 2010
Abstract
In this paper we survey methods to control for regression model error that is correlated within groups or clusters, but is uncorrelated across groups or clusters. Then
failure to control for the clustering can lead to understatement of standard errors and
overstatement of statistical signi cance, as emphasized most notably in empirical studies by Moulton (1990) and Bertrand, Du o and Mullainathan (2004). We emphasize
OLS estimation with statistical inference based on minimal assumptions regarding the
error correlation process. Complications we consider include cluster-speci c xed effects, few clusters, multi-way clustering, more e cient feasible GLS estimation, and
adaptation to nonlinear and instrumental variables estimators.
Keywords: Cluster robust, random e ects, xed e ects, di erences in di erences,
cluster bootstrap, few clusters, multi-way clusters.
JEL Classi cation: C12, C21, C23.
This paper is prepared for A. Ullah and D. E. Giles eds., Handbook of Empirical
Economics and Finance, forthcoming 2009.
1
Contents
1 Introduction
3
2 Clustering and its consequences
2.1 Clustered errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Equicorrelated errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Panel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
3
4
5
3 Cluster-robust inference for OLS
3.1 Cluster-robust inference . . . . . . . . . . . . .
3.2 Specifying the clusters . . . . . . . . . . . . . .
3.3 Cluster-speci c xed e ects . . . . . . . . . . .
3.4 Many observations per cluster . . . . . . . . . .
3.5 Survey design with clustering and strati cation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
7
8
9
10
4 Inference with few clusters
4.1 Finite-sample adjusted standard errors . . . .
4.2 Finite-sample Wald tests . . . . . . . . . . . .
4.3 T-distribution for inference . . . . . . . . . . .
4.4 Cluster bootstrap with asymptotic re nement
4.5 Few treated groups . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
10
11
11
13
13
5 Multi-way clustering
5.1 Multi-way cluster-robust inference . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Spatial correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
14
15
6 Feasible GLS
6.1 FGLS and cluster-robust inference .
6.2 E ciency gains of feasible GLS . .
6.3 Random e ects model . . . . . . .
6.4 Hierarchical linear models . . . . .
6.5 Serially correlated errors models for
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
16
17
17
18
estimators
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
18
18
20
21
22
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
panel data .
7 Nonlinear and instrumental variables
7.1 Population-averaged models . . . . .
7.2 Cluster-speci c e ects models . . . .
7.3 Instrumental variables . . . . . . . .
7.4 GMM . . . . . . . . . . . . . . . . .
8 Empirical Example
22
9 Conclusion
23
10 References
24
2
1
Introduction
In this survey we consider regression analysis when observations are grouped in clusters, with
independence across clusters but correlation within clusters. We consider this in settings
where estimators retain their consistency, but statistical inference based on the usual crosssection assumption of independent observations is no longer appropriate.
Statistical inference must control for clustering, as failure to do so can lead to massively
under-estimated standard errors and consequent over-rejection using standard hypothesis
tests. Moulton (1986, 1990) demonstrated that this problem arises in a much wider range
of settings than had been appreciated by microeconometricians. More recently Bertrand,
Du o and Mullainathan (2004) and Kezdi (2004) emphasized that with state-year panel or
repeated cross-section data, clustering can be present even after including state and year
e ects and valid inference requires controlling for clustering within state. Wooldridge (2003,
2006) provides surveys.
A common solution is to use \cluster-robust" standard errors that rely on weak assumptions { errors are independent but not identically distributed across clusters and can have
quite general patterns of within-cluster correlation and heteroskedasticity { provided the
number of clusters is large. This correction generalizes that of White (1980) for independent heteroskedastic errors. Additionally, more e cient estimation may be possible using
alternative estimators, such as feasible GLS, that explicitly model the error correlation.
The loss of estimator precision due to clustering is presented in section 2, while clusterrobust inference is presented in section 3. The complications of inference given only a few
clusters, and inference when there is clustering in more than one direction, are considered in
sections 4 and 5. Section 6 presents more e cient feasible GLS estimation when structure
is placed on the within-cluster error correlation. In section 7 we consider adaptation to
nonlinear and instrumental variables estimators. An empirical example in section 8 illustrates
many of the methods discussed in this survey.
2
Clustering and its consequences
Clustering leads to less e cient estimation than if data are independent, and default OLS
standard errors need to be adjusted.
2.1
Clustered errors
The linear model with (one-way) clustering is
yig = x0ig + uig ;
(1)
where i denotes the ith of N individuals in the sample, g denotes the g th of G clusters,
E[uig jxig ] = 0, and error independence across clusters is assumed so that for i 6= j
E[uig ujg0 jxig ; xjg0 ] = 0, unless g = g 0 :
3
(2)
Errors for individuals belonging to the same group may be correlated, with quite general heteroskedasticity and correlation. Grouping observations by cluster the model can be written
as yg = Xg + ug , where yg and ug are Ng 1 vectors, Xg is an Ng K matrix, and there
are Ng observations in cluster g. Further stacking over clusters yields y = X + u, where y
P
and u are N 1 vectors, X is an N K matrix, and N = g Ng . The OLS estimator is
b = (X0 X) 1 X0 y. Given error independence across clusters, this estimator has asymptotic
variance matrix
!
G
X
1
1
V[ b ] = (E[X0 X])
E[X0g ug u0g Xg ] (E[X0 X]) ;
(3)
g=1
rather than the default OLS variance
2.2
2
u
(E[X0 X]) 1 , where
2
u
= V[uig ].
Equicorrelated errors
One way that within-cluster correlation can arise is in the random e ects model where the
error uig = g + "ig , where g is a cluster-speci c error or common shock that is i.i.d.
(0; 2 ), and "ig is an idiosyncratic error that is i.i.d. (0; 2 ). Then Var[uig ] = 2 + 2
"
"
and Cov[uig ; ujg ] = 2 for i 6= j. It follows that the intraclass correlation of the error
2
=( 2 + 2 ). The correlation is constant across all pairs of errors in
u = Cor[uig ; ujg ] =
"
a given cluster. This correlation pattern is suitable when observations can be viewed as
exchangeable, with ordering not mattering. Leading examples are individuals or households
within a village or other geographic unit (such as state), individuals within a household, and
students within a school.
If the primary source of clustering is due to such equicorrelated group-level common
shocks, a useful approximation is that for the j th regressor the default OLS variance estimate
based on s2 (X0 X) 1 , where s is the standard error of the regression, should be in ated by
j
'1+
xj u (Ng
1);
(4)
where xj is a measure of the within-cluster correlation of xj , u is the within-cluster error
correlation, and Ng is the average cluster size. This result for equicorrelated errors is exact
if clusters are of equal size; see Kloek (1981) for the special case xj = 1, and Scott and
Holt (1982) and Greenwald (1983) for the general result. The e ciency loss, relative to
independent observations, is increasing in the within-cluster correlation of both the error
and the regressor and in the number of observations in each cluster.
To understand the loss of estimator precision given clustering, consider the sample mean
when observations are correlated. In this case the entire sample is viewed as a single cluster.
Then
nXN
o
X X
2
V[y] = N
V[ui ] +
Cov[ui ; uj ] :
(5)
i=1
i
2
j6=i
Given equicorrelated errors with Cov[yig ; yjg ] =
for i 6= j, V[y] = N 2 fN 2 + N (N
1) 2 g = N 1 2 f1 + (N 1)g compared to N 1 2 in the i.i.d. case. At the extreme
V[y] = 2 as ! 1 and there is no bene t at all to increasing the sample size beyond N = 1.
4
Similar results are obtained when we generalize to several clusters of equal size (balanced
clusters) with regressors that are invariant within cluster, so yig = x0g + uig where i denotes
the ith of N individuals in the sample and g denotes the g th of G clusters, and there are
N = N=G observations in each cluster. Then OLS estimation of yig on xg is equivalent to
OLS estimation in the model yg = x0g + ug , where yg and ug are the within-cluster averages
of the dependent variable and error. If ug is independent and homoskedastic with variance
1
PG
2
then V[ b ] = 2
xg x0
, where the formula for 2 varies with the within-cluster
ug
ug
g=1
g
ug
2
ug
correlation of uig . For equicorrelated errors
= N [1 + u (N 1)] 2 compared to N 1 2
u
u
with independent errors, so the true variance of the OLS estimator is (1 + u (N
1)) times
the default, as given in (4) with xj = 1.
In an in uential paper Moulton (1990) pointed out that in many settings the adjustment
factor j can be large even if u is small. He considered a log earnings regression using
March CPS data (N = 18; 946), regressors aggregated at the state level (G = 49), and
errors correlated within state (bu = 0:032). The average group size was 18; 946=49 = 387,
0:032 386 = 13:3. The weak correlation
xj = 1 for a state-level regressor, so j ' 1 + 1
of errors within state was still enough to lead to cluster-corrected standard errors being
p
13:3 = 3:7 times larger than the (incorrect) default standard errors, and in this example
many researchers would not appreciate the need to make this correction.
2.3
1
Panel Data
A second way that clustering can arise is in panel data. We assume that observations are
independent across individuals in the panel, but the observations for any given individual
are correlated over time. Then each individual is viewed as a cluster. The usual notation
is to denote the data as yit where i denotes the individual and t the time period. But in
our framework (1) the data are denoted yig where i is the within-cluster subscript (for panel
data the time period) and g is the cluster unit (for panel data the individual).
The assumption of equicorrelated errors is unlikely to be suitable for panel data. Instead
we expect that the within-cluster (individual) correlation decreases as the time separation
increases.
For example, we might consider an AR(1) model with uit = ui;t 1 + "it , where 0 < < 1
and "it is i.i.d. (0; 2 ). In terms of the notation in (1), uig = ui 1;g + "ig . Then the
"
within-cluster error correlation Cor[uig ; ujg ] = ji jj , and the consequences of clustering are
less extreme than in the case of equicorrelated errors.
To see this, consider the variance of the sample mean y when Cov[yi ; yj ] = ji jj 2 .
PN
Then (5) yields V[y] = N 1 [1 + 2N 1 s=11 s s ] 2 . For example, if = 0:5 and N =
u
10, then V[y] = 0:260 2 compared to 0:55 2 for equicorrelation, using V[y] = N 1 2 f1 +
(N 1)g, and 0:1 2 when there is no correlation ( = 0:0). More generally with several
clusters of equal size and regressors invariant within cluster, OLS estimation of yig on xg is
equivalent to OLS estimation of yg on xg , see section 2.2, and with an AR(1) error V[ b ] =
5
1
1
PN 1 s 2 P
P
0
0
N 1 [1 + 2N
, less than N 1 [1 + u (N 1)] 2
with
u
s=1 s ] u
g xg xg
g xg xg
an equicorrelated error.
For panel data in practice, while within-cluster correlations for errors are not constant,
they do not dampen as quickly as those for an AR(1) model. The variance in ation formula
(4) can still provide a reasonable guide in panels that are short and have high within-cluster
serial correlations of the regressor and of the error.
3
Cluster-robust inference for OLS
The most common approach in applied econometrics is to continue with OLS, and then
obtain correct standard errors that correct for within-cluster correlation.
3.1
Cluster-robust inference
Cluster-robust estimates for the variance matrix of an estimate are sandwich estimates that
are cluster adaptations of methods proposed originally for independent observations by White
(1980) for OLS with heteroskedastic errors, and by Huber (1967) and White (1982) for the
maximum likelihood estimator.
The cluster-robust estimate of the variance matrix of the OLS estimator, de ned in (3),
is the sandwich estimate
b
b
V[ b ] = (X0 X) 1 B(X0 X) 1 ;
(6)
where
b
B=
XG
g=1
b b
X0g ug u0g Xg ;
(7)
P
b b
b
and ug = yg Xg b . This provides a consistent estimate of the variance matrix if G 1 G X0g ug u0g Xg
g=1
PG
p
G 1 g=1 E[X0g ug u0g Xg ] ! 0 as G ! 1.
The estimate of White (1980) for independent heteroskedastic errors is the special case
of (7) where each cluster has only one observation (so G = N and Ng = 1 for all g). It relies
P
on the same intuition that G 1 G E[X0g ug u0g Xg ] is a nite-dimensional (K K) matrix
g=1
of averages that can be be consistently estimated as G ! 1.
White (1984, p.134-142) presented formal theorems that justify use of (7) for OLS with a
multivariate dependent variable, a result directly applicable to balanced clusters. Liang and
Zeger (1986) proposed this method for estimation for a range of models much wider than
OLS; see sections 6 and 7 of their paper for a range of extensions to (7). Arellano (1987)
considered the xed e ects estimator in linear panel models, and Rogers (1993) popularized
this method in applied econometrics by incorporating it in Stata. Note that (7) does not
require speci cation of a model for E[ug u0g ].
Finite-sample modi cations of (7) are typically used, since without modi cation the
p
b
cluster-robust standard errors are biased downwards. Stata uses cb g in (7) rather than ug ,
u
6
with
G
N 1
G
'
:
(8)
G 1N K
G 1
Some other packages such as SAS use c = G=(G 1). This simpler correction is also used
by Stata for extensions to nonlinear models. Cameron, Gelbach, and Miller (2008) review
various nite-sample corrections that have been proposed in the literature, for both standard
errors and for inference using resultant Wald statistics; see also section 6.
b
The rank of V[ b ] in (7) can be shown to be at most G, so at most G restrictions on the
parameters can be tested if cluster-robust standard errors are used. In particular, in models
with cluster-speci c e ects it may not be possible to perform a test of overall signi cance of
the regression, even though it is possible to perform tests on smaller subsets of the regressors.
c=
3.2
Specifying the clusters
It is not always obvious how to de ne the clusters.
As already noted in section 2.2, Moulton (1986, 1990) pointed out for statistical inference
on an aggregate-level regressor it may be necessary to cluster at that level. For example, with
individual cross-sectional data and a regressor de ned at the state level one should cluster at
the state level if regression model errors are even very mildly correlated at the state level. In
other cases the key regressor may be correlated within group, though not perfectly so, such
as individuals within household. Other reasons for clustering include discrete regressors and
a clustered sample design.
In some applications there can be nested levels of clustering. For example, for a householdbased survey there may be error correlation for individuals within the same household, and
for individuals in the same state. In that case cluster-robust standard errors are computed
at the most aggregated level of clustering, in this example at the state level. Pepper (2002)
provides a detailed example.
Bertrand, Du o and Mullainathan (2004) noted that with panel data or repeated crosssection data, and regressors clustered at the state level, many researchers either failed to
account for clustering or mistakenly clustered at the state-year level rather than the state
level. Let yist denote the value of the dependent variable for the ith individual in the sth
state in the tth year, and let xst denote a state-level policy variable that in practice will be
quite highly correlated over time in a given state. The authors considered the di erence-indi erences (DiD) model yist = s + t + xst + z0ist + uit , though their result is relevant even
for OLS regression of yist on xst alone. The same point applies if data were more simply
observed at only the state-year level (i.e. yst rather than yist ).
In general DiD models using state-year data will have high within-cluster correlation of
the key policy regressor. Furthermore there may be relatively few clusters; a complication
considered in section 4.
7
3.3
Cluster-speci c
xed e ects
A standard estimation method for clustered data is to additionally incorporate clusterspeci c xed e ects as regressors, estimating the model
yig =
g
+ x0ig + uig :
(9)
This is similar to the equicorrelated error model, except that g is treated as a (nuisance)
parameter to be estimated. Given Ng nite and G ! 1 the parameters g , g = 1; :::; G;
cannot be consistently estimated. The parameters can still be consistently estimated, with
the important caveat that the coe cients of cluster-invariant regressors (xg rather than xig )
are not identi ed. (In microeconometrics applications, xed e ects are typically included to
enable consistent estimation of a cluster-varying regressor while controlling for a limited form
of endogeneity { the regressor xig may be correlated with the cluster-invariant component
g of the error term g + uig ).
Initial applications obtained default standard errors that assume uig in (9) is i.i.d. (0; 2 ),
u
assuming that cluster-speci c xed e ects are su cient to mop up any within-cluster error
correlation. More recently it has become more common to control for possible within-cluster
correlation of uig by using (7), as suggested by Arellano (1987). Kezdi (2004) demonstrated
that cluster-robust estimates can perform well in typical-sized panels, despite the need to
rst estimate the xed e ects, even when Ng is large relative to G.
It is well-known that there are several alternative ways to obtain the OLS estimator of
in (9). Less well-known is that these di erent ways can lead to di erent cluster-robust
estimates of V[ b ]. We thank Arindrajit Dube and Jason Lindo for bringing this issue to our
attention.
The two main estimation methods we consider are the least squares dummy variables
(LSDV) estimator, which obtains the OLS estimator from regression of yig on xig and a set
of dummy variables for each cluster, and the mean-di erenced estimator, which is the OLS
estimator from regression of (yig yg ) on (xig xg ).
These two methods lead to the same cluster-robust standard errors if we apply formula
(7) to the respective regressions, or if we multiply this estimate by G=(G 1). Di erences
arise, however, if we multiply by the small-sample correction c given in (8). Let K denote the
number of regressors including the intercept. Then the LSDV model views the total set of
regressors to be G cluster dummies and (K 1) other regressors, while the mean-di erenced
model considers there to be only (K 1) regressors (this model is estimated without an
intercept). Then
Model
Finite sample adjustment Balanced case
N 1
LSDV
c = GG 1 N G (k 1)
c ' GG 1 NN 1
Mean-di erenced model c = GG 1 N N(k 1 1)
c ' GG 1 :
In the balanced case N = N G, leading to the approximation given above if additionally K
is small relative to N .
8
The di erence can be very large for small N . Thus if N = 2 (or N = 3) then the
cluster-robust variance matrix obtained using LSDV is essentially 2 times (or 3=2 times)
that obtained from estimating the mean-di erenced model, and it is the mean-di erenced
model that gives the correct nite-sample correction.
Note that if instead the error uig is assumed to be i.i.d. (0; 2 ), so that default standard
u
errors are used, then it is well-known that the appropriate small-sample correction is (N
P
1)=N G (K 1), i.e. we use s2 (X0 X) 1 where s2 = (N G (K 1)) 1 ig u2 . In that
big
case LSDV does give the correct adjustment, and estimation of the mean-di erenced model
will give the wrong nite-sample correction.
An alternative variance estimator after estimation of (9) is a heteroskedastic-robust estimator, which permits the error uig in (9) to be heteroskedastic but uncorrelated across both
i and g. Stock and Watson (2008) show that applying the method of White (1980) after
mean-di erenced estimation of (9) leads, surprisingly, to inconsistent estimates of V[ b ] if
the number of observations Ng in each cluster is small (though it is correct if Ng = 2). The
bias comes from estimating the cluster-speci c means rather than being able to use the true
cluster-means. They derive a bias-corrected formula for heteroskedastic-robust standard errors. Alternatively, and more simply, the cluster-robust estimator gives a consistent estimate
of V[ b ] even if the errors are only heteroskedastic, though this estimator is more variable
than the bias-corrected estimator proposed by Stock and Watson.
3.4
Many observations per cluster
The preceding analysis assumes the number of observations within each cluster is xed, while
the number of clusters goes to in nity.
This assumption may not be appropriate for clustering in long panels, where the number
of time periods goes to in nity. Hansen (2007a) derived asymptotic results for the standard
one-way cluster-robust variance matrix estimator for panel data under various assumptions.
We consider a balanced panel of N individuals over T periods, so there are N T observations
in N clusters with T observations per cluster. When N ! 1 with T xed (a short panel),
p
as we have assumed above, the rate of convergence for the OLS estimator b is N . When
both N ! 1 and T ! 1 (a long panel with N ! 1), the rate of convergence of b is
p
p
N if there is no mixing (his Theorem 2) and N T if there is mixing (his Theorem 3). By
mixing we mean that the correlation becomes damped as observations become further apart
in time.
As illustrated in section 2.3, if the within-cluster error correlation of the error diminishes
as errors are further apart in time, then the data has greater informational content. This
p
is re ected in the p
rate of convergence increasing from N (determined by the number of
cross-sections) to N T (determined by the total size of the panel). The latter rate is the
rate we expect if errors were independent within cluster.
While the rates of convergence di er in the two cases, Hansen (2007a) obtains the same
asymptotic variance for the OLS estimator, so (7) remains valid.
9
3.5
Survey design with clustering and strati cation
Clustering routinely arises in complex survey data. Rather than randomly draw individuals
from the population, the survey may be restricted to a randomly-selected subset of primary sampling units (such as a geographic area) followed by selection of people within that
geographic area. A common approach in microeconometrics is to control for the resultant
clustering by computing cluster-robust standard errors that control for clustering at the level
of the primary sampling unit, or at a more aggregated level such as state.
The survey methods literature uses methods to control for clustering that predate the
references in this paper. The loss of estimator precision due to clustering is called the design
e ect: \The design e ect or De is the ratio of the actual variance of a sample to the variance
of a simple random sample of the same number of elements" (Kish (1965), p.258)). Kish
and Frankel (1974) give the variance in ation formula (4) assuming equicorrelated errors in
the non-regression case of estimation of the mean. Pfe ermann and Nathan (1981) consider
the more general regression case.
The survey methods literature additionally controls for another feature of survey data {
strati cation. More precise statistical inference is possible after strati cation. For the linear
regression model, survey methods that do so are well-established and are incorporated in
specialized software as well as in some broad-based packages such as Stata.
Bhattacharya (2005) provides a comprehensive treatment in a GMM framework. He
nds that accounting for strati cation tends to reduce estimated standard errors, and that
this e ect can be meaningfully large. In his empirical examples, the strati cation e ect is
largest when estimating (unconditional) means and Lorenz shares, and much smaller when
estimating conditional means via regression.
The current common approach of microeconometrics studies is to ignore the (bene cial)
e ects of strati cation. In so doing there will be some over-estimation of estimator standard
errors.
4
Inference with few clusters
Cluster-robust inference asymptotics are based on G ! 1. Often, however, cluster-robust
inference is desired but there are only a few clusters. For example, clustering may be at the
regional level but there are few regions (e.g. Canada has only ten provinces). Then several
di erent nite-sample adjustments have been proposed.
4.1
Finite-sample adjusted standard errors
b
e
Finite-sample adjustments replace ug in (7) with a modi ed residual ug . The simplest is
p
e
u
ug = G=(G 1)b g , or the modi cation of this given in (8). Kauermann and Carroll (2001)
e
b
and Bell and McCa rey (2002) use ug = [INg Hgg ] 1=2 ug , where Hgg = Xg (X0 X) 1 X0g . This
b
transformed residual leads to E[V[ b ]] = V[ b ] in the special case that g = E[ug u0g ] = 2 I.
10
p
eg
b
Bell and McCa rey (2002) also consider use of u+ = G=(G 1)[INg Hgg ] 1 ug , which
can shown to equal the (clustered) jackknife estimate of the variance of the OLS estimator.
These adjustments are analogs of the HC2 and HC3 measures of MacKinnon and White
(1985) proposed for heteroskedastic-robust standard errors in the nonclustered case.
eg
e
Angrist and Lavy (2002) found that using u+ rather than ug increased cluster-robust
standard errors by 10 50 percent in an application with G = 30 to 40.
Kauermann and Carroll (2001), Bell and McCa rey (2002), Mancl and DeRouen (2001),
and McCa rey, Bell and Botts (2001) also consider the case where g 6= 2 I is of known
functional form, and present extension to generalized linear models.
4.2
Finite-sample Wald tests
For a two-sided test of H0 :
=
0
j
against Ha :
6=
0
j,
where j is a scalar component of
0
, the standard procedure is to use Wald test statistic w = bj
j =sbj , where sbj is the
b
square root of the appropriate diagonal entry in V[ b ]. This \t" test statistic is asymptotically
j
j
normal under H0 as G ! 1, and we reject H0 at signi cance level 0:05 if jwj > 1:960.
With few clusters, however, the asymptotic normal distribution can provide a poor approximation, even if an unbiased variance matrix estimator is used in calculating sbj . The
situation is a little unusual. In a pure time series or pure cross-section setting with few
observations, say N = 10, j is likely to be very imprecisely estimated so that statistical inference is not worth pursuing. By contrast, in a clustered setting we may have N su ciently
large that j is reasonably precisely estimated, but G is so small that the asymptotic normal
approximation is a very poor one.
We present two possible approaches: basing inference on the T distribution with degrees of
freedom determined by the cluster, and using a cluster bootstrap with asymptotic re nement.
Note that feasible GLS based on a correctly speci ed model of the clustering, see section 6,
will not su er from this problem.
4.3
T-distribution for inference
The simplest small-sample correction for the Wald statistic is to use a T distribution, rather
than the standard normal. As we outline below in some cases the TG L distribution might be
used, where L is the number of regressors that are invariant within cluster. Some packages
for some commands do use the T distribution. For example, Stata uses G 1 degrees of
freedom for t-tests and F tests based on cluster-robust standard errors.
Such adjustments can make quite a di erence. For example with G = 10 for a two-sided
test at level 0:05 the critical value for T9 is 2:262 rather than 1:960, and if w = 1:960 the
p-value based on T9 is 0:082 rather than 0:05. In Monte Carlo simulations by Cameron,
Gelbach, and Miller (2008) this technique works reasonably well. At the minimum one
should use the T distribution with G 1 degrees of freedom, say, rather than the standard
normal.
11
Donald and Lang (2007) provide a rationale for using the TG L distribution. If clusters
are balanced and all regressors are invariant within cluster then the OLS estimator in the
model yig = x0g + uig is equivalent to OLS estimation in the grouped model yg = x0g + ug .
b
If ug is i.i.d. normally distributed then the Wald statistic is TG L distributed, where V[ b ] =
P
2
b
s2 (X 0 X) 1 and s2 = (G K) 1 g ug . Note that ug is i.i.d. normal in the random e ects
model if the error components are i.i.d. normal.
Donald and Lang (2007) extend this approach to additionally include regressors zig that
vary within clusters, and allow for unbalanced clusters. They assume a random e ects model
with normal i.i.d. errors. Then feasible GLS estimation of in the model
yig = x0g + z0ig +
s
+ "is ;
(10)
is equivalent to the following two-step procedure. First do OLS estimation in the model
yig = g + z0ig + "ig , where g is treated as a cluster-speci c xed e ect. Then do FGLS of
yg z0g b on xg . Donald and Lang (2007) give various conditions under which the resulting
Wald statistic based on bj is TG L distributed. These conditions require that if zig is a
regressor then zg in the limit is constant over g, unless Ng ! 1. Usually L = 2, as the only
regressors that do not vary within clusters are an intercept and a scalar regressor xg .
Wooldridge (2006) presents an expansive exposition of the Donald and Lang approach.
Additionally, Wooldridge proposes an alternative approach based on minimum distance estimation. He assumes that g in yig = g + z0ig + "ig can be adequately explained by xg and
at the second step uses minimum chi-square methods to estimate in bg = + x0g . This
provides estimates of that are asymptotically normal as Ng ! 1 (rather than G ! 1).
Wooldridge argues that this leads to less conservative statistical inference. The 2 statistic
from the minimum distance method can be used as a test of the assumption that the g do
not depend in part on cluster-speci c random e ects. If this test fails, the researcher can
then use the Donald and Lang approach, and use a T distribution for inference.
An alternate approach for correct inference with few clusters is presented by Ibragimov
and Muller (2010). Their method is best suited for settings where model identi cation,
and central limit theorems, can be applied separately to observations in each cluster. They
propose separate estimation of the key parameter within each group. Each group's estimate
is then a draw from a normal distribution with mean around the truth, though perhaps
with separate variance for each group. The separate estimates are averaged, divided by
the sample standard deviation of these estimates, and the test statistic is compared against
critical values from a T distribution. This approach has the strength of o ering correct
inference even with few clusters. A limitation is that it requires identi cation using only
within-group variation, so that the group estimates are independent of one another. For
example, if state-year data yst are used and the state is the cluster unit, then the regressors
cannot use any regressor zt such as a time dummy that varies over time but not states.
12
4.4
Cluster bootstrap with asymptotic re nement
A cluster bootstrap with asymptotic re nement can lead to improved nite-sample inference.
For inference based on G ! 1, a two-sided Wald test of nominal size can be shown
to have true size + O(G 1 ) when the usual asymptotic normal approximation is used.
If instead an appropriate bootstrap with asymptotic re nement is used, the true size is
+ O(G 3=2 ). This is closer to the desired for large G, and hopefully also for small G.
For a one-sided test or a nonsymmetric two-sided test the rates are instead, respectively,
+ O(G 1=2 ) and + O(G 1 ).
Such asymptotic re nement can be achieved by bootstrapping a statistic that is asymptotically pivotal, meaning the asymptotic distribution does not depend on any unknown
parameters. For this reason the Wald t-statistic w is bootstrapped, rather than the estimator bj whose distribution depends on V[bj ] which needs to be estimated. The pairs
cluster bootstrap procedure does B iterations where at the bth iteration: (1) form G clusters
f(y1 ; X1 ); :::; (yG ; XG )g by resampling with replacement G times from the original sample
of clusters; (2) do OLS estimation with this resample and calculate the Wald test statistic
wb = (bj;b bj )=sb where sb is the cluster-robust standard error of bj;b , and bj is the
j;b
j;b
OLS estimate of j from the original sample. Then reject H0 at level if and only if the
original sample Wald statistic w is such that w < w[ =2] or w > w[1 =2] where w[q] denotes
the q th quantile of w1 ; :::; wB .
Cameron, Gelbach, and Miller (2008) provide an extensive discussion of this and related
bootstraps. If there are regressors which contain few values (such as dummy variables),
and if there are few clusters, then it is better to use an alternative design-based bootstrap
that additionally conditions on the regressors, such as a cluster Wild bootstrap. Even then
bootstrap methods, unlike the method of Donald and Lang, will not be appropriate when
there are very few groups, such as G = 2.
4.5
Few treated groups
Even when G is su ciently large, problems arise if most of the variation in the regressor
is concentrated in just a few clusters. This occurs if the key regressor is a cluster-speci c
binary treatment dummy and there are few treated groups.
Conley and Taber (2010) examine a di erences-in-di erences (DiD) model in which there
are few treated groups and an increasing number of control groups. If there are group-time
random e ects, then the DiD model is inconsistent because the treated groups random e ects
are not averaged away. If the random e ects are normally distributed, then the model of
Donald and Lang (2007) applies and inference can use a T distribution based on the number
of treated groups. If the group-time shocks are not random, then the T distribution may be
a poor approximation. Conley and Taber (2010) then propose a novel method that uses the
distribution of the untreated groups to perform inference on the treatment parameter.
13
5
Multi-way clustering
Regression model errors can be clustered in more than way. For example, they might be
correlated across time within a state, and across states within a time period. When the
groups are nested (for example, households within states), one clusters on the more aggregate
group; see section 3.2. But when they are non-nested, traditional cluster inference can only
deal with one of the dimensions.
In some applications it is possible to include su cient regressors to eliminate error correlation in all but one dimension, and then do cluster-robust inference for that remaining
dimension. A leading example is that in a state-year panel of individuals (with dependent
variable yist ) there may be clustering both within years and within states. If the within-year
clustering is due to shocks that are the same across all individuals in a given year, then including year xed e ects as regressors will absorb within-year clustering and inference then
need only control for clustering on state.
When this is not possible, the one-way cluster robust variance can be extended to multiway clustering.
5.1
Multi-way cluster-robust inference
The cluster-robust estimate of V[ b ] de ned in (6)-(7) can be generalized to clustering in multiple dimensions. Regular one-way clustering is based on the assumption that E[ui uj jxi ; xj ] =
b P PN xi x0 ui uj 1[i; j
0, unless observations i and j are in the same cluster. Then (7) sets B = N
jb b
i=1
j=1
0b
in same cluster], where ui = yi xi and the indicator function 1[A] equals 1 if event A ocb
curs and 0 otherwise. In multi-way clustering, the key assumption is that E[ui uj jxi ; xj ] = 0,
unless observations i and j share any cluster dimension. Then the multi-way cluster robust
b P PN xi x0 ui uj 1[i; j share any cluster]:
estimate of V[ b ] replaces (7) with B = N
jb b
i=1
j=1
For two-way clustering this robust variance estimator is easy to implement given software
that computes the usual one-way cluster-robust estimate. We obtain three di erent clusterrobust \variance" matrices for the estimator by one-way clustering in, respectively, the rst
dimension, the second dimension, and by the intersection of the rst and second dimensions.
Then add the rst two variance matrices and, to account for double-counting, subtract the
third. Thus
b
b
b
b
Vtwo-way [ b ] = V1 [ b ] + V2 [ b ] V1\2 [ b ];
(11)
where the three component variance estimates are computed using (6)-(7) for the three
di erent ways of clustering. Similar methods for additional dimensions, such as three-way
clustering, are detailed in Cameron, Gelbach, and Miller (2010).
This method relies on asymptotics that are in the number of clusters of the dimension
with the fewest number. This method is thus most appropriate when each dimension has
many clusters. Theory for two-way cluster robust estimates of the variance matrix is presented in Cameron, Gelbach, and Miller (2006, 2010), Miglioretti and Heagerty (2006), and
14
Thompson (2006). Early empirical applications that independently proposed this method
include Acemoglu and Pischke (2003), and Fafchamps and Gubert (2007).
5.2
Spatial correlation
The multi-way robust clustering estimator is closely related to the eld of time-series and
spatial heteroskedasticity and autocorrelation variance estimation.
P P
b
In general B in (7) has the form i j w (i; j) xi x0j ui uj . For multi-way clustering the
bb
weight w (i; j) = 1 for observations who share a cluster, and w (i; j) = 0 otherwise. In
White and Domowitz (1984), the weight w (i; j) = 1 for observations \close" in time to one
another, and w (i; j) = 0 for other observations. Conley (1999) considers the case where
observations have spatial locations, and has weights w (i; j) decaying to 0 as the distance
between observations grows.
A distinguishing feature between these papers and multi-way clustering is that White and
Domowitz (1984) and Conley (1999) use mixing conditions (to ensure decay of dependence) as
observations grow apart in time or distance. These conditions are not applicable to clustering
due to common shocks. Instead the multi-way robust estimator relies on independence of
observations that do not share any clusters in common.
There are several variations to the cluster-robust and spatial or time-series HAC estimators, some of which can be thought of as hybrids of these concepts.
The spatial estimator of Driscoll and Kraay (1998) treats each time period as a cluster,
additionally allows observations in di erent time periods to be correlated for a nite time
di erence, and assumes T ! 1. The Driscoll-Kraay estimator can be thought of as using weight w (i; j) = 1 D (i; j) =(Dmax + 1), where D (i; j) is the time distance between
observations i and j, and Dmax is the maximum time separation allowed to have correlation.
An estimator proposed by Thompson (2006) allows for across-cluster (in his example
rm) correlation for observations close in time in addition to within-cluster correlation at
any time separation. The Thompson estimator can be thought of as using w (i; j) = 1[i; j
share a rm, or D (i; j) Dmax ]. It seems that other variations are likely possible.
Foote (2007) contrasts the two-way cluster-robust and these other variance matrix estimators in the context of a macroeconomics example. Petersen (2009) contrasts various
methods for panel data on nancial rms, where there is concern about both within rm
correlation (over time) and across rm correlation due to common shocks.
6
Feasible GLS
When clustering is present and a correct model for the error correlation is speci ed, the
feasible GLS estimator is more e cient than OLS. Furthermore, in many situations one
can obtain a cluster-robust version of the standard errors for the FGLS estimator, to guard
against misspeci cation of model for the error correlation. Many applied studies nonetheless
use the OLS estimator, despite the potential expense of e ciency loss in estimation.
15
6.1
FGLS and cluster-robust inference
Suppose we specify a model for g = E[ug u0g jXg ], such as within-cluster equicorrelation.
1
Then the GLS estimator is (X0 1 X) X0 1 y, where = Diag[ g ]. Given a consistent
estimate b of , the feasible GLS estimator of is
b
FGLS
=
XG
g=1
X0g b g 1 Xg
1
XG
g=1
X0g b g 1 yg :
(12)
1
The default estimate of the variance matrix of the FGLS estimator, X0 b 1 X
, is correct
0
under the restrictive assumption that E[ug ug jXg ] = g .
The cluster-robust estimate of the asymptotic variance matrix of the FGLS estimator is
b
V[ b FGLS ] = X0 b
1
1
X
XG
g=1
b b
X0g b g 1 ug u0g b g 1 Xg
X0 b
1
1
X
;
(13)
b
where ug = yg Xg b FGLS . This estimator requires that ug and uh are uncorrelated, for g 6= h,
but permits E[ug u0g jXg ] 6= g . In that case the FGLS estimator is no longer guaranteed to
be more e cient than the OLS estimator, but it would be a poor choice of model for g
that led to FGLS being less e cient.
Not all econometrics packages compute this cluster-robust estimate. In that case one
can use a pairs cluster bootstrap (without asymptotic re nement). Speci cally B times
form G clusters f(y1 ; X1 ); :::; (yG ; XG )g by resampling with replacement G times from the
original sample of clusters, each time compute the FGLS estimator, and then compute the
P
b
variance of the B FGLS estimates b 1 ; :::; b B as Vboot [ b ] = (B 1) 1 B ( b b b )( b b b )0 .
b=1
Care is needed, however, if the model includes cluster-speci c xed e ects; see, for example,
Cameron and Trivedi (2009, p.421).
6.2
E ciency gains of feasible GLS
Given a correct model for the within-cluster correlation of the error, such as equicorrelation,
the feasible GLS estimator is more e cient than OLS. The e ciency gains of FGLS need
not necessarily be great. For example, if the within-cluster correlation of all regressors is
unity (so xig = xg ) and ug de ned in section 2.3 is homoskedastic, then FGLS is equivalent
to OLS so there is no gain to FGLS.
For equicorrelated errors and general X, Scott and Holt (1982) provide an upper bound
to the maximum proportionate e ciency loss of OLS compared to the variance of the FGLS
i
h
u )[1+(N
estimator of 1= 1 + 4(1 (Nmax max2 1) u ; Nmax = maxfN1 ; :::; NG g. This upper bound is
u)
increasing in the error correlation u and the maximum cluster size Nmax . For low u the
maximal e ciency gain for can be low. For example, Scott and Holt (1982) note that for
u = :05 and Nmax = 20 there is at most a 12% e ciency loss of OLS compared to FGLS.
But for u = 0:2 and Nmax = 50 the e ciency loss could be as much as 74%, though this
depends on the nature of X.
16
6.3
Random e ects model
The one-way random e ects (RE) model is given by (1) with uig = g + "ig , where g and "ig
are i.i.d. error components; see section 2.2. Some algebra shows that the FGLS estimator
in (12) can be computed by OLS estimation of (yig byi ) on (xig bxi ) where b = 1
q
b" = b2 + Ng b2 . Applying the cluster-robust variance matrix formula (7) for OLS in this
"
transformed model yields (13) for the FGLS estimator.
The RE model can be extended to multi-way clustering, though FGLS estimation is then
more complicated. In the two-way case, yigh = x0igh + g + h + "igh . For example, Moulton
(1986) considered clustering due to grouping of regressors (schooling, age and weeks worked)
in a log earnings regression. In his model he allowed for a common random shock for each
year of schooling, for each year of age, and for each number of weeks worked. Davis (2002)
modelled lm attendance data clustered by lm, theater and time. Cameron and Golotvina
(2005) modelled trade between country-pairs. These multi-way papers compute the variance
matrix assuming is correctly speci ed.
6.4
Hierarchical linear models
The one-way random e ects model can be viewed as permitting the intercept to vary randomly across clusters. The hierarchical linear model (HLM) additionally permits the slope
coe cients to vary. Speci cally
yig = x0ig g + uig ;
(14)
where the rst component of xig is an intercept. A concrete example is to consider data
on students within schools. Then yig is an outcome measure such as test score for the ith
student in the g th school. In a two-level model the k th component of g is modelled as
0
kg = wkg k + vkg , where wkg is a vector of school characteristics. Then stacking over all K
components of we have
(15)
g = Wg + vj ;
where Wg = Diag[wkg ] and usually the rst component of wkg is an intercept.
The random e ects model is the special case g = ( 1g ; 2g ) where 1g = 1
1 + v1g and
kg = k + 0 for k > 1, so v1g is the random e ects model's g . The HLM model additionally
allows for random slopes 2g that may or may not vary with level-two observables wkg .
Further levels are possible, such as schools nested in school districts.
The HLM model can be re-expressed as a mixed linear model, since substituting (15)
into (14) yields
yig = (x0ig Wg ) + x0ig vg + uig :
(16)
The goal is to estimate the regression parameter
and the variances and covariances of
the errors uig and vg . Estimation is by maximum likelihood assuming the errors vg and uig
are normally distributed. Note that the pooled OLS estimator of is consistent but is less
e cient.
17
HLM programs assume that (15) correctly speci es the within-cluster correlation. One
can instead robustify the standard errors by using formulae analogous to (13), or by the
cluster bootstrap.
6.5
Serially correlated errors models for panel data
If Ng is small, the clusters are balanced, and it is assumed that g is the same for all g, say
, then the FGLS estimator in (12) can be used without need to specify a model for
g =
P
. Instead we can let b have ij th entry G 1 G uig ujg , where uig are the residuals from
b
g=1 b b
initial OLS estimation.
This procedure was proposed for short panels by Kiefer (1980). It is appropriate in this
context under the assumption that variances and autocovariances of the errors are constant
across individuals. While this assumption is restrictive, it is less restrictive than, for example,
the AR(1) error assumption given in section 2.3.
In practice two complications can arise with panel data. First, there are T (T 1) =2
o -diagonal elements to estimate and this number can be large relative to the number of
observations N T . Second, if an individual-speci c xed e ects panel model is estimated,
then the xed e ects lead to an incidental parameters bias in estimating the o -diagonal
covariances. This is the case for di erences-in-di erences models, yet FGLS estimation is
desirable as it is more e cient than OLS. Hausman and Kuersteiner (2008) present xes for
both complications, including adjustment to Wald test critical values by using a higher-order
Edgeworth expansion that takes account of the uncertainty in estimating the within-state
covariance of the errors.
A more commonly-used model speci es an AR(p) model for the errors. This has the
advantage over the preceding method of having many fewer parameters to estimate in ,
though is a more restrictive model. Of course, one can robustify using (13). If xed e ects are
present, however, then there is again a bias (of order Ng 1 ) in estimation of the AR(p) coefcients due to the presence of xed e ects. Hansen (2007b) obtains bias-corrected estimates
of the AR(p) coe cients and uses these in FGLS estimation.
Other models for the errors have also been proposed. For example if clusters are large,
we can allow correlation parameters to vary across clusters.
7
Nonlinear and instrumental variables estimators
Relatively few econometrics papers consider extension of the complications discussed in this
paper to nonlinear models; a notable exception is Wooldridge (2006).
7.1
Population-averaged models
The simplest approach to clustering in nonlinear models is to estimate the same model as
would be estimated in the absence of clustering, but then base inference on cluster-robust
18
standard errors that control for any clustering. This approach requires the assumption that
the estimator remains consistent in the presence of clustering.
For commonly-used estimators that rely on correct speci cation of the conditional mean,
such as logit, probit and Poisson, one continues to assume that E[yig jxig ] is correctly-speci ed.
The model is estimated ignoring any clustering, but then sandwich standard errors that
control for clustering are computed. This pooled approach is called a population-averaged
approach because rather than introduce a cluster e ect g and model E[yig jxig ; g ], see
section 7.2, we directly model E[yig jxig ] = E g [E[yig jxig ; g ]] so that g has been averaged
out.
This essentially extends pooled OLS to, for example, pooled probit. E ciency gains
analogous to feasible GLS are possible for nonlinear models if one additionally speci es a
reasonable model for the within-cluster correlation.
The generalized estimating equations (GEE) approach, due to Liang and Zeger (1986),
introduces within-cluster correlation into the class of generalized linear models (GLM). A
conditional mean function is speci ed, with E[yig jxig ] = m(x0ig ), so that for the g th cluster
E[yg jXg ] = mg ( );
(17)
where mg ( ) = [m(x01g ); :::; m(x0Ng g )]0 and Xg = [x1g ; :::; xNg g ]0 . A model for the variances
and covariances is also speci ed. First given the variance model V[yig jxig ] = h(m(x0ig )
where is an additional scale parameter to estimate, we form Hg ( ) = Diag[ h(m(x0ig )], a
diagonal matrix with the variances as entries. Second a correlation matrix R( ) is speci ed
with ij th entry Cor[yig ; yjg jXg ], where
are additional parameters to estimate. Then the
within-cluster covariance matrix is
g
= V[yg jXg ] = Hg ( )1=2 R( )Hg ( )1=2
(18)
R( ) = I if there is no within-cluster correlation, and R( ) = R( ) has diagonal entries 1
and o diagonal entries in the case of equicorrelation. The resulting GEE estimator b GEE
solves
XG @m0g ( )
b 1 (yg mg ( )) = 0;
(19)
g
g=1
@
where b g equals g in (18) with R( ) replaced by R( b ) where b is consistent for
cluster-robust estimate of the asymptotic variance matrix of the GEE estimator is
b
b
V[ b GEE ] = D0 b
1b
D
1
XG
g=1
b b
D0g b g 1 ug u0g b g 1 Dg
D0 b
1
. The
1
D
;
(20)
b
b
b
b
b
where Dg = @m0g ( )=@ b , D = [D1 ; :::; DG ]0 , ug = yg mg ( b ), and now b g = Hg ( b )1=2 R( b )Hg ( b )1=2 .
The asymptotic theory requires that G ! 1.
The result (20) is a direct analog of the cluster-robust estimate of the variance matrix for
FGLS. Consistency of the GEE estimator requires that (17) holds, i.e. correct speci cation
of the conditional mean (even in the presence of clustering). The variance matrix de ned in
19
(18) permits heteroskedasticity and correlation. It is called a \working" variance matrix as
subsequent inference based on (20) is robust to misspeci cation of (18). If (18) is assumed
b
b
to be correctly speci ed then the asymptotic variance matrix is more simply (D0 b 1 D) 1 .
For likelihood-based models outside the GLM class, a common procedure is to perform
ML estimation under the assumption of independence over i and g, and then obtain clusterrobust standard errors that control for within-cluster correlation. Let f (yig jxig ; ) denote
P
the density, sig ( ) = @ ln f (yig jxig ; )=@ , and sg ( ) = i sig ( ). Then the MLE of solves
P P
P
g
i sig ( ) =
g sg ( ) = 0. A cluster-robust estimate of the variance matrix is
X
X
1
1 X
b
@sg ( )=@ 0 jb
:
(21)
V[ b ML ] =
@sg ( )=@ 0 jb
sg (b)sg (b)0
g
g
g
This method generally requires that f (yig jxig ; ) is correctly speci ed even in the presence
of clustering.
In the case of a (mis)speci ed density that is in the linear exponential family, as in
GLM estimation, the MLE retains its consistency under the weaker assumption that the
conditional mean E[yig jxig ; ] is correctly speci ed. In that case the GEE estimator de ned
in (19) additionally permits incorporation of a model for the correlation induced by the
clustering.
7.2
Cluster-speci c e ects models
An alternative approach to controlling for clustering is to introduce a group-speci c e ect.
For conditional mean models the population-averaged assumption that E[yig jxig ] = m(x0ig )
is replaced by
E[yig jxig ; g ] = g(x0ig + g );
(22)
where g is not observed. The presence of g will induce correlation between yig and yjg ,
i 6= j. Similarly, for parametric models the density speci ed for a single observation is
f (yig jxig ; ; g ) rather than the population-averaged f (yig jxig ; ).
In a xed e ects model the g are parameters to be estimated. If asymptotics are that
Ng is xed while G ! 1 then there is an incidental parameters problem, as there are Ng
parameters 1 ; :::; G to estimate and G ! 1. In general this contaminates estimation of
so that b is a inconsistent. Notable exceptions where it is still possible to consistently estimate are the linear regression model, the logit model, the Poisson model, and a nonlinear
regression model with additive error (so (22) is replaced by E[yig jxig ; g ] = g(x0ig ) + g ).
For these models, aside from the logit, one can additionally compute cluster-robust standard
errors after xed e ects estimation.
We focus on the more commonly-used random e ects model that speci es g to have
density h( g j ) and consider estimation of likelihood-based models. Conditional on g , the
Q Ng
joint density for the g th cluster is f (y1g ; :::; jxNg g ; ; g ) = i=1 f (yig jxig ; ; g ). We then
integrate out g to obtain the likelihood function
Z Y
YG
Ng
f (yig jxig ; ; g ) dh( g j ) :
(23)
L( ; jy; X) =
g=1
i=1
20
In some special nonlinear models, such as a Poisson model with g being gamma distributed,
it is possible to obtain a closed-form solution for the integral. More generally this is not the
case, but numerical methods work well as (23) is just a one-dimensional integral. The usual
assumption is that g is distributed as N [0; 2 ]. The MLE is very fragile and failure of any
assumption in a nonlinear model leads to inconsistent estimation of .
The population-averaged and random e ects models di er for nonlinear models, so that
is not comparable across the models. But the resulting average marginal e ects, that
integrate out g in the case of a random e ects model, may be similar. A leading example is the probit model. Then E[yig jxig ; g ] = (x0ig + g ), where ( ) is the standard
normal c.d.f. Letting f ( g ) denotep N [0; 2 ] density for g , we obtain E[yig jxig ] =
the
R
0
0
(xig + g )f ( g )d g = (xig = 1 + 2 ); see Wooldridge (2002, p.470). This di ers
from E[yig jxig ] = (x0ig p for the pooled or population-averaged probit model. The di er)
ence is the scale factor 1 + 2 p
. However, the p
marginal e ects are similarly rescaled, since
0
2)
= 1 + 2 , so in this case PA probit and ran@ Pr[yig = 1jxig ]=@xig = (xig = 1 +
dom e ects probit will yield similar estimates of the average marginal e ects; see Wooldridge
(2002, 2006).
7.3
Instrumental variables
The cluster-robust formula is easily adapted to instrumental variables estimation. It is
assumed that there exist instruments zig such that uig = yig x0ig satis es E[uig jzig ] =
0. If there is within-cluster correlation we assume that this condition still holds, but now
Cov[uig ; ujg jzig ; zjg ] 6= 0.
Shore-Sheppard (1996) examines the impact of equicorrelated instruments and groupspeci c shocks to the errors. Her model is similar to that of Moulton, applied to an IV
setting. She shows that IV estimation that does not model the correlation will understate
the standard errors, and proposes either cluster-robust standard errors or FGLS.
Hoxby and Paserman (1998) examine the validity of overidenti cation (OID) tests with
equicorrelated instruments. They show that not accounting for within-group correlation can
lead to mistaken OID tests, and they give a cluster-robust OID test statistic. This is the
GMM criterion function with a weighting matrix based on cluster summation.
A recent series of developments in applied econometrics deals with the complication of
weak instruments that lead to poor nite-sample performance of inference based on asymptotic theory, even when sample sizes are quite large; see for example the survey by Andrews
and Stock (2007), and Cameron and Trivedi (2005, 2009). The literature considers only the
non-clustered case, but the problem is clearly relevant also for cluster-robust inference. Most
papers consider only i.i.d. case errors. An exception is Chernozhukov and Hansen (2008)
who suggest a method based on testing the signi cance of the instruments in the reduced
form that is heteroskedastic-robust. Their tests are directly amenable to adjustments that
allow for clustering; see Finlay and Magnusson (2009).
21
7.4
GMM
Finally we consider generalized methods of moments (GMM) estimation.
Suppose that we combine moment conditions for the g th cluster, so E[hg (wg ; )] = 0
where wg denotes all variables in the cluster. Then the GMM estimator bGMM with weighting
0
P
P
matrix W minimizes
W
g hg
g hg , where hg = hg (wg ; ). Using standard results
in, for example, Cameron and Trivedi (2005, p.175) or Wooldridge (2002, p.423), the variance
matrix estimate is
b
b
b
V[bGMM ] = A0 WA
1
b
b b b
b
A0 WBWA A0 WA
1
b P
b P b b
where A = g @hg =@ 0 jb and a cluster-robust variance matrix estimate uses B = g hg h0g .
This assumes independence across clusters and G ! 1. Bhattacharya (2005) considers
strati cation in addition to clustering for the GMM estimator.
Again a key assumption is that the estimator remains consistent even in the presence for
clustering. For GMM this means that we need to assume that the moment condition holds
true even when there is within-cluster correlation. The reasonableness of this assumption
will vary with the particular model and application at hand.
8
Empirical Example
To illustrate some empirical issues related to clustering, we present an application based on
a simpli ed version of the model in Hersch (1998), who examined the relationship between
wages and job injury rates. We thank Joni Hersch for sharing her data with us. Job injury
rates are observed only at occupation levels and industry levels, inducing clustering at these
levels. In this application we have individual-level data from the Current Population Survey
on 5,960 male workers working in 362 occupations and 211 industries. For most of our
analysis we focus on the occupation injury rate coe cient.
In column 1 of Table 1, we present results from linear regression of log wages on occupation and industry injury rates, potential experience and its square, years of schooling,
and indicator variables for union, nonwhite, and 3 regions. The rst three rows show that
standard errors of the OLS estimate increase as we move from default (row 1) to White
heteroskedastic-robust (row 2) to cluster-robust with clustering on occupation (row 3). A
priori heteroskedastic-robust standard errors may be larger or smaller than the default. The
clustered standard errors are expected to be larger. Using formula (4) yields in ation factor
p
1 + 1 0:207 (5960=362 1) = 2:05, as the within-cluster correlation of model residuals
is 0:207, compared to an actual in ation of 0:516=0:188 = 2:74.
Column 2 of Table 1 illustrates analysis with few clusters, when analysis is restricted to
the 1,594 individuals who work in the ten most common occupations in the dataset. From
rows 1-3 the standard errors increase, due to fewer observations, and the variance in ation
factor is larger due to a larger average group size, as suggested by formula (4). Our concern
22
is that with G = 10 the usual asymptotic theory requires some adjustment. The Wald twosided test statistic for a zero coe cient on occupation injury rate is 2:751=0:994 = 2:77.
Rows 4-6 of column 2 report the associated p-value computed in three ways. First, p = 0:006
using standard normal critical values (or the T with N K = 1584 degrees of freedom).
Second, p = 0:022 using a T-distribution based on G 1 = 9 degrees of freedom. Third,
when we perform a pairs cluster percentile-T bootstrap, the p-value increases to 0:110. These
changes illustrate the importance of adjusting for few clusters in conducting inference. The
large increase in p-value with the bootstrap may in part be because the rst two p-values
are based on cluster-robust standard errors with nite-sample bias; see section 4.1.This may
also explain why the RE model standard errors in rows 8-10 of column 2 exceed the OLS
cluster-robust standard error in row 3 of column 2.
We next consider multi-way clustering. Since both occupation-level and industry-level
regressors are included we should compute two-way cluster-robust standard errors. Comparing row 7 of column 1 to row 3, the standard error of the occupation injury rate coe cient
changes little from 0.516 to 0.515. But there is a big impact for the coe cient of the industry
injury rate. In results not reported in the table, the standard error of the industry injury
rate coe cient increases from 0.563 when we cluster on only occupation to 1.015 when we
cluster on both occupation and industry.
If the clustering within occupations is due to common occupation-speci c shocks, then
a random e ects (RE) model may provide more e cient parameter estimates. From row
8 of column 1 the default RE standard error is 0.308, but if we cluster on occupation this
increases to 0.536 (row 10). For these data there is apparently no gain compared to OLS
(see row 3).
Finally we consider a nonlinear example, probit regression with the same data and regressors, except the dependent variable is now a binary outcome equal to one if the hourly
wage exceeds twelve dollars. The results given in column 3 are qualitatively similar to those
in column 1. Cluster-robust standard errors are 2-3 times larger, and two-way cluster robust
are slightly larger still. The parameters of the random e ects probit model are rescalings
of those of the standard probit model, as explained in section 7.2. The rescaled coe cient
is 5:119, as b g has estimated variance 0:279. This is smaller than the probit coe cient,
though this di erence may just re ect noise in estimation.
9
Conclusion
Cluster-robust inference is possible in a wide range of settings. The basic methods were
proposed in the 1980's, but are still not yet fully incorporated into applied econometrics,
especially for estimators other than OLS. Useful references on cluster-robust inference for the
practitioner include the surveys by Wooldridge (2003, 2006), the texts by Wooldridge (2002)
and Cameron and Trivedi (2005) and, for implementation in Stata, Nichols and Scha er
(2007) and Cameron and Trivedi (2009).
23
10
References
Acemoglu, D., and J.-S. Pischke (2003), \Minimum Wages and On-the-job Training," Research in Labor Economics, 22, 159-202.
Andrews, D.W.K., and J.H. Stock (2007), \Inference with Weak Instruments," in R. Blundell, W.K. Newey, and T. Persson, eds., Advances in Economics and Econometrics, Theory
and Applications: Ninth World Congress of the Econometric Society, Vol. III, Ch.3, Cambridge, Cambridge University Press.
Angrist, J.D., and V. Lavy (2002), \The E ect of High School Matriculation Awards: Evidence from Randomized Trials," NBER Working Paper No. 9389.
Arellano, M. (1987), \Computing Robust Standard Errors for Within-Group Estimators,"
Oxford Bulletin of Economics and Statistics, 49, 431-434.
Bell, R.M., and D.F. McCa rey (2002), \Bias Reduction in Standard Errors for Linear
Regression with Multi-Stage Samples," Survey Methodology, 169-179.
Bertrand, M., E. Du o, and S. Mullainathan (2004), \How Much Should We Trust Di erencesin-Di erences Estimates?," Quarterly Journal of Economics, 119, 249-275.
Bhattacharya, D. (2005), \Asymptotic Inference from Multi-stage Samples," Journal of
Econometrics, 126, 145-171.
Cameron, A.C., Gelbach, J.G., and D.L. Miller (2006), \Robust Inference with Multi-Way
Clustering," NBER Technical Working Paper 0327.
Cameron, A.C., Gelbach, J.G., and D.L. Miller (2010), \Robust Inference with Multi-Way
Clustering," Journal of Business and Economic Statistics, forthcoming.
Cameron, A.C., Gelbach, J.G., and D.L. Miller (2008), \Bootstrap-Based Improvements for
Inference with Clustered Errors," Review of Economics and Statistics, 90, 414-427.
Cameron, A.C., and N. Golotvina (2005), \Estimation of Country-Pair Data Models Controlling for Clustered Errors: with International Trade Applications," U.C.-Davis Economics
Department Working Paper No. 06-13.
Cameron, A.C., and P.K. Trivedi (2005), Microeconometrics: Methods and Applications,
Cambridge, Cambridge University Press.
Cameron, A.C., and P.K. Trivedi (2009), Microeconometrics using Stata, College Station,
TX, Stata Press.
Chernozhukov, V., and C. Hansen (2008), \The Reduced Form: A Simple Approach to
Inference with Weak Instruments," Economics Letters, 100, Pages 68-71.
Conley, T.G. (1999), \GMM Estimation with Cross Sectional Dependence," Journal of
Econometrics, 92, 1-45.
Conley, T.G., and C. Taber (2010), \Inference with `Di erence in Di erences' with a Small
Number of Policy Changes," Review of Economics and Statistics, forthcoming.
24
Davis, P. (2002), \Estimating Multi-Way Error Components Models with Unbalanced Data
Structures," Journal of Econometrics, 106, 67-95.
Donald, S.G. and K. Lang. (2007), \Inference with Di erence-in-Di erences and Other Panel
Data," The Review of Economics and Statistics, 89(2), 221-233.
Driscoll, J.C. and A.C. Kraay (1998), \Consistent Covariance Matrix Estimation with Spatially Dependent Panel Data," The Review of Economics and Statistics, 80(4), 549-560.
Fafchamps, M., and F. Gubert (2007), \The Formation of Risk Sharing Networks," Journal
of Development Economics, 83, 326-350.
Finlay, K. and L.M. Magnusson (2009), \Implementing Weak Instrument Robust Tests for
a General Class of Instrumental-Variables Models," Stata Journal, 9, 398-421.
Foote, C.L. (2007), \Space and Time in Macroeconomic Panel Data: Young Workers and
State-Level Unemployment Revisited", Working Paper No. 07-10, Federal Reserve Bank of
Boston.
Greenwald, B.C. (1983), \A General Analysis of Bias in the Estimated Standard Errors of
Least Squares Coe cients," Journal of Econometrics, 22, 323-338.
Hansen, C. (2007a), \Asymptotic Properties of a Robust Variance Matrix Estimator for
Panel Data when T is Large," Journal of Econometrics, 141, 597-620.
Hansen, C. (2007b), \Generalized Least Squares Inference in Panel and Multi-level Models
with Serial Correlation and Fixed E ects," Journal of Econometrics, 141, 597-620.
Hausman, J. and G. Kuersteiner (2008), \Di erence in Di erence Meets Generalized Least
Squares: Higher Order Properties of Hypotheses Tests," Journal of Econometrics, 144, 371391.
Hersch, J. (1998), \Compensating Wage Di erentials for Gender-Speci c Job Injury Rates,"
American Economic Review, 88, 598-607.
Hoxby, C. and M.D. Paserman (1998), \Overidenti cation Tests with Group Data," NBER
Technical Working Paper 0223.
Huber, P.J. (1967), \The Behavior of Maximum Likelihood Estimates under Nonstandard
Conditions," in Proceedings of the Fifth Berkeley Symposium, J. Neyman (Ed.), 1, 221-233,
Berkeley, CA, University of California Press.
Huber, P.J. (1981), Robust Statistics, New York, John Wiley.
Ibragimov, R. and U.K. Muller (2010), \T-Statistic Based Correlation and Heterogeneity
Robust Inference," Journal of Business and Economic Statistics, forthcoming.
Kauermann, G. and R.J. Carroll (2001), \A Note on the E ciency of Sandwich Covariance
Matrix Estimation," Journal of the American Statistical Association, 96, 1387-1396.
Kezdi, G. (2004), \Robust Standard Error Estimation in Fixed-E ects Models," Robust
Standard Error Estimation in Fixed-E ects Panel Models," Hungarian Statistical Review,
Special Number 9, 95-116.
25
Kiefer, N.M. (1980), \Estimation of xed e ect models for time series of cross-sections with
arbitrary intertemporal covariance," Journal of Econometrics, 214, 195-202.
Kish, L. (1965), Survey Sampling, New York, John Wiley.
Kish, L., and Frankel (1974), \Inference from Complex Surveys with Discussion", Journal
of the Royal Statistical Society, Series B, 36, 1-37.
Kloek, T. (1981), \OLS Estimation in a Model where a Microvariable is Explained by Aggregates and Contemporaneous Disturbances are Equicorrelated," Econometrica, 49, 205-07.
Liang, K.-Y., and S.L. Zeger (1986), \Longitudinal Data Analysis Using Generalized Linear
Models," Biometrika, 73, 13-22.
MacKinnon, J.G., and H. White (1985), \Some Heteroskedasticity-Consistent Covariance
Matrix Estimators with Improved Finite Sample Properties," Journal of Econometrics, 29,
305-325.
Mancl, L.A. and T.A. DeRouen, \A Covariance Estimator for GEE with Improved FiniteSample Properties," Biometrics, 57, 126-134.
McCa rey, D.F., Bell, R.M., and C.H. Botts (2001), \Generalizations of bias Reduced Linearization," Proceedings of the Survey Research Methods Section, American Statistical Association.
Miglioretti, D.L., and P.J. Heagerty (2006), \Marginal Modeling of Nonnested Multilevel
Data using Standard Software," American Journal of Epidemiology, 165(4), 453-463.
Moulton, B.R. (1986), \Random Group E ects and the Precision of Regression Estimates,"
Journal of Econometrics, 32, 385-397.
Moulton, B.R. (1990), \An Illustration of a Pitfall in Estimating the E ects of Aggregate
Variables on Micro Units," Review of Economics and Statistics, 72, 334-38.
Nichols, A., and M.E. Scha er (2007), \Clustered Standard Errors in Stata," United Kingdom Stata Users' Group Meetings, July 2007.
Pepper, J.V. (2002), \Robust Inferences from Random Clustered Samples: An Application
using Data from the Panel Study of Income Dynamics," Economics Letters, 75, 341-5.
Petersen, M. (2009), \Estimating Standard Errors in Finance Panel Data Sets: Comparing
Approaches," Review of Financial Studies, 22, 435-480.
Pfe ermann, D., and G. Nathan (1981), \Regression analysis of data from a cluster sample,"
Journal of the American Statistical Association, 76, 681-689.
Rogers, W.H. (1993), \Regression Standard Errors in Clustered Samples," Stata Technical
Bulletin, 13, 19-23.
Scott, A.J., and D. Holt (1982), \The E ect of Two-Stage Sampling on Ordinary Least
Squares Methods," Journal of the American Statistical Association, 77, 848-854.
Shore-Sheppard, L. (1996), \The Precision of Instrumental Variables Estimates with Grouped
Data," Princeton University Industrial Relations Section Working Paper 374.
26
Stock, J.H. and M.W. Watson (2008), \Heteroskedasticity-robust Standard Errors for Fixed
E ects Panel Data Regression," Econometrica, 76, 155-174.
Thompson, S. (2006), \Simple Formulas for Standard Errors that Cluster by Both Firm and
Time," SSRN: http://ssrn.com/abstract=914002.
White, H. (1980), \A Heteroskedasticity-Consistent Covariance Matrix Estimator and a
Direct Test for Heteroskedasticity," Econometrica, 48, 817-838.
White, H. (1982), \Maximum Likelihood Estimation of Misspeci ed Models," Econometrica,
50, 1-25.
White, H. (1984), Asymptotic Theory for Econometricians, San Diego, Academic Press.
White, H, and I. Domowitz (1984), \Nonlinear Regression with Dependent Observations,"
Econometrica, 52, 143-162.
Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data, Cambridge, MA, MIT Press.
Wooldridge, J.M. (2003), \Cluster-Sample Methods in Applied Econometrics," American
Economic Review, 93, 133-138.
Wooldridge, J.M. (2006), \Cluster-Sample Methods in Applied Econometrics: An Extended
Analysis," Department of Economics, Michigan State University.
27
Table 1 - Occupation injury rate and Log Wages
Impacts of varying ways of dealing with clustering
1
Main Sample
Linear
1
2
3
4
5
6
7
OLS (or Probit) coefficient on Occupation Injury Rate
Default (iid) std. error
White-robust std. error
Cluster-robust std. error (Clustering on Occupation)
P-value based on (3) and Standard Normal
P-value based on (3) and T(10-1)
P-value based on Percentile-T Pairs Bootstrap (999 replications)
Two-way (Occupation and Industry) robust std. error
Random effects Coefficient on Occupation Injury Rate
8 Default std. error
9 White-robust std. error
10 Cluster-robust std. error (Clustering on Occupation)
Number of observations (N)
Number of Clusters (G)
Within-Cluster correlation of errors (rho)
-2.158
0.188
0.243
0.516
2
3
10 Largest
Occupations Main Sample
Linear
Probit
-2.751
0.308
0.320
0.994
0.006
0.022
0.110
0.515
-6.978
0.626
1.008
1.454
1.516
-1.652
0.357
0.579
0.536
-2.669
1.429
2.058
2.148
-5.789
1.106
5960
362
0.207
1594
10
0.211
5960
362
Notes: C ffi i t and standard errors multiplied b 100 R
N t
Coefficients d t d d
lti li d by 100. Regression covariates i l d O
i
i t include Occupation
ti
Injurty rate, Industry Injury rate, Potential experience, Potential experience squared, Years of
schooling, and indicator variables for union, nonwhite, and three regions. Data from Current
Population Survey, as described in Hersch (1998). Std. errs. in rows 9 and 10 are from bootstraps with
400 replications. Probit outcome is wages >= $12/hour.
EXHIBIT 14
Case 1:12-cv-02826-DLC Document 265
D5NHUSA1
1
2
UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF NEW YORK
------------------------------x
3
UNITED STATES OF AMERICA,
4
Filed 05/31/13 Page 1 of 66
Plaintiff,
5
6
v.
APPLE, INC., et al.,
7
8
12 Civ. 2826 (DLC)
Defendants.
------------------------------x
9
May 23, 2013
2:30 p.m.
10
Before:
11
HON. DENISE L. COTE,
12
District Judge
13
14
15
16
17
18
19
20
21
22
23
24
25
SOUTHERN DISTRICT REPORTERS, P.C.
(212) 805-0300
1
Case 1:12-cv-02826-DLC Document 265
D5NHUSA1
1
Filed 05/31/13 Page 2 of 66
APPEARANCES
2
3
4
5
6
7
8
UNITED STATES DEPARTMENT OF JUSTICE
Attorneys for Plaintiff
BY: MARK W. RYAN
DANIEL McCUAIG
LAWRENCE BUTERMAN
CARRIE SYME
OFFICE OF THE ATTORNEY GENERAL OF TEXAS
Attorneys for State of Texas and Liaison counsel
for plaintiff States
BY: ERIC LIPMAN
GABRIEL R. GERVEY
DAVID M. ASHTON
9
10
11
OFFICE OF THE ATTORNEY GENERAL OF CONNECTICUT
Attorneys for State of Connecticut and Liaison counsel
for plaintiff States
BY: W. JOSEPH NIELSEN
GARY M. BECKER
12
13
OFFICE OF THE ATTORNEY GENERAL OF OHIO
Attorneys for State of Ohio
BY: EDWARD J. OLSZEWSKI
14
15
16
17
18
19
GIBSON, DUNN & CRUTCHER
Attorneys for Defendant Apple
BY: ORIN SNYDER
LISA RUBIN
DANIEL FLOYD
DANIEL SWANSON
CYNTHIA RICHMAN
-andO'MELVENEY & MYERS
BY: HOWARD HEISS
20
21
22
23
24
25
SOUTHERN DISTRICT REPORTERS, P.C.
(212) 805-0300
2
Case 1:12-cv-02826-DLC Document 265
D5NHUSA1
1
THE DEPUTY CLERK:
3
(In open court)
2
Filed 05/31/13 Page 3 of 66
3
Inc. and others.
4
5
Counsel for the government, please state your name for
the record.
6
7
MR. RYAN:
Honor.
Mark Ryan for the United States, your
Good afternoon.
8
THE DEPUTY CLERK:
9
THE COURT:
10
For the plaintiff.
Excuse me one second.
Anyone else for the
United States?
11
12
United States of America v. Apple
MR. BUTERMAN:
Good afternoon, your Honor.
Lawrence
Buterman.
13
MR. MCCUAIG:
14
MS. SYME:
15
THE COURT:
16
MR. LIPMAN:
Good afternoon, your Honor.
Eric Lipman.
17
MR. GERVEY:
Good afternoon, your Honor.
Gabriel
19
MR. ASHTON:
David Ashton, your Honor.
20
THE COURT:
21
MR. NIELSEN:
22
MR. BECKER:
Gary Becker, your Honor.
23
THE COURT:
Anyone else for the States?
24
MR. OLSZEWSKI:
18
25
Daniel McCuaig, your Honor.
Carrie Syme, your Honor.
For the plaintiff States.
For Texas.
Gervey.
For Connecticut.
Joe Nielsen, your Honor.
Yes.
Edward Olszewski for Ohio,
Attorney General's Office.
SOUTHERN DISTRICT REPORTERS, P.C.
(212) 805-0300
Case 1:12-cv-02826-DLC Document 265
D5NHUSA1
1
THE COURT:
2
MR. SNYDER:
3
MS. RUBIN:
4
5
6
7
8
9
10
11
12
13
14
Filed 05/31/13 Page 4 of 66
4
For Apple.
Good afternoon.
Orin Snyder for Apple.
Good afternoon, your Honor.
Lisa Rubin
for Apple.
MR. SWANSON:
Good afternoon, your Honor.
Dan Swanson
Good afternoon, your Honor.
Cynthia
for Apple.
MS. RICHMAN:
Richman for Apple.
MR. HEISS:
Good afternoon, your Honor.
Howard Heiss
Good afternoon, your Honor.
Daniel Floyd
for Apple.
MR. FLOYD:
for Apple.
THE COURT:
Is there anyone else who needs to place
their appearance on the record?
15
MR. SNYDER:
No, your Honor.
Thank you.
16
THE COURT:
17
To assist our court reporter, and me, perhaps, I am
Thank you, Mr. Snyder.
18
going to ask you if you speak please to identify yourself
19
briefly for the record before you speak.
20
Welcome, everyone.
This is our final pretrial
21
conference.
We have a long agenda today to get ourselves ready
22
for our trial which begins on June 3rd, as you know.
23
address the following topics, and you may have some additional
24
issues as well.
25
follow during the trial and the procedures generally that will
I want to
I want to talk about the schedule we will
SOUTHERN DISTRICT REPORTERS, P.C.
(212) 805-0300
Case 1:12-cv-02826-DLC Document 265
D5NHUSA1
5
Filed 05/31/13 Page 5 of 66
1
apply in this non-jury trial.
I want to go through your
2
witness list, make sure we understand who is actually going to
3
be called to testify and clarify who is going to be in the
4
courtroom and subject to cross-examination.
5
about time limits and whether those are appropriate here.
6
have motions in limine that I am prepared to address.
7
to talk about the state law claims and the extent to which they
8
will be litigated and under what standard.
9
about the depositions and the way we are going to approach
I want to talk
We
I want
I want to talk
10
deposition evidence that parties have offered as part of their
11
pretrial order.
12
including potentially authenticity objections.
13
about third-party redactions.
14
there and I want to make sure we know what procedure we are
15
going to follow with respect to those.
16
I want to deal with objections to exhibits,
I want to talk
We have gotten some submissions
So then, of course, I won't end this conference
17
without -- and maybe I will just start this conference that
18
way.
19
case.
20
prepared for our June 3rd trial.
21
case settles and I can put down my pen and turn to something
22
else, I would like a call, night or day, at the chamber's
23
telephone number because it will affect how I spend my time.
24
So thank you so much for that.
25
I am working hard.
My staff is working hard on this
I am sure counsel is working hard on this case to be
So if for any reason this
So let's talk about our schedule.
We will begin at
SOUTHERN DISTRICT REPORTERS, P.C.
(212) 805-0300
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?