"The Apple iPod iTunes Anti-Trust Litigation"

Filing 753

***ERRONEOUS ENTRY, PLEASE REFER TO DOCUMENT NO. 754 *** EXHIBITS re 752 Opposition/Response to Motion, filed byApple Inc.. (Attachments: # 1 Exhibit 2, # 2 Exhibit 3, # 3 Exhibit 4, # 4 Exhibit 5, # 5 Exhibit 6, # 6 Exhibit 7, # 7 Exhibit 8, # 8 Exhibit 9, # 9 Exhibit 11, # 10 Exhibit 12, # 11 Proposed Order)(Related document(s) 752 ) (Kiernan, David) (Filed on 1/14/2014) Modified on 1/14/2014 (jlmS, COURT STAFF).

Download PDF
Exhibit 7 A Practitioner's Guide to Cluster-Robust Inference A. Colin Cameron and Douglas L. Miller Department of Economics, University of California - Davis. This version (almost nal): October 15, 2013 Abstract We consider statistical inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within clusters. Examples include data on individuals with clustering on village or region or other category such as industry, and state-year di erences-in-di erences studies with clustering on state. In such settings default standard errors can greatly overstate estimator precision. Instead, if the number of clusters is large, statistical inference after OLS should be based on cluster-robust standard errors. We outline the basic method as well as many complications that can arise in practice. These include cluster-speci c xed e ects, few clusters, multi-way clustering, and estimators other than OLS. JEL Classi cation: C12, C21, C23, C81. Keywords: Cluster robust; cluster e ect; xed e ects, random e ects; hierarchical models; feasible GLS; di erences in di erences; panel data; cluster bootstrap; few clusters; multi-way clusters. In preparation for The Journal of Human Resources. We thank four referees and the journal editor for very helpful comments and for guidance, participants at the 2013 California Econometrics Conference and at a workshop sponsored by the U.K. Programme Evaluation for Policy Analysis, and the many people who have over time have sent us cluster-related puzzles (the solutions to some of which appear in this paper). Doug Miller acknowledges nancial support from the Center for Health and Wellbeing at the Woodrow Wilson School of Public Policy at Princeton University. Contents I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II Cluster-Robust Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIA A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIB Clustered Errors and Two Leading Examples . . . . . . . . . . . . . . . . IIB.1 Example 1: Individuals in Cluster . . . . . . . . . . . . . . . . . . IIB.2 Example 2: Di erences-in-Di erences (DiD) in a State-Year Panel IIC The Cluster-Robust Variance Matrix Estimate . . . . . . . . . . . . . . . IID Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IIE Implementation for OLS and FGLS . . . . . . . . . . . . . . . . . . . . . IIF Cluster-Bootstrap Variance Matrix Estimate . . . . . . . . . . . . . . . . III Cluster-Speci c Fixed E ects . . . . . . . . . . . . . . . . . . . . . . . . . . . IIIA Do Fixed E ects Fully Control for Within-Cluster Correlation? . . . . . . IIIB Cluster-Robust Variance Matrix with Fixed E ects . . . . . . . . . . . . IIIC Feasible GLS with Fixed E ects . . . . . . . . . . . . . . . . . . . . . . . IIID Testing the Need for Fixed E ects . . . . . . . . . . . . . . . . . . . . . . IV What to Cluster Over? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IVA Factors Determining What to Cluster Over . . . . . . . . . . . . . . . . . IVB Clustering Due to Survey Design . . . . . . . . . . . . . . . . . . . . . . V Multi-way Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VA Multi-way Cluster-Robust Variance Matrix Estimate . . . . . . . . . . . VB Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VC Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VD Spatial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI Few Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIA The Basic Problems with Few Clusters . . . . . . . . . . . . . . . . . . . VIB Solution 1: Bias-Corrected Cluster-Robust Variance Matrix . . . . . . . . VIC Solution 2: Cluster Bootstrap with Asymptotic Re nement . . . . . . . . VIC.1 Percentile-t Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . VIC.2 Wild Cluster Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . VIC.3 Bootstrap with Caution . . . . . . . . . . . . . . . . . . . . . . . VID Solution 3: Improved Critical Values using a T-distribution . . . . . . . . VID.1 G-L Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . VID.2 Data-determined Degrees of Freedom . . . . . . . . . . . . . . . . VID.3 E ective Number of Clusters . . . . . . . . . . . . . . . . . . . . . VIE Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIE.1 Fixed Number of Clusters with Cluster Size Growing . . . . . . . VIE.2 Few Treated Groups . . . . . . . . . . . . . . . . . . . . . . . . . VIE.3 What if you have multi-way clustering and few clusters? . . . . . VIIExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIIACluster-Robust F-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIIBInstrumental Variables Estimators . . . . . . . . . . . . . . . . . . . . . . VIIB.1 IV and 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIIB.2 Weak Instruments . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 6 6 7 8 9 10 11 13 14 15 16 17 19 19 20 20 22 23 23 25 26 27 28 29 30 31 31 32 34 35 35 35 37 37 37 38 39 39 39 40 40 41 VIIB.3 Linear GMM . . . . . . . . . . . . . . . . VIICNonlinear Models . . . . . . . . . . . . . . . . . . VIIC.1 Di erent Models for Clustering . . . . . . VIIC.2 Fixed E ects . . . . . . . . . . . . . . . . VIIDCluster-randomized Experiments . . . . . . . . . VIII Empirical Example . . . . . . . . . . . . . . . . . . . . VIIIA Individual-level Cross-section Data: One Sample . VIIIB Individual-level Cross-section Data: Monte Carlo . VIIIC State{Year Panel Data: One Sample . . . . . . . VIIID State{Year Panel Data: Monte Carlo . . . . . . . IX Concluding Thoughts . . . . . . . . . . . . . . . . . . . X References . . . . . . . . . . . . . . . . . . . . . . . . . 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 43 43 45 46 46 46 48 49 50 51 52 I. Introduction In an empiricist's day-to-day practice, most e ort is spent on getting unbiased or consistent point estimates. That is, a lot of attention focuses on the parameters (b). In this paper we focus on getting accurate statistical inference, a fundamental component of which is obtaining accurate standard errors (se, the estimated standard deviation of b). We begin with the basic reminder that empirical researchers should also really care about getting this part right. An asymptotic 95% con dence interval is b 1:96 se, and hypothesis testing b and se are critical is typically based on the Wald \t-statistic" w = (b 0 )=se. Both ingredients for statistical inference, and we should be paying as much attention to getting a good se as we do for obtaining b. In this paper, we consider statistical inference in regression models where observations can be grouped into clusters, with model errors uncorrelated across clusters but correlated within cluster. One leading example of \clustered errors" is individual-level cross-section data with clustering on geographical region, such as village or state. Then model errors for individuals in the same region may be correlated, while model errors for individuals in di erent regions are assumed to be uncorrelated. A second leading example is panel data. Then model errors in di erent time periods for a given individual (e.g., person or rm or region) may be correlated, while model errors for di erent individuals are assumed to be uncorrelated. Failure to control for within-cluster error correlation can lead to very misleadingly small standard errors, and consequent misleadingly narrow con dence intervals, large t-statistics and low p-values. It is not unusual to have applications where standard errors that control for within-cluster correlation are several times larger than default standard errors that ignore such correlation. As shown below, the need for such control increases not only with increase in the size of within-cluster error correlation, but the need also increases with the size of within-cluster correlation of regressors and with the number of observations within a cluster. A leading example, highlighted by Moulton (1986, 1990), is when interest lies in measuring the e ect of a policy variable, or other aggregated regressor, that takes the same value for all observations within a cluster. One way to control for clustered errors in a linear regression model is to additionally specify a model for the within-cluster error correlation, consistently estimate the parameters of this error correlation model, and then estimate the original model by feasible generalized least squares (FGLS) rather than ordinary least squares (OLS). Examples include random e ects estimators and, more generally, random coe cient and hierarchical models. If all goes well this provides valid statistical inference, as well as estimates of the parameters of the original regression model that are more e cient than OLS. However, these desirable properties hold only under the very strong assumption that the model for within-cluster error correlation is correctly speci ed. A more recent method to control for clustered errors is to estimate the regression model with limited or no control for within-cluster error correlation, and then post-estimation 4 obtain \cluster-robust" standard errors proposed by White (1984, p.134-142) for OLS with a multivariate dependent variable (directly applicable to balanced clusters); by Liang and Zeger (1986) for linear and nonlinear models; and by Arellano (1987) for the xed e ects estimator in linear panel models. These cluster-robust standard errors do not require speci cation of a model for within-cluster error correlation, but do require the additional assumption that the number of clusters, rather than just the number of observations, goes to in nity. Cluster-robust standard errors are now widely used, popularized in part by Rogers (1993) who incorporated the method in Stata, and by Bertrand, Du o and Mullainathan (2004) who pointed out that many di erences-in-di erences studies failed to control for clustered errors, and those that did often clustered at the wrong level. Cameron and Miller (2011) and Wooldridge (2003, 2006) provide surveys, and lengthy expositions are given in Angrist and Pischke (2009) and Wooldridge (2010). One goal of this paper is to provide the practitioner with the methods to implement cluster-robust inference. To this end we include in the paper reference to relevant Stata commands (for version 13), since Stata is the computer package most used in applied microeconometrics research. And we will post on our websites more expansive Stata code and datasets used in this paper. A second goal is presenting how to deal with complications such as determining when is there a need to cluster, incorporating xed e ects, and inference when there are few clusters. A third goal is to provide an exposition of the underlying econometrics theory as this can aid in understanding complications. In practice the most di cult complication to deal with can be \few" clusters, see Section VI. There is no clear-cut de nition of \few"; depending on the situation \few" may range from less than 20 to less than 50 clusters in the balanced case. We focus on OLS, for simplicity and because this is the most commonly-used estimation method in practice. Section II presents the basic results for OLS with clustered errors. In principle, implementation is straightforward as econometrics packages include cluster-robust as an option for the commonly-used estimators; in Stata it is the vce(cluster) option. The remainder of the survey concentrates on complications that often arise in practice. Section III addresses how the addition of xed e ects impacts cluster-robust inference. Section IV deals with the obvious complication that it is not always clear what to cluster over. Section V considers clustering when there is more than one way to do so and these ways are not nested in each other. Section VI considers how to adjust inference when there are just a few clusters as, without adjustment, test statistics based on the cluster-robust standard errors over-reject and con dence intervals are too narrow. Section VII presents extension to the full range of estimators { instrumental variables, nonlinear models such as logit and probit, and generalized method of moments. Section VIII presents both empirical examples and real-data based simulations. Concluding thoughts are given in Section IX. 5 II. Cluster-Robust Inference In this section we present the fundamentals of cluster-robust inference. For these basic results we assume that the model does not include cluster-speci c xed e ects, that it is clear how to form the clusters, and that there are many clusters. We relax these conditions in subsequent sections. Clustered errors have two main consequences: they (usually) reduce the precision of b, b and the standard estimator for the variance of b, V[b] , is (usually) biased downward from the true variance. Computing cluster-robust standard errors is a x for the latter issue. We illustrate these issues, initially in the context of a very simple model and then in the following subsection in a more typical model. IIA. A Simple Example For simplicity, we begin with OLS with a single regressor that is nonstochastic, and assume no intercept in the model. The results extend to multiple regression with stochastic regressors. Let yi = xi + ui , i = 1; :::; N , where xi is nonstochastic and E[ui ] = 0. The OLS P P P P = i xi ui = i x2 , so in general estimator b = i xi yi = i x2 can be re-expressed as b i i P 2 P )2 ] = V [ i xi ui ] = i xi V[b] = E[(b 2 : (1) P 2 P P If errors are uncorrelated over i, then V[ i xi ui ] = i xi V[ui ]. In the i V[xi ui ] = P 2 simplest case of homoskedastic errors, V[ui ] = and (1) simpli es to V[b] = 2 = i x2 . i If instead errors are heteroskedastic, then (1) becomes P Vhet [b] = i x2 E[u2 ] = i i P i 2 x2 i ; using V[ui ] = E[u2 ] since E[ui ] = 0. Implementation seemingly requires consistent estimates i of each of the N error variances E[u2 ]. In a very in uential paper, one that extends naturally i to the clustered setting, White (1980) noted that instead all that is needed is an estimate of P P b the scalar i x2 E[u2 ], and that one can simply use i x2 u2 , where ui = yi bxi is the OLS i bi i i residual, provided N ! 1. This leads to estimated variance P 2 2 P 2 b Vhet [b] = b i xi ui ] = i xi 2 : The resulting standard error for b is often called a robust standard error, though a better, more precise term, is heteroskedastic-robust standard error. What if errors are correlated over i? In the most general case where all errors are correlated with each other, so P P P P P V [ i xi ui ] = i j Cov[xi ui ; xj uj ] = i j xi xj E[ui uj ]; Vcor [b] = P P i j xi xj E[ui uj ] = 6 P i x2 i 2 : P P P 2 2 b The obvious extension of White (1980) is to use V[b] = bb i j x i x j ui u j ] = ( i xi ) , but P this equals zero since i xi ui = 0. Instead one needs to rst set a large fraction of the error b correlations E[ui uj ] to zero. For time series data with errors assumed to be correlated only up to, say, m periods apart as well as heteroskedastic, White's result can be extended to yield a heteroskedastic- and autocorrelation-consistent (HAC) variance estimate; see Newey and West (1987). In this paper we consider clustered errors, with E[ui uj ] = 0 unless observations i and j are in the same cluster (such as same region). Then P P P 2 2 Vclu [b] = ; (2) i j xi xj E[ui uj ]1[i; j in same cluster] = i xi where the indicator function 1[A] equals 1 if event A happens and equals 0 if event A does not happen. Provided the number of clusters goes to in nity, we can use the variance estimate P 2 2 P P b : (3) bb Vclu [b] = i xi i j xi xj ui uj 1[i; j in same cluster] = This estimate is called a cluster-robust estimate, though more precisely it is heteroskedasticb and cluster-robust. This estimate reduces to Vhet [b] in the special case that there is only one observation in each cluster. b b Typically Vclu [b] exceeds Vhet [b] due to the addition of terms when i 6= j. The amount of increase is larger (1) the more positively associated are the regressors across observations in the same cluster (via xi xj in (3)), (2) the more correlated are the errors (via E[ui uj ] in (2)), and (3) the more observations are in the same cluster (via 1[i; j in same cluster] in (3)). There are several take-away messages. First there can be great loss of e ciency in OLS estimation if errors are correlated within cluster rather than completely uncorrelated. Intuitively, if errors are positively correlated within cluster then an additional observation in the cluster no longer provides a completely independent piece of new information. Second, failure to control for this within-cluster error correlation can lead to using standard errors that are too small, with consequent overly-narrow con dence intervals, overly-large t-statistics, and over-rejection of true null hypotheses. Third, it is straightforward to obtain cluster-robust standard errors, though they do rely on the assumption that the number of clusters goes to in nity (see Section VI for the few clusters case). IIB. Clustered Errors and Two Leading Examples Let i denote the ith of N individuals in the sample, and g denote the g th of G clusters. Then for individual i in cluster g the linear model with (one-way) clustering is yig = x0ig + uig ; (4) where xig is a K 1 vector. As usual it is assumed that E[uig jxig ] = 0. The key assumption is that errors are uncorrelated across clusters, while errors for individuals belonging to the same cluster may be correlated. Thus E[uig ujg0 jxig ; xjg0 ] = 0, unless g = g 0 : 7 (5) IIB.1. Example 1: Individuals in Cluster Hersch (1998) uses cross-section individual-level data to estimate the impact of job injury risk on wages. Since there is no individual-level data on job injury rate, a more aggregated measure such as job injury risk in the individual's industry is used as a regressor. Then for individual i (with N = 5960) in industry g (with G = 211) yig = riskg + z0ig + uig : The regressor riskg is perfectly correlated within industry. The error term will be positively correlated within industry if the model systematically overpredicts (or underpredicts) wages in a given industry. In this case default OLS standard errors will be downwards biased. To measure the extent of this downwards bias, suppose errors are equicorrelated within cluster, so Cor[uig ; ujg ] = for all i 6= j. This pattern is suitable when observations can be viewed as exchangeable, with ordering not mattering. Common examples include the current one, individuals or households within a village or other geographic unit (such as state), individuals within a household, and students within a school. Then a useful approximation is that for the k th regressor the default OLS variance estimate based on s2 (X0 X) 1 , where s is the standard error of the regression, should be in ated by k '1+ xk u (Ng 1); (6) where xk is a measure of the within-cluster correlation of xigk , u is the within-cluster error correlation, and Ng is the average cluster size. The result (6) is exact if clusters are of equal size (\balanced" clusters) and xk = 1 for all regressors (Kloek, 1981); see Scott and Holt (1982) and Greenwald (1983) for the general result with a single regressor. This very important and insightful result is that the variance in ation factor is increasing in 1. the within-cluster correlation of the regressor 2. the within-cluster correlation of the error 3. the number of observations in each cluster. For clusters of unequal size replace (Ng 1) in (6) by ((V[Ng ]=Ng ) + Ng 1); see Moulton (1986, p.387). Note that there is no cluster problem if any one of the following occur: xk = 0 or u = 0 or Ng = 1. In an in uential paper, Moulton (1990) pointed out that in many settings the in ation factor k can be large even if u is small. He considered a log earnings regression using March CPS data (N = 18; 946), regressors aggregated at the state level (G = 49), and errors correlated within state (bu = 0:032). The average group size was 18; 946=49 = 387, 0:032 386 = 13:3. The weak xk = 1 for a state-level regressor, so (6) yields k ' 1 + 1 8 correlation of errors within state was still enough to lead to cluster-corrected standard errors p being 13:3 = 3:7 times larger than the (incorrect) default standard errors! In such examples of cross-section data with an aggregated regressor, the cluster-robust standard errors can be much larger despite low within-cluster error correlation because the regressor of interest is perfectly correlated within cluster and there may be many observations per cluster. IIB.2. Example 2: Di erences-in-Di erences (DiD) in a State-Year Panel Interest may lie in how wages respond to a binary policy variable dts that varies by state and over time. Then at time t in state s yts = dts + z0ts + s + t + uts ; where we assume independence over states, so the ordering of subscripts (t; s) corresponds to (i; g) in (4), and s and t are state and year e ects. The binary regressor dts equals one if the policy of interest is in e ect and equals 0 otherwise. The regressor dts is often highly serially correlated since, for example, dts will equal a string of zeroes followed by a string of ones for a state that switches from never having the policy in place to forever after having the policy in place. The error uts is correlated over time for a given state if the model systematically overpredicts (or underpredicts) wages in a given state. Again the default standard errors are likely to be downwards-biased. In the panel data case, the within-cluster (i.e., within-individual) error correlation decreases as the time separation increases, so errors are not equicorrelated. A better model for the errors is a time series model, such as an autoregressive error of order one that implies 0 that Cor[uts ; ut0 s ] = jt t j . The true variance of the OLS estimator will again be larger than the OLS default, although the consequences of clustering are less extreme than in the case of equicorrelated errors (see Cameron and Miller (2011, Section 2.3) for more detail). In such DiD examples with panel data, the cluster-robust standard errors can be much larger than the default because both the regressor of interest and the errors are highly correlated within cluster. Note also that this complication can exist even with the inclusion of xed e ects (see Section III). The same problems arise if we additionally have data on individuals, so that yits = dts + z0its + s + t + uits : In an in uential paper, Bertrand, Du o and Mullainathan (2004) demonstrated the importance of using cluster-robust standard errors in DiD settings. Furthermore, the clustering should be on state, assuming error independence across states. The clustering should not be on state-year pairs since, for example, the error for California in 2010 is likely to be correlated with the error for California in 2009. The issues raised here are relevant for any panel data application, not just DiD studies. The DiD panel example with binary policy regressor is often emphasized in the cluster-robust 9 literature because it is widely used and it has regressor that is highly serially correlated, even after mean-di erencing to control for xed e ects. This serial correlation leads to a potentially large di erence between cluster-robust and default standard errors. IIC. The Cluster-Robust Variance Matrix Estimate Stacking all observations in the g th cluster, the model (4) can be written as yg = Xg + ug ; g = 1; :::; G; where yg and ug are Ng 1 vectors, Xg is an Ng K matrix, and there are Ng observations in cluster g. Further stacking yg , Xg and ug over the G clusters then yields the model y = X + u: The OLS estimator is b = (X0 X) 1 X0 y = PG 0 g=1 Xg Xg In general, the variance matrix conditional on X is with V[ b ] = (X0 X) 1 B (X0 X) 1 PG g=1 1 ; B = X0 V[ujX]X: X0g yg : (7) (8) Given error independence across clusters, V[ujX] has a block-diagonal structure, and (8) simpli es to P Bclu = G X0g E ug u0g jXg Xg : (9) g=1 The matrix Bclu , the middle part of the \sandwich matrix" (7), corresponds to the numerator of (2). Bclu can be written as: Bclu = PG PNg PNg g=1 i=1 j=1 xig x0jg ! ig;jg ; where ! ig;jg = E[uig ujg jXg ] is the error covariance for the ig th and jg th observations. We can gain a few insights from inspecting this equation. The term B (and hence V[ b ]) will be bigger when: (1) regressors within cluster are correlated, (2) errors within cluster are correlated so ! ig;jg is non-zero, (3) Ng is large, and (4) the within-cluster regressor and error correlations are of the same sign (the usual situation). These conditions mirror the more precise Moulton result for the special case of equicorrelated errors given in equation (6). Both examples in Section II had high within-cluster correlation of the regressor, the DiD example had high within-cluster (serial) correlation of the error and the Moulton (1990) example had Ng large. 10 Implementation requires an estimate of Bclu given in (9). The cluster-robust estimate of the variance matrix (CRVE) of the OLS estimator is the sandwich estimate where b b Vclu [ b ] = (X0 X) 1 Bclu (X0 X) 1 ; P b b b Bclu = G X0g ug u0g Xg ; g=1 (10) (11) b and ug = yg Xg b is the vector of OLS residuals for the g th cluster. Formally (10)-(11) proP P p b b vides a consistent estimate of the variance matrix if G 1 g X0g ug u0g Xg G 1 g E[X0g ug u0g Xg ] ! 0 as G ! 1. Initial derivations of this estimator, by White (1984, p.134-142) for balanced clusters and by Liang and Zeger (1986) for unbalanced, assumed a nite number of observations per cluster. Hansen (2007a) showed that the CRVE can also be used if Ng ! 1, the case for long panels, in addition to G ! 1. Carter, Schnepel and Steigerwald (2013) consider unbalanced panels with either Ng xed or Ng ! 1. The sandwich formula for the CRVE extends to many estimators other than OLS; see Section VII. b b Algebraically, the estimator (10)-(11) equals (7) and (9) with E[ug u0g ] replaced with ug u0g . 0 b b What is striking about this is that for each cluster g, the Ng Ng matrix ug ug is bound to 0 be a very poor estimate of the Ng Ng matrix E[ug ug ] { there is no averaging going on to enable use of a Law of Large Numbers. The \magic" of the CRVE is that despite this, by averaging across all G clusters in (11), we are able to get a consistent variance estimate. This fact helps us to understand one of the limitations of this method in practice { the averaging b that makes V[ b ] accurate for V[ b ] is an average based on the number of clusters G. In applications with few clusters this can lead to problems that we discuss below in Section VI. b Finite-sample modi cations of (11) are typically used, to reduce downwards bias in Vclu [ b ] p b due to nite G. Stata uses cb g in (11) rather than ug , with u c= G G N 1N 1 G ' : K G 1 (12) In general c ' G=(G 1), though see Section IIIB for an important exception when xed e ects are directly estimated. Some other packages such as SAS use c = G=(G 1), a simpler correction that is also used by Stata for extensions to nonlinear models. Either choice of c usually lessens, but does not fully eliminate, the usual downwards bias in the CRVE. Other nite-cluster corrections are discussed in Section VI, but there is no clear best correction. IID. Feasible GLS If errors are correlated within cluster, then in general OLS is ine cient and feasible GLS may be more e cient. Suppose we specify a model for g = E[ug u0g jXg ] in (9), such as within-cluster equicorrelation. Then the GLS estimator is (X0 1 X) 1 X0 1 y, where = Diag[ g ]. Given a consistent estimate b of , the feasible GLS estimator of is 1P G 0 b 1 b FGLS = PG X0 b 1 Xg (13) g g g=1 g=1 Xg g yg : 11 The FGLS estimator is second-moment e cient, with variance matrix b Vdef [ b FGLS ] = (X0 b 1 X) 1 ; (14) under the strong assumption that the error variance is correctly speci ed. Remarkably, the cluster-robust method of the previous section can be extended to FGLS. Essentially OLS is the special case where g = 2 INg . The cluster-robust estimate of the asymptotic variance matrix of the FGLS estimator is b Vclu [ b FGLS ] = X0 b 1 1 X PG 0 b 1 b b0 b 1 g=1 Xg g ug ug g Xg X0 b 1 1 X ; (15) b where ug = yg Xg b FGLS . This estimator requires that ug and uh are uncorrelated when g 6= h, and that G ! 1, but permits E[ug u0g jXg ] 6= g . The approach of specifying a model for the error variances and then doing inference that guards against misspeci cation of this model is especially popular in the biostatistics literature that calls g a \working" variance matrix (see, for example, Liang and Zeger, 1986). There are many possible candidate models for g , depending on the type of data being analyzed. For individual-level data clustered by region, example 1 in Section IIB, a common starting point model is the random e ects (RE) model. The error is speci ed to have two components: uig = g + "ig ; (16) where g is a cluster-speci c error or common shock that is assumed to be independent and identically distributed (i.i.d.) (0; 2 ), and "ig is an idiosyncratic error that is assumed to be i.i.d. (0; 2 ). Then V[uig ] = 2 + 2 and Cov[uig ; ujg ] = 2 for i 6= j. It follows that the " " intraclass correlation of the error u = Cor[uig ; ujg ] = 2 =( 2 + 2 ), so this model implies " equicorrelated errors within cluster. Richer models that introduce heteroskedasticity include random coe cients models and hierarchical linear models. For panel data, example 2 in Section IIB, a range of time series models for uit may be used, including autoregressive and moving average error models. Analysis of within-cluster residual correlation patterns after OLS estimation can be helpful in selecting a model for g. Note that in all cases if cluster-speci c xed e ects are included as regressors and N g is small then bias-corrected FGLS should be used; see Section IIIC. The e ciency gains of FGLS need not be great. As an extreme example, with equicorrelated errors, balanced clusters, and all regressors invariant within cluster (xig = xg ) the FGLS estimator equals the OLS estimator - and so there is no e ciency gain to FGLS. With equicorrelated errors and general X, Scott and Holt (1982) provide an upper bound to the maximum proportionate e ciency loss of OLS, compared to the variance of the FGLS h i u )[1+(N estimator, of 1= 1 + 4(1 (Nmax max2 1) u , Nmax = maxfN1 ; :::; NG g. This upper bound is u) increasing in the error correlation u and the maximum cluster size Nmax . For low u the 12 maximal e ciency gain can be low. For example, Scott and Holt (1982) note that for u = :05 and Nmax = 20 there is at most a 12% e ciency loss of OLS compared to FGLS. With u = 0:2 and Nmax = 100 the e ciency loss could be as much as 86%, though this depends on the nature of X. There is no clear guide to when FGLS may lead to considerable improvement in e ciency, and the e ciency gains can be modest. However, especially in models without cluster-speci c xed e ects, implementation of FGLS and use of (15) to guard against misspeci cation of g is straightforward. And even modest e ciency gains can be bene cial. It is remarkable that current econometric practice with clustered errors ignores the potential e ciency gains of FGLS. IIE. Implementation for OLS and FGLS For regression software that provides a cluster-robust option, implementation of the CRVE for OLS simply requires de ning for each observation a cluster identi er variable that takes one of G distinct values according to the observation's cluster, and then passing this cluster identi er to the estimation command's cluster-robust option. For example, if the cluster identi er is id clu, then Stata OLS command regress y x becomes regress y x, vce(cluster id clu). Wald hypothesis tests and con dence intervals are then implemented in the usual way. In some cases, however, joint tests of several hypotheses and of overall statistical signi cance b may not be possible. The CRVE Vclu [ b ] is guaranteed to be positive semi-de nite, so the estimated variance of the individual components of b are necessarily nonnegative. But b Vclu [ b ] is not necessarily positive de nite, so it is possible that the variance matrix of linear b combinations of the components of b is singular. The rank of Vclu [ b ] equals the rank of b b b b X0G uG ] is a K G matrix, it B de ned in (11). Since B = C0 C, where C0 = [X01 u1 b and hence that of Vclu [ b ], is at most the rank of C. Since b follows that the rank of B, b b X01 u1 + + X0G uG = 0, the rank of C is at most the minimum of K and G 1. E ectively, b clu [ b ] equals min(K; G 1), though it can be less than this in some examples the rank of V such as perfect collinearity of regressors and cluster-speci c dummy regressors (see Section IIIB for the latter). A common setting is to have a richly speci ed model with thousands of observations in b far fewer clusters, leading to more regressors than clusters. Then Vclu [ b ] is rank-de cient, so it will not be possible to perform an overall F test of the joint statistical signi cance of all regressors. And in a log-wage regression with occupation dummies and clustering on state, we cannot test the joint statistical signi cance of the occupation dummies if there are more occupations than states. But it is still okay to perform statistical inference on individual regression coe cients, and to do joint tests on a limited number of restrictions (potentially as many as min(K; G 1)). Regression software usually also includes a panel data component. Panel commands may enable not only OLS with cluster-robust standard errors, but also FGLS for some models of 13 within-cluster error correlation with default (and possibly cluster-robust) standard errors. It is important to note that those panel data commands that do not explicitly use time series methods, an example is FGLS with equicorrelation of errors within-cluster, can be applied more generally to other forms of clustered data, such as individual-level data with clustering on geographic region. For example, in Stata rst give the command xtset id clu to let Stata know that the cluster-identi er is variable id clu. Then the Stata command xtreg y x, pa corr(ind) vce(robust) yields OLS estimates with cluster-robust standard errors. Note that for Stata xt commands, option vce(robust) is generally interpreted as meaning cluster-robust; this is always the case from version 12.1 on. The xt commands use standard normal critical values whereas command regress uses T (G 1) critical values; see Sections VI and VIIA for further discussion. For FGLS estimation the commands vary with the model for g . For equicorrelated errors, a starting point for example 1 in Section IIB, the FGLS estimator can be obtained using command xtreg y x, pa corr(exch) or command xtreg y x, re; slightly di erent estimates are obtained due to slightly di erent estimates of the equicorrelation. For FGLS for hierarchical models that are richer than a random e ects model, use Stata command mixed (version 13) or xtmixed (earlier versions). For FGLS with panel data and time variable time, rst give the command xtset id clu time to let Stata know both the cluster-identi er and time variable. A starting point for example 2 in Section IIB is an autoregressive error of order one, estimated using command xtreg y x, pa corr(ar 1). Stata permits a wide range of possible models for serially correlated errors. In all of these FGLS examples the reported standard errors are the default ones that assume correct speci cation of g . Better practice is to add option vce(robust) for xtreg commands, or option vce(cluster id clu) for mixed commands, as this yields standard errors that are based on the cluster-robust variance de ned in (15). IIF. Cluster-Bootstrap Variance Matrix Estimate Not all econometrics packages compute cluster-robust variance estimates, and even those that do may not do so for all estimators. In that case one can use a pairs cluster bootstrap that, like the CRVE, gives a consistent estimate of V[ b ] when errors are clustered. To implement this bootstrap, do the following steps B times: (1) form G clusters f(y1 ; X1 ); :::; (yG ; XG )g by resampling with replacement G times from the original sample of clusters, and (2) compute the estimate of , denoted b b in the bth bootstrap sample. Then, given the B estimates b 1 ; :::; b B , compute the variance of these b Vclu;boot [ b ] = 1 B 1 PB b=1 ( b b b )( b b b )0 ; P where b = B 1 B b b and B = 400 should be more than adequate in most settings. It b=1 is important that the resampling be done over entire clusters, rather than over individual 14 observations. Each bootstrap resample will have exactly G clusters, with some of the original clusters not appearing at all while others of the original clusters may be repeated in the resample two or more times. The terms \pairs" is used as (yg ; Xg ) are resampled as a pair. The term \nonparametric" is also used for this bootstrap. Some alternative bootstraps p b hold Xg xed while resampling. For nite clusters, if Vclu [ b ] uses cbg in (11) then for u b comparability Vclu;boot [ b ] should be multiplied by the constant c in (12). The pairs cluster bootstrap leads to cluster-robust variance matrix for OLS with rank K even if K > G. An alternative resampling method that can be used is the leave-one-cluster-out jackknife. Then, letting b g denote the estimator of when the g th cluster is deleted, G b Vclu;jack [ b ] = G 1 PG g=1 ( bg b )( b g b )0 ; P where b = G 1 G b g . This older method can be viewed as an approximation to the g=1 bootstrap that does not work as well for nonlinear estimators. It is used less often than the bootstrap, and has the same rank as the CRVE. Unlike a percentile-t cluster bootstrap, presented later, the pairs cluster bootstrap and cluster jackknife variance matrix estimates are no better asymptotically than the CRVE, so it is best and quickest to use the CRVE if it is available. But the CRVE is not always available, especially for estimators more complicated than OLS. In that case one can instead use the pairs cluster bootstrap, though see the end of Section VIC for potential pitfalls if there are few clusters, or even the cluster jackknife. In Stata the pairs cluster bootstrap for OLS without xed e ects can be implemented in several equivalent ways including: regress y x, vce(boot, cluster(id clu) reps(400) seed(10101)); xtreg y x, pa corr(ind) vce(boot, reps(400) seed(10101)); and bootstrap, cluster(id clu) reps(400) seed(10101) : regress y x. The last variant can be used for estimation commands and user-written programs that do not have a vce(boot) option. We recommend 400 bootstrap iterations for published results and for replicability one should always set the seed. For the jackknife the commands are instead, respectively, regress y x, vce(jack, cluster(id clu)); xtreg y x, pa corr(ind) vce(jack); and jackknife, cluster(id clu): regress y x. For Stata xt commands, options vce(boot) and vce(jack) are generally interpreted as meaning cluster bootstrap and cluster jackknife; always so from Stata 12.1 on. III. Cluster-Speci c Fixed E ects The cluster-speci c xed e ects (FE) model includes a separate intercept for each cluster, so yig = x0ig + = x0ig + g + uig PG h=1 15 g dhig (17) + uig ; where dhig , the hth of G dummy variables, equals one if the ig th observation is in cluster h and equals zero otherwise. There are several di erent ways to obtain the same cluster-speci c xed e ects estimator. The two most commonly-used are the following. The least squares dummy variable (LSDV) estimator directly estimates the second line of (17), with OLS regression of yig on xig and the P G dummy variables d1ig ; :::; dGig , in which case b g = yg x0g b where yg = Ng 1 G yig and i=1 P xg = Ng 1 G xig . The within estimator, also called the xed e ects estimator, estimates i=1 by OLS the within or mean-di erenced model (yig yg ) = (xig xg )0 + (uig ug ): (18) The main reason that empirical economists use the cluster-speci c FE estimator is that it controls for a limited form of endogeneity of regressors. Suppose in (17) that we view g + uig as the error, and the regressors xig are correlated with this error, but only with the cluster-invariant component, i.e., Cov[xig ; g ] 6= 0 while Cov[xig ; uig ] = 0. Then OLS and FGLS regression of yig on xig , as in Section II, leads to inconsistent estimation of . Mean-di erencing (17) leads to the within model (18) that has eliminated the problematic cluster-invariant error component g . The resulting FE estimator is consistent for if either G ! 1 or Ng ! 1. The cluster-robust variance matrix formula given in Section II carries over immediately to OLS estimation in the FE model, again assuming G ! 1. In this section we consider some practicalities. First, including xed e ects generally does not control for all the within-cluster correlation of the error and one should still use the CRVE. Second, when cluster sizes are small and degrees-of-freedom corrections are used the CRVE should be computed by within rather than LSDV estimation. Third, FGLS estimators need to be bias-corrected when cluster sizes are small. Fourth, tests of xed versus random e ect models should use a modi ed version of the Hausman test. IIIA. Do Fixed E ects Fully Control for Within-Cluster Correlation? While cluster-speci c e ects will control for part of the within-cluster correlation of the error, in general they will not completely control for within-cluster error correlation (not to mention heteroskedasticity). So the CRVE should still be used. There are several ways to make this important point. Suppose we have data on students in classrooms in schools. A natural model, a special case of a hierarchical model, is to suppose that there is both an unobserved school e ect and, on top of that, an unobserved classroom e ect. Letting i denote individual, s school, and c classroom, we have yisc = x0isc + s + c + "isc . A regression with school-level xed e ects (or random e ects) will control for within-school correlation, but not the additional within-classroom correlation. Suppose we have a short panel (T xed, N ! 1) of uncorrelated individuals and estimate yit = x0it + i + uit . Then the error uit may be correlated over time (i.e., within-cluster) due 16 to omitted factors that evolve progressively over time. Inoue and Solon (2006) provide a test for this serial correlation. Cameron and Trivedi (2005, p.710) present an FE individual-level panel data log-earnings regressed on log-hours example with cluster-robust standard errors four times the default. Serial correlation in the error may be due to omitting lagged y as a regressor. When yi;t 1 is included as an additional regressor in the FE model, the ArellanoBond estimator is used and even with yi;t 1 included still requires that we rst test whether the remaining error uit is serially correlated. Finally, suppose we have a single cross-section (or a single time series). This can be viewed as regression on a single cluster. Then in the model yi = + x0i + ui (or yt = + x0t + ut ), the intercept is the cluster-speci c xed e ect. There are many reasons for why the error ui (or ut ) may be correlated in this regression. IIIB. Cluster-Robust Variance Matrix with Fixed E ects Since inclusion of cluster-speci c xed e ects may not fully control for cluster correlation (and/or heteroskedasticity), default standard errors that assume uig to be i.i.d. may be invalid. So one should use cluster-robust standard errors. b Arellano (1987) showed that Vclu [ b ] de ned in (10)-(11) remains valid for the within estimator that controls for inclusion of G cluster-speci c xed e ects, provided G ! 1 and Ng is small. If instead one obtains the LSDV estimator, the CRVE formula gives the same CRVE for b as that for the within estimator, with the important proviso that the same degrees-of-freedom adjustment must be used { see below. The xed e ects estimates b g b obtained for the LSDV estimator are essentially based only on Ng observations, so V[b g ] is inconsistent for V[b g ], just as b g is inconsistent for g . Hansen (2007a, p.600) shows that this CRVE can also be used if additionally Ng ! 1, for both the case where within-cluster correlation is always present (e.g. for many individuals in each village) and for the case where within-cluster correlation eventually disappears (e.g. for panel data where time series correlation disappears for observations far apart in time). p p The rates of convergence are G in the rst case and GNg in the second case, but the same asymptotic variance matrix is obtained in either case. Kezdi (2004) analyzed the CRVE for a range of values of G and Ng . It is important to note that while LSDV and within estimation lead to identical estimates of , they can yield di erent standard errors due to di erent nite sample degrees-of-freedom correction. It is well known that if default standard errors are used, i.e. it is assumed that uig in (17) is i.i.d., then one can safely use standard errors after LSDV estimation as it correctly views the number of parameters as G + K rather than K. If instead the within estimator is used, however, manual OLS estimation of (18) will mistakenly view the number of parameters to equal K rather than G + K. (Built-in panel estimation commands for the within estimator, i.e. a xed e ects command, should remain okay to use, since they should be programmed to use G + K in calculating the standard errors.) 17 It is not well known that if cluster-robust standard errors are used, and cluster sizes are small, then inference should be based on the within estimator standard errors. We thank Arindrajit Dube and Jason Lindo for bringing this issue to our attention. Within and LSDV estimation lead to the same cluster-robust standard errors if we apply formula (11) to the respective regressions, or if we multiply this estimate by c = G=(G 1). Di erences arise, however, if we multiply by the small-sample correction c given in (12). Within estimation sets c = GG 1 N N 1 1) since there are only (K 1) regressors { the within model is estimated (K N 1 without an intercept. LSDV estimation uses c = GG 1 N G (K 1) since the G cluster dummies are also included as regressors. For balanced clusters with Ng = N and G large relative to K, c ' 1 for within estimation and c ' N =(N 1) for LSDV estimation. Suppose there are only two observations per cluster, due to only two individuals per household or two time periods in a panel setting, so Ng = N = 2. Then c ' 2=(2 1) = 2 for LSDV estimation, leading to CRVE that is twice that from within estimation. Within estimation leads to the correct nite-sample correction. In Stata the within command xtreg y x, fe vce(robust) gives the desired CRVE. The alternative LSDV commands regress y x i.id clu, vce(cluster id clu) and, equivalently, regress y x, absorb(id clu) vce(cluster id clu) use the wrong degrees-offreedom correction. If a CRVE is needed, then use xtreg. If there is reason to instead use regress i.id then the cluster-robust standard errors should be multiplied by the square root of [N (K 1)]=[N G (K 1)], especially if Ng is large and G is small. The inclusion of dummy variables does not lead to an increase in the rank of the CRVE. To see this, stack the dummy variable dhig for cluster g into the Ng 1 vector dhg . Then b b dh0g ug = 0, by the OLS normal equations, leading to the rank of Vclu [ b ] falling by one for each cluster-speci c e ect. If there are m regressors varying within cluster and G 1 b dummies then, even though there are m + G 1 parameters , the rank of Vclu [ b ] is only the minimum of m and G 1. And a test that 1 ; :::; G are jointly statistically signi cant is a test of G 1 restrictions (since the intercept or one of the xed e ects needs to be dropped). So even if the cluster-speci c xed e ects are consistently estimated (i.e., if Ng ! 1), it is not possible to perform this test if m < G 1, which is often the case. If cluster-speci c e ects are present then the pairs cluster bootstrap must be adapted to account for the following complication. Suppose cluster 3 appears twice in a bootstrap resample. Then if clusters in the bootstrap resample are identi ed from the original clusteridenti er, the two occurrences of cluster 3 will be incorrectly treated as one large cluster rather than two distinct clusters. In Stata, the bootstrap option idcluster ensures that distinct identi ers are used in each bootstrap resample. Examples are regress y x i.id clu, vce(boot, cluster(id clu) idcluster(newid) reps(400) seed(10101)) and, more simply, xtreg y x, vce(boot, reps(400) seed(10101)) , as in this latter case Stata automatically accounts for this complication. 18 IIIC. Feasible GLS with Fixed E ects When cluster-speci c xed e ects are present, more e cient FGLS estimation can become more complicated. In particular, if asymptotic theory relies on G ! 1 with Ng xed, the g cannot be consistently estimated. The within estimator of is nonetheless consistent, as g disappears in the mean-di erenced model. But the resulting residuals uig are contaminated, b b and b g , and these residuals will be used to form a FGLS since they depend on both estimator. This leads to bias in the FGLS estimator, so one needs to use bias-corrected FGLS unless Ng ! 1. The correction method varies with the model for g = V[ug ], and currently there are no Stata user-written commands to implement these methods. For panel data a commonly-used model speci es an AR(p) model for the errors uig in (17). If xed e ects are present, then there is a bias (of order Ng 1 ) in estimation of the AR(p) coe cients. Hansen (2007b) obtains bias-corrected estimates of the AR(p) coe cients and uses these in FGLS estimation. Hansen (2007b) in simulations shows considerable e ciency gains in bias-corrected FGLS compared to OLS. Brewer, Crossley, and Joyce (2013) consider a DiD model with individual-level U.S. panel data with N = 750,127, T = 30, and a placebo state-level law so clustering is on state with G = 50. They nd that bias-corrected FGLS for AR(2) errors, using the Hansen (2007b) correction, leads to higher power than FE estimation. In their example ignoring the bias correction does not change results much, perhaps because T = 30 is reasonably large. For balanced clusters with g the same for all g, say g = , and for Ng small, then the FGLS estimator in (13) can be used without need to specify a model for . Instead b have ij th entry G 1 PG uig ujg , where uig are the residuals from initial OLS b we can let g=1 b b estimation. These assumptions may be reasonable for a balanced panel. Two complications can arise. First, even without xed e ects there may be many o -diagonal elements to estimate and this number can be large relative to the number of observations. Second, the xed e ects lead to bias in estimating the o -diagonal covariances. Hausman and Kuersteiner (2008) present xes for both complications. IIID. Testing the Need for Fixed E ects FE estimation is accompanied by a loss of precision in estimation, as only within-cluster variation is used (recall we regress (yig yg ) on (xig xg )). Furthermore, the coe cient of a cluster-invariant regressor is not identi ed, since then xig xg = 0. Thus it is standard to test whether it is su cient to estimate by OLS or FGLS, without cluster-speci c xed e ects. The usual test is a Hausman test based on the di erence between the FE estimator b FE , and the RE estimator, b RE . Let 1 denote a subcomponent of , possibly just the coe cient of a single regressor of key interest; at most 1 contains the coe cients of all regressors that are not invariant within cluster or, in the case of panel data, are not aggregate time e ects 19 that take the same value for each individual. The chi-squared distributed test statistic is THaus = ( b 1;FE b 0b 1 b 1;RE ) V ( 1;FE b ;RE ); b where V is a consistent estimate of V[ b 1;FE b 1;RE ]. b Many studies use the standard form of the Hausman test. This forms obtain V under the strong assumption that b RE is fully e cient under the null hypothesis. This requires the unreasonably strong assumptions that i and "ig in (16) are i.i.d., requiring that neither i nor "ig is heteroskedastic and that "ig has no within-cluster correlation. As already noted, these assumptions are likely to fail and one should not use default standard errors. Instead a CRVE should be used. For similar reasons the standard form of the Hausman test should not be used. Wooldridge (2010, p.332) instead proposes implementing a cluster-robust version of the Hausman test by the following OLS regression 0 yig = x0ig + wg + uig ; P where wg denotes the subcomponent of xig that varies within cluster and wg = Ng 1 G wig . i=1 If H0 : = 0 is rejected using a Wald test based on a cluster-robust estimate of the variance matrix, then the xed e ects model is necessary. The Stata user-written command xtoverid, due to Scha er and Stillman (2010), implements this test. b An alternative is to use the pairs cluster bootstrap to obtain V, in each resample obtaining b b b b 1;FE and 1;RE , leading to B resample estimates of ( 1;FE 1;RE ). We are unaware of studies comparing these two cluster-robust versions of the Hausman test. IV. What to Cluster Over? It is not always clear what to cluster over - that is, how to de ne the clusters - and there may even be more than one way to cluster. Before providing some guidance, we note that it is possible for cluster-robust errors to actually be smaller than default standard errors. First, in some rare cases errors may be negatively correlated, most likely when Ng = 2, in which case (6) predicts a reduction in the standard error. Second, cluster-robust is also heteroskedastic-robust and White heteroskedasticrobust standard errors in practice are sometimes larger and sometimes smaller than the default. Third, if clustering has a modest e ect, so cluster-robust and default standard are similar in expectation, then cluster-robust may be smaller due to noise. In cases where the cluster-robust standard errors are smaller they are usually not much smaller than the default, whereas in other applications they can be much, much larger. IVA. Factors Determining What to Cluster Over There are two guiding principles that determine what to cluster over. 20 First, given V[ b ] de ned in (7) and (9) whenever there is reason to believe that both the regressors and the errors might be correlated within cluster, we should think about clustering de ned in a broad enough way to account for that clustering. Going the other way, if we think that either the regressors or the errors are likely to be uncorrelated within a potential group, then there is no need to cluster within that group. b Second, Vclu [ b ] is an average of G terms that gets closer to V[ b ] only as G gets large. If we de ne very large clusters, so that there are very few clusters to average over in equation b (11), then the resulting Vclu [ b ] can be a very poor estimate of V[ b ]. This complication, and discussion of how few is \few", is the subject of Section VI. These two principles mirror the bias-variance trade-o that is common in many estimation problems { larger and fewer clusters have less bias but more variability. There is no general solution to this trade-o , and there is no formal test of the level at which to cluster. The consensus is to be conservative and avoid bias and use bigger and more aggregate clusters when possible, up to and including the point at which there is concern about having too few clusters. For example, suppose your dataset included individuals within counties within states, and you were considering whether to cluster at the county level or the state level. We have been inclined to recommend clustering at the state level. If there was within-state crosscounty correlation of the regressors and errors, then ignoring this correlation (for example, by clustering at the county level) would lead to incorrect inference. In practice researchers often cluster at progressively higher (i.e., broader) levels and stop clustering when there is relatively little change in the standard errors. This seems to be a reasonable approach. There are settings where one may not need to use cluster-robust standard errors. We outline several, though note that in all these cases it is always possible to still obtain clusterrobust standard errors and contrast them to default standard errors. If there is an appreciable di erence, then use cluster-robust standard errors. If a key regressor is randomly assigned within clusters, or is as good as randomly assigned, then the within-cluster correlation of the regressor is likely to be zero. Thus there is no need to cluster standard errors, even if the model's errors are clustered. In this setting, if there are additionally control variables of interest, and if these are not randomly assigned within cluster, then we may wish to cluster our standard errors for the sake of correct inference on the control variables. If the model includes cluster-speci c xed e ects, and we believe that within-cluster correlation of errors is solely driven by a common shock process, then we may not be worried about clustering. The xed e ects will absorb away the common shock, and the remaining errors will have zero within-cluster correlation. More generally, control variables may absorb systematic within-cluster correlation. For example, in a state-year panel setting, control variables may capture the state-speci c business cycle. However, as already noted in Section IIIA, the within-cluster correlation is usually not fully eliminated. And even if it is eliminated, the errors may still be heteroskedastic. Stock and Watson (2008) show that applying the usual White (1980) heteroskedastic-consistent 21 variance matrix estimate to the within estimator leads, surprisingly, to inconsistent estimation of V[ b ] if Ng is small (though it is correct if Ng = 2). They derive a bias-corrected formula for heteroskedastic-robust standard errors. Alternatively, and more simply, the CRVE is consistent for V[ b ], even if the errors are only heteroskedastic, though this estimator of V[ b ] is more variable. Finally, as already noted in Section IID we can always build a parametric model of the correlation structure of the errors and estimate by FGLS. If we believe that this parametric model captures the salient features of the error correlations, then default FGLS standard errors can be used. IVB. Clustering Due to Survey Design Clustering routinely arises due to the sampling methods used in complex surveys. Rather than randomly draw individuals from the entire population, costs are reduced by sampling only a randomly-selected subset of primary sampling units (such as a geographic area), followed by random selection, or strati ed selection, of people within the chosen primary sampling units. The survey methods literature uses methods to control for clustering that predate the cluster-robust approach of this paper. The loss of estimator precision due to clustered sampling is called the design e ect: \The design e ect or De is the ratio of the actual variance of a sample to the variance of a simple random sample of the same number of elements" (Kish (1965), p.258)). Kish and Frankel (1974) give the variance in ation formula (6), assuming equicorrelated errors in the non-regression case of estimation of the mean. Pfe ermann and Nathan (1981) consider the more general regression case. The CRVE is called the linearization formula, and the common use of G 1 as the degrees of freedom used in hypothesis testing comes from the survey methods literature; see Shah, Holt and Folsom (1977) which predates the econometrics literature. Applied economists routinely use data from complex surveys and control for clustering, by using a cluster-robust variance matrix estimate. At the minimum one should cluster at the level of the primary sampling unit, though often there is reason to cluster at a broader level, such as clustering on state if regressors and errors are correlated within state. The survey methods literature additionally controls for two other features of survey data { weighting and strati cation. These methods are well-established and are incorporated in specialized software, as well as in some broad-based packages such as Stata. Bhattacharya (2005) provides a comprehensive treatment in a GMM framework. If sample weights are provided then it is common to perform weighted least squares. Then the CRVE for b WLS = (X0 WX) 1 X0 Wy is that given in (15) with b g 1 replaced by Wg . The need to weight can be ignored if strati cation is on only the exogenous regressors and we assume correct speci cation of the model, so that in our sample E[yjX] = X . In that special case both weighted and unweighted estimators are consistent, and weighted OLS may actually be less e cient if, for example, model errors are i.i.d.; see, for example, Solon, 22 Haider, and Wooldridge (2013). Another situation in which to use weighted least squares, unrelated to complex surveys, is when data for the ig th observation is obtained by in turn averaging Nig observations and Nig varies. Strati cation of the sample can enable more precise statistical inference. These gains can be bene cial in the nonregression case, such as estimating the monthly national unemployment rate. The gains can become much smaller once regressors are included, since these can partially control for strati cation; see, for example, the application in Bhattacharya (2005). Econometrics applications therefore usually do not adjust standard errors for strati cation, leading to conservative inference due to some relatively small over -estimation of the standard errors. V. Multi-way Clustering The discussion to date has presumed that if there is more than one potential way to cluster, these ways are nested in each other, such as households within states. But when clusters are non-nested, traditional cluster inference can only deal with one of the dimensions. In some applications it is possible to include su cient regressors to eliminate concern about error correlation in all but one dimension, and then do cluster-robust inference for that remaining dimension. A leading example is that in a state-year panel there may be clustering both within years and within states. If the within-year clustering is due to shocks that are the same across all observations in a given year, then including year xed e ects as regressors will absorb within-year clustering, and inference then need only control for clustering on state. When this is not possible, the one-way cluster robust variance can be extended to multiway clustering. Before discussing this, we want to highlight one error that we nd some practitioners make. This error is to cluster at the intersection of the two groupings. In the preceding example, some might be tempted to cluster at the state-year level. This is what Stata will do if you give it the command regress y x, vce(cluster id styr) where id styr is a state-year identi er. This will be very inadequate, since it imposes the restriction that observations are independent if they are in the same state but in di erent years. Indeed if the data is aggregated at the state-year level, there is only one observation at the state-year level, so this is identical to using heteroskedastic-robust standard errors, i.e. not clustering at all. This point was highlighted by Bertrand, Du o, and Mullainathan (2004) who advocated clustering on the state. VA. Multi-way Cluster-Robust Variance Matrix Estimate The cluster-robust estimate of V[ b ] de ned in (10)-(11) can be generalized to clustering in multiple dimensions. In a change of notation, suppress the subscript for cluster and more simply denote the model for an individual observation as yi = x0i + ui : 23 (19) Regular one-way clustering is based on the assumption that E[ui uj jxi ; xj ] = 0, unless obP PN 0 b bb servations i and j are in the same cluster. Then (11) sets B = N i=1 j=1 xi xj ui uj 1[i; j in same cluster], where ui = yi x0i b . In multi-way clustering, the key assumption is that b E[ui uj jxi ; xj ] = 0, unless observations i and j share any cluster dimension. Then the multiway cluster robust estimate of V[ b ] replaces (12) with b B= XN XN i=1 j=1 xi x0j ui uj 1[i; j share any cluster]: bb (20) This method relies on asymptotics that are in the number of clusters of the dimension with the fewest number of clusters. This method is thus most appropriate when each dimension has many clusters. Theory for two-way cluster robust estimates of the variance matrix is presented in Cameron, Gelbach, and Miller (2006, 2011), Miglioretti and Heagerty (2006), and Thompson (2006, 2011). See also empirical panel data applications by Acemoglu and Pischke (2003), who clustered at individual level and at region time level, and by Petersen (2009), who clustered at rm level and at year level. Cameron, Gelbach and Miller (2011) present extension to multi-way clustering. Like one-way cluster-robust, the method can be applied to estimators other than OLS. For two-way clustering this robust variance estimator is easy to implement given software that computes the usual one-way cluster-robust estimate. First obtain three di erent clusterrobust \variance" matrices for the estimator by one-way clustering in, respectively, the rst dimension, the second dimension, and by the intersection of the rst and second dimensions. Then add the rst two variance matrices and, to account for double-counting, subtract the third. Thus b b b b V2way [ b ] = V1 [ b ] + V2 [ b ] V1\2 [ b ]; (21) where the three component variance estimates are computed using (10)-(11) for the three di erent ways of clustering. We spell this out in a step-by-step fashion. 1. Identify your two ways of clustering. Make sure you have a variable that identi es each way of clustering. Also create a variable that identi es unique \group 1 by group 2" combinations. For example, suppose you have individual-level data spanning many U.S. states and many years, and you want to cluster on state and on year. You will need a variable for state (e.g. California), a variable for year (e.g. 1990), and a variable for state-by-year (California and 1990). 2. Estimate your model, clustering on \group 1". For example, regress y on x, clustering b on state. Save the variance matrix as V1 . 3. Estimate your model, clustering on \group 2". For example, regress y on x, clustering b on year. Save the variance matrix as V2 . 24 4. Estimate your model, clustering on \group 1 by group 2". For example, regress y on b x, clustering on state-by-year. Save the variance matrix as V1\2 . b b b 5. Create a new variance matrix V2way = V1 + V2 {c 1\2 . This is your new two-way V b. cluster robust variance matrix for 6. Standard errors are the square root of the diagonal elements of this matrix. If you are interested in only one coe cient, you can also just focus p saving the standard on se2 . error for this coe cient in steps 2-4 above, and then create se2way = se2 + se2 1\2 2 1 In taking these steps, you should watch out for some potential pitfalls. With perfectly multicollinear regressors, such as inclusion of dummy variables some of which are redundant, a statistical package may automatically drop one or more variables to ensure a nonsingular set of regressors. If the package happens to drop di erent sets of variables in steps 2, 3, b0 and 4, then the resulting V s will not be comparable, and adding them together in step 5 will give a nonsense result. To prevent this issue, manually inspect the estimation results in steps 2, 3, and 4 to ensure that each step has the same set of regressors, the same number of observations, etc. The only things that should be di erent are the reported standard errors and the reported number of clusters. VB. Implementation b Unlike the standard one-way cluster case, V2way [ b ] is not guaranteed to be positive semide nite, so it is possible that it may compute negative variances. In some applications with b b xed e ects, V[ b ] may be non positive-de nite, but the subcomponent of V[ b ] associated with the regressors of interest may be positive-de nite. This may lead to an error message, even though inference is appropriate for the parameters of interest. Our informal observation is that this issue is most likely to arise when clustering is done over the same groups b as the xed e ects. Few clusters in one or more dimensions can also lead to V2way [ b ] being a non-positive-semide nite matrix. Cameron, Gelbach and Miller (2011) present an eigendecomposition technique used in the time series HAC literature that zeroes out negative b eigenvalues in V2way [ b ] to produce a positive semi-de nite variance matrix estimate. The Stata user-written command cmgreg, available at the authors' websites, implements multi-way clustering for the OLS estimator with, if needed, the negative eigenvalue adjustment. The Stata add-on command ivreg2, due to Baum, Scha er, and Stillman (2007), implements two-way clustering for OLS, IV and linear GMM estimation. Other researchers have also posted code, available from searching the web. Cameron, Gelbach, and Miller (2011) apply the two-way method to data from Hersch (1998) that examines the relationship between individual wages and injury risk measured separately at the industry level and the occupation level. The log-wage for 5960 individuals is regressed on these two injury risk measures, with standard errors obtained by two-way clustering on 211 industries and 387 occupations. In that case two-way clustering leads to 25 only a modest change in the standard error of the industry job risk coe cient compared to the standard error with one-way clustering on industry. Since industry job risk is perfectly correlated within industry, by result (6) we need to cluster on industry if there is any withinindustry error correlation. By similar logic, the additional need to cluster on occupation depends on the within-occupation correlation of job industry risk, and this correlation need not be high. For the occupation job risk coe cient, the two-way and one-way cluster (on occupation) standard errors are similar. Despite the modest di erence in this example, twoway clustering avoids the need to report standard errors for one coe cient clustering in one way and for the second coe cient clustering in the second way. Cameron, Gelbach, and Miller (2011) also apply the two-way cluster-robust method to data on volume of trade between 98 countries with 3262 unique country pairs. In that case, two-way clustering on each of the countries in the country pair leads to standard errors that are 40% larger than one-way clustering or not clustering at all. Cameron and Miller (2012) study such dyadic data in further detail. They note that two-way clustering does not pick up all the potential correlations in the data. Instead, more general cluster-robust methods, including one proposed by Fafchamps and Gubert (2007), should be used. VC. Feasible GLS Similar to one-way clustering, FGLS is more e cient than OLS, provided an appropriate model for = E[uu0 jX] is speci ed and is consistently estimated. The random e ects model can be extended to multi-way clustering. For individual i in clusters g and h, the two-way random e ects model speci es yigh = x0igh + g + h + "igh ; where the errors g , h , and "igh are each assumed to be i.i.d. distributed with mean 0. For example, Moulton (1986) considered clustering due to grouping of regressors (schooling, age and weeks worked) in a log earnings regression, and estimated a model with common random shock for each year of schooling, for each year of age, and for each number of weeks worked. The two-way random e ects model can be estimated using standard software for (nested) hierarchical linear models; see, for example, Cameron and Trivedi (2009, ch. 9.5.7) for Stata command xtmixed (command mixed from version 13 on). For estimation of a many-way random e ects model, see Davis (2002) who modelled lm attendance data clustered by lm, theater and time. The default standard errors after FGLS estimation require that is correctly speci ed. FGLS estimation entails transforming the data in such a way that there is no obvious method for computing a variance matrix estimate that is robust to misspeci cation of in the twoway or multi-way random e ects model. Instead if there is concern about misspeci cation of then one needs to consider FGLS with richer models for , see Rabe-Hesketh and Skrondal (2012) for richer hierarchical models in Stata, or do less e cient OLS with twoway cluster-robust standard errors. 26 VD. Spatial Correlation Cluster-robust variance matrix estimates are closely related to spatial-robust variance matrix estimates. b In general, for model (19), B in (20) has the form XN XN b w (i; j) xi x0j ui uj ; bb (22) B= i=1 j=1 where w (i; j) are weights. For cluster-robust inference these weights are either 1 (cluster in common) or 0 (no cluster in common). But the weights can also decay from one to zero, as in the case of the HAC variance matrix estimate for time series where w (i; j) decays to zero as ji jj increases. For spatial data it is assumed that model errors become less correlated as the spatial distance between observations grows. For example, with state-level data the assumption that model errors are uncorrelated across states may be relaxed to allow correlation that decays to zero as the distance between states gets large. Conley (1999) provides conditions under which (10) and (22) provide a robust variance matrix estimate for the OLS estimator, where the weights w (i; j) decay with the spatial distance. The estimate (which Conley also generalizes to GMM models) is often called a spatial-HAC estimate, rather than spatialrobust, as proofs use mixing conditions (to ensure decay of dependence) as observations grow apart in distance. These conditions are not applicable to clustering due to common shocks which leads to the cluster-robust estimator with independence of observations across clusters. Driscoll and Kraay (1998) consider panel data with T time periods and N individuals, with errors potentially correlated across individuals (and no spatial dampening), though this correlation across individuals disappears for observations that are more than m time periods apart. Let it denote the typical observation. The Driscoll-Kraay spatial correlation consistent (SCC) variance matrix estimate can be shown to use weight w (it; js) = 1 d (it; js) =(m+1) in (22), where the summation is now over i; j; s and t, and d (it; js) = jt sj if jt sj m and d (it; js) = 0 otherwise. This method requires that the number of time periods T ! 1, so is not suitable for short panels, while N may be xed or N ! 1. The Stata add-on command xtscc, due to Hoechle (2007), implements this variance estimator. An estimator proposed by Thompson (2006) allows for across-cluster (in his example rm) correlation for observations close in time in addition to within-cluster correlation at any time separation. The Thompson estimator can be thought of as using w (it; js) = 1[i; j share a rm, or d (it; js) m]. Foote (2007) contrasts the two-way cluster-robust and these other variance matrix estimators in the context of a macroeconomics example. Petersen (2009) contrasts various methods for panel data on nancial rms, where there is concern about both within rm correlation (over time) and across rm correlation due to common shocks. Barrios, Diamond, Imbens, and Kolesar (2012) consider state-year panel data on individuals in states over years with state-level treatment and outcome (earnings) that is correlated 27 spatially across states. This spatial correlation can be ignored if the state-level treatment is randomly assigned. But if the treatment is correlated over states (e.g. adjacent states may be more likely to have similar minimum wage laws) then one can no longer use standard errors clustered at the state level. Instead one should additionally allow for spatial correlation of errors across states. The authors additionally contrast traditional model-based inference with randomization inference. In practice data can have cluster, spatial and time series aspects, leading to hybrids of cluster-robust, spatial-HAC and time-series HAC estimators. Furthermore, it may be possible to parameterize some of the error correlation. For example for a time series AR(1) error it b may be preferable to use E[ut us ] based on an AR(1) model rather than w (t; s) ut us . To date bb empirical practice has not commonly modeled these combined types of error correlations. This may become more common in the future. VI. Few Clusters We return to one-way clustering, and focus on the Wald \t-statistic" w= b 0 sb ; (23) is one element in the parameter vector , and the standard error sb is the square root b of the appropriate diagonal entry in Vclu [ b ]. If G ! 1 then w N [0; 1] under H0 : = 0 . For nite G the distribution of w is unknown, even with normal errors. It is common to use the T distribution with G 1 degrees of freedom. It is not unusual for the number of clusters G to be quite small. Despite few clusters, b may still be a reasonably precise estimate of if there are many observations per cluster. But b with small G the asymptotics have not kicked in. Then Vclu [ b ] can be downwards-biased. b One should at a minimum use T (G 1) critical values and Vclu [ b ] de ned in (10)-(11) p with residuals scaled by c where c is de ned in (12) or c = G=(G 1). Most packages rescale the residuals { Stata uses the rst choice of c and SAS the second. The use of T (G 1) critical values is less common. Stata uses this adjustment after command regress y x, vce(cluster). But the alternative command xtreg y x, vce(robust) instead uses standard normal critical values. Even with both of these adjustments, Wald tests generally over-reject. The extent of over-rejection depends on both how few clusters there are and the particular data and model used. In some cases the over-rejection is mild, in others cases a test with nominal size 0:05 may have true test size of 0:10. The next subsection outlines the basic problem and discusses how few is \few" clusters. The subsequent three subsections present three approaches to nite-cluster correction { biascorrected variance, bootstrap, with asymptotic re nement and use of the T distribution with adjusted degrees-of-freedom. The nal subsection considers special cases. where 28 VIA. The Basic Problems with Few Clusters There are two main problems with few clusters. First, OLS leads to \over tting", with estimated residuals systematically too close to zero compared to the true error terms. This leads to a downwards-biased cluster-robust variance matrix estimate. The second problem is b that even with bias-correction, the use of tted residuals to form the estimate B of B leads to over-rejection (and con dence intervals that are too narrow) if the critical values are from the standard normal or even T (G 1) distribution. For the linear regression model with independent homoskedastic normally distributed errors these problems are easily controlled. An unbiased variance matrix is obtained by bb bb estimating the error variance 2 by s2 = u0 u=(N K) rather than u0 u=N . This is the \ x" in the OLS setting for the rst problem. The analogue to the second problem is that the N [0; 1] distribution is a poor approximation to the true distribution of the Wald statistic. In the i.i.d. case, the Wald statistic w can be shown to be exactly T (N K) distributed. For nonnormal homoskedastic errors the T (N K) is still felt to provide a good approximation, provided N is not too small. Both of these problems arise in the clustered setting, albeit with more complicated manifestations and xes. For independent heteroskedastic normally distributed errors there are no exact results. MacKinnon and White (1985) consider several adjustments to the heteroskedastic-consistent variance estimate of White (1980), including one called HC2 that is unbiased in the special case that errors are homoskedastic. Unfortunately if errors are actually heteroskedastic, as expected, the HC2 estimate is no longer unbiased for V[ b ] { an unbiased estimator depends on the unknown pattern of heteroskedasticity and on the design matrix X. And there is no way to obtain an exact T distribution result for w, even if errors are normally distributed. Other proposed solutions for testing and forming con dence intervals include using a T distribution with data-determined degrees of freedom, and using a bootstrap with asymptotic re nement. In the following subsections we consider extensions of these various adjustments to the clustered case, where the problems can become even more pronounced. Before proceeding we note that there is no speci c point at which we need to worry about few clusters. Instead, \more is better". Current consensus appears to be that G = 50 is enough for state-year panel data. In particular, Bertrand, Du o, and Mullainathan (2004, Table 8) nd in their simulations that for a policy dummy variable with high within-cluster correlation, a Wald test based on the standard CRVE with critical value of 1.96 had rejection rates of .063, .058, .080, and .115 for number of states (G) equal to, respectively, 50, 20, 10 and 6. The simulations of Cameron, Gelbach and Miller (2008, Table 3), based on a quite di erent data generating process but again with standard CRVE and critical value of 1.96, had rejection rates of .068, .081, .118, and .208 for G equal to, respectively, 30, 20, 10 and 5. In both cases the rejection rates would also exceed .05 if the critical value was from the T (G 1) distribution. The preceding results are for balanced clusters. Cameron, Gelbach and Miller (2008, 29 Table 4, column 8) consider unbalanced clusters when G = 10. The rejection rate with unbalanced clusters, half of size Ng = 10 and half of size 50, is :183, appreciably worse than rejection rates of :126 and :115 for balanced clusters of sizes, respectively, 10 and 100. Recent papers by Carter, Schnepel, and Steigerwald (2013) and Imbens and Kolesar (2012) provide theory that also indicates that the e ective number of clusters is reduced when Ng varies across clusters; see also the simulations in MacKinnon and Webb (2013). Similar issues may also arise if the clusters are balanced, but estimation is by weighted OLS that places di erent weights on di erent clusters. Cheng and Hoekstra (2013) document that weighting can result in over-rejection in the panel DiD setting of Bertrand, Du o, and Mullainathan (2004). To repeat a key message, there is no clear-cut de nition of \few". Depending on the situation \few" may range from less than 20 to less than 50 clusters in the balanced case, and even more clusters in the unbalanced case. VIB. Solution 1: Bias-Corrected Cluster-Robust Variance Matrix b A weakness of the standard CRVE with residual ug is that it is biased for Vclu [ b ], since 0 0 E[b g ug ] 6= E[ug ug ]. The bias depends on the form of g but will usually be downwards. u b e b Several corrected residuals ug to use in place of ug in (11) have been proposed. The simplest, p p e e already mentioned, is to use ug = G=(G 1)b g or ug = cb g where c is de ned in (12). u u One should at least use either of these corrections. Bell and McCa rey (2002) use e ug = [INg Hgg ] 1=2 b ug ; (24) where Hgg = Xg (X0 X) 1 X0g . This transformed residual leads to unbiased CRVE in the special case that E[ug u0g ] = 2 I. This is a cluster generalization of the HC2 variance estimate of MacKinnon and White (1985), so we refer to it as the CR2VE. Bell and McCa rey (2002) also use r G 1 e b ug = [INg Hgg ] 1 ug : (25) G This transformed residual leads to CRVE that can be shown to equal the delete-one-cluster jackknife estimate of the variance of the OLS estimator. This jackknife correction leads to upwards-biased CRVE if in fact E[ug u0g ] = 2 I. This is a cluster generalization of the HC3 variance estimate of MacKinnon and White (1985), so we refer to it as the CR3VE. Angrist and Pischke (2009, Chapter 8) and Cameron, Gelbach and Miller (2008) provide a more extensive discussion and cite more of the relevant literature. This literature includes papers that propose corrections for the more general case that E[ug u0g ] 6= 2 I but has a known parameterization, such as an RE model, and extension to generalized linear models. Angrist and Lavy (2002) apply the CR2VE correction (24) in an application with G = 30 to 40 and nd that the correction increases cluster-robust standard errors by between 10 and 30 50 percent. Cameron, Gelbach and Miller (2008, Table 3) nd that the CR3VE correction (24) has rejection rates of .062, .070, .092, and .138 for G equal to, respectively, 30, 20, 10 and 5. These rejection rates are a considerable improvement on .068, .081, .118, and .208 for the standard CRVE, but there is still considerable over-rejection for very small G. The literature has gravitated to using the CR2VE adjustment rather than the CR3VE adjustment. This reduces but does not eliminate over-rejection when there are few clusters. VIC. Solution 2: Cluster Bootstrap with Asymptotic Re nement In Section IIF we introduced the bootstrap as it is usually used, to calculate standard errors that rely on regular asymptotic theory. Here we consider a di erent use of the bootstrap, one with asymptotic re nement that may lead to improved nite-sample inference. p We consider inference based on G ! 1 (formally, G(b ) has a limit normal distribution). Then a two-sided Wald test of nominal size 0:05, for example, can be shown to have true size 0:05 + O(G 1 ) when the usual asymptotic normal approximation is used. For G ! 1 this equals the desired 0.05, but for nite G this di ers from 0:05. If an appropriate bootstrap with asymptotic re nement is instead used, the true size is 0:05 + O(G 3=2 ). This is closer to the desired 0:05 for large G, as G 3=2 < G 1 . Hopefully this is also the case for small G, something that is established using appropriate Monte Carlo experiments. For a one-sided test or a nonsymmetric two-sided test the rates are instead, respectively, 0:05 + O(G 1=2 ) and 0:05 + O(G 1 ). Asymptotic re nement can be achieved by bootstrapping a statistic that is asymptotically pivotal, meaning the asymptotic distribution does not depend on any unknown parameters. The estimator b is not asymptotically pivotal as its distribution depends on V[b] which in turn depends on unknown variance parameters in V[ujX]. The Wald t-statistic de ned in (23) is asymptotically pivotal as its asymptotic distribution is N [0; 1] which does not depend on unknown parameters. VIC.1. Percentile-t Bootstrap A percentile-t bootstrap obtains B draws, w1 ; :::; wB , from the distribution of the Wald t-statistic as follows. B times do the following: 1. Obtain G clusters f(y1 ; X1 ); :::; (yG ; XG )g by one of the cluster bootstrap methods detailed below. 2. Compute the OLS estimate bb in this resample. 3. Calculate the Wald test statistic wb = (bb b)=sb where sb is the cluster-robust b b standard error of bb , and b is the OLS estimate of j from the original sample. Note that we center on b and not 0 , as the bootstrap views the sample as the population, i.e., = b, and the B resamples are based on this \population." Note also that the standard 31 error in step 3 needs to be cluster-robust. A good choice of B is B = 999; this is more than B for standard error estimation as tests are in the tails of the distribution, and is such that (B + 1) is an integer for common choices of test size . Let w(1) ; :::; w(B) denote the ordered values of w1 ; :::; wB . These ordered values trace out the density of the Wald t-statistic, taking the place of a standard normal or T distribution. For example, the critical values for a 95% nonsymmetric con dence interval or a 5% nonsymmetric Wald test are the lower 2.5 percentile and upper 97.5 percentile of w1 ; :::; wB , rather than, for example, the standard normal values of 1:96 and 1:96. The p-value for a symmetric test based on the original sample Wald statistic w equals the proportion of times that jwj > jwb j, b = 1; :::; B. The simplest cluster resampling method is the pairs cluster resampling introduced in Section IIF. Then in step 1. above we form G clusters f(y1 ; X1 ); :::; (yG ; XG )g by resampling with replacement G times from the original sample of clusters. This method has the advantage of being applicable to a wide range of estimators, not just OLS. However, for some types of data the pairs cluster bootstrap may not be applicable - see \Bootstrap with Caution" below. Cameron, Gelbach, and Miller (2008) found that in Monte Carlos with few clusters the cluster bootstrap did not eliminate test over-rejection. The authors proposed using an alternative percentile-t bootstrap, the wild cluster bootstrap, that holds the regressors xed across bootstrap replications. VIC.2. Wild Cluster Bootstrap The wild cluster bootstrap resampling method is as follows. First, estimate the main model, imposing (forcing) the null hypothesis H0 that you wish to test, to give estimate e H0 . For example, for test of statistical signi cance of a regressor OLS regress yig on all components of xig except the regressor with coe cient zero under the null hypothesis. Form the residual uig = yi x0ig e H0 . Then obtain the bth resample for step 1 above as follows: e 1a. Randomly assign cluster g the weight dg = 1 with probability 0:5 and the weight dg = 1 with probability 0:5. All observations in cluster g get the same value of the weight. 1b. Generate new pseudo-residuals uig = dg x0ig e H0 + uig . uig , and hence new outcome variables yig = e Then proceed with step 2 as before, regressing yig on xig , and calculate wb as in step 3. The p-value for the the test based on the original sample Wald statistic w equals the proportion of times that jwj > jwb j, b = 1; :::; B. For the wild bootstrap, the values w1 ; :::; wB cannot be used directly to form critical values for a con dence interval. Instead they can be used to provide a p-value for testing a hypothesis. To form a con dence interval, one needs to invert a sequence of tests, pro ling 32 over a range of candidate null hypotheses H0 : = 0 . For each of these null hypotheses, obtain the p-value. The 95% con dence interval is the set of values of 0 for which p 0:05. This method is computationally intensive, but conceptually straightforward. As a practical matter, you may want to ensure that you have the same set of bootstrap draws across candidate hypotheses, so as to not introduce additional bootstrapping noise into the determination of where the cuto is. In principle it is possible to directly use a bootstrap for bias-reduction, such as to remove bias in standard errors. In practice this is not done, however, as in practice any bias reduction comes at the expense of considerably greater variability. A conservative estimate of the standard error equals the width of a 95% con dence interval, obtained using asymptotic re nement, divided by 2 1:96. Note that for the wild cluster bootstrap the resamples f(y1 ; X1 ); :::; (yG ; XG )g have the same X in each resample, whereas for pairs cluster both y and X vary across the B resamples. The wild cluster bootstrap is an extension of the wild bootstrap proposed for heteroskedastic data. It works essentially because the two-point distribution for forming ug e e ensures that E[ug ] = 0 and V[ug ] = ug u0g . There are other two-point distributions that also do so, but Davidson and Flachaire (2008) show that in the heteroskedastic case it is best to use the weights dg = f 1; 1g, called Rademacher weights. The wild cluster bootstrap essentially replaces yg in each resample with one of two values e e yg = X e H0 + ug or yg = X e H0 ug . Because this is done across G clusters, there are at most G 2 possible combinations of the data, so there are at most 2G unique values of w1 ; :::; wB . If there are very few clusters then there is no need to actually bootstrap as we can simply enumerate, with separate estimation for each of the 2G possible datasets. Webb (2013) expands on these limitations. He shows that there actually only 2G 1 possible t-statistics in absolute value. Thus with G = 5 there are at most 24 = 16 possible values of w1 ; :::; wB . So if the main test statistic is more extreme than any of these 16 values, for example, then all we know is that the p-value is smaller than 1=16 = 0:0625. Full enumeration makes this discreteness clear. Bootstrapping without consideration of this issue can lead to inadvertently picking just one point from the interval of equally plausible p-values. As G gets to be as large as 11 this concern is less of an issue since 210 = 1024. Webb (2013) proposes greatly reducing the discreteness of p-values with very low G by instead using a six-point distribution for the weights dg in step 1b. In this proposed distribup p p p p p tion the weights dg have a 1=6 chance of each value in f 1:5; 1; :5; :5; 1; 1:5g. In his simulations this method outperforms the two-point wild bootstrap for G < 10 and is de nitely the method to use if G < 10. MacKinnon and Webb (2013) address the issue of unbalanced clusters and nd that, even with G = 50, tests based on the standard CRVE with T (G 1) critical values can over-reject considerably if the clusters are unbalanced. By contrast, the two-point wild bootstrap with Rademacher weights is generally reliable. 33 VIC.3. Bootstrap with Caution Regardless of the bootstrap method used, pairs cluster with or without asymptotic re nement or wild cluster bootstrap, when bootstrapping with few clusters, an important step in the process is to examine the distribution of bootstrapped values. This is something that should be done whether you are bootstrapping to obtain a standard error, or bootstrapping tstatistics with re nement to obtain a more accurate p-value. This examination can take the form of looking at four things: (1) basic summary statistics like mean and variance; (2) the sample size to con rm that it is the same as the number of bootstrap replications (no missing values); (3) the largest and smallest handful of values of the distribution; and (4) a histogram of the bootstrapped values. We detail a few potential problems that this examination can diagnose. First, if you are using a pairs cluster bootstrap and one cluster is an outlier in some sense, then the resulting histogram may have a big \mass" that sits separately from the rest of the bootstrap distribution { that is, there may be two distinct distributions, one for cases where that cluster is sampled and one for cases where it is not. If this is the case then you know that your results are sensitive to the inclusion of that cluster. Second, if you are using a pairs cluster bootstrap with dummy right-hand side variables, then in some samples you can get no or very limited variation in treatment. This can lead to zero or near-zero standard errors. For a percentile-t pairs cluster bootstrap, a zero or missing standard error will lead to missing values for w , since the standard error is zero or missing. If you naively use the remaining distribution, then there is no reason to expect that you will have valid inference. And if the bootstrapped standard errors are zero plus machine precision noise, rather than exactly zero, very large t-values may result. Then your bootstrap distribution of t-statistics will have really fat tails, and you will not reject anything, even false null hypotheses. No variation or very limited variation in treatment can also result in many of your b 's being \perfect t" b 's with limited variability. Then the bootstrap standard deviation of the b 's will be too small, and if you use it as your estimated standard error you will over-reject. In this case we suggest using the wild cluster bootstrap. Third, if your pairs cluster bootstrap samples draw nearly multicollinear samples, you can get huge b 's. This can make a bootstrapped standard error seem very large. You need to look at what about the bootstrap samples \causes" the huge b 's . If this is some pathological but common draw, then you may need to think about a di erent type of bootstrap, such as the wild cluster bootstrap, or give up on bootstrapping methods. For an extreme example, consider a DiD model, with rst-order \control" xed e ects and an interaction term. Suppose that a bootstrap sample happens to have among its \treatment group" only observations where \post = 1". Then the variables \treated" and \treated*post" are collinear, and an OLS package will drop one of these variables. If it drops the \post" variable, it will report a coe cient on \treated*post", but this coe cient will not be a proper interaction term, it will instead also include the level e ect for the treated group. Fourth, with less than ten clusters the wild cluster bootstrap needs to use the six-point 34 version of Webb (2013). Fifth, in general if you see missing values in your bootstrapped t-statistics, then you need to gure out why. Don't take your bootstrap results at face value until you know what's going on. VID. Solution 3: Improved Critical Values using a T-distribution The simplest small-sample correction for the Wald statistic is to use a T distribution, rather than the standard normal, with degrees of freedom at most the number of clusters G. Recent research has proposed methods that lead to using degrees of freedom much less than G, especially if clusters are unbalanced. VID.1. G-L Degrees of Freedom Some packages, including Stata after command regress, use G 1 degrees of freedom for t-tests and F tests based on cluster-robust standard errors. This choice emanates from the complex survey literature; see Bell and McCa rey (2002) who note, however, that with normal errors this choice still tends to result in test over-rejection so the degrees of freedom should be even less than this. Even the T (G 1) can make quite a di erence. For example with G = 10 for a two-sided test at level 0:05 the critical value for T (9) is 2:262 rather than 1:960, and if w = 1:960 the p-value based on T (9) is 0:082 rather than 0:05. In Monte Carlo simulations by Cameron, Gelbach, and Miller (2008) this choice works reasonably well, and at a minimum one should use the T (G 1) distribution rather than the standard normal. For models that include L regressors that are invariant within cluster, Donald and Lang (2007) provide a rationale for using the T (G L) distribution. If clusters are balanced and all regressors are invariant within cluster then the OLS estimator in the model yig = x0g + uig is equivalent to OLS estimation in the grouped model yg = x0g + ug . If ug is i.i.d. normally b distributed then the Wald statistic is T (G L) distributed, where V[ b ] = s2 (X0 X) 1 and P 2 b s2 = (G L) 1 g ug . Note that ug is i.i.d. normal in the random e ects model if the error components are i.i.d. normal. Usually if there is a time-invariant regressor there is only one, in addition to the intercept, in which case L = 2. Donald and Lang extend this approach to inference on in a model that additionally includes regressors zig that vary within clusters, and allow for unbalanced clusters, leading to G L for the RE estimator. Wooldridge (2006) presents an expansive exposition of the Donald and Lang approach. He also proposes an alternative approach based on minimum distance estimation. See Wooldridge (2006) and, for a summary, Cameron and Miller (2011). VID.2. Data-determined Degrees of Freedom For testing the di erence between two means of normally and independently distributed populations with di erent variances the t test is not exactly T distributed { this is known 35 as the Behrens-Fisher problem. Satterthwaite (1946) proposed an approximation that was extended to regression with clustered errors by Bell and McCa rey (2002) and developed further by Imbens and Kolesar (2012). p The T (N k) distribution is the ratio of N [0; 1] to independent [ 2 (N K)]=(N k). For linear regression under i.i.d. normal errors, the derivation of the T (N k) distribution 2 (N K), where s2 is for the Wald t-statistic uses the result that (N K)(s2 = 2 ) b b b 2 = V[b]. This result no longer holds for non-i.i.d. errors, the usual unbiased estimate of b even if they are normally distributed. Instead, an approximation uses the T (v ) distribution where v is such that the rst two moments of v (s2 = 2 ) equal the rst two moments (v b b 2 2 and 2v ) of the (v ) distribution. Assuming sb is unbiased for 2 , E[v (s2 = 2 )] = v . And b b b V[v (s2 = 2 )] = 2v if v = 2[( 2 )2 =V[s2 ]]. b b b b Thus the Wald t-statistic is treated as being T (v ) distributed where v = 2( 2 )2 =V [s2 ]. b b 2 Assumptions are needed to obtain an expression for V[sb ]. For clustered errors with u N [0; ] and using the CRVE de ned in Section IIC, or using CR2VE or CR3VE de ned in Section VIB, Bell and McCa rey (2002) show that the distribution of the Wald t-statistic de ned in (23) can be approximated by the T (v ) distribution where P ( G j )2 j=1 v = PG 2 ; ( j=1 j ) (26) and j are the eigenvalues of the G G matrix G0 b G, where b is consistent for , the N G matrix G has g th column (IN H)0g Ag Xg (X0 X) 1 ek , (IN H)g is the Ng N submatrix for cluster g of the N N matrix IN X(X0 X) 1 X0 , Ag = (INg Hgg ) 1=2 for CR2VE, and ek is a K 1 vector of zeroes aside from 1 in the k th position if b = bk . Note that v needs to be calculated separately, and di ers, for each regression coe cient. The method extends to tests on scalar linear combinations c0 b : The justi cation relies on normal errors and knowledge of = E[uu0 jX]. Bell and McCa rey (2002) perform simulations with balanced clusters (G = 20 and Ng = 10) and equicorrelated errors within cluster. They calculate v assuming = 2 I, even though errors are in fact clustered, and nd that their method leads to Wald tests with true size closer to the nominal size than tests based on the conventional CRVE, CRV2E, and CRV3E. Imbens and Kolesar (2012) additionally consider calculating v where b is based on equicorrelated errors within cluster. They follow the Monte Carlo designs of Cameron, Gelbach and Miller (2008), with G = 5 and 10 and equicorrelated errors. They nd that all nite-sample adjustments perform better than using the standard CRVE with T (G 1) critical values. The best methods use the CR2VE and T (v ), with slight over-rejection with v based on b = s2 I (Bell and McCa rey) and slight under-rejection with v based on b assuming equicorrelated errors (Imbens and Kolesar). For G = 5 these methods outperform the two-point wild cluster bootstrap, as expected given the very low G problem discussed in Section VIC. More surprisingly these methods also outperform wild cluster bootstrap when 36 G = 10, perhaps because Imbens and Kolesar (2012) did not impose the null hypothesis in forming the residuals for this bootstrap. VID.3. E ective Number of Clusters Carter, Schnepel and Steigerwald (2013) propose a measure of the e ective number of clusters. This measure is G = G ; (1 + ) (27) P 1 where = G G f( g )2 = 2 g, g = e0k (X0 X) 1 X0g g Xg (X0 X) 1 ek , ek is a K 1 vector g=1 P 1 of zeroes aside from 1 in the k th position if b = bk , and = G G g . Note that G varies g=1 with the regression coe cient considered, and the method extends to tests on scalar linear combinations c0 b : The quantity measures cluster heterogeneity, which disappears if g = for all g. Given the formula for g , cluster heterogeneity ( 6= 0) can arise for many reasons, including variation in Ng , variation in Xg and variation in g across clusters. In simulations using standard normal critical values, Carter et al. (2013) nd that test size distortion occurs for G < 20. In application they assume errors are perfectly correlated within cluster, so g = ll0 where l is an Ng 1 vector of ones. For data from the Tennessee STAR experiment they nd G = 192 when G = 318. For the Hersch (1998) data of Section IIB, with very unbalanced clusters, they nd for the industry job risk coe cient and with clustering on industry that G = 19 when G = 211: Carter et al. (2013) do not actually propose using critical values based on the T (G ) distribution. The key component in obtaining the formula for v in the Bell and McCa rey 2 2 2 (2002) approach is determining V[s2 = 2 ], which equals E[(s2 b )= b ] given sb is unbiased b b b 2 2 e2 sb for 2 . Carter et al. (2013) instead work with E[(e2 b )= b ] where sb , de ned in their b eb b paper, is an approximation to s2 that is good for large G (formally s2 = 2 ! s2 = 2 as b b b 2 2 2 )= b ] = 2(1 + )=G, see Carter et al. (2013), where is de ned G ! 1). Now E[(eb s b in (27). This suggests using the T (G ) distribution as an approximation, and that this approximation will improve as G increases. VIE. Special Cases With few clusters, additional results can be obtained if there are many observations in each group. In DiD studies the few clusters problem arises if few groups are treated, even if G is large. And the few clusters problem is more likely to arise if there is multi-way clustering. VIE.1. Fixed Number of Clusters with Cluster Size Growing The preceding adjustments to the degrees of freedom of the T distribution are based on the assumption of normal errors. In some settings asymptotic results can be obtained when G 37 is small, provided Ng ! 1. Bester, Conley and Hansen (2011), building on Hansen (2007a), give conditions under p whichp t-test statistic based on (11) is G=(G 1) times TG 1 distributed. Then using the e u ug = G=(G 1)b g in (11) yields a T (G 1) distributed statistic. In addition to assuming G is xed while Ng ! 1, it is assumed that the within group correlation satis es a mixing condition (this does not happen in all data settings, although it does for time series and spatial correlation), and that homogeneity assumptions are satis ed, including equality of 1 plim Ng X0g Xg for all g. P Let bg denote the estimate of parameter in cluster g, b = G 1 G bg denote the g=1 PG b 2 b)2 denote their variance. Suppose average of the G estimates, and s = (G 1) ( g=1 b g that the bg are asymptotically normal as Ng ! 1 with common mean , and consider test p of H0 : = 0 based on t = G(bg u 0 )=sb . Then Ibragimov and M• ller (2010) show that tests based on the T (G 1) will be conservative tests (i.e., under-reject) for level 0:083. This approach permits correct inference even with extremely few clusters, assuming Ng is large. However, the requirement that cluster estimates are asymptotically independent must be met. Thus the method is not directly applicable to a state-year di erences-in-di erences application when there are year xed e ects (or other regressor that varies over time but not states). In that case Ibragimov and M•ller propose applying their method after aggregating u subsets of states into groups in which some states are treated and some are not. VIE.2. Few Treated Groups Problems arise if most of the variation in the regressor is concentrated in just a few clusters, even when G is su ciently large. This occurs if the key regressor is a cluster-speci c binary treatment dummy and there are few treated groups. Conley and Taber (2011) examine a di erences-in-di erences (DiD) model in which there are few treated groups and an increasing number of control groups. If there are group-time random e ects, then the DiD model is inconsistent because the treated groups random e ects are not averaged away. If the random e ects are normally distributed, then the model of Donald and Lang (2007) applies and inference can use a T distribution based on the number of treated groups. If the group-time shocks are not random, then the T distribution may be a poor approximation. Conley and Taber (2011) then propose a novel method that uses the distribution of the untreated groups to perform inference on the treatment parameter. Abadie, Diamond and Hainmueller (2010) propose synthetic control methods that provide a data-driven method to select the control group in a DiD study, and that provide inference under random permutations of assignment to treated and untreated groups. The methods are suitable for treatment that e ects few observational units. 38 VIE.3. What if you have multi-way clustering and few clusters? Sometimes we are worried about multi-way clustering, but one or both of the ways has few clusters. Currently we are not aware of an ideal approach to deal with this problem. One potential solution is to try to add su cient control variables so as to minimize concerns about clustering in one of the ways, and then use a one-way few-clusters cluster robust approach on the other way. Another potential solution is to model one of the ways of clustering in a parametric way, such as with a common shock or an autoregressive error correlation. Then you can construct a variance estimator that is a hybrid of the parametric model, and cluster robust in the remaining dimension. VII. Extensions The preceding material has focused on the OLS (and FGLS) estimator and tests on a single coe cient. The basic results generalize to multiple hypothesis tests, instrumental variables (IV) estimation, nonlinear estimators and generalized method of moments (GMM). These extensions are incorporated in Stata, though Stata generally computes test p-values and con dence intervals using standard normal and chisquared distributions, rather than T and F distributions. And for nonlinear models stronger assumptions are needed to ensure that the estimator of retains its consistency in the presence of clustering. We provide a brief overview. VIIA. Cluster-Robust F-tests Consider Wald joint tests of several restrictions on the regression parameters. Except in the special case of linear restrictions and OLS with i.i.d. normal errors, asymptotic theory yields only a chi-squared distributed statistic, say W , that is 2 (h) distributed, where h is the number of (linearly independent) restrictions. Alternatively we can use the related F statistic, F = W=h. This yields the same p-value as the chi-squared test if we treat F as being F (h; 1) distributed. In the cluster case, a nite-sample adjustment instead treats F as being F (h; G 1) distributed. This is analogous to using the T (G 1) distribution, rather than N [0; 1], for a test on a single coe cient. In Stata, the nite-sample adjustment of using the T (G 1) for a t-test on a single coe cient, and using the F (h; G 1) for an F-test, is only done after OLS regression with command regress. Otherwise Stata reports critical values and p-values based on the N [0; 1] and 2 (h) distributions. Thus Stata does no nite-cluster correction for tests and con dence intervals following instrumental variables estimation commands, nonlinear model estimation commands, or even after command regress in the case of tests and con dence intervals using commands testnl and nlcom. The discussion in Section VI was limited to inference after OLS regression, but it seems reasonable to believe that for other estimators one should also base inference on the 39 T (G 1) and F (h; G 1) distributions, and even then tests may over-reject when there are few clusters. Some of the few-cluster methods of section VI can be extended to tests of more than one restriction following OLS regression. The Wald test can be based on the bias-adjusted variance matrices CR2VE or CR3VE, rather than CRVE. For a bootstrap with asymptotic re nement of a Wald test of H0 : R = r, in the bth resample we compute Wb = (R b b b R b )0 [RVclu [ b b ]R0 ] 1 (R b b R b ). Extension of the data-determined degrees of freedom method of Section VID.2 to tests of more than one restriction requires, at a minimum, extension of Theorem 4 of Bell and McCa rey (2002) from the case that covers , where is a single component of , to R . An alternative ad hoc approach would be to use the F (h; v ) distribution where v is an average (possibly weighted by estimator precision) of v de ned in (26) computed separately for each exclusion restriction. b For the estimators discussed in the remainder of Section VII, the rank of Vclu [ b ] is again the minimum of G 1 and the number of parameters (K). This means that at most G 1 restrictions can be tested using a Wald test, in addition to the usual requirement that h K. VIIB. Instrumental Variables Estimators The cluster-robust variance matrix estimate for the OLS estimator extends naturally to the IV estimator, the two-stage least squares (2SLS) estimator and the linear GMM estimator. The following additional adjustments must be made when errors are clustered. First, a modi ed version of the Hausman test of endogeneity needs to be used. Second, the usual inference methods when instruments are weak need to be adjusted. Third, tests of overidentifying restrictions after GMM need to be based on an optimal weighting matrix that controls for cluster correlation of the errors. VIIB.1. IV and 2SLS In matrix notation, the OLS estimator in the model y = X +u is inconsistent if E[ujX] 6= 0. We assume existence of a set of instruments Z that satisfy E[ujZ] = 0 and satisfy other conditions, notably Z is of full rank with dim[Z] dim[X] and Cor[Z; X] 6= 0. For the clustered case the assumption that errors in di erent clusters are uncorrelated is now one of uncorrelated errors conditional on the instruments Z, rather than uncorrelated errors conditional on the regressors X. In the g th cluster the matrix of instruments Zg is an Ng M matrix, where M K, and we assume that E[ug jZg ] = 0 and Cov[ug u0h jZg ; Zh ] = 0 for g 6= h. In the just-identi ed case, with Z and X having the same dimension, the IV estimator b = (Z0 X) 1 Z0 y, and the cluster-robust variance matrix estimate is is IV b where ug = yg b Vclu [ b IV ] = (Z0 X) 1 PG g=1 b b Z0g ug u0g Zg (X0 Z) 1 ; (28) Xg b IV are residuals calculated using the consistent IV estimator. We again 40 assume G ! 1. As for OLS, the CRVE may be rank-de cient with rank the minimum of K and G 1. In the over-identi ed case with Z having dimension greater than X, the 2SLS estimator is the special case of the linear GMM estimator in (29) below with W = (Z0 Z) 1 , and b the CRVE is that in (30) below with W = (Z0 Z) 1 and ug the 2SLS residuals. In the just-identi ed case 2SLS is equivalent to IV. A test for endogeneity of a regressor(s) can be conducted by comparing the OLS estimator to the 2SLS (or IV) estimator that controls for this endogeneity. The two estimators have the same probability limit given exogeneity and di erent probability limits given endogeneity. This is a classic setting for the Hausman test but, as in the Hausman test for xed e ects discussed in Section IIID, the standard version of the Hausman test cannot be used. Instead partition X = [X1 X2 ], where X1 is potentially endogenous and X2 is exogenous, and let b vig denote the residuals from rst-stage OLS regression of the endogenous regressors on instruments and exogenous regressors. Then estimate by OLS the model yig = x01ig 1 + x02ig 2 b0 + v1ig + uig : The regressors x1 are considered endogenous if we reject H0 : = 0 using a Wald test based on a CRVE. In Stata this is implemented using command estat endogenous. (Alternatively a pairs cluster bootstrap can be used to estimate the variance of b 2SLS b OLS ). VIIB.2. Weak Instruments When endogenous regressor(s) are weakly correlated with instrument(s), where this correlation is conditional on the exogenous regressors in the model, there is great loss of precision, with the standard error for the coe cient of the endogenous regressor much higher after IV or 2SLS estimation than after OLS estimation. Additionally, asymptotic theory takes an unusually long-time to kick in so that even with large samples the IV estimator can still have considerable bias and the Wald statistic is still not close to normally distributed. See, for example, Bound, Jaeger, and Baker (1995), Andrews and Stock (2007), and textbook discussions in Cameron and Trivedi (2005, 2009). For this second consequence, called the \weak instrument" problem, the econometrics literature has focused on providing theory and guidance in the case of homoskedastic or heteroskedastic errors, rather than within-cluster correlated errors. The problem may even be greater in the clustered case, as the asymptotics are then in G ! 1 rather than N ! 1, though we are unaware of evidence on this. A standard diagnostic for detecting weak instruments, in the case of a single endogenous regressor, is to estimate by OLS the rst-stage regression, of the endogenous regressor on the remaining exogenous regressors and the additional instrument(s). Then form the F-statistic for the joint signi cance of the instruments; in the case of a just-identi ed model there is only one instrument to test so the F-statistic is the square of the t-statistic. A popular rule-of-thumb, due to Staiger and Stock (1997), is that there may be no weak instrument 41 problem if F > 10. With clustered errors, this F-statistic needs to be based on a clusterrobust variance matrix estimate. In settings where there is a great excess of instruments, however, this test will not be possible if the number of instruments (M K) exceeds the number of clusters. Instead only subsets or linear combinations of the excess instruments can be tested, and the F > 10 criteria becomes a cruder guide. Baum, Scha er and Stillman (2007) provide a comprehensive discussion of various methods for IV, 2SLS, limited information maximum likelihood (LIML), k-class, continuous updating and GMM estimation in linear models, and present methods using their ivreg2 Stata command. They explicitly allow for within-cluster correlation and state which of the proposed methods for weak instruments diagnostics and tests developed for i.i.d. errors can be generalized to within-cluster correlated errors. Chernozhukov and Hansen (2008) proposed a novel method that provides valid inference on the coe cient of endogenous regressor(s) under weak instruments with errors that are not restricted to being i.i.d. For simplicity suppose that there is one endogenous regressor, yig = xig + uig , and that the rst-stage model is xig = z0ig + vig . If there are additional exogenous regressors x2 , as is usually the case, the method still works if the variables y, x and z are de ned after partialling out x2 . The authors show that a test of = is equivalent 0 to a test of = 0 in the model yig xig = zig + wi . A 95% con dence interval for can be constructed by running this regression for a range of values of and including in the interval for only those values of for which we did not reject H0 : = 0 when testing at 5%. This approach generalizes to more than one endogenous regressor. More importantly it does not require i.i.d. errors. Chernozhukov and Hansen proposed the method for HAC robust inference, but the method can also be applied to clustered errors. Then the test of = 0 should be based on a CRVE. Note that as with other F-tests, this can only be performed when the dimension of is less than G. Finlay and Magnusson (2009) provide this and other extensions, and provide a command ivtest for Stata. We speculate that if additionally there are few clusters, then some of the adjustments discussed in Section VI would help. VIIB.3. Linear GMM For over-identi ed models the linear GMM estimator is more e cient than the 2SLS estimator if E[uu0 jZ] 6= 2 I, Then where W is a full rank K b GMM = (X0 ZWZ0 X) 1 (X0 ZWZ0 y); K weighting matrix. The CRVE for GMM is PG 0 0 0 0 1 b b b0 Vclu [ b GMM ] = (X0 ZWZ0 X) 1 X0 ZW g=1 Zg ug ug Zg WZ X(X ZWZ X) ; (29) (30) b where ug are residuals calculated using the GMM estimator. P b b For clustered errors, the e cient two-step GMM estimator uses W = ( G Z0g ug u0g Zg ) 1 , g=1 b where ug are 2SLS residuals. Implementation of this estimator requires that the number of 42 P b b clusters exceeds the number of instruments, since otherwise G Z0g ug u0g Zg is not invertible. g=1 Here Z contains both the exogenous regressors in the structural equation and the additional instruments required to enable identi cation of the endogenous regressors. When this condition is not met, Baum, Scha er and Stillman (2007) propose doing two-step GMM after rst partialling out the instruments z from the dependent variable y, the endogenous variables in the initial model yig = x0ig + uig , and any additional instruments that are not also exogenous regressors in this model. The over-identifying restrictions (OIR) test, also called a Hansen test or a Sargan test, is a limited test of instrument validity that can be used when there are more instruments than necessary. When errors are clustered the OIR tests must be computed following the cluster version of two-step GMM estimation; see Hoxby and Paserman (1998). Just as GLS is more e cient than OLS, specifying a model for = E[uu0 jZ] can lead to more e cient estimation than GMM. Given a model for , and conditional moment condition E[ujZ] = 0, a more e cient estimator is based on the unconditional moment condition E[Z0 1 u] = 0. Then we minimize (Z0 b 1 u)0 (Z0 b 1 Z) 1 (Z0 b 1 u), where b is consistent for . Furthermore the CRVE can be robusti ed against misspeci cation of , similar to the case of FGLS, though an OIR test is no longer possible if is misspeci ed. In practice such FGLS-type improvements to GMM are seldom used, even in simpler settings that the clustered setting. An exception is Shore-Sheppard (1996) who considers the impact of equicorrelated instruments and group-speci c shocks in a model similar to that of Moulton. One reason may be that this option is not provided in Stata command ivregress. In the special case of a random e ects model for , command xtivreg can be used along with a pairs cluster bootstrap used to guard against misspeci cation of : VIIC. Nonlinear Models For nonlinear models there are several ways to handle clustering. We provide a brief summary; see Cameron and Miller (2011) for further details. For concreteness we focus on logit regression. Recall that in the cross-section case yi takes value 0 or 1 and the logit model speci es that E[yi jxi ] = Pr[yi = 1jxi ] = (x0i ), where (z) = ez =(1 + ex ): VIIC.1. Di erent Models for Clustering The simplest approach is a pooled approach that assumes that clustering does not change the functional form for the conditional probability of a single observation. Thus, for the logit model, whatever the nature of clustering, it is assumed that E[yig jxig ] = Pr[yig = 1jxig ] = (x0ig ): (31) This is called a population-averaged approach, as (x0ig ) is obtained after averaging out any within-cluster correlation. Inference needs to control for within-cluster correlation, however, and more e cient estimation may be possible. 43 The generalized estimating equations (GEE) approach, due to Liang and Zeger (1986) and widely used in biostatistics, introduces within-cluster correlation into the class of generalized linear models (GLM), a class that includes the logit model. One possible model for within-cluster correlation is equicorrelation, with Cor[yig ; yjg jxig ; xjg ] = . The Stata command xtgee y x, family(binomial) link(logit) corr(exchangeable) estimates the population-averaged logit model and provides the CRVE assuming the equicorrelation model for within-cluster correlation is correctly speci ed. The option vce(robust) provides a CRVE that is robust to misspeci cation of the model for within-cluster correlation. Command xtgee includes a range of models for the within-error correlation. The method is a nonlinear analog of FGLS given in Section IID, and asymptotic theory requires G ! 1. A further extension is nonlinear GMM. For example, with endogenous regressors and instruments z that satisfy E[yig exp(x0ig )jzig ] = 0, a nonlinear GMM estimator minimizes P P h( )0 Wh( ) where h( ) = g i zig (yig exp(x0ig ). Other choices of h( ) that allow for intracluster correlation may lead to more e cient estimation, analogous to the linear GMM example discussed at the end of section VIIB. Given a choice of h( ), the two-step nonlinear GMM estimator at the second step uses weighting matrix W that is the inverse of a consistent estimator of V[ b ], and one can then use the minimized objection function for an overidentifying restrictions test. Now suppose we consider a random e ects logit model with normally distributed random e ect, so Pr[yig = 1j g ; xig ] = ( g + x0ig ); (32) where g N [0; joint density 2 ]. If f (y1g ; :::; yNg g j Since g is known, the Ng observations in cluster g are independent with g ; Xg ) = Q Ng i=1 ( g + x0ig )yig [1 ( g + x0ig )]1 is unknown it is integrated out, leading to joint density Z Q Ng 0 yig f (y1g ; :::; yNg g jXg ) = ( g + x0ig )]1 i=1 ( g + xig ) [1 g yig h( yig : gj 2 )d g; where h( g j 2 ) is the N [0; 2 ] density. There is no closed form solution for this integral, but it is only one-dimensional so numerical approximation (such as Gaussian quadrature) can be used. The consequent MLE can be obtained in Stata using the command xtlogit y x, re: Note that in this RE logit model (31) no longer holds, so in the model (32) is scaled di erently from in (31). Furthermore in (32) is inconsistent if the distribution for g is misspeci ed, so there is no point in using option vce(robust) after command xtlogit, re. It is important to realize that in nonlinear models such as logit the population-averaged and random e ects approaches lead to quite di erent estimates of that are not comparable since means di erent things in the di erent models. The resulting estimated average marginal e ects may be similar, however, just as they are in logit and probit models. 44 With few clusters, Wald statistics are likely to over-reject as in the linear case, even if we scale the CRVE's given in this section by G=(G 1) as is typically done; see (12) for the linear case. McCa rey, Bell, and Botts (2001) consider bias-correction of the CRVE in generalized linear models. Asymptotic re nement using a pairs cluster bootstrap as in Section VIC is possible. The wild bootstrap given in Section VID is no longer possible in a nonlinear model, aside from nonlinear least squares, since it requires additively separable errors. Instead one can use the score wild bootstrap proposed by Klein and Santos (2012) for nonlinear models, including maximum likelihood and GMM models. The idea in their paper is to estimate the model once, generate scores for all observations, and then perform a bootstrap in the wild-cluster style, perturbing the scores by bootstrap weights at each step. For each bootstrap replication the perturbed scores are used to build a test statistic, and the resulting distribution of this test statistic can be used for inference. They nd that this method performs well in small samples, and can greatly ease computational burden because the nonlinear model need only be estimated once. The conservative test of Ibragimov and M•ller (2010) can be used if Ng ! 1. u VIIC.2. Fixed E ects A cluster-speci c xed e ects version of the logit model treats the unobserved parameter g in (32) as being correlated with the regressors xig . In that case both the population-averaged and random e ects logit estimators are inconsistent for . Instead we need a xed e ects logit estimator. In general there is an incidental parameters problem if asymptotics are that Ng is xed while G ! 1, as there only Ng observations for each g , and inconsistent estimation of g spills over to inconsistent estimation of : Remarkably for the logit model it is nonetheless possible to consistently estimate . The logit xed e ects estimator is obtained in Stata using the command xtlogit y x, fe. Note, however, that the marginal e ect in model (32) is @ Pr[yig = 1j g ; xig ]=@xijk = ( g + x0ig )(1 ( g + x0ig )) k . Unlike the linear FE model this depends on the unknown g . So the marginal e ects cannot be computed, though the ratio of the marginal e ects of the k th and lth regressor equals k = l which can be consistently estimated. The logit model is one of few nonlinear models for which xed e ects estimation is possible when Ng is small. The other models are Poisson with E[yig jXg ; g ] = exp( g + x0ig ), and nonlinear models with E[yig jXg ; g ] = g + m(x0ig ), where m( ) is a speci ed function. The natural approach to introduce cluster-speci c e ects in a nonlinear model is to include a full set of cluster dummies as additional regressors. This leads to inconsistent estimation of in all models except the linear model (estimated by OLS) and the Poisson regression model, unless Ng ! 1. There is a growing literature on bias-corrected estimation in such cases; see, for example, Fernandez-Val (2009). This paper also cites several simulation studies that gauge the extent of bias of dummy variable estimators for moderate Ng , such as Ng = 20. Yoon and Galvao (2013) consider xed e ects in panel quantile regression models with 45 correlation within cluster and provide methods under the assumption that both the number of individuals and the number of time periods go to in nity. VIID. Cluster-randomized Experiments Increasingly researchers are gathering their own data, often in the form of eld or laboratory experiments. When analyzing data from these experiments they will want to account for the clustered nature of the data. And so when designing these experiments, they should also account for clustering. Fitzsimons, Malde, Mesnard, and Vera-Hernandez (2012) use a wild cluster bootstrap in an experiment with 12 treated and 12 control clusters. Traditional guidance for computing power analyses and minimum detectable e ects (see e.g. Du o, Glennerster and Kremer, 2007, pp. 3918-3922, and Hemming and Marsh (2013)) are based on assumptions of either independent errors or, in a clustered setting, a random e ects common-shock model. Ideally one would account for more general forms of clustering in these calculations (the types of clustering that motivate cluster-robust variance estimation), but this can be di cult to do ex ante. If you have a data set that is similar to the one you will be analyzing later, then you can assign a placebo treatment, and compute the ratio of cluster-robust standard errors to default standard errors. This can provide a sense of how to adjust the traditional measures used in design of experiments. VIII. Empirical Example In this section we illustrate many of the issues and methods presented in this paper. The data and accompanying Stata code will be posted on our websites. The micro data are from the March CPS, downloaded from IPUMS-CPS (King et. al, 2010). We use data covering individuals who worked 40 or more weeks during the prior year, and whose usual hours per week in that year was 30 or more. The hourly wage is constructed as annual earnings divided by annual hours (usual hours per week times number of weeks worked), de ated to real 1999 dollars, and observations with real wage in the range ($2, $100) are kept. Cross-section examples use individual-level data for 2012. Panel examples use data aggregated to the state-year level for 1977 to 2012. In both cases we estimate log-wage regressions and perform inference on a generated regressor that has zero coe cient. VIIIA. Individual-level Cross-section Data: One Sample In our rst application we use data on 65,685 individuals from the year 2012. We randomly generate a random dummy \policy" variable, equal to one for one-half of the states and zero for the other half. Log-wage is regressed on this policy variable and a set of individual-level controls (age, age squared, and education in years). Estimation is by OLS, using Stata command regress, and by FGLS controlling for state-level random e ects, using command 46 xtreg, re. The policy variable is often referred to as a \placebo" treatment, and should be statistically insigni cant in 95% of tests performed at signi cance level 0:05. Table 1 reports the estimated coe cient of the policy variable, along with standard errors computed in several di erent ways. The default standard errors for OLS are misleadingly small (se = 0:0042), leading to the dummy variable being very highly statistically signi cant (t = 0:0226=0:0042 = 5:42) even though it was randomly generated independently of log-wage. The White heteroskedastic-robust standard errors, from regress option vce(robust), are no better. These standard errors should not be used if Ng is small, see Section IVB, but here Ng is large. The cluster-robust standard error (se = 0:0217), from option vce(cluster state) is 5.5 times as large, however, leading to the more sensible result that the regressor is statistically insigni cant (t = 1:04). In results not presented in Table 1, the cluster-robust standard errors of age, age squared and education were, respectively, 1.2, 1.2 and 2.3 times the default, so again ignoring clustering understates the standard errors. A pairs cluster bootstrap (without asymptotic re nement), from option vce(boot, cluster(state)), yields very similar standard error as expected. For FGLS the cluster-robust standard error, computed using (15) or by a pairs cluster bootstrap, is fairly close to the default standard error computed using (14). This suggests that random e ects models the error correlation well. This example illustrates that clustering can make a big di erence even when errors are only weakly correlated within cluster (the intraclass correlation of the residuals in this application is 0.018), if additionally the regressor is highly correlated within cluster (here perfectly correlated within cluster) and cluster sizes are large (ranging here from 519 to 5866). The cluster-robust OLS and FGLS standard errors are very similar (.0217 in both cases), so in this example random e ects FGLS led to no improvement in e ciency. Note that formula (6) suggests that the cluster-robust standard errors are 4:9 times the p default ( 1 + (1 0:018 (65685=51 1) = 4:9), close to the observed multiple of 5:5. Formula (6) may work especially well in this example as taking the natural logarithm of wage leads to model error that is close to homoskedastic and equicorrelation is a good error model for individuals clustered in regions. To illustrate the potential pitfalls of pairs cluster bootstrapping for standard errors with few clusters, discussed in Section VIC, we examine a modi cation with six randomly selected states broken into treated (AZ, LA, MD) and control (DE, PA, UT). For these states, we estimate a model similar to that in Table 1. Then b = 0:0373 with default se = 0:0128. We then perform 1000 replications, resampling the six states with replacement, and perform a pairs cluster bootstrap in each replication. The bootstrap se = 0:0622 is not too di erent from the cluster-robust se = 0:0577. But several problems arise. First, 38 replications cannot be estimated, presumably due to no variation in treatment in the bootstrap samples. Second, a kernel density estimate of the bootstrapped bs reveals that their distribution is very multi-modal and has limited density near the middle of the distribution. And there are many outliers as the 5th percentile of bootstrap replicates is 0:1010, and the 95th percentile is 0:1032. These numbers are much more than would be expected based on the 47 CRVE. Considering these results, we would not feel comfortable using the pairs cluster bootstrap in this dataset with these few clusters. VIIIB. Individual-level Cross-section Data: Monte Carlo We next perform a Monte Carlo exercise based on the same regression, with 1000 replications. In each replication, we generate a dataset by sampling (with replacement) states and all their associated observations. For quicker computation of the Monte Carlo simulation, we draw states from a 3% sample of individuals within each state, so there are on average approximately 40 observations per cluster. We explore the e ect of the number of clusters G by performing varying simulations with G in f6, 10, 20, 30, 50g. Given a sample of states, we assign a dummy \policy" variable to one-half of the states. We run OLS regressions of log-wage on the policy variable and the same controls as in Table 1. In these simulations we perform tests of the null hypothesis that the slope coe cient of the policy variable is zero. Table 2 presents rejection rates that with millions of replications should all be 0:05, since we are testing a true hypothesis at a nominal 5% level. Because only 1,000 simulations are performed, we expect that 95% of these simulations will yield estimated test size in the range (0:036, 0:064) if the true test size is 0:05. We begin with lengthy discussion of the last column, which shows results for G = 50. Rows 1-9 report Wald tests based on t = b=se where se is computed in various ways, while rows 10-13 report results of bootstraps with an asymptotic re nement. Rows 1-3 are obtained by standard Stata commands { row 1 by regress, vce(robust); row 2 by xtreg, vce(robust) after xtset state; and row 3 by regress, vce(cluster state). Ignoring clustering leads to great over-rejection. The problem is that standard errors that (correctly) control for clustering are 1.3 times larger { using formula (6) for a p 3% sample yields 1 + (1 0:018 (0:03 65685=51 1) = 1:30. So t = 1:96 using the heteroskedastic-robust standard error is really t = 2=1:3 = 1:51 and, using standard normal critical values, an apparent p = 0:05 is really p = 0:13. Rows 2 and 3 both use the CRVE, but with di erent critical values. Using T (G 1) in row 3 leads to rejection rate that is closer to 0:05 than does using N [0; 1] in row 2. Using T (G 2) in row 4, suggested by the study of Donald and Lang (2007), leads to slight further improvement, but there is still over-rejection. Rows 5 and 6 use the residual bias adjustments CR2 and CR3 discussed in Section VIB, along with T (G 1) critical values. This leads to further decrease in the rejection rates, towards the desired 0:05. Rows 7 and 8 combine the residual bias adjustment CR2 with the data-determined degrees-of-freedom of Section VID. For these data the Imbens and Kolesar (2013) measure v in (26), denoted IK, and the Carter, Schnepel and Steigerwald (2013) measure G in (27) both equal 17 on average when G = 50 (see rows 14 and 17). This leads to further improvement in the Monte Carlo rejection rate. Row 9 uses bootstrap standard errors obtained by a pairs cluster bootstrap, implemented 48 using command regress, vce(boot, cluster(state)). The rejection rate is essentially the same as that in row 3, as expected since this bootstrap has no asymptotic re nement. Rows 10-13 implement the various percentile-t bootstraps with asymptotic re nement presented in Section VIC. These lead to mild over-rejection, with rejection rate of 0:06 rather than 0:05. Row 10 can be computed using the bootstrap: command, see our posted code, while rows 11-13 require additional coding. Only 399 bootstraps need be used here as any consequent bootstrap simulation error averages out over the many Monte Carlo replications. But if these bootstraps were used just once, as in an empirical study, a percentile-t bootstrap should use at least 999 replications. As we examine settings with fewer clusters most methods lead to rejection rates even further away from the nominal test size of 0:05. The methods that appear to perform best are the CR3 residual-correction with T (G 1) critical values and the CR2 residual-correction with T (v ) critical values where v is the Imbens and Kolesar (2013) calculated degrees of freedom. Consider the case G = 6. The choice of degrees of freedom makes an enormous di erence, as the critical value for a test at level 0:05 rises from 2:571 to 2:776 and 3:182 for, respectively, the T (5), T (4) and T (3) distributions, and from row 14 the IK degrees of freedom averages 3:3 across the simulations. The CSS degrees of freedom is larger as, from Section VID, it involves an approximation that only disappears as G becomes large. It appears that using HC2 with the Imbens and Kolesar degrees of freedom does best; although with 1000 Monte Carlo replications, we cannot statistically distinguish between the results in rows 3-13. Finally we compare the variability in cluster-robust standard errors to that for heteroskedasticrobust standard errors. We return to the full cross-section micro dataset and perform 1000 replications, resampling the 50 states with replacement. The standard deviation of the cluster-robust standard error was 12.3% of the mean cluster-robust standard error, while the standard deviation of the heteroskedastic-robust standard error was 4.5% of its mean. So although the CRVE is less biased than heteroskedastic-robust (or default), it is also more variable. But this variability is still relatively small, especially compared to the very large bias if clustering is not controlled for. VIIIC. State{Year Panel Data: One Sample We next turn to a panel di erence-in-di erence application motivated by Bertrand, Du o, and Mullainathan (2004). The model estimated for 51 states from 1977 to 2012 is yts = s + t + dts + uts ; (33) where yts is the average log-wage in year t and state s, s and t are state and year dummies, and dts is a random \policy" variable that turns on and stays on for the last 18 years. Here G = 51, T = 36 and N = 1836. To speed up bootstraps, and to facilitate computation of the CR2 residual adjustment, we partial out the state and year xed e ects and regress 49 (without intercept) uy;ts on ud;ts , where uy;ts (or ud;ts ) is the residual from OLS regression of b b b b yts (or dts ) on state and year xed e ects. Table 3 presents results for the policy dummy regressor which should have coe cient zero since it is randomly assigned. We begin with model 1, OLS controlling for state and year xed e ects. Using default or White-robust standard errors (rows 1-2) leads to a standard error of 0.0042 that is much smaller than the cluster-robust standard error of 0.0153 (row 3), where clustering is on state. Similar standard errors are obtained using the CR2 correction and bootstrap without asymptotic re nement (rows 3-6). Note that even after controlling for state xed e ects, the default standard errors were greatly under-estimated, with cluster-robust standard errors 3.6 times larger. The second column of results for model 1 gives p-values for tests of the null hypothesis that = 0. Default and heteroskedastic-robust standard errors lead to erroneously large t-statistics (of 4.12 and 4.10), p = 0:000, and hence false rejections of the null hypothesis. Using standard errors that control for clustering (rows 3-6) leads to p ' 0:2 so that the null is not rejected. Note that here the IK and CSS degrees of freedom are calculated to be G 1, an artifact of having balanced clusters and a single regressor that is invariant within cluster. Rows 7-9 report p-values from several percentile-t bootstraps that again lead to rejection of H0 : = 0. Model 2 drops state xed e ects from the model (33). Then the cluster-robust standard error (row 3) is 0.0276, compared to 0.0153 when state- xed e ects are included (Model 1). So inclusion of state xed e ects does lead to more precise parameter estimates. But even with state xed e ects included (model 1) there is still within-cluster correlation of the error so that one still needs to use cluster-robust standard errors. In fact in both Models 1 and 2 the cluster-robust standard errors are approximately 3.7 times the default. Model 3 estimates the model with FGLS, allowing for an AR(1) process for the errors. Since there are 36 years of data the bias correction of Hansen (2007b), see Section IIIC, will make little di erence. For this column we control for state and year xed e ects, similar to OLS in Model 1. The cluster-robust standard errors (row 3) are much smaller for FGLS (0.0101) than for OLS (0.0153), indicating improved statistical precision. For FGLS there is some di erence between default standard errors (0.0070) and cluster-robust standard errors (0.0101), suggesting that an AR(1) model is not the perfect model for the within-state error correlation over time. VIIID. State{Year Panel Data: Monte Carlo We next embed the panel data example in a Monte Carlo simulation for the OLS estimator, with 1000 replications. In each simulation, we draw a random set of G states (with replacement). When a state is drawn, we take all years of data for that state. We then assign our DiD \policy" variable to half the states. As for Table 2, we examine varying numbers of states, ranging from six to fty. In these simulations we perform tests of the null hypothesis 50 that the slope coe cient of the policy variable is zero. Again 95% of exercises such as this should yield an estimated test size in the range (0:036, 0:064). We begin with the last column of the table, with G = 50 states. All tests aside from that based on default standard errors (row 1) have rejection rates that are not appreciably di erent from 0.05, once we allow for simulation error. As the number of clusters decreases it becomes clear that one should use the T (G 1) or T (G 2) distribution for critical values, and even this leads to over-rejection with low G. The pairs cluster percentile-t bootstrap fails with few clusters, with rejection rate of only 0:01 when G = 6. For low G, the wild cluster percentile-t bootstrap has similar results with either 2-point or 6-point weights, with some over-rejection. IX. Concluding Thoughts It is important to aim for correct statistical inference, many empirical applications feature the potential for errors to be correlated within clusters, and we need to make sure our inference accounts for this. Often this is straightforward to do using traditional clusterrobust variance estimators - but sometimes things can be tricky. The leading di culties are (1) determining how to de ne the clusters, and (2) dealing with few clusters; but other complications can arise as well. When faced with these di culties, there is no simple hard and fast rule regarding how to proceed. You need to think carefully about the potential for correlations in your residuals, and how that interacts with correlations in your covariates. In this essay we have aimed to present the current leading set of tools available to practitioners to deal with clustering issues. 51 X. References Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. \Synthetic Control Methods for Comparative Case Studies: Estimating the E ect of California's Tobacco Control Program," Journal of the American Statistical Association 105(490): 493-505. Acemoglu, Daron, and J•rn-Ste en Pischke. 2003. \Minimum Wages and On-the-job o Training." Research in Labor Economics 22: 159-202. Andrews, Donald W. K. and James H. Stock. 2007. \Inference with Weak Instruments." In Advances in Economics and Econometrics, Theory and Applications: Ninth World Congress of the Econometric Society, Vol. III, ed. Richard Blundell, Whitney K. Newey, and T. Persson, Ch.3. Cambridge: Cambridge University Press. Angrist, Joshua D., and Victor Lavy. 2002. \The E ect of High School Matriculation Awards: Evidence from Randomized Trials." American Economic Review 99: 1384-1414. Angrist, Joshua D., and J•rn-Ste en Pischke. 2009. Mostly Harmless Econometrics: An o Empiricist's Companion. Princeton: Princeton University Press. Arellano, Manuel 1987. \Computing Robust Standard Errors for Within-Group Estimators." Oxford Bulletin of Economics and Statistic 49: 431-434. Barrios, Thomas, Rebecca Diamond, Guido W. Imbens, and Michal Kolesar. 2012. \Clustering, Spatial Correlations and Randomization Inference." Journal of the American Statistical Association 107(498): 578{591. Baum, Christopher F., Mark E. Scha er, Steven Stillman. 2007. \Enhanced Routines for Instrumental Variables/GMM Estimation and Testing." The Stata Journal 7(4): 465-506. Bell, Robert M., and Daniel F. McCa rey. 2002. \Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples." Survey Methodology 28: 169-179. Bertrand, Marianne, Esther Du o, and Sendhil Mullainathan. 2004. \How Much Should We Trust Di erences-in-Di erences Estimates?." Quarterly Journal of Economics 119: 249275. Bester, C. Alan, Timothy G. Conley, and Christian B. Hansen. 2011. \Inference with Dependent Data Using Cluster Covariance Estimators." Journal of Econometrics 165: 137151. Bhattacharya, Debopam. 2005. \Asymptotic Inference from Multi-stage Samples." Journal of Econometrics 126: 145-171. Bound, John, David A. Jaeger, and Regina M. Baker. 1995. \Problems with Instrumental Variables Estimation when the Correlation Between the Instruments and the Endogenous Explanatory Variable is Weak." Journal of the American Statistical Association 90: 443-450. Brewer, Mike, Thomas F. Crossley, and Robert Joyce. 2013. \Inference with Di erencesin-Di erences Revisited." Unpublished. Cameron, A. Colin, Jonah G. Gelbach, and Douglas L. Miller. 2006. \Robust Inference with Multi-Way Clustering." NBER Technical Working Paper 0327. |||. 2008. \Bootstrap-Based Improvements for Inference with Clustered Errors." Review of Economics and Statistics. 90: 414-427. 52 |||. 2011. \Robust Inference with Multi-Way Clustering." Journal Business and Economic Statistics 29(2): 238-249. Cameron, A. Colin, and Douglas L. Miller. 2011. \Robust Inference with Clustered Data." In Handbook of Empirical Economics and Finance. ed. Aman Ullah and David E. Giles, 1-28. Boca Raton: CRC Press. |||. 2012. \Robust Inference with Dyadic Data: with Applications to Country-pair International Trade." University of California - Davis. Unpublished. Cameron, A. Colin, and Pravin K. Trivedi. 2005. Microeconometrics: Methods and Applications. Cambridge University Press. |||. 2009. Microeconometrics using Stata. College Station, TX: Stata Press. Carter, Andrew V., Kevin T. Schnepel, and Douglas G. Steigerwald. 2013. \Asymptotic Behavior of a t Test Robust to Cluster Heterogeneity." University of California - Santa Barbara. Unpublished. Cheng, Cheng, and Mark Hoekstra. 2013. \Pitfalls in Weighted Least Squares Estimation: A Practitioner's Guide." Texas A&M University. Unpublished. Chernozhukov, Victor, and Christian Hansen. 2008. \The Reduced Form: A Simple Approach to Inference with Weak Instruments." Economics Letters 100: 68-71. Conley, Timothy G. 1999. \GMM Estimation with Cross Sectional Dependence." Journal of Econometrics 92: 1-45. Conley, Timothy G., and Christopher R. Taber. 2011. \Inference with `Di erence in Di erences' with a Small Number of Policy Changes." Review of Economics and Statistics 93(1): 113-125. Davidson, Russell, and Emmanuel Flachaire. 2008. \The Wild Bootstrap, Tamed at Last." Journal of Econometrics 146: 162{169. Davis, Peter. 2002. \Estimating Multi-Way Error Components Models with Unbalanced Data Structures." Journal of Econometrics 106: 67-95. Donald, Stephen G., and Kevin Lang. 2007. \Inference with Di erence-in-Di erences and Other Panel Data." Review of Economics and Statistics 89: 221-233. Driscoll, John C., and Aart C. Kraay. 1998. \Consistent Covariance Matrix Estimation with Spatially Dependent Panel Data." Review of Economics and Statistics 80: 549-560. Du o, Esther, Rachel Glennerster and Michael Kremer. 2007. \Using Randomization in Development Economics Research: A Toolkit." In Handbook of Development Economics, Vol. 4, ed. Dani Rodrik and Mark Rosenzweig, 3895-3962. Amsterdam: North-Holland. Fafchamps, Marcel, and Flore Gubert. 2007. \The Formation of Risk Sharing Networks." Journal of Development Economics 83: 326-350. Fernandez-Val, Ivan. 2009. \Fixed E ects Estimation of Structural Parameters and Marginal E ects in Panel Probit Models." Journal of Econometrics 150: 70-85. Finlay, Keith, and Leandro M. Magnusson. 2009. \Implementing Weak Instrument Robust Tests for a General Class of Instrumental-Variables Models." Stata Journal 9: 398421. 53 Fitzsimons, Emla, Bansi Malde, Alice Mesnard, and Marcos Vera-Hernandez. 2012. \Household Responses to Information on Child Nutrition: Experimental Evidence from Malawi." IFS Working Paper W12/07. Foote, Christopher L. 2007. \Space and Time in Macroeconomic Panel Data: Young Workers and State-Level Unemployment Revisited." Working Paper 07-10, Federal Reserve Bank of Boston. Greenwald, Bruce C. 1983. \A General Analysis of Bias in the Estimated Standard Errors of Least Squares Coe cients." Journal of Econometrics 22: 323-338. Hansen, Christian. 2007a. \Asymptotic Properties of a Robust Variance Matrix Estimator for Panel Data when T is Large." Journal of Econometrics 141: 597-620. Hansen, Christian. 2007b. \Generalized Least Squares Inference in Panel and Multi-level Models with Serial Correlation and Fixed E ects." Journal of Econometrics 141: 597-620. Hausman, Jerry, and Guido Kuersteiner. 2008. \Di erence in Di erence Meets Generalized Least Squares: Higher Order Properties of Hypotheses Tests." Journal of Econometrics 144: 371-391. Hemming, Karla, and Jen Marsh. 2013. \A Menu-driven Facility for Sample-size Calculations in Cluster Randomized Controlled Trails." Stata Journal 13: 114-135. Hersch, Joni. 1998. \Compensating Wage Di erentials for Gender-Speci c Job Injury Rates." American Economic Review 88: 598-607. Hoechle, Daniel. 2007. \Robust Standard Errors for Panel Regressions with Cross{ sectional Dependence." Stata Journal 7(3): 281-312. Hoxby, Caroline, and M. Daniele Paserman. 1998. \Overidenti cation Tests with Group Data." NBER Technical Working Paper 0223. Ibragimov, Rustam, and Ulrich K. M•ller. 2010. \T-Statistic Based Correlation and u Heterogeneity Robust Inference." Journal of Business and Economic Statistics 28(4): 453468. Imbens, Guido W., and Michal Kolesar. 2012. \Robust Standard Errors in Small Samples: Some Practical Advice." NBER Working Paper 18478. Inoue, Atsushi, and Gary Solon. 2006. \A Portmanteau Test for Serially Correlated Errors in Fixed E ects Models." Econometric Theory 22: 835-851. Kezdi, Gabor. 2004. \Robust Standard Error Estimation in Fixed-E ects Panel Models." Hungarian Statistical Review Special Number 9: 95-116. King, Miriam, Steven Ruggles, J. Trent Alexander, Sarah Flood, Katie Genadek, Matthew B. Schroeder, Brandon Trampe, and Rebecca Vick. 2010. Integrated Public Use Microdata Series, Current Population Survey: Version 3.0. [Machine-readable database]. Minneapolis: University of Minnesota. Kish, Leslie. 1965. Survey Sampling. New York: John Wiley. Kish, Leslie, and Martin R. Frankel. 1974. \Inference from Complex Surveys with Discussion." Journal Royal Statistical Society B 36: 1-37. Klein, Patrick, and Andres Santos. 2012. \A Score Based Approach to Wild Bootstrap Inference." Journal of Econometric Methods:1(1): 23-41. 54 Kloek, T. 1981. \OLS Estimation in a Model where a Microvariable is Explained by Aggregates and Contemporaneous Disturbances are Equicorrelated." Econometrica 49: 20507. Liang, Kung-Yee, and Scott L. Zeger. 1986. \Longitudinal Data Analysis Using Generalized Linear Models." Biometrika 73: 13-22. MacKinnon, James. G., and Halbert White. 1985. \Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties." Journal of Econometrics 29: 305-325. MacKinnon, James, and Matthew D. Webb. 2013. \Wild Bootstrap Inference for Wildly Di erent Cluster Sizes." Queens Economics Department Working Paper No. 1314. McCa rey, Daniel F., Bell, Robert M., and Carsten H. Botts. 2001. \Generalizations of Bias Reduced Linearization." Proceedings of the Survey Research Methods Section, American Statistical Association. Miglioretti, D. L., and P. J. Heagerty. 2006. \Marginal Modeling of Nonnested Multilevel Data using Standard Software." American Journal of Epidemiology 165: 453-463. Moulton, Brent R. 1986. \Random Group E ects and the Precision of Regression Estimates." Journal of Econometrics 32: 385-397. Moulton, Brent R. 1990. \An Illustration of a Pitfall in Estimating the E ects of Aggregate Variables on Micro Units." Review of Economics and Statistics 72: 334-38. Newey, Whitney K., and Kenneth D. West. 1987. \A Simple, Positive Semi-De nite, Heteroscedasticity and Autocorrelation Consistent Covariance Matrix." Econometrica 55: 703-708. Petersen, Mitchell A. 2009. \Estimating Standard Errors in Finance Panel Data Sets: Comparing Approaches." Review of Financial Studies 22: 435-480. Pfe ermann, Daniel, and Gaf Nathan. 1981. \Regression Analysis of Data from a Cluster Sample." Journal American Statistical Association 76: 681-689. Rabe-Hesketh, Sophia, and Anders Skrondal. 2012. Multilevel and Longitudinal Modeling Using Stata, Volumes I and II, Third Edition. College Station, TX: Stata Press. Rogers, William H. 1993. \Regression Standard Errors in Clustered Samples." Stata Technical Bulletin 13: 19-23. Satterthwaite, F. E. 1946. \An Approximate Distribution of Estimates of Variance Components." Biometrics Bulletin 2(6): 110-114. Scha er, Mark E., and Stillman, Steven. 2010. \xtoverid: Stata Module to Calculate Tests of Overidentifying Restrictions after xtreg, xtivreg, xtivreg2 and xthtaylor." http://ideas.repec.org/c/boc/bocode/s456779.html Scott, A. J., and D. Holt. 1982. \The E ect of Two-Stage Sampling on Ordinary Least Squares Methods." Journal American Statistical Association 77: 848-854. Shah, Bbabubhai V., M. M. Holt and Ralph E. Folsom. 1977. \Inference About Regression Models from Sample Survey Data." Bulletin of the International Statistical Institute Proceedings of the 41st Session 47(3): 43-57. 55 Shore-Sheppard, L. 1996. \The Precision of Instrumental Variables Estimates with Grouped Data." Princeton University Industrial Relations Section Working Paper 374. Solon, Gary, Steven J. Haider, and Je rey Wooldridge. 2013. \What Are We Weighting For?" NBER Working Paper 18859. Staiger, Douglas, and James H. Stock. 1997. \Instrumental Variables Regression with Weak Instruments." Econometrica 65: 557-586. Stock, James H., and Mark W. Watson. 2008. \Heteroskedasticity-robust Standard Errors for Fixed E ects Panel Data Regression." Econometrica 76: 155-174. Thompson, Samuel. 2006. \Simple Formulas for Standard Errors that Cluster by Both Firm and Time." SSRN paper. http://ssrn.com/abstract=914002. Thompson, Samuel. 2011. \Simple Formulas for Standard Errors that Cluster by Both Firm and Time." Journal of Financial Economics 99(1): 1-10. Webb, Matthew D. 2013. \Reworking Wild Bootstrap Based Inference for Clustered Errors." Queens Economics Department Working Paper 1315. White, Halbert. 1980. \A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity." Econometrica 48: 817-838. White, Halbert. 1984. Asymptotic Theory for Econometricians. San Diego: Academic Press. Wooldridge, Je rey M. 2003. \Cluster-Sample Methods in Applied Econometrics." American Economic Review 93: 133-138. Wooldridge, Je rey M. 2006. \Cluster-Sample Methods in Applied Econometrics: An Extended Analysis." Michigan State University. Unpublished. Wooldridge, Je rey M. 2010. Econometric Analysis of Cross Section and Panel Data. Cambridge, MA: MIT Press. Yoon, Jungmo, and Antonio Galvao. 2013. \Robust Inference for Panel Quantile Regression Models with Individual E ects and Serial Correlation." Unpublished. 56 Table 1 Impacts of clustering and estimator choices on estimated coefficients and standard errors Cross‐section Individual data with randomly‐assigned State‐level variable Estimation Method OLS FGLS (Random Effects) Slope coefficient ‐0.0226 ‐0.0011 Standard Errors Default 0.0042 0.0199 Heteroscedastic Robust 0.0042 ‐ Cluster Robust (cluster on State) 0.0217 0.0217 Pairs cluster bootstrap 0.0215 0.0227 Number observations Number clusters (states) Cluster size range Intraclass correlation 65685 51 519 to 5866 0.018 65685 51 519 to 5866 ‐ Notes:  March 2012 CPS data, from IPUMS download.  Default standard errors for OLS assume  errors are iid; default standard errors for FGLS assume the Random Effects model is correctly  specified.  399 Bootstrap replications.  A fixed effect model is not possible, since the regressor  is invariant within states. IK effective DOF (mean) IK effective DOF (5th percentile) IK effective DOF (95th percentile) CSS effective # clusters (mean) 10 11 12 13 14 15 16 17 5.5 4.1 6.9 6.6 0.037 0.062 0.063 0.076 0.174 0.130 0.094 0.090 0.075 0.058 0.056 0.077 0.063 9.6 5.3 14.3 10.2 0.043 0.050 0.058 0.060 0.172 0.091 0.075 0.075 0.066 0.047 0.045 0.055 0.066 20 12.6 6.7 20.3 12.9 0.069 0.068 0.064 0.073 0.181 0.098 0.080 0.079 0.071 0.061 0.056 0.062 0.070 30 17.1 9.7 30.4 17.1 0.057 0.055 0.055 0.062 0.176 0.080 0.070 0.070 0.065 0.063 0.055 0.060 0.072 50 Notes:  Data drawn from March 2012 CPS data, 3% sample from IPUMS download (later version to use a larger data set).  1000 Monte  Carlo replications (later version to have more reps).  399 Bootstrap replications.  "IK effective DOF" from Imbens and Kolesar (2013), and  "CSS effective # clusters" from Carter, Schnepel and Steigerwald (2013), see section x.x. 3.3 2.7 3.8 4.6 0.019 0.081 0.087 0.085 Bootstrap Percentile‐T methods Pairs cluster bootstrap Wild cluster bootstrap, Rademacher 2 point distribution Wild cluster bootstrap, Webb 6 point distribution Wild cluster bootstrap, Rademacher 2 pt, do not impose null hypothesis 1 2 3 4 5 6 7 8 9 0.165 0.213 0.124 0.108 0.089 0.051 0.060 0.118 0.090 Numbers of Clusters 6 10 Wald Tests White Robust, T(N‐k) for critical value Cluster on state, N(0,1) for critical value Cluster on state, T(G‐1) for critical value Cluster on state, T(G‐2) for critical value Cluster on state, CR2 bias correction, T(G‐1) for critical value Cluster on state, CR3 bias correction, T(G‐1) for critical value Cluster on state, CR2 bias correction, IK degrees of freedom Cluster on state, CR2 bias correction, T(CSS effective # clusters) Pairs cluster bootstrap for standard error, T(G‐1) for critical value Estimation Method Table 2 ‐ Cross‐section individual level data Monte Carlo rejection rates of true null hypothesis (slope = 0) with different number of clusters and different rejection methods Nominal 5% rejection rates Default standard errors, T(N‐k) for critical value White Robust, T(N‐k) for critical value Cluster on state, T(G‐1) for critical value Cluster on state, CR2 bias correction, T(G‐1) for critical value Cluster on state, CR2 bias correction, IK degrees of freedom Pairs cluster bootstrap for standard error, T(G‐1) for critical value Number observations Number clusters (states) 10 Imbens‐Kolesar effective DOF 11 Carter‐Schnepel‐Steigerwald effective # clusters Bootstrap Percentile‐T methods 7 Pairs cluster bootstrap 8 Wild cluster bootstrap, Rademacher 2 point distribution 9 Wild cluster bootstrap, Webb 6 point distribution 1 2 3 4 5 6 Slope coefficient Standard Errors 1 2 1836 51 50 51 0.0042 0.0042 0.0153 0.0153 0.0153 0.0149 ‐0.0198 0.1598 0.7353 0.7413 0.0000 0.0000 0.2016 0.2017 0.2017 0.1879 1836 51 50 51 0.0074 0.0067 0.0276 0.0276 0.0276 0.0278 ‐0.0072 0.7832 0.9590 0.9590 0.3328 0.2845 0.7959 0.7958 0.7959 0.7977 p‐values for  OLS, no  p‐values for  test of  test of  (state)  fixed  statistical  statistical  Estimation Method: OLS‐FE significance effects significance Model: Table 3 ‐ State‐year panel data with differences‐in‐differences estimation Impacts of clustering and estimation choices on estimated coefficients and standard errors 1836 51 ‐ ‐ ‐ ‐ ‐ 0.0070 ‐ 0.0101 ‐ ‐ 0.0102 0.0037 ‐ ‐ ‐ 0.5967 ‐ 0.7142 ‐ ‐ 0.7172 (AR(1)  p‐values for  test of  within  each  statistical  state) significance 3 0.010 0.080 0.083 0.604 0.136 0.077 0.062 0.075 0.018 0.065 0.060 0.576 0.104 0.063 0.056 0.077 Numbers of Clusters 6 10 0.048 0.043 0.045 0.589 0.054 0.042 0.041 0.047 20 0.036 0.037 0.035 0.573 0.042 0.034 0.034 0.043 30 0.061 0.053 0.058 0.611 0.055 0.049 0.049 0.056 50 Notes:  Outcome is mean log wages from March CPS data, 1977‐2012.  Models include state and year fixed effects, and a "fake policy"  dummy variable which turns on starting at 1986, for a random subset of half of the states.  1000 Monte Carlo replications (later  version to have more reps).  399 Bootstrap replications. Bootstrap Percentile‐T methods 6 Pairs cluster bootstrap 7 Wild cluster bootstrap, Rademacher 2 point distribution 8 Wild cluster bootstrap, Webb 6 point distribution 1 2 3 4 5 Wald Tests Default standard errors, T(N‐k) for critical value Cluster on state, N(0,1) for critical value Cluster on state, T(G‐1) for critical value Cluster on state, T(G‐2) for critical value Pairs cluster bootstrap for standard error, T(G‐1) for critical value Estimation Method Table 4 ‐ State‐year panel data with differences‐in‐differences estimation Monte Carlo rejection rates of true null hypothesis (slope = 0) with different # clusters and different rejection methods Nominal 5% rejection rates

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?