#
"The Apple iPod iTunes Anti-Trust Litigation"

### Filing
754

Administrative Motion to File Under Seal Opposition to Plaintiffs' Daubert Motion 737 filed by Apple Inc.. (Attachments: # 1 Declaration of Kiernan ISO Admin Motion to Seal, # 2 Exhibit 1 of Kiernan ISO Admin Motion to Seal, # 3 Exhibit 2 of Kiernan ISO Admin Motion to Seal, # 4 Proposed Order Granting Motion to Seal, # 5 Apple's Opp to Pls' Daubert Motion (Redacted), # 6 Apple's Opp to Pls' Daubert Motion, # 7 Declaration of Kiernan ISO Apple's Opp to Pls' Daubert Motion, # 8 Exhibit 1-4 (Redacted), # 9 Exhibit 5-12 (Redacted), # 10 Exhibit 1-2, 6, 9-11, # 11 Proposed Order Denying Plfs' Daubert Motion)(Kiernan, David) (Filed on 1/14/2014)

Exhibit 6
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 1
Page 2
·1· · · · · · · · ·UNITED STATES DISTRICT COURT
·1· · · · · · · · ·UNITED STATES DISTRICT COURT
·2· · · · · · · ·NORTHERN DISTRICT OF CALIFORNIA
·2· · · · · · · ·NORTHERN DISTRICT OF CALIFORNIA
·3· · · · · · · · · · · OAKLAND DIVISION
·3· · · · · · · · · · · OAKLAND DIVISION
·4
·4
·5· ·THE APPLE iPOD iTUNES· · · · · · Lead Case No. C 05-00037
·5· ·THE APPLE iPOD iTUNES· · · · · · Lead Case No. C 05-00037
· · ·ANTI-TRUST LITIGATION
· · ·ANTI-TRUST LITIGATION
·6
·6
·7· ·____________________________
·7· ·____________________________
·8· ·This Document Relates To:
·8· ·This Document Relates To:
·9· ·ALL ACTIONS
·9· ·ALL ACTIONS
10· ·____________________________
10· ·____________________________
11
11
12
12
13
13· · · · · · · ·CONFIDENTIAL - ATTORNEYS' EYES ONLY
14· · · · · · · ·CONFIDENTIAL - ATTORNEYS' EYES ONLY
14
15· · · · · VIDEOTAPED DEPOSITION OF ROGER G. NOLL, PH.D.
15· · · · · ·Videotaped Deposition of ROGER G. NOLL, PH.D.,
16· · · · · · · · ·Wednesday, December 18, 2013
16· ·taken on behalf of the Defendant, at 1755 Embarcadero
17· · · · · · · · · · Palo Alto, California
17· ·Road, Palo Alto, California, beginning at 9:06 a.m. and
18
18· ·ending at 11:54 p.m., on Wednesday, December 18, 2013,
19
20
19· ·before Darcy J. Brokaw, CSR No. 12584.
21
20
22
21
23· ·Reported by:
22
· · ·Darcy J. Brokaw
23
24· ·RPR, CRR, CSR No. 12584
24
25· ·Job No. 10008944
25
Page 4
Page 3
·1· · · · · · · · · · · · ·APPEARANCES
·1· · · · · · · · · · ·INDEX TO EXAMINATION
·2
·2· · · · · · · · · · ·ROGER G. NOLL, PH.D.
·3
·4· ·For the Plaintiffs and the deponent, Dr. Noll:
·3
·5· · · · ROBBINS GELLER RUDMAN & DOWD, LLP
·4
· · · · · BY: ALEXANDRA S. BERNAY, ESQ.
·5· ·EXAMINATION· · · · · · · · · · · · · · · · · · · · ·PAGE
·6· · · · BY: JENNIFER N. CARINGAL, ESQ.
·6· ·BY MR. KIERNAN· · · · · · · · · · · · · · · · · · · · ·7
· · · · · 655 West Broadway, Suite 1900
·7· · · · San Diego, California· 92101
·7
· · · · · (619)231-1058
·8
·8· · · · xanb@rgrdlaw.com
·9
·9
10· ·For the Defendant, Apple Inc.:
10
11· · · · JONES DAY
11
· · · · · BY: DAVID KIERNAN, ESQ.
12
12· · · · BY: AMIR AMIRI, ESQ.
13
· · · · · BY: ROBERT MITTELSTAEDT, ESQ.
13· · · · 555 California Street, 26th Floor
14
· · · · · San Francisco, California· 94104
15
14· · · · (415)626-3939
16
· · · · · dkiernan@jonesday.com
17
15
16
18
17
19
18· ·Also present:
20
19· · · · Peter Hibdon, Videographer
20
21
21
22
22
23
23
24
24
25
25
Page 1..4
www.aptusCR.com
YVer1f
Page 21
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. BERNAY:· Objection.· Argumentative.
·BY MR. KIERNAN:
· · · ·Q.· It would be column A divided by 1 plus
·column B times column B; isn't that correct?
· · · · · ·MS. BERNAY:· Objection.· Vague.
· · · · · ·THE WITNESS:· It would be explain -·explain it to me again.
·BY MR. KIERNAN:
· · · ·Q.· It would be -· · · ·A.· The actual overcharge would be -- so
·the -- what the percentage damages is is the
·percentage of the calculation you get from all the
·independent variables, which is an estimate of the
·transaction price you actually have.· And the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·regression are correlated within a particular group
·and you don't do anything to correct for that, what
·would be the impact on the reported standard errors?
· · · · · ·MS. BERNAY:· Objection.· Vague and
·ambiguous.
· · · · · ·THE WITNESS:· I didn't completely follow
·the question.· Ask it again.
·BY MR. KIERNAN:
· · · ·Q.· If the residual errors in the regression
·are correlated within a particular group and you
·don't do anything to correct for that, what would be
·the impact on the reported standard errors?
· · · · · ·MS. BERNAY:· Same objection.
· · · · · ·THE WITNESS:· It could be either way.· It
·could make them higher or it could make them lower,
·depending on the nature of the correlation.
·BY MR. KIERNAN:
· · · ·Q.· And why would it impact the reported
·standard errors?
· · · ·A.· Well, it's all built up in the -- in the
·nature of the assumptions one makes in doing a
·regression analysis, which is an independence of the
·standard errors.· And if the standard errors -- if
·the -- if the random shock that is -· · · · · ·(Reporter inquires.)
Page 22
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·transaction price is -- has that overcharge of that
·amount.· All right.
· · · · · ·So I'm not sure I understand -· · · ·Q.· I'm focusing on the formula that's in C.
· · · ·A.· Yes.
· · · ·Q.· And the formula in C is taking the
·percentage of the weighted average price.· And my
·question is -· · · ·A.· That is the existing price.· It's not the
·but-for price.
· · · ·Q.· Right.· And what I'm asking is:· Isn't the
·correct formula to determine the price overcharge
·A divided by 1 plus column B times column B -· · · · · ·MS. BERNAY:· Objection -·BY MR. KIERNAN:
· · · ·Q.· -- because column B reflects the change in
·percentage between -· · · ·A.· Yes, you're right -· · · ·Q.· -- the but-for price and -· · · · · ·(Reporter admonishes.)
· · · · · ·THE WITNESS:· Yes, the 2.3.8 is an
·approximation of what the -- what the exactly
·precise calculation would be, yes.
·BY MR. KIERNAN:
· · · ·Q.· Okay.· If the residual errors in the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·THE WITNESS:· If the random shock that is
·in the regression equation does not satisfy the
·independence assumption, then the effect on the
·standard errors of the coefficients could be either
·to elevate them or to reduce them, depending on the
·nature of the violation of the independence
·assumption.
·BY MR. KIERNAN:
· · · ·Q.· Okay.· And are there standard statistical
·tests to test whether the residual errors are
·correlated within a particular group?
· · · · · ·MS. BERNAY:· Objection.· Vague.
· · · · · ·THE WITNESS:· There are many such tests
·and many such corrections.· But the effect is -- the
·existence of even statistically significant
·correlations is small unless those correlations are
·high.· All right.
· · · · · ·So the corrections for autocorrelation of
·residuals are not something that actually matters in
·the vast majority of cases because the -- it's
·almost never the case there's no correlation in
·residual errors, but it's almost never the case that
·making a correction for the auto- -- the correlation
·that does exist matters in terms of the regression.
· · · · · ·It's also the case here that we're not
Page 23
Page 24
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 25
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·talking about a source of bias in the coefficients.
·We're talking about a source of bias in the
·estimated statistical significance, the -·BY MR. KIERNAN:
· · · ·Q.· The standard errors?
· · · ·A.· Yeah, the values of the -- the expected
·value of the regression coefficients is not
·affected.
· · · ·Q.· The coefficients aren't affected, but the
·calculations of the standard errors are affected?
· · · ·A.· Right, the calculations of the standard
·errors are affected, but the -- but the estimated
·effect of the independent variable is the same, the
·expected estimated effect.
· · · ·Q.· And if the residual errors are correlated
·within a particular group, the standard errors could
·either be overstated or understated?
· · · ·A.· Yes.
· · · ·Q.· Without a correction?
· · · ·A.· They could be.· Although, again, the -·it's not -- it's not a dichotomous issue.· They -·A, they may be affected, and B, the magnitude of the
·effect depends on the exact conditions.
· · · ·Q.· And to know the magnitude of effect, you'd
·have to test it, you'd have to run one of the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·that -- that's a good way to see if there's positive
·error correlation, but it's not a good way to see if
·there's negative error correlation.
· · · · · ·And the second point is that the nature of
·the error correlation may be that it's dependent on
·particular combinations of variables; and that one,
·the standard tests wouldn't even tell you that it
·exists.
· · · ·Q.· In this case, did you do anything to check
·whether the residual errors in your regression set
·forth in Exhibits 3A and 3B to Noll 10 are
·correlated with any particular group?
· · · · · ·MS. BERNAY:· Objection.· Vague and
·ambiguous.
· · · · · ·THE WITNESS:· What do you mean by "group"?
·BY MR. KIERNAN:
· · · ·Q.· Within any group.
· · · ·A.· What do you mean, "a group"?· I don't
·understand what you mean by a group.
· · · ·Q.· We've been using group for the last ten
·minutes.
· · · · · ·MS. BERNAY:· Objection.· Argumentative.
·BY MR. KIERNAN:
· · · ·Q.· Same group that you've -- the same group
·that you've been referring to.
Page 26
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·standard statistical tests?
· · · · · ·MS. BERNAY:· Objection.· Calls for
·speculation.
· · · · · ·THE WITNESS:· Well, actually, that's not
·what most -- what typically -·BY MR. KIERNAN:
· · · ·Q.· Can you just eyeball it?
· · · ·A.· -- happens.
· · · · · ·(Reporter inquires.)
·BY MR. KIERNAN:
· · · ·Q· ·Can you just eyeball it?
· · · · · ·MS. BERNAY:· Objection.· Vague.
· · · · · ·THE WITNESS:· Can I finish my first answer
·before I answer the next question?
·BY MR. KIERNAN:
· · · ·Q.· Yes.
· · · ·A.· Okay.· It is the case that if you plot the
·errors, you will know from experience if you
·actually have a problem that is causing the
·regression equation to be unreliable.· But so
·"eyeball" is sort of a bizarre word.
· · · · · ·What you actually do is you look at the
·actual scatter plot of points around the regression
·line and see if there is a clustering of
·observations above and below it.· The problem with
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · ·A.· I didn't refer to a group.· I don't know
·what you're talking about.· I know I fully
·intended -· · · ·Q.· You used the term "cluster" -· · · · · ·(Reporter admonishes.)
·BY MR. KIERNAN:
· · · ·Q· ·You used the word cluster, within a
·cluster.
· · · ·A.· I don't agree that there are any clusters
·here.
· · · · · ·MS. BERNAY:· Objection.
·BY MR. KIERNAN:
· · · ·Q.· That's not my question, Dr. Noll.· I asked
·you, did you do anything to check whether the
·residual errors in your regressions set forth in
·Exhibit 3A and 3B are correlated within any cluster
·or group?
· · · · · ·MS. BERNAY:· Objection.· Asked and
·answered.
· · · · · ·THE WITNESS:· I don't know what you mean
·by a group.· And you used the word "or," and I don't
·believe there are any clusters.· So how can I test
·for something when I don't -- I think it either
·doesn't exist or I don't understand what you're
·asking?
Page 27
Page 28
Page 25..28
www.aptusCR.com
YVer1f
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 29
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·What is it you're asking?· Can't you just
·give me an example of what you mean by a group, and
·then we won't have to discuss it?
·BY MR. KIERNAN:
· · · ·Q.· So you don't understand the question?
· · · ·A.· I don't understand what you mean by a
·group, no.· I don't know what you have in mind.
· · · ·Q.· And you don't know what I mean by cluster?
· · · · · ·MS. BERNAY:· Objection -· · · · · ·THE WITNESS:· I know what you mean by a
·cluster, and there aren't any in this particular
·regression.
·BY MR. KIERNAN:
· · · ·Q.· How do you know?
· · · ·A.· Because I know what cluster analysis is,
·and it doesn't apply to this regression because this
·isn't a sample.
· · · ·Q.· What did you do to determine if there were
·clusters?· What statistical tests did you apply?
· · · · · ·MS. BERNAY:· Objection.
· · · · · ·THE WITNESS:· I looked at the definition
·of a cluster, and it doesn't apply to anything in
·this regression.· I know -- I know what cluster
·analysis is, and it doesn't apply to this
·regression, notwithstanding what many of your
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·There is -·BY MR. KIERNAN:
· · · ·Q.· I'm just trying to understand what you did
·other than reading some books to determine if there
·are clusters in the case.
· · · · · ·MS. BERNAY:· Objection.· Argumentative.
· · · · · ·THE WITNESS:· There is no such thing as a
·test for whether you ought to use cluster analysis
·in a regression that doesn't satisfy the conditions
·for clustering.
·BY MR. KIERNAN:
· · · ·Q.· Okay.· That's what you teach your
·students?
· · · · · ·MS. BERNAY:· Objection.· Argumentative.
· · · · · ·THE WITNESS:· Of course it is.
·BY MR. KIERNAN:
· · · ·Q.· On page 34 of Noll 10 -- let me know when
·you get there.
· · · ·A.· I'm there.
· · · ·Q.· The first paragraph, the last third, you
·state that "Professors Murphy and Topel do not test
·whether the mean residual errors from this procedure
·are statistically significantly different from zero,
·which would have to be the case if the errors within
·a cluster are correlated."
Page 30
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·experts have said.· They're just not right.
·BY MR. KIERNAN:
· · · ·Q.· Anything else other than looking at a
·definition?
· · · · · ·MS. BERNAY:· Objection.· Argumentative.
· · · · · ·THE WITNESS:· I know -- the report, about
·a third of this report is about what cluster
·analysis is and what kinds of problems you apply to
·it and why this isn't a cluster sample problem.· All
·right.
· · · · · ·So, yes, there it is.· I've cited articles
·in the professional literature of which I not only
·have read, but I actually know what they do.· I have
·taught this stuff.· So I know what I'm talking
·about.· And there's references here.· It's not that
·I just read a definition and decided that something
·didn't apply.
· · · · · ·But I know, just from knowing what cluster
·analysis is, that it doesn't apply here.
·BY MR. KIERNAN:
· · · ·Q.· You just know it when you see it?
· · · · · ·MS. BERNAY:· Objection.· Argumentative,
·misstates his prior testimony.
· · · · · ·Come on, David.
· · · · · ·THE WITNESS:· That's complete nonsense.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · ·A.· Yes.
· · · ·Q.· Did you perform that analysis?
· · · ·A.· No, because I don't believe there are
·clusters.· The premise of that paragraph is if you
·assume a cluster analysis is appropriate, here's
·something you do.· And they didn't do it.· But I
·don't think you should even do that because it's not
·a cluster sample problem.
· · · ·Q.· If it turns out that within a group,
·within a cluster -- we can use the one defined by
·Professors Murphy and Topel -- the mean residual
·errors are statistically significantly different
·from zero, what would that tell you?
· · · ·A.· Nothing.
· · · ·Q.· Why not?
· · · ·A.· Because as I said before, you only get
·that far if you have a cluster sampling problem, and
·we don't have a cluster sampling problem.· So
·there's no point in testing for cluster, the
·presence of clustering effects if you don't have a
·cluster to begin with.
· · · · · ·This is a paragraph written on if there -·if it were a sample -- if the way I had done the
·analysis was to sample some transactions according
·to a subset of the models of iPods that were out
Page 31
Page 32
Page 29..32
www.aptusCR.com
YVer1f
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 33
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·there, so instead of having 100-odd iPod models, I
·only had 20, and within those 20, I had just drawn a
·sample of transactions instead of looking at the
·entire universe, then, in principle, there might be
·a clustering problem.· But when you don't have a
·sample of either the models or the transactions,
·it's not a cluster problem.
· · · · · ·So testing for cluster effects is a
·non sequitur.· It's inappropriate, because you don't
·have cluster samples.
· · · ·Q.· Okay.· And other than that basis that
·there's not a clustering problem because it's not a
·sample from a population, any other reason, any
·other basis for your opinion that there's not a
·clustering issue?
· · · ·A.· Only the fact it doesn't satisfy the
·conditions for doing cluster analysis?
· · · ·Q.· The one that you just described.
· · · ·A.· Yes.· That's why it isn't a cluster
·problem, is because it's not a cluster sample.· And
·cluster sampling is a procedure you use when you are
·sampling on both groups and people within a group.
· · · · · ·If you have a population instead of a
·sample, there's no cluster issue, by definition.
· · · ·Q.· And so if the mean residual errors within
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·other two are versions of omitted variable problems.
· · · · · ·So the issue is, is there a sampling issue
·here?· The answer is no.
· · · · · ·Are there omitted variables?· I'm not
·aware of any that would add statistical significance
·to the regression equation without being so highly
·multicollinear that they would destroy the
·coefficient estimates.
· · · · · ·So there can't -- there isn't any -- none
·of the three reasons why you might have a problem
·exist.· So I don't care what the test is, because
·it's testing for something that, in principle, can't
·exist as a problem in the regression.
·BY MR. KIERNAN:
· · · ·Q.· So if you run a test on a particular group
·of transactions and the test shows that the mean
·residual errors are statistically significantly
·different from zero, your opinion is it has no
·impact on the calculation of the standard errors?
· · · · · ·MS. BERNAY:· Objection.· Vague and
·ambiguous.· Misstates prior testimony as well.
·BY MR. KIERNAN:
· · · ·Q.· Let me put it differently.· It does not
·overstate or understate the standard errors that
·you're calculating?
Page 34
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·certain particular groups in the transaction data at
·issue in this case are correlated, that is, they are
·statistically significantly different from zero,
·your opinion is it has no impact on the calculation
·of the standard errors in the case?
· · · ·A.· That's not what I said.
· · · · · ·MS. BERNAY:· Objection.· Misstates his
·prior testimony.
·BY MR. KIERNAN:
· · · ·Q.· What was wrong with -- what do you
·disagree with in the question I just asked?
· · · · · ·MS. BERNAY:· Objection.· Vague.
· · · · · ·THE WITNESS:· First of all, if you look
·within a -- if you define the group as a particular
·model of an iPod, and you look at the errors in
·predicting that, and you find they're correlated, it
·may be -- it's perfectly explained if you took into
·account all the values of all the other independent
·variables.
· · · · · ·So that test in and of itself doesn't
·prove anything.· All right.· The only way it proves
·something -- again, let's go back to the reasons
·cluster sampling can be a problem.· And as stated in
·the report, there's three reasons why it can be a
·problem.· One is a sample bias problem, and the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. BERNAY:· Same objection.
· · · · · ·THE WITNESS:· It may or may not.· You
·haven't -- there's not enough information in your
·question to make a prediction about the effect on
·the calculation of the standard errors.
·BY MR. KIERNAN:
· · · ·Q.· What additional information do you need?
· · · ·A.· You have to understand what is the source
·of what you're measuring.· All right.· You have
·to -· · · ·Q.· The source of the observations?
· · · ·A.· No.
· · · · · ·THE REPORTER:· What's the question?
· · · · · ·You guys are cutting each other off.
· · · · · ·THE WITNESS:· Yeah, he does do this,
·doesn't he?
· · · · · ·The very first step is precisely what
·residual errors are you correlating, what actually
·is it.· All right.· And I don't know the answer to
·that.
· · · · · ·All you're telling me is that within a
·model of iPods, the mean residual error isn't
·zero.· That's all you're telling me.· You're not
·telling me anything else about why it might be
·different from zero.
Page 35
Page 36
Page 33..36
www.aptusCR.com
YVer1f
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 37
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·In fact, purely as a statistical matter, I
·would expect it not to be the case that they would
·all be zero, all right, just purely from random
·sample or random -- the random shocks in the
·regression.
· · · · · ·So we have to know why.· Before we can get
·to the question "is that going to affect the
·calculation of the standard errors of the regression
·coefficients," we have to understand why the
·residual errors don't sum to zero.
·BY MR. KIERNAN:
· · · ·Q.· And did you explore any of the why the
·mean residual errors in your regression are
·statistically significantly different from zero for
·certain groups of transactions of iPods?
· · · · · ·MS. BERNAY:· Objection.· Vague and
·ambiguous.· Again, mischaracterizes the prior
·testimony.
· · · · · ·THE WITNESS:· First of all, you're
·assuming in the way the question is answered that I
·know which ones are statistically significantly
·different from zero, and I don't.
· · · · · ·Secondly, all I did is examine the reasons
·given by Professors Murphy and Topel as to why these
·things were different from zero, and they're wrong.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·bootstrapping as a technique.· And describe what
·"bootstrapping" is so I make sure I understand it.
· · · ·A.· Sample -- you have a small number of
·observations and -- this was actually invented by my
·college roommate.
· · · · · ·You have a small number of observations,
·and the idea is if you just ran a single regression
·on the small sample that you have, the end wouldn't
·be large enough to be able to detect an effect, a
·causal effect of one variable on another.
· · · · · ·So what you do is you draw a sample
·from -- a sample with replacement; that is to say,
·you pick an observation, pull it out, count that as
·an observation, and you put it back into the puddle
·of all the observations and you draw another one.
· · · · · ·And you do that several times, run a
·regression.· And then you do it all again and run
·another regression, and then you do it all again and
·run another regression.· And then you use the
·distribution of the coefficients from those
·regressions as a way to estimate what the true
·coefficient is.
· · · ·Q.· Is that something you did in this case?
· · · ·A.· No.· We don't have a small sample.· We
·have a population.
Page 38
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·So I didn't go beyond that.
· · · · · ·And they're small to begin with.
·Regardless of whether they're statistically
·significant, they're small anyway.
· · · · · ·So there's certainly no proof that the
·answer in the regression equation about what damages
·are is in any way affected by anything in there that
·they discuss with regard to cluster analysis.
·BY MR. KIERNAN:
· · · ·Q.· Dr. Noll, when you say "they're small to
·begin with.· Regardless of whether they're
·statistically significant, they're small anyway,"
·what are you referring to as they are small?
· · · ·A.· Well, there's a -- in the backup stuff to
·the reports, the residual errors, the mean residual
·errors by model are not big numbers.· That's what I
·recall.· I don't remember the precise thing because
·it was months ago.
· · · · · ·But we did in fact examine what the basis
·was for their statements about the mean residual
·error, and there was no -- there was really nothing
·very important there.
· · · ·Q.· In your report, you discuss one technique
·for -- well, strike that.
· · · · · ·You set forth a description of
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · ·Q.· What was the point of discussing that in
·your report?
· · · ·A.· The point of discussing it was the
·mischaracterization of what independence means, that
·the -- Professors Murphy and Topel mischaracterize
·independence as being the same observation.· And in
·bootstrapping, you use the same observation over and
·over and over and over again, and it doesn't violate
·the independence assumption.
· · · ·Q.· What does the independence assumption
·refer to, in your words?· You disagree with
·Professor Murphy and Professor Topel.· Define for me
·what you're referring to as the independence
·assumption.
· · · ·A.· It's that the random component of the
·regression equation -- the distribution of that
·random component is unaffected by the observed
·values of any other component.
· · · · · ·And the reason the independence assumption
·is satisfied in bootstrapping is that you're
·randomly drawing samples.· So before the fact, what
·the next observation is going to be is independent
·of what the previous observation was.
· · · ·Q.· When you were referring to "random
·component," were you referring to the residual?
Page 39
Page 40
Page 37..40
www.aptusCR.com
YVer1f
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 41
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · ·A.· The residual is an estimate.· It's the
·unexplained variance.· The independence assumption
·refers to the underlying distribution of the error
·term.· And then the residual error in the regression
·equation is for the entire equation.· By definition,
·a regression analysis has to have mean zero.
· · · · · ·So the issue then about residual errors
·being correlated is you draw some subset of the
·observations and say is that sub- -- does that
·subset have correlations.· And then if it does, that
·means the correlations of the residual errors in
·that group, if on average they're greater than zero,
·that means all the others have to be on average less
·than zero.
· · · · · ·And then the issue is why does one
·subsample have positive residual errors and another
·have negative.· And there's some potential
·explanations for that, one of which is the actual
·way you created the groups, because you may not have
·taken fully into account the effect, the actual
·effect that's already explained in the regression of
·some of the independent variables.
· · · ·Q.· And if I understand your answer, one way
·to test your independence assumption is to look at
·the distribution of the error terms in the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·dealing with the data, have constructed subsamples
·in a way to get groups that have residual errors
·that are statistically significantly different from
·zero.· That doesn't tell me anything about the
·underlying quality of the regression, the standard
·errors or anything else.· It just means that I've
·cherry-picked.
· · · · · ·So that's why the answer to questions like
·you've been asking me always have to be "it
·depends."· It depends on how the subsample was
·collected whether any test of whether the residual
·errors are positive or negative even makes sense to
·begin with.
· · · ·Q.· When dealing with -- strike that.
· · · · · ·Are there other cases in which you have
·worked with an entire population of transactions in
·estimating a regression?
· · · ·A.· Yes.
· · · ·Q.· And in those cases, have you done anything
·to test the independence assumption that we've been
·discussing?
· · · · · ·MS. BERNAY:· Objection.· Calls for
·speculation, vague and ambiguous.
· · · · · ·THE WITNESS:· I have -- the only -- first
·of all, the only circumstances in which that even is
Page 42
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·regression?
· · · ·A.· It can be, but it isn't necessarily.· You
·have to -- you have to have an underlying both
·economic theoretic and econometric theoretic reason
·to believe that the groupings you have make sense.
· · · · · ·In other words, I can always construct a
·way to separate a sample into two groups so that one
·has positive residual errors on average and the
·other has negative, but that doesn't mean that
·there's a problem with the regression analysis,
·because I've constructed it to produce that, that
·result.· And that's why I say you'd have to know
·what the reason for it is.
· · · · · ·Just to take a very simple example, I
·could just take the ten observations where the model
·underestimates the true value by the maximum amount,
·all right, the worst possible observations in terms
·of underpredicting the dependent variable.· Then I
·could call that a group, and I say, ah-hah, those
·are statistically significant positive residual
·errors.
· · · · · ·But that doesn't mean there's anything
·wrong with the model.· It doesn't mean there's
·anything funny going on with violation of
·independence.· It just means that me, as the person
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·an interesting issue would be where you had very low
·explanatory power in the regression.· All right.
·Then it's possible that you could have economically
·and econometrically meaningful subgroups that had
·positive or negative residual errors.
· · · · · ·And so if you have extremely high
·R-squares, if your regression is doing a good job
·explaining the data, then it would not be a
·meaningful exercise to do that.· And in most cases,
·I never do, because the R-square, like this one, is
·very high.
·BY MR. KIERNAN:
· · · ·Q.· So if the R-square is very high and you're
·dealing with a -· · · ·A.· Population.
· · · ·Q.· -- population subset, your opinion is
·there's no reason to test the independence
·assumption?
· · · · · ·MS. BERNAY:· Objection.· Mischaracterizes
·the prior testimony.
· · · · · ·THE WITNESS:· Right.· I have normally not
·attempted to test, but there are -- the only
·circumstances in which I would do that is if there
·was -- there was some really big outlier prediction
·errors and they were all the same thing.· And you
Page 43
Page 44
Page 41..44
www.aptusCR.com
YVer1f
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 45
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·could get still get a high squared with a very small
·subset getting big prediction errors.
· · · · · ·(Reporter inquires.)
· · · · · ·THE WITNESS:· You can have a high
·R-squared in a regression and still have a group of
·predictions that were -- where the prediction error
·is large.· And then you would -- you would still
·want to address whether that group -- you had some
·omitted variable for that group or something.
· · · · · ·But again, that's not really likely to
·happen if you already have group identifiers.· See,
·again, the -- by definition, if you have group
·identifiers, the residual error within that group is
·going to be zero.· The mean residual error is going
·to be zero, because that's what regression analysis
·does.
· · · · · ·So that's why, for example, the most
·conventional solution to cluster problems is to use
·group identifiers, indicator variables, to get the
·mean of those residual errors for each group to
·zero.
·BY MR. KIERNAN:
· · · ·Q.· In this case, did you perform any
·statistical test to determine or to test your
·independence assumption?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·BY MR. KIERNAN:
· · · ·Q.· If you could turn to page 27.
· · · · · ·Let me know when you get there, Dr. Noll.
· · · · · ·MS. BERNAY:· 27?
· · · · · ·MR. KIERNAN:· Yeah, of Noll 10.
·BY MR. KIERNAN:
· · · ·Q.· In the second paragraph, you state that
·"Whereas new iPod owners in late 2006 became more
·locked in to iPods over time..."
· · · · · ·Do you see that sentence?
· · · ·A.· Yes.
· · · ·Q.· When you're referring to "new iPod
·owners in late 2006," who are you referring to?
· · · ·A.· People who had just bought or were about
·to buy an iPod.
· · · ·Q.· Okay.· And are you referring to consumers
·that purchase an iPod -- only those consumers who
·purchase an iPod with 7.0 implemented?
· · · ·A.· The issue of lock-in effect also depends
·on how locked in you are, of course.· But the -·if -- and it also depends on whether you have bought
·digital downloads from the iTunes Store or not.
· · · · · ·So the degree of lock-in would be affected
·by 7.0, but it's not the only factor affecting
·lock-in.
Page 46
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. BERNAY:· Objection.· Asked and
·answered.
· · · · · ·THE WITNESS:· I have -- I have not
·performed a test of the independence assumption as
·you've put it in that way, no.· It would be
·unnecessary, because there are no groups with
·outlying residual errors in the R-squared spot.· And
·by definition, the mean residual errors by group are
·going to be zero.
·BY MR. KIERNAN:
· · · ·Q.· And if statistical tests show that mean
·residual errors within groups are correlated, that
·does not affect your analysis or any of your
·opinions in any way?
· · · · · ·MS. BERNAY:· Objection.· Calls for
·speculation.
· · · · · ·THE WITNESS:· It might or it might not,
·depending on what the reason for finding that
·correlation was, that statistically significant
·correlation was.· It would purely depend on the way
·the test was performed and the way the groups were
·created and the way the residual errors were
·calculated.· All right.· That's what it would depend
·on.
·///
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MR. KIERNAN:· Okay.· Move to strike as
·nonresponsive.
·BY MR. KIERNAN:
· · · ·Q.· With respect to your first sentence, you
·state:· "Whereas new iPod owners in late 2006 became
·more locked in to iPods over time..."
· · · · · ·My question is:· The "new iPod owners in
·late 2006," does that refer to purchasers of iPods
·that only included 7.0?
· · · · · ·MS. BERNAY:· Objection.· Asked and
·answered.
· · · · · ·THE WITNESS:· My nonresponsive answer was
·in fact responsive.· It depends on other things.
·All right.· People who bought 7.0, obviously 7.0 -·iPods with 7.0 in them contributed to a lock-in
·effect more than people whose iPods did not have
·7.0.
· · · · · ·But on the other hand, if people,
·regardless of whether they bought 7.0, bought music
·from the iTunes Store in a DRM-protected fashion,
·they would be experiencing lock-in as well.
·BY MR. KIERNAN:
· · · ·Q.· Okay.
· · · ·A.· So that, contrary to your assertion, it
·was completely responsive.· You just didn't
Page 47
Page 48
Page 45..48
www.aptusCR.com
YVer1f
Roger Noll, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 117
·1· · · · · · · · · · ·REPORTER CERTIFICATION
·2
·3· · · · · · ·I, Darcy J. Brokaw, a Certified Shorthand
·4· ·Reporter, do hereby certify:
·5· · · · · · ·That prior to being examined, the witness in
·6· ·the foregoing proceedings was by me duly sworn to
·7· ·testify to the truth, the whole truth, and nothing but
·8· ·the truth;
·9· · · · · · ·That said proceedings were taken before me at
10· ·the time and place therein set forth and were taken down
11· ·by me in shorthand and thereafter transcribed into
12· ·typewriting under my direction and supervision;
13· · · · · · ·I further certify that I am neither counsel
14· ·for, nor related to, any party to said proceedings, nor
15· ·in any way interested in the outcome thereof.
16· · · · · · ·In witness whereof, I have hereunto subscribed
17· ·my name.
18
19· ·Dated:· December 19, 2013
20
21· ·_____________________________
· · ·Darcy J. Brokaw
22· ·CSR No. 12584, RPR, CRR
23
24
25
Page 117
www.aptusCR.com
Exhibit 7
A Practitioner's Guide to Cluster-Robust Inference
A. Colin Cameron and Douglas L. Miller
Department of Economics, University of California - Davis.
This version (almost nal): October 15, 2013
Abstract
We consider statistical inference for regression when data are grouped into clusters, with regression model errors independent across clusters but correlated within
clusters. Examples include data on individuals with clustering on village or region or
other category such as industry, and state-year di erences-in-di erences studies with
clustering on state. In such settings default standard errors can greatly overstate estimator precision. Instead, if the number of clusters is large, statistical inference after
OLS should be based on cluster-robust standard errors. We outline the basic method
as well as many complications that can arise in practice. These include cluster-speci c
xed e ects, few clusters, multi-way clustering, and estimators other than OLS.
JEL Classi cation: C12, C21, C23, C81.
Keywords: Cluster robust; cluster e ect; xed e ects, random e ects; hierarchical
models; feasible GLS; di erences in di erences; panel data; cluster bootstrap; few
clusters; multi-way clusters.
In preparation for The Journal of Human Resources. We thank four referees and
the journal editor for very helpful comments and for guidance, participants at the 2013
California Econometrics Conference and at a workshop sponsored by the U.K. Programme Evaluation for Policy Analysis, and the many people who have over time have
sent us cluster-related puzzles (the solutions to some of which appear in this paper).
Doug Miller acknowledges nancial support from the Center for Health and Wellbeing
at the Woodrow Wilson School of Public Policy at Princeton University.
Contents
I Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
II Cluster-Robust Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IIA A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IIB Clustered Errors and Two Leading Examples . . . . . . . . . . . . . . . .
IIB.1 Example 1: Individuals in Cluster . . . . . . . . . . . . . . . . . .
IIB.2 Example 2: Di erences-in-Di erences (DiD) in a State-Year Panel
IIC The Cluster-Robust Variance Matrix Estimate . . . . . . . . . . . . . . .
IID Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IIE Implementation for OLS and FGLS . . . . . . . . . . . . . . . . . . . . .
IIF Cluster-Bootstrap Variance Matrix Estimate . . . . . . . . . . . . . . . .
III Cluster-Speci c Fixed E ects . . . . . . . . . . . . . . . . . . . . . . . . . . .
IIIA Do Fixed E ects Fully Control for Within-Cluster Correlation? . . . . . .
IIIB Cluster-Robust Variance Matrix with Fixed E ects . . . . . . . . . . . .
IIIC Feasible GLS with Fixed E ects . . . . . . . . . . . . . . . . . . . . . . .
IIID Testing the Need for Fixed E ects . . . . . . . . . . . . . . . . . . . . . .
IV What to Cluster Over? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
IVA Factors Determining What to Cluster Over . . . . . . . . . . . . . . . . .
IVB Clustering Due to Survey Design . . . . . . . . . . . . . . . . . . . . . .
V Multi-way Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VA Multi-way Cluster-Robust Variance Matrix Estimate . . . . . . . . . . .
VB Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VC Feasible GLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VD Spatial Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VI Few Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VIA The Basic Problems with Few Clusters . . . . . . . . . . . . . . . . . . .
VIB Solution 1: Bias-Corrected Cluster-Robust Variance Matrix . . . . . . . .
VIC Solution 2: Cluster Bootstrap with Asymptotic Re nement . . . . . . . .
VIC.1 Percentile-t Bootstrap . . . . . . . . . . . . . . . . . . . . . . . .
VIC.2 Wild Cluster Bootstrap . . . . . . . . . . . . . . . . . . . . . . . .
VIC.3 Bootstrap with Caution . . . . . . . . . . . . . . . . . . . . . . .
VID Solution 3: Improved Critical Values using a T-distribution . . . . . . . .
VID.1 G-L Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . .
VID.2 Data-determined Degrees of Freedom . . . . . . . . . . . . . . . .
VID.3 E ective Number of Clusters . . . . . . . . . . . . . . . . . . . . .
VIE Special Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VIE.1 Fixed Number of Clusters with Cluster Size Growing . . . . . . .
VIE.2 Few Treated Groups . . . . . . . . . . . . . . . . . . . . . . . . .
VIE.3 What if you have multi-way clustering and few clusters? . . . . .
VIIExtensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VIIACluster-Robust F-tests . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VIIBInstrumental Variables Estimators . . . . . . . . . . . . . . . . . . . . . .
VIIB.1 IV and 2SLS . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
VIIB.2 Weak Instruments . . . . . . . . . . . . . . . . . . . . . . . . . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
6
6
7
8
9
10
11
13
14
15
16
17
19
19
20
20
22
23
23
25
26
27
28
29
30
31
31
32
34
35
35
35
37
37
37
38
39
39
39
40
40
41
VIIB.3 Linear GMM . . . . . . . . . . . . . . . .
VIICNonlinear Models . . . . . . . . . . . . . . . . . .
VIIC.1 Di erent Models for Clustering . . . . . .
VIIC.2 Fixed E ects . . . . . . . . . . . . . . . .
VIIDCluster-randomized Experiments . . . . . . . . .
VIII
Empirical Example . . . . . . . . . . . . . . . . . . . .
VIIIA
Individual-level Cross-section Data: One Sample .
VIIIB
Individual-level Cross-section Data: Monte Carlo .
VIIIC
State{Year Panel Data: One Sample . . . . . . .
VIIID
State{Year Panel Data: Monte Carlo . . . . . . .
IX Concluding Thoughts . . . . . . . . . . . . . . . . . . .
X References . . . . . . . . . . . . . . . . . . . . . . . . .
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
43
43
45
46
46
46
48
49
50
51
52
I. Introduction
In an empiricist's day-to-day practice, most e ort is spent on getting unbiased or consistent
point estimates. That is, a lot of attention focuses on the parameters (b). In this paper
we focus on getting accurate statistical inference, a fundamental component of which is
obtaining accurate standard errors (se, the estimated standard deviation of b). We begin
with the basic reminder that empirical researchers should also really care about getting this
part right. An asymptotic 95% con dence interval is b 1:96 se, and hypothesis testing
b and se are critical
is typically based on the Wald \t-statistic" w = (b
0 )=se. Both
ingredients for statistical inference, and we should be paying as much attention to getting a
good se as we do for obtaining b.
In this paper, we consider statistical inference in regression models where observations
can be grouped into clusters, with model errors uncorrelated across clusters but correlated
within cluster. One leading example of \clustered errors" is individual-level cross-section
data with clustering on geographical region, such as village or state. Then model errors
for individuals in the same region may be correlated, while model errors for individuals in
di erent regions are assumed to be uncorrelated. A second leading example is panel data.
Then model errors in di erent time periods for a given individual (e.g., person or rm or
region) may be correlated, while model errors for di erent individuals are assumed to be
uncorrelated.
Failure to control for within-cluster error correlation can lead to very misleadingly small
standard errors, and consequent misleadingly narrow con dence intervals, large t-statistics
and low p-values. It is not unusual to have applications where standard errors that control
for within-cluster correlation are several times larger than default standard errors that ignore
such correlation. As shown below, the need for such control increases not only with increase
in the size of within-cluster error correlation, but the need also increases with the size of
within-cluster correlation of regressors and with the number of observations within a cluster.
A leading example, highlighted by Moulton (1986, 1990), is when interest lies in measuring
the e ect of a policy variable, or other aggregated regressor, that takes the same value for
all observations within a cluster.
One way to control for clustered errors in a linear regression model is to additionally
specify a model for the within-cluster error correlation, consistently estimate the parameters
of this error correlation model, and then estimate the original model by feasible generalized
least squares (FGLS) rather than ordinary least squares (OLS). Examples include random
e ects estimators and, more generally, random coe cient and hierarchical models. If all
goes well this provides valid statistical inference, as well as estimates of the parameters of
the original regression model that are more e cient than OLS. However, these desirable
properties hold only under the very strong assumption that the model for within-cluster
error correlation is correctly speci ed.
A more recent method to control for clustered errors is to estimate the regression model
with limited or no control for within-cluster error correlation, and then post-estimation
4
obtain \cluster-robust" standard errors proposed by White (1984, p.134-142) for OLS with a
multivariate dependent variable (directly applicable to balanced clusters); by Liang and Zeger
(1986) for linear and nonlinear models; and by Arellano (1987) for the xed e ects estimator
in linear panel models. These cluster-robust standard errors do not require speci cation of
a model for within-cluster error correlation, but do require the additional assumption that
the number of clusters, rather than just the number of observations, goes to in nity.
Cluster-robust standard errors are now widely used, popularized in part by Rogers (1993)
who incorporated the method in Stata, and by Bertrand, Du o and Mullainathan (2004)
who pointed out that many di erences-in-di erences studies failed to control for clustered
errors, and those that did often clustered at the wrong level. Cameron and Miller (2011)
and Wooldridge (2003, 2006) provide surveys, and lengthy expositions are given in Angrist
and Pischke (2009) and Wooldridge (2010).
One goal of this paper is to provide the practitioner with the methods to implement
cluster-robust inference. To this end we include in the paper reference to relevant Stata
commands (for version 13), since Stata is the computer package most used in applied microeconometrics research. And we will post on our websites more expansive Stata code and
datasets used in this paper. A second goal is presenting how to deal with complications
such as determining when is there a need to cluster, incorporating xed e ects, and inference when there are few clusters. A third goal is to provide an exposition of the underlying
econometrics theory as this can aid in understanding complications. In practice the most
di cult complication to deal with can be \few" clusters, see Section VI. There is no clear-cut
de nition of \few"; depending on the situation \few" may range from less than 20 to less
than 50 clusters in the balanced case.
We focus on OLS, for simplicity and because this is the most commonly-used estimation
method in practice. Section II presents the basic results for OLS with clustered errors. In
principle, implementation is straightforward as econometrics packages include cluster-robust
as an option for the commonly-used estimators; in Stata it is the vce(cluster) option. The
remainder of the survey concentrates on complications that often arise in practice. Section
III addresses how the addition of xed e ects impacts cluster-robust inference. Section IV
deals with the obvious complication that it is not always clear what to cluster over. Section
V considers clustering when there is more than one way to do so and these ways are not
nested in each other. Section VI considers how to adjust inference when there are just a few
clusters as, without adjustment, test statistics based on the cluster-robust standard errors
over-reject and con dence intervals are too narrow. Section VII presents extension to the
full range of estimators { instrumental variables, nonlinear models such as logit and probit,
and generalized method of moments. Section VIII presents both empirical examples and
real-data based simulations. Concluding thoughts are given in Section IX.
5
II. Cluster-Robust Inference
In this section we present the fundamentals of cluster-robust inference. For these basic
results we assume that the model does not include cluster-speci c xed e ects, that it is
clear how to form the clusters, and that there are many clusters. We relax these conditions
in subsequent sections.
Clustered errors have two main consequences: they (usually) reduce the precision of b,
b
and the standard estimator for the variance of b, V[b] , is (usually) biased downward from
the true variance. Computing cluster-robust standard errors is a x for the latter issue.
We illustrate these issues, initially in the context of a very simple model and then in the
following subsection in a more typical model.
IIA. A Simple Example
For simplicity, we begin with OLS with a single regressor that is nonstochastic, and assume no
intercept in the model. The results extend to multiple regression with stochastic regressors.
Let yi = xi + ui , i = 1; :::; N , where xi is nonstochastic and E[ui ] = 0. The OLS
P
P
P
P
= i xi ui = i x2 , so in general
estimator b = i xi yi = i x2 can be re-expressed as b
i
i
P 2
P
)2 ] = V [ i xi ui ] =
i xi
V[b] = E[(b
2
:
(1)
P 2
P
P
If errors are uncorrelated over i, then V[ i xi ui ] =
i xi V[ui ]. In the
i V[xi ui ] =
P
2
simplest case of homoskedastic errors, V[ui ] =
and (1) simpli es to V[b] = 2 = i x2 .
i
If instead errors are heteroskedastic, then (1) becomes
P
Vhet [b] =
i
x2 E[u2 ] =
i
i
P
i
2
x2
i
;
using V[ui ] = E[u2 ] since E[ui ] = 0. Implementation seemingly requires consistent estimates
i
of each of the N error variances E[u2 ]. In a very in uential paper, one that extends naturally
i
to the clustered setting, White (1980) noted that instead all that is needed is an estimate of
P
P
b
the scalar i x2 E[u2 ], and that one can simply use i x2 u2 , where ui = yi bxi is the OLS
i bi
i
i
residual, provided N ! 1. This leads to estimated variance
P 2 2
P 2
b
Vhet [b] =
b
i xi ui ] =
i xi
2
:
The resulting standard error for b is often called a robust standard error, though a better,
more precise term, is heteroskedastic-robust standard error.
What if errors are correlated over i? In the most general case where all errors are
correlated with each other,
so
P
P P
P P
V [ i xi ui ] = i j Cov[xi ui ; xj uj ] = i j xi xj E[ui uj ];
Vcor [b] =
P P
i
j
xi xj E[ui uj ] =
6
P
i
x2
i
2
:
P P
P 2 2
b
The obvious extension of White (1980) is to use V[b] =
bb
i
j x i x j ui u j ] = (
i xi ) , but
P
this equals zero since i xi ui = 0. Instead one needs to rst set a large fraction of the error
b
correlations E[ui uj ] to zero. For time series data with errors assumed to be correlated only
up to, say, m periods apart as well as heteroskedastic, White's result can be extended to
yield a heteroskedastic- and autocorrelation-consistent (HAC) variance estimate; see Newey
and West (1987).
In this paper we consider clustered errors, with E[ui uj ] = 0 unless observations i and j
are in the same cluster (such as same region). Then
P P
P 2 2
Vclu [b] =
;
(2)
i
j xi xj E[ui uj ]1[i; j in same cluster] =
i xi
where the indicator function 1[A] equals 1 if event A happens and equals 0 if event A does not
happen. Provided the number of clusters goes to in nity, we can use the variance estimate
P 2 2
P P
b
:
(3)
bb
Vclu [b] =
i xi
i
j xi xj ui uj 1[i; j in same cluster] =
This estimate is called a cluster-robust estimate, though more precisely it is heteroskedasticb
and cluster-robust. This estimate reduces to Vhet [b] in the special case that there is only
one observation in each cluster.
b
b
Typically Vclu [b] exceeds Vhet [b] due to the addition of terms when i 6= j. The amount
of increase is larger (1) the more positively associated are the regressors across observations
in the same cluster (via xi xj in (3)), (2) the more correlated are the errors (via E[ui uj ] in
(2)), and (3) the more observations are in the same cluster (via 1[i; j in same cluster] in (3)).
There are several take-away messages. First there can be great loss of e ciency in OLS
estimation if errors are correlated within cluster rather than completely uncorrelated. Intuitively, if errors are positively correlated within cluster then an additional observation in the
cluster no longer provides a completely independent piece of new information. Second, failure
to control for this within-cluster error correlation can lead to using standard errors that are
too small, with consequent overly-narrow con dence intervals, overly-large t-statistics, and
over-rejection of true null hypotheses. Third, it is straightforward to obtain cluster-robust
standard errors, though they do rely on the assumption that the number of clusters goes to
in nity (see Section VI for the few clusters case).
IIB. Clustered Errors and Two Leading Examples
Let i denote the ith of N individuals in the sample, and g denote the g th of G clusters. Then
for individual i in cluster g the linear model with (one-way) clustering is
yig = x0ig + uig ;
(4)
where xig is a K 1 vector. As usual it is assumed that E[uig jxig ] = 0. The key assumption
is that errors are uncorrelated across clusters, while errors for individuals belonging to the
same cluster may be correlated. Thus
E[uig ujg0 jxig ; xjg0 ] = 0, unless g = g 0 :
7
(5)
IIB.1. Example 1: Individuals in Cluster
Hersch (1998) uses cross-section individual-level data to estimate the impact of job injury
risk on wages. Since there is no individual-level data on job injury rate, a more aggregated
measure such as job injury risk in the individual's industry is used as a regressor. Then for
individual i (with N = 5960) in industry g (with G = 211)
yig =
riskg + z0ig + uig :
The regressor riskg is perfectly correlated within industry. The error term will be positively correlated within industry if the model systematically overpredicts (or underpredicts)
wages in a given industry. In this case default OLS standard errors will be downwards biased.
To measure the extent of this downwards bias, suppose errors are equicorrelated within
cluster, so Cor[uig ; ujg ] = for all i 6= j. This pattern is suitable when observations can be
viewed as exchangeable, with ordering not mattering. Common examples include the current
one, individuals or households within a village or other geographic unit (such as state),
individuals within a household, and students within a school. Then a useful approximation
is that for the k th regressor the default OLS variance estimate based on s2 (X0 X) 1 , where
s is the standard error of the regression, should be in ated by
k
'1+
xk u (Ng
1);
(6)
where xk is a measure of the within-cluster correlation of xigk , u is the within-cluster error
correlation, and Ng is the average cluster size. The result (6) is exact if clusters are of equal
size (\balanced" clusters) and xk = 1 for all regressors (Kloek, 1981); see Scott and Holt
(1982) and Greenwald (1983) for the general result with a single regressor.
This very important and insightful result is that the variance in ation factor is increasing
in
1. the within-cluster correlation of the regressor
2. the within-cluster correlation of the error
3. the number of observations in each cluster.
For clusters of unequal size replace (Ng 1) in (6) by ((V[Ng ]=Ng ) + Ng 1); see Moulton
(1986, p.387). Note that there is no cluster problem if any one of the following occur: xk = 0
or u = 0 or Ng = 1.
In an in uential paper, Moulton (1990) pointed out that in many settings the in ation
factor k can be large even if u is small. He considered a log earnings regression using
March CPS data (N = 18; 946), regressors aggregated at the state level (G = 49), and
errors correlated within state (bu = 0:032). The average group size was 18; 946=49 = 387,
0:032 386 = 13:3. The weak
xk = 1 for a state-level regressor, so (6) yields k ' 1 + 1
8
correlation of errors within state was still enough to lead to cluster-corrected standard errors
p
being 13:3 = 3:7 times larger than the (incorrect) default standard errors!
In such examples of cross-section data with an aggregated regressor, the cluster-robust
standard errors can be much larger despite low within-cluster error correlation because the
regressor of interest is perfectly correlated within cluster and there may be many observations
per cluster.
IIB.2. Example 2: Di erences-in-Di erences (DiD) in a State-Year Panel
Interest may lie in how wages respond to a binary policy variable dts that varies by state
and over time. Then at time t in state s
yts =
dts + z0ts +
s
+
t
+ uts ;
where we assume independence over states, so the ordering of subscripts (t; s) corresponds
to (i; g) in (4), and s and t are state and year e ects.
The binary regressor dts equals one if the policy of interest is in e ect and equals 0
otherwise. The regressor dts is often highly serially correlated since, for example, dts will
equal a string of zeroes followed by a string of ones for a state that switches from never having
the policy in place to forever after having the policy in place. The error uts is correlated over
time for a given state if the model systematically overpredicts (or underpredicts) wages in a
given state. Again the default standard errors are likely to be downwards-biased.
In the panel data case, the within-cluster (i.e., within-individual) error correlation decreases as the time separation increases, so errors are not equicorrelated. A better model for
the errors is a time series model, such as an autoregressive error of order one that implies
0
that Cor[uts ; ut0 s ] = jt t j . The true variance of the OLS estimator will again be larger than
the OLS default, although the consequences of clustering are less extreme than in the case
of equicorrelated errors (see Cameron and Miller (2011, Section 2.3) for more detail).
In such DiD examples with panel data, the cluster-robust standard errors can be much
larger than the default because both the regressor of interest and the errors are highly
correlated within cluster. Note also that this complication can exist even with the inclusion
of xed e ects (see Section III).
The same problems arise if we additionally have data on individuals, so that
yits =
dts + z0its +
s
+
t
+ uits :
In an in uential paper, Bertrand, Du o and Mullainathan (2004) demonstrated the importance of using cluster-robust standard errors in DiD settings. Furthermore, the clustering
should be on state, assuming error independence across states. The clustering should not
be on state-year pairs since, for example, the error for California in 2010 is likely to be
correlated with the error for California in 2009.
The issues raised here are relevant for any panel data application, not just DiD studies.
The DiD panel example with binary policy regressor is often emphasized in the cluster-robust
9
literature because it is widely used and it has regressor that is highly serially correlated,
even after mean-di erencing to control for xed e ects. This serial correlation leads to a
potentially large di erence between cluster-robust and default standard errors.
IIC. The Cluster-Robust Variance Matrix Estimate
Stacking all observations in the g th cluster, the model (4) can be written as
yg = Xg + ug ;
g = 1; :::; G;
where yg and ug are Ng 1 vectors, Xg is an Ng K matrix, and there are Ng observations
in cluster g. Further stacking yg , Xg and ug over the G clusters then yields the model
y = X + u:
The OLS estimator is
b = (X0 X)
1
X0 y =
PG
0
g=1 Xg Xg
In general, the variance matrix conditional on X is
with
V[ b ] = (X0 X)
1
B (X0 X)
1
PG
g=1
1
;
B = X0 V[ujX]X:
X0g yg :
(7)
(8)
Given error independence across clusters, V[ujX] has a block-diagonal structure, and (8)
simpli es to
P
Bclu = G X0g E ug u0g jXg Xg :
(9)
g=1
The matrix Bclu , the middle part of the \sandwich matrix" (7), corresponds to the numerator
of (2). Bclu can be written as:
Bclu =
PG PNg PNg
g=1
i=1
j=1
xig x0jg ! ig;jg ;
where ! ig;jg = E[uig ujg jXg ] is the error covariance for the ig th and jg th observations. We
can gain a few insights from inspecting this equation. The term B (and hence V[ b ]) will
be bigger when: (1) regressors within cluster are correlated, (2) errors within cluster are
correlated so ! ig;jg is non-zero, (3) Ng is large, and (4) the within-cluster regressor and error
correlations are of the same sign (the usual situation). These conditions mirror the more
precise Moulton result for the special case of equicorrelated errors given in equation (6).
Both examples in Section II had high within-cluster correlation of the regressor, the DiD
example had high within-cluster (serial) correlation of the error and the Moulton (1990)
example had Ng large.
10
Implementation requires an estimate of Bclu given in (9). The cluster-robust estimate of
the variance matrix (CRVE) of the OLS estimator is the sandwich estimate
where
b
b
Vclu [ b ] = (X0 X) 1 Bclu (X0 X) 1 ;
P
b
b b
Bclu = G X0g ug u0g Xg ;
g=1
(10)
(11)
b
and ug = yg Xg b is the vector of OLS residuals for the g th cluster. Formally (10)-(11) proP
P
p
b b
vides a consistent estimate of the variance matrix if G 1 g X0g ug u0g Xg G 1 g E[X0g ug u0g Xg ] !
0 as G ! 1. Initial derivations of this estimator, by White (1984, p.134-142) for balanced
clusters and by Liang and Zeger (1986) for unbalanced, assumed a nite number of observations per cluster. Hansen (2007a) showed that the CRVE can also be used if Ng ! 1,
the case for long panels, in addition to G ! 1. Carter, Schnepel and Steigerwald (2013)
consider unbalanced panels with either Ng xed or Ng ! 1. The sandwich formula for the
CRVE extends to many estimators other than OLS; see Section VII.
b b
Algebraically, the estimator (10)-(11) equals (7) and (9) with E[ug u0g ] replaced with ug u0g .
0
b b
What is striking about this is that for each cluster g, the Ng Ng matrix ug ug is bound to
0
be a very poor estimate of the Ng Ng matrix E[ug ug ] { there is no averaging going on to
enable use of a Law of Large Numbers. The \magic" of the CRVE is that despite this, by
averaging across all G clusters in (11), we are able to get a consistent variance estimate. This
fact helps us to understand one of the limitations of this method in practice { the averaging
b
that makes V[ b ] accurate for V[ b ] is an average based on the number of clusters G. In
applications with few clusters this can lead to problems that we discuss below in Section VI.
b
Finite-sample modi cations of (11) are typically used, to reduce downwards bias in Vclu [ b ]
p
b
due to nite G. Stata uses cb g in (11) rather than ug , with
u
c=
G
G
N
1N
1
G
'
:
K
G 1
(12)
In general c ' G=(G 1), though see Section IIIB for an important exception when xed
e ects are directly estimated. Some other packages such as SAS use c = G=(G 1), a simpler
correction that is also used by Stata for extensions to nonlinear models. Either choice of c
usually lessens, but does not fully eliminate, the usual downwards bias in the CRVE. Other
nite-cluster corrections are discussed in Section VI, but there is no clear best correction.
IID. Feasible GLS
If errors are correlated within cluster, then in general OLS is ine cient and feasible GLS
may be more e cient.
Suppose we specify a model for g = E[ug u0g jXg ] in (9), such as within-cluster equicorrelation. Then the GLS estimator is (X0 1 X) 1 X0 1 y, where
= Diag[ g ]. Given a
consistent estimate b of , the feasible GLS estimator of is
1P
G
0 b 1
b FGLS = PG X0 b 1 Xg
(13)
g g
g=1
g=1 Xg g yg :
11
The FGLS estimator is second-moment e cient, with variance matrix
b
Vdef [ b FGLS ] = (X0 b
1
X) 1 ;
(14)
under the strong assumption that the error variance is correctly speci ed.
Remarkably, the cluster-robust method of the previous section can be extended to FGLS.
Essentially OLS is the special case where g = 2 INg . The cluster-robust estimate of the
asymptotic variance matrix of the FGLS estimator is
b
Vclu [ b FGLS ] = X0 b
1
1
X
PG
0 b 1
b b0 b 1
g=1 Xg g ug ug g Xg
X0 b
1
1
X
;
(15)
b
where ug = yg Xg b FGLS . This estimator requires that ug and uh are uncorrelated when
g 6= h, and that G ! 1, but permits E[ug u0g jXg ] 6= g . The approach of specifying a model
for the error variances and then doing inference that guards against misspeci cation of this
model is especially popular in the biostatistics literature that calls g a \working" variance
matrix (see, for example, Liang and Zeger, 1986).
There are many possible candidate models for g , depending on the type of data being
analyzed.
For individual-level data clustered by region, example 1 in Section IIB, a common starting
point model is the random e ects (RE) model. The error is speci ed to have two components:
uig =
g
+ "ig ;
(16)
where g is a cluster-speci c error or common shock that is assumed to be independent and
identically distributed (i.i.d.) (0; 2 ), and "ig is an idiosyncratic error that is assumed to be
i.i.d. (0; 2 ). Then V[uig ] = 2 + 2 and Cov[uig ; ujg ] = 2 for i 6= j. It follows that the
"
"
intraclass correlation of the error u = Cor[uig ; ujg ] = 2 =( 2 + 2 ), so this model implies
"
equicorrelated errors within cluster. Richer models that introduce heteroskedasticity include
random coe cients models and hierarchical linear models.
For panel data, example 2 in Section IIB, a range of time series models for uit may be
used, including autoregressive and moving average error models. Analysis of within-cluster
residual correlation patterns after OLS estimation can be helpful in selecting a model for
g.
Note that in all cases if cluster-speci c xed e ects are included as regressors and N g is
small then bias-corrected FGLS should be used; see Section IIIC.
The e ciency gains of FGLS need not be great. As an extreme example, with equicorrelated errors, balanced clusters, and all regressors invariant within cluster (xig = xg ) the
FGLS estimator equals the OLS estimator - and so there is no e ciency gain to FGLS.
With equicorrelated errors and general X, Scott and Holt (1982) provide an upper bound to
the maximum proportionate e ciency loss of OLS, compared to the variance of the FGLS
h
i
u )[1+(N
estimator, of 1= 1 + 4(1 (Nmax max2 1) u , Nmax = maxfN1 ; :::; NG g. This upper bound is
u)
increasing in the error correlation u and the maximum cluster size Nmax . For low u the
12
maximal e ciency gain can be low. For example, Scott and Holt (1982) note that for
u = :05 and Nmax = 20 there is at most a 12% e ciency loss of OLS compared to FGLS.
With u = 0:2 and Nmax = 100 the e ciency loss could be as much as 86%, though this
depends on the nature of X.
There is no clear guide to when FGLS may lead to considerable improvement in e ciency,
and the e ciency gains can be modest. However, especially in models without cluster-speci c
xed e ects, implementation of FGLS and use of (15) to guard against misspeci cation of
g is straightforward. And even modest e ciency gains can be bene cial. It is remarkable
that current econometric practice with clustered errors ignores the potential e ciency gains
of FGLS.
IIE. Implementation for OLS and FGLS
For regression software that provides a cluster-robust option, implementation of the CRVE
for OLS simply requires de ning for each observation a cluster identi er variable that takes
one of G distinct values according to the observation's cluster, and then passing this cluster identi er to the estimation command's cluster-robust option. For example, if the cluster identi er is id clu, then Stata OLS command regress y x becomes regress y x,
vce(cluster id clu).
Wald hypothesis tests and con dence intervals are then implemented in the usual way.
In some cases, however, joint tests of several hypotheses and of overall statistical signi cance
b
may not be possible. The CRVE Vclu [ b ] is guaranteed to be positive semi-de nite, so the
estimated variance of the individual components of b are necessarily nonnegative. But
b
Vclu [ b ] is not necessarily positive de nite, so it is possible that the variance matrix of linear
b
combinations of the components of b is singular. The rank of Vclu [ b ] equals the rank of
b
b
b
b
X0G uG ] is a K G matrix, it
B de ned in (11). Since B = C0 C, where C0 = [X01 u1
b and hence that of Vclu [ b ], is at most the rank of C. Since
b
follows that the rank of B,
b
b
X01 u1 + + X0G uG = 0, the rank of C is at most the minimum of K and G 1. E ectively,
b clu [ b ] equals min(K; G 1), though it can be less than this in some examples
the rank of V
such as perfect collinearity of regressors and cluster-speci c dummy regressors (see Section
IIIB for the latter).
A common setting is to have a richly speci ed model with thousands of observations in
b
far fewer clusters, leading to more regressors than clusters. Then Vclu [ b ] is rank-de cient, so
it will not be possible to perform an overall F test of the joint statistical signi cance of all
regressors. And in a log-wage regression with occupation dummies and clustering on state,
we cannot test the joint statistical signi cance of the occupation dummies if there are more
occupations than states. But it is still okay to perform statistical inference on individual
regression coe cients, and to do joint tests on a limited number of restrictions (potentially
as many as min(K; G 1)).
Regression software usually also includes a panel data component. Panel commands may
enable not only OLS with cluster-robust standard errors, but also FGLS for some models of
13
within-cluster error correlation with default (and possibly cluster-robust) standard errors. It
is important to note that those panel data commands that do not explicitly use time series
methods, an example is FGLS with equicorrelation of errors within-cluster, can be applied
more generally to other forms of clustered data, such as individual-level data with clustering
on geographic region.
For example, in Stata rst give the command xtset id clu to let Stata know that the
cluster-identi er is variable id clu. Then the Stata command xtreg y x, pa corr(ind)
vce(robust) yields OLS estimates with cluster-robust standard errors. Note that for Stata
xt commands, option vce(robust) is generally interpreted as meaning cluster-robust; this
is always the case from version 12.1 on. The xt commands use standard normal critical
values whereas command regress uses T (G 1) critical values; see Sections VI and VIIA
for further discussion.
For FGLS estimation the commands vary with the model for g . For equicorrelated
errors, a starting point for example 1 in Section IIB, the FGLS estimator can be obtained
using command xtreg y x, pa corr(exch) or command xtreg y x, re; slightly di erent
estimates are obtained due to slightly di erent estimates of the equicorrelation. For FGLS for
hierarchical models that are richer than a random e ects model, use Stata command mixed
(version 13) or xtmixed (earlier versions). For FGLS with panel data and time variable time,
rst give the command xtset id clu time to let Stata know both the cluster-identi er and
time variable. A starting point for example 2 in Section IIB is an autoregressive error of
order one, estimated using command xtreg y x, pa corr(ar 1). Stata permits a wide
range of possible models for serially correlated errors.
In all of these FGLS examples the reported standard errors are the default ones that
assume correct speci cation of g . Better practice is to add option vce(robust) for xtreg
commands, or option vce(cluster id clu) for mixed commands, as this yields standard
errors that are based on the cluster-robust variance de ned in (15).
IIF. Cluster-Bootstrap Variance Matrix Estimate
Not all econometrics packages compute cluster-robust variance estimates, and even those
that do may not do so for all estimators. In that case one can use a pairs cluster bootstrap
that, like the CRVE, gives a consistent estimate of V[ b ] when errors are clustered.
To implement this bootstrap, do the following steps B times: (1) form G clusters
f(y1 ; X1 ); :::; (yG ; XG )g by resampling with replacement G times from the original sample of
clusters, and (2) compute the estimate of , denoted b b in the bth bootstrap sample. Then,
given the B estimates b 1 ; :::; b B , compute the variance of these
b
Vclu;boot [ b ] =
1
B
1
PB
b=1 (
b
b
b )( b
b
b )0 ;
P
where b = B 1 B b b and B = 400 should be more than adequate in most settings. It
b=1
is important that the resampling be done over entire clusters, rather than over individual
14
observations. Each bootstrap resample will have exactly G clusters, with some of the original
clusters not appearing at all while others of the original clusters may be repeated in the
resample two or more times. The terms \pairs" is used as (yg ; Xg ) are resampled as a
pair. The term \nonparametric" is also used for this bootstrap. Some alternative bootstraps
p
b
hold Xg xed while resampling. For nite clusters, if Vclu [ b ] uses cbg in (11) then for
u
b
comparability Vclu;boot [ b ] should be multiplied by the constant c in (12). The pairs cluster
bootstrap leads to cluster-robust variance matrix for OLS with rank K even if K > G.
An alternative resampling method that can be used is the leave-one-cluster-out jackknife.
Then, letting b g denote the estimator of when the g th cluster is deleted,
G
b
Vclu;jack [ b ] =
G
1 PG
g=1 (
bg
b )( b g
b )0 ;
P
where b = G 1 G b g . This older method can be viewed as an approximation to the
g=1
bootstrap that does not work as well for nonlinear estimators. It is used less often than the
bootstrap, and has the same rank as the CRVE.
Unlike a percentile-t cluster bootstrap, presented later, the pairs cluster bootstrap and
cluster jackknife variance matrix estimates are no better asymptotically than the CRVE,
so it is best and quickest to use the CRVE if it is available. But the CRVE is not always
available, especially for estimators more complicated than OLS. In that case one can instead
use the pairs cluster bootstrap, though see the end of Section VIC for potential pitfalls if
there are few clusters, or even the cluster jackknife.
In Stata the pairs cluster bootstrap for OLS without xed e ects can be implemented in
several equivalent ways including: regress y x, vce(boot, cluster(id clu) reps(400)
seed(10101)); xtreg y x, pa corr(ind) vce(boot, reps(400) seed(10101)); and bootstrap,
cluster(id clu) reps(400) seed(10101) : regress y x. The last variant can be used
for estimation commands and user-written programs that do not have a vce(boot) option.
We recommend 400 bootstrap iterations for published results and for replicability one should
always set the seed.
For the jackknife the commands are instead, respectively, regress y x, vce(jack,
cluster(id clu)); xtreg y x, pa corr(ind) vce(jack); and jackknife, cluster(id clu):
regress y x. For Stata xt commands, options vce(boot) and vce(jack) are generally interpreted as meaning cluster bootstrap and cluster jackknife; always so from Stata 12.1
on.
III. Cluster-Speci c Fixed E ects
The cluster-speci c xed e ects (FE) model includes a separate intercept for each cluster,
so
yig = x0ig +
=
x0ig
+
g + uig
PG
h=1
15
g dhig
(17)
+ uig ;
where dhig , the hth of G dummy variables, equals one if the ig th observation is in cluster h
and equals zero otherwise.
There are several di erent ways to obtain the same cluster-speci c xed e ects estimator.
The two most commonly-used are the following. The least squares dummy variable (LSDV)
estimator directly estimates the second line of (17), with OLS regression of yig on xig and the
P
G dummy variables d1ig ; :::; dGig , in which case b g = yg x0g b where yg = Ng 1 G yig and
i=1
P
xg = Ng 1 G xig . The within estimator, also called the xed e ects estimator, estimates
i=1
by OLS the within or mean-di erenced model
(yig
yg ) = (xig
xg )0 + (uig
ug ):
(18)
The main reason that empirical economists use the cluster-speci c FE estimator is that
it controls for a limited form of endogeneity of regressors. Suppose in (17) that we view
g + uig as the error, and the regressors xig are correlated with this error, but only with
the cluster-invariant component, i.e., Cov[xig ; g ] 6= 0 while Cov[xig ; uig ] = 0. Then OLS
and FGLS regression of yig on xig , as in Section II, leads to inconsistent estimation of .
Mean-di erencing (17) leads to the within model (18) that has eliminated the problematic
cluster-invariant error component g . The resulting FE estimator is consistent for if either
G ! 1 or Ng ! 1.
The cluster-robust variance matrix formula given in Section II carries over immediately
to OLS estimation in the FE model, again assuming G ! 1.
In this section we consider some practicalities. First, including xed e ects generally
does not control for all the within-cluster correlation of the error and one should still use the
CRVE. Second, when cluster sizes are small and degrees-of-freedom corrections are used the
CRVE should be computed by within rather than LSDV estimation. Third, FGLS estimators
need to be bias-corrected when cluster sizes are small. Fourth, tests of xed versus random
e ect models should use a modi ed version of the Hausman test.
IIIA. Do Fixed E ects Fully Control for Within-Cluster Correlation?
While cluster-speci c e ects will control for part of the within-cluster correlation of the
error, in general they will not completely control for within-cluster error correlation (not to
mention heteroskedasticity). So the CRVE should still be used. There are several ways to
make this important point.
Suppose we have data on students in classrooms in schools. A natural model, a special
case of a hierarchical model, is to suppose that there is both an unobserved school e ect
and, on top of that, an unobserved classroom e ect. Letting i denote individual, s school,
and c classroom, we have yisc = x0isc + s + c + "isc . A regression with school-level xed
e ects (or random e ects) will control for within-school correlation, but not the additional
within-classroom correlation.
Suppose we have a short panel (T xed, N ! 1) of uncorrelated individuals and estimate
yit = x0it + i + uit . Then the error uit may be correlated over time (i.e., within-cluster) due
16
to omitted factors that evolve progressively over time. Inoue and Solon (2006) provide a test
for this serial correlation. Cameron and Trivedi (2005, p.710) present an FE individual-level
panel data log-earnings regressed on log-hours example with cluster-robust standard errors
four times the default. Serial correlation in the error may be due to omitting lagged y as a
regressor. When yi;t 1 is included as an additional regressor in the FE model, the ArellanoBond estimator is used and even with yi;t 1 included still requires that we rst test whether
the remaining error uit is serially correlated.
Finally, suppose we have a single cross-section (or a single time series). This can be viewed
as regression on a single cluster. Then in the model yi = + x0i + ui (or yt = + x0t + ut ),
the intercept is the cluster-speci c xed e ect. There are many reasons for why the error ui
(or ut ) may be correlated in this regression.
IIIB. Cluster-Robust Variance Matrix with Fixed E ects
Since inclusion of cluster-speci c xed e ects may not fully control for cluster correlation
(and/or heteroskedasticity), default standard errors that assume uig to be i.i.d. may be
invalid. So one should use cluster-robust standard errors.
b
Arellano (1987) showed that Vclu [ b ] de ned in (10)-(11) remains valid for the within
estimator that controls for inclusion of G cluster-speci c xed e ects, provided G ! 1 and
Ng is small. If instead one obtains the LSDV estimator, the CRVE formula gives the same
CRVE for b as that for the within estimator, with the important proviso that the same
degrees-of-freedom adjustment must be used { see below. The xed e ects estimates b g
b
obtained for the LSDV estimator are essentially based only on Ng observations, so V[b g ] is
inconsistent for V[b g ], just as b g is inconsistent for g .
Hansen (2007a, p.600) shows that this CRVE can also be used if additionally Ng ! 1,
for both the case where within-cluster correlation is always present (e.g. for many individuals
in each village) and for the case where within-cluster correlation eventually disappears (e.g.
for panel data where time series correlation disappears for observations far apart in time).
p
p
The rates of convergence are G in the rst case and GNg in the second case, but the same
asymptotic variance matrix is obtained in either case. Kezdi (2004) analyzed the CRVE for
a range of values of G and Ng .
It is important to note that while LSDV and within estimation lead to identical estimates
of , they can yield di erent standard errors due to di erent nite sample degrees-of-freedom
correction.
It is well known that if default standard errors are used, i.e. it is assumed that uig in (17)
is i.i.d., then one can safely use standard errors after LSDV estimation as it correctly views
the number of parameters as G + K rather than K. If instead the within estimator is used,
however, manual OLS estimation of (18) will mistakenly view the number of parameters to
equal K rather than G + K. (Built-in panel estimation commands for the within estimator,
i.e. a xed e ects command, should remain okay to use, since they should be programmed
to use G + K in calculating the standard errors.)
17
It is not well known that if cluster-robust standard errors are used, and cluster sizes are
small, then inference should be based on the within estimator standard errors. We thank
Arindrajit Dube and Jason Lindo for bringing this issue to our attention. Within and LSDV
estimation lead to the same cluster-robust standard errors if we apply formula (11) to the
respective regressions, or if we multiply this estimate by c = G=(G 1). Di erences arise,
however, if we multiply by the small-sample correction c given in (12). Within estimation
sets c = GG 1 N N 1 1) since there are only (K 1) regressors { the within model is estimated
(K
N 1
without an intercept. LSDV estimation uses c = GG 1 N G (K 1) since the G cluster dummies
are also included as regressors. For balanced clusters with Ng = N and G large relative to
K, c ' 1 for within estimation and c ' N =(N
1) for LSDV estimation. Suppose there
are only two observations per cluster, due to only two individuals per household or two time
periods in a panel setting, so Ng = N = 2. Then c ' 2=(2 1) = 2 for LSDV estimation,
leading to CRVE that is twice that from within estimation. Within estimation leads to the
correct nite-sample correction.
In Stata the within command xtreg y x, fe vce(robust) gives the desired CRVE. The
alternative LSDV commands regress y x i.id clu, vce(cluster id clu) and, equivalently, regress y x, absorb(id clu) vce(cluster id clu) use the wrong degrees-offreedom correction. If a CRVE is needed, then use xtreg. If there is reason to instead use
regress i.id then the cluster-robust standard errors should be multiplied by the square
root of [N (K 1)]=[N G (K 1)], especially if Ng is large and G is small.
The inclusion of dummy variables does not lead to an increase in the rank of the CRVE.
To see this, stack the dummy variable dhig for cluster g into the Ng 1 vector dhg . Then
b
b
dh0g ug = 0, by the OLS normal equations, leading to the rank of Vclu [ b ] falling by one
for each cluster-speci c e ect. If there are m regressors varying within cluster and G 1
b
dummies then, even though there are m + G 1 parameters , the rank of Vclu [ b ] is only the
minimum of m and G 1. And a test that 1 ; :::; G are jointly statistically signi cant is a
test of G 1 restrictions (since the intercept or one of the xed e ects needs to be dropped).
So even if the cluster-speci c xed e ects are consistently estimated (i.e., if Ng ! 1), it is
not possible to perform this test if m < G 1, which is often the case.
If cluster-speci c e ects are present then the pairs cluster bootstrap must be adapted
to account for the following complication. Suppose cluster 3 appears twice in a bootstrap
resample. Then if clusters in the bootstrap resample are identi ed from the original clusteridenti er, the two occurrences of cluster 3 will be incorrectly treated as one large cluster
rather than two distinct clusters.
In Stata, the bootstrap option idcluster ensures that distinct identi ers are used in each
bootstrap resample. Examples are regress y x i.id clu, vce(boot, cluster(id clu)
idcluster(newid) reps(400) seed(10101)) and, more simply, xtreg y x, vce(boot,
reps(400) seed(10101)) , as in this latter case Stata automatically accounts for this complication.
18
IIIC. Feasible GLS with Fixed E ects
When cluster-speci c xed e ects are present, more e cient FGLS estimation can become
more complicated. In particular, if asymptotic theory relies on G ! 1 with Ng xed, the g
cannot be consistently estimated. The within estimator of is nonetheless consistent, as g
disappears in the mean-di erenced model. But the resulting residuals uig are contaminated,
b
b and b g , and these residuals will be used to form a FGLS
since they depend on both
estimator. This leads to bias in the FGLS estimator, so one needs to use bias-corrected
FGLS unless Ng ! 1. The correction method varies with the model for g = V[ug ], and
currently there are no Stata user-written commands to implement these methods.
For panel data a commonly-used model speci es an AR(p) model for the errors uig in
(17). If xed e ects are present, then there is a bias (of order Ng 1 ) in estimation of the AR(p)
coe cients. Hansen (2007b) obtains bias-corrected estimates of the AR(p) coe cients and
uses these in FGLS estimation. Hansen (2007b) in simulations shows considerable e ciency
gains in bias-corrected FGLS compared to OLS.
Brewer, Crossley, and Joyce (2013) consider a DiD model with individual-level U.S. panel
data with N = 750,127, T = 30, and a placebo state-level law so clustering is on state with
G = 50. They nd that bias-corrected FGLS for AR(2) errors, using the Hansen (2007b)
correction, leads to higher power than FE estimation. In their example ignoring the bias
correction does not change results much, perhaps because T = 30 is reasonably large.
For balanced clusters with g the same for all g, say g =
, and for Ng small, then
the FGLS estimator in (13) can be used without need to specify a model for
. Instead
b have ij th entry G 1 PG uig ujg , where uig are the residuals from initial OLS
b
we can let
g=1 b b
estimation. These assumptions may be reasonable for a balanced panel. Two complications
can arise. First, even without xed e ects there may be many o -diagonal elements to
estimate and this number can be large relative to the number of observations. Second, the
xed e ects lead to bias in estimating the o -diagonal covariances. Hausman and Kuersteiner
(2008) present xes for both complications.
IIID. Testing the Need for Fixed E ects
FE estimation is accompanied by a loss of precision in estimation, as only within-cluster
variation is used (recall we regress (yig yg ) on (xig xg )). Furthermore, the coe cient of
a cluster-invariant regressor is not identi ed, since then xig xg = 0. Thus it is standard
to test whether it is su cient to estimate by OLS or FGLS, without cluster-speci c xed
e ects.
The usual test is a Hausman test based on the di erence between the FE estimator b FE ,
and the RE estimator, b RE . Let 1 denote a subcomponent of , possibly just the coe cient
of a single regressor of key interest; at most 1 contains the coe cients of all regressors that
are not invariant within cluster or, in the case of panel data, are not aggregate time e ects
19
that take the same value for each individual. The chi-squared distributed test statistic is
THaus = ( b 1;FE
b
0b 1 b
1;RE ) V ( 1;FE
b
;RE );
b
where V is a consistent estimate of V[ b 1;FE b 1;RE ].
b
Many studies use the standard form of the Hausman test. This forms obtain V under
the strong assumption that b RE is fully e cient under the null hypothesis. This requires the
unreasonably strong assumptions that i and "ig in (16) are i.i.d., requiring that neither i
nor "ig is heteroskedastic and that "ig has no within-cluster correlation. As already noted,
these assumptions are likely to fail and one should not use default standard errors. Instead
a CRVE should be used. For similar reasons the standard form of the Hausman test should
not be used.
Wooldridge (2010, p.332) instead proposes implementing a cluster-robust version of the
Hausman test by the following OLS regression
0
yig = x0ig + wg + uig ;
P
where wg denotes the subcomponent of xig that varies within cluster and wg = Ng 1 G wig .
i=1
If H0 : = 0 is rejected using a Wald test based on a cluster-robust estimate of the variance
matrix, then the xed e ects model is necessary. The Stata user-written command xtoverid,
due to Scha er and Stillman (2010), implements this test.
b
An alternative is to use the pairs cluster bootstrap to obtain V, in each resample obtaining
b
b
b
b
1;FE and
1;RE , leading to B resample estimates of ( 1;FE
1;RE ). We are unaware of
studies comparing these two cluster-robust versions of the Hausman test.
IV. What to Cluster Over?
It is not always clear what to cluster over - that is, how to de ne the clusters - and there
may even be more than one way to cluster.
Before providing some guidance, we note that it is possible for cluster-robust errors to
actually be smaller than default standard errors. First, in some rare cases errors may be negatively correlated, most likely when Ng = 2, in which case (6) predicts a reduction in the standard error. Second, cluster-robust is also heteroskedastic-robust and White heteroskedasticrobust standard errors in practice are sometimes larger and sometimes smaller than the
default. Third, if clustering has a modest e ect, so cluster-robust and default standard are
similar in expectation, then cluster-robust may be smaller due to noise. In cases where the
cluster-robust standard errors are smaller they are usually not much smaller than the default,
whereas in other applications they can be much, much larger.
IVA. Factors Determining What to Cluster Over
There are two guiding principles that determine what to cluster over.
20
First, given V[ b ] de ned in (7) and (9) whenever there is reason to believe that both the
regressors and the errors might be correlated within cluster, we should think about clustering
de ned in a broad enough way to account for that clustering. Going the other way, if we
think that either the regressors or the errors are likely to be uncorrelated within a potential
group, then there is no need to cluster within that group.
b
Second, Vclu [ b ] is an average of G terms that gets closer to V[ b ] only as G gets large. If
we de ne very large clusters, so that there are very few clusters to average over in equation
b
(11), then the resulting Vclu [ b ] can be a very poor estimate of V[ b ]. This complication, and
discussion of how few is \few", is the subject of Section VI.
These two principles mirror the bias-variance trade-o that is common in many estimation
problems { larger and fewer clusters have less bias but more variability. There is no general
solution to this trade-o , and there is no formal test of the level at which to cluster. The
consensus is to be conservative and avoid bias and use bigger and more aggregate clusters
when possible, up to and including the point at which there is concern about having too few
clusters.
For example, suppose your dataset included individuals within counties within states,
and you were considering whether to cluster at the county level or the state level. We have
been inclined to recommend clustering at the state level. If there was within-state crosscounty correlation of the regressors and errors, then ignoring this correlation (for example,
by clustering at the county level) would lead to incorrect inference. In practice researchers
often cluster at progressively higher (i.e., broader) levels and stop clustering when there is
relatively little change in the standard errors. This seems to be a reasonable approach.
There are settings where one may not need to use cluster-robust standard errors. We
outline several, though note that in all these cases it is always possible to still obtain clusterrobust standard errors and contrast them to default standard errors. If there is an appreciable
di erence, then use cluster-robust standard errors.
If a key regressor is randomly assigned within clusters, or is as good as randomly assigned,
then the within-cluster correlation of the regressor is likely to be zero. Thus there is no need
to cluster standard errors, even if the model's errors are clustered. In this setting, if there
are additionally control variables of interest, and if these are not randomly assigned within
cluster, then we may wish to cluster our standard errors for the sake of correct inference on
the control variables.
If the model includes cluster-speci c xed e ects, and we believe that within-cluster
correlation of errors is solely driven by a common shock process, then we may not be worried
about clustering. The xed e ects will absorb away the common shock, and the remaining
errors will have zero within-cluster correlation. More generally, control variables may absorb
systematic within-cluster correlation. For example, in a state-year panel setting, control
variables may capture the state-speci c business cycle.
However, as already noted in Section IIIA, the within-cluster correlation is usually not
fully eliminated. And even if it is eliminated, the errors may still be heteroskedastic. Stock
and Watson (2008) show that applying the usual White (1980) heteroskedastic-consistent
21
variance matrix estimate to the within estimator leads, surprisingly, to inconsistent estimation of V[ b ] if Ng is small (though it is correct if Ng = 2). They derive a bias-corrected formula for heteroskedastic-robust standard errors. Alternatively, and more simply, the CRVE
is consistent for V[ b ], even if the errors are only heteroskedastic, though this estimator of
V[ b ] is more variable.
Finally, as already noted in Section IID we can always build a parametric model of the
correlation structure of the errors and estimate by FGLS. If we believe that this parametric
model captures the salient features of the error correlations, then default FGLS standard
errors can be used.
IVB. Clustering Due to Survey Design
Clustering routinely arises due to the sampling methods used in complex surveys. Rather
than randomly draw individuals from the entire population, costs are reduced by sampling
only a randomly-selected subset of primary sampling units (such as a geographic area),
followed by random selection, or strati ed selection, of people within the chosen primary
sampling units.
The survey methods literature uses methods to control for clustering that predate the
cluster-robust approach of this paper. The loss of estimator precision due to clustered
sampling is called the design e ect: \The design e ect or De is the ratio of the actual
variance of a sample to the variance of a simple random sample of the same number of
elements" (Kish (1965), p.258)). Kish and Frankel (1974) give the variance in ation formula
(6), assuming equicorrelated errors in the non-regression case of estimation of the mean.
Pfe ermann and Nathan (1981) consider the more general regression case. The CRVE is
called the linearization formula, and the common use of G 1 as the degrees of freedom used
in hypothesis testing comes from the survey methods literature; see Shah, Holt and Folsom
(1977) which predates the econometrics literature.
Applied economists routinely use data from complex surveys and control for clustering,
by using a cluster-robust variance matrix estimate. At the minimum one should cluster at
the level of the primary sampling unit, though often there is reason to cluster at a broader
level, such as clustering on state if regressors and errors are correlated within state.
The survey methods literature additionally controls for two other features of survey data
{ weighting and strati cation. These methods are well-established and are incorporated in
specialized software, as well as in some broad-based packages such as Stata. Bhattacharya
(2005) provides a comprehensive treatment in a GMM framework.
If sample weights are provided then it is common to perform weighted least squares.
Then the CRVE for b WLS = (X0 WX) 1 X0 Wy is that given in (15) with b g 1 replaced by
Wg . The need to weight can be ignored if strati cation is on only the exogenous regressors
and we assume correct speci cation of the model, so that in our sample E[yjX] = X . In
that special case both weighted and unweighted estimators are consistent, and weighted OLS
may actually be less e cient if, for example, model errors are i.i.d.; see, for example, Solon,
22
Haider, and Wooldridge (2013). Another situation in which to use weighted least squares,
unrelated to complex surveys, is when data for the ig th observation is obtained by in turn
averaging Nig observations and Nig varies.
Strati cation of the sample can enable more precise statistical inference. These gains can
be bene cial in the nonregression case, such as estimating the monthly national unemployment rate. The gains can become much smaller once regressors are included, since these can
partially control for strati cation; see, for example, the application in Bhattacharya (2005).
Econometrics applications therefore usually do not adjust standard errors for strati cation,
leading to conservative inference due to some relatively small over -estimation of the standard
errors.
V. Multi-way Clustering
The discussion to date has presumed that if there is more than one potential way to cluster,
these ways are nested in each other, such as households within states. But when clusters are
non-nested, traditional cluster inference can only deal with one of the dimensions.
In some applications it is possible to include su cient regressors to eliminate concern
about error correlation in all but one dimension, and then do cluster-robust inference for
that remaining dimension. A leading example is that in a state-year panel there may be
clustering both within years and within states. If the within-year clustering is due to shocks
that are the same across all observations in a given year, then including year xed e ects
as regressors will absorb within-year clustering, and inference then need only control for
clustering on state.
When this is not possible, the one-way cluster robust variance can be extended to multiway clustering. Before discussing this, we want to highlight one error that we nd some
practitioners make. This error is to cluster at the intersection of the two groupings. In
the preceding example, some might be tempted to cluster at the state-year level. This is
what Stata will do if you give it the command regress y x, vce(cluster id styr) where
id styr is a state-year identi er. This will be very inadequate, since it imposes the restriction
that observations are independent if they are in the same state but in di erent years. Indeed
if the data is aggregated at the state-year level, there is only one observation at the state-year
level, so this is identical to using heteroskedastic-robust standard errors, i.e. not clustering at
all. This point was highlighted by Bertrand, Du o, and Mullainathan (2004) who advocated
clustering on the state.
VA. Multi-way Cluster-Robust Variance Matrix Estimate
The cluster-robust estimate of V[ b ] de ned in (10)-(11) can be generalized to clustering in
multiple dimensions. In a change of notation, suppress the subscript for cluster and more
simply denote the model for an individual observation as
yi = x0i + ui :
23
(19)
Regular one-way clustering is based on the assumption that E[ui uj jxi ; xj ] = 0, unless obP PN
0
b
bb
servations i and j are in the same cluster. Then (11) sets B = N
i=1
j=1 xi xj ui uj 1[i; j
in same cluster], where ui = yi x0i b . In multi-way clustering, the key assumption is that
b
E[ui uj jxi ; xj ] = 0, unless observations i and j share any cluster dimension. Then the multiway cluster robust estimate of V[ b ] replaces (12) with
b
B=
XN XN
i=1
j=1
xi x0j ui uj 1[i; j share any cluster]:
bb
(20)
This method relies on asymptotics that are in the number of clusters of the dimension with
the fewest number of clusters. This method is thus most appropriate when each dimension
has many clusters.
Theory for two-way cluster robust estimates of the variance matrix is presented in
Cameron, Gelbach, and Miller (2006, 2011), Miglioretti and Heagerty (2006), and Thompson
(2006, 2011). See also empirical panel data applications by Acemoglu and Pischke (2003),
who clustered at individual level and at region time level, and by Petersen (2009), who
clustered at rm level and at year level. Cameron, Gelbach and Miller (2011) present extension to multi-way clustering. Like one-way cluster-robust, the method can be applied to
estimators other than OLS.
For two-way clustering this robust variance estimator is easy to implement given software
that computes the usual one-way cluster-robust estimate. First obtain three di erent clusterrobust \variance" matrices for the estimator by one-way clustering in, respectively, the rst
dimension, the second dimension, and by the intersection of the rst and second dimensions.
Then add the rst two variance matrices and, to account for double-counting, subtract the
third. Thus
b
b
b
b
V2way [ b ] = V1 [ b ] + V2 [ b ] V1\2 [ b ];
(21)
where the three component variance estimates are computed using (10)-(11) for the three
di erent ways of clustering.
We spell this out in a step-by-step fashion.
1. Identify your two ways of clustering. Make sure you have a variable that identi es
each way of clustering. Also create a variable that identi es unique \group 1 by group
2" combinations. For example, suppose you have individual-level data spanning many
U.S. states and many years, and you want to cluster on state and on year. You will
need a variable for state (e.g. California), a variable for year (e.g. 1990), and a variable
for state-by-year (California and 1990).
2. Estimate your model, clustering on \group 1". For example, regress y on x, clustering
b
on state. Save the variance matrix as V1 .
3. Estimate your model, clustering on \group 2". For example, regress y on x, clustering
b
on year. Save the variance matrix as V2 .
24
4. Estimate your model, clustering on \group 1 by group 2". For example, regress y on
b
x, clustering on state-by-year. Save the variance matrix as V1\2 .
b
b
b
5. Create a new variance matrix V2way = V1 + V2 {c 1\2 . This is your new two-way
V
b.
cluster robust variance matrix for
6. Standard errors are the square root of the diagonal elements of this matrix.
If you are interested in only one coe cient, you can also just focus p saving the standard
on
se2 .
error for this coe cient in steps 2-4 above, and then create se2way = se2 + se2
1\2
2
1
In taking these steps, you should watch out for some potential pitfalls. With perfectly
multicollinear regressors, such as inclusion of dummy variables some of which are redundant,
a statistical package may automatically drop one or more variables to ensure a nonsingular
set of regressors. If the package happens to drop di erent sets of variables in steps 2, 3,
b0
and 4, then the resulting V s will not be comparable, and adding them together in step 5
will give a nonsense result. To prevent this issue, manually inspect the estimation results in
steps 2, 3, and 4 to ensure that each step has the same set of regressors, the same number of
observations, etc. The only things that should be di erent are the reported standard errors
and the reported number of clusters.
VB. Implementation
b
Unlike the standard one-way cluster case, V2way [ b ] is not guaranteed to be positive semide nite, so it is possible that it may compute negative variances. In some applications with
b
b
xed e ects, V[ b ] may be non positive-de nite, but the subcomponent of V[ b ] associated
with the regressors of interest may be positive-de nite. This may lead to an error message,
even though inference is appropriate for the parameters of interest. Our informal observation is that this issue is most likely to arise when clustering is done over the same groups
b
as the xed e ects. Few clusters in one or more dimensions can also lead to V2way [ b ] being
a non-positive-semide nite matrix. Cameron, Gelbach and Miller (2011) present an eigendecomposition technique used in the time series HAC literature that zeroes out negative
b
eigenvalues in V2way [ b ] to produce a positive semi-de nite variance matrix estimate.
The Stata user-written command cmgreg, available at the authors' websites, implements
multi-way clustering for the OLS estimator with, if needed, the negative eigenvalue adjustment. The Stata add-on command ivreg2, due to Baum, Scha er, and Stillman (2007),
implements two-way clustering for OLS, IV and linear GMM estimation. Other researchers
have also posted code, available from searching the web.
Cameron, Gelbach, and Miller (2011) apply the two-way method to data from Hersch
(1998) that examines the relationship between individual wages and injury risk measured
separately at the industry level and the occupation level. The log-wage for 5960 individuals
is regressed on these two injury risk measures, with standard errors obtained by two-way
clustering on 211 industries and 387 occupations. In that case two-way clustering leads to
25
only a modest change in the standard error of the industry job risk coe cient compared to
the standard error with one-way clustering on industry. Since industry job risk is perfectly
correlated within industry, by result (6) we need to cluster on industry if there is any withinindustry error correlation. By similar logic, the additional need to cluster on occupation
depends on the within-occupation correlation of job industry risk, and this correlation need
not be high. For the occupation job risk coe cient, the two-way and one-way cluster (on
occupation) standard errors are similar. Despite the modest di erence in this example, twoway clustering avoids the need to report standard errors for one coe cient clustering in one
way and for the second coe cient clustering in the second way.
Cameron, Gelbach, and Miller (2011) also apply the two-way cluster-robust method to
data on volume of trade between 98 countries with 3262 unique country pairs. In that case,
two-way clustering on each of the countries in the country pair leads to standard errors that
are 40% larger than one-way clustering or not clustering at all. Cameron and Miller (2012)
study such dyadic data in further detail. They note that two-way clustering does not pick
up all the potential correlations in the data. Instead, more general cluster-robust methods,
including one proposed by Fafchamps and Gubert (2007), should be used.
VC. Feasible GLS
Similar to one-way clustering, FGLS is more e cient than OLS, provided an appropriate
model for = E[uu0 jX] is speci ed and is consistently estimated.
The random e ects model can be extended to multi-way clustering. For individual i in
clusters g and h, the two-way random e ects model speci es
yigh = x0igh +
g
+
h
+ "igh ;
where the errors g , h , and "igh are each assumed to be i.i.d. distributed with mean 0.
For example, Moulton (1986) considered clustering due to grouping of regressors (schooling,
age and weeks worked) in a log earnings regression, and estimated a model with common
random shock for each year of schooling, for each year of age, and for each number of weeks
worked.
The two-way random e ects model can be estimated using standard software for (nested)
hierarchical linear models; see, for example, Cameron and Trivedi (2009, ch. 9.5.7) for Stata
command xtmixed (command mixed from version 13 on). For estimation of a many-way
random e ects model, see Davis (2002) who modelled lm attendance data clustered by
lm, theater and time.
The default standard errors after FGLS estimation require that is correctly speci ed.
FGLS estimation entails transforming the data in such a way that there is no obvious method
for computing a variance matrix estimate that is robust to misspeci cation of in the twoway or multi-way random e ects model. Instead if there is concern about misspeci cation
of
then one needs to consider FGLS with richer models for , see Rabe-Hesketh and
Skrondal (2012) for richer hierarchical models in Stata, or do less e cient OLS with twoway cluster-robust standard errors.
26
VD. Spatial Correlation
Cluster-robust variance matrix estimates are closely related to spatial-robust variance matrix
estimates.
b
In general, for model (19), B in (20) has the form
XN XN
b
w (i; j) xi x0j ui uj ;
bb
(22)
B=
i=1
j=1
where w (i; j) are weights. For cluster-robust inference these weights are either 1 (cluster in
common) or 0 (no cluster in common). But the weights can also decay from one to zero, as
in the case of the HAC variance matrix estimate for time series where w (i; j) decays to zero
as ji jj increases.
For spatial data it is assumed that model errors become less correlated as the spatial
distance between observations grows. For example, with state-level data the assumption
that model errors are uncorrelated across states may be relaxed to allow correlation that
decays to zero as the distance between states gets large. Conley (1999) provides conditions
under which (10) and (22) provide a robust variance matrix estimate for the OLS estimator,
where the weights w (i; j) decay with the spatial distance. The estimate (which Conley also
generalizes to GMM models) is often called a spatial-HAC estimate, rather than spatialrobust, as proofs use mixing conditions (to ensure decay of dependence) as observations
grow apart in distance. These conditions are not applicable to clustering due to common
shocks which leads to the cluster-robust estimator with independence of observations across
clusters.
Driscoll and Kraay (1998) consider panel data with T time periods and N individuals,
with errors potentially correlated across individuals (and no spatial dampening), though
this correlation across individuals disappears for observations that are more than m time
periods apart. Let it denote the typical observation. The Driscoll-Kraay spatial correlation
consistent (SCC) variance matrix estimate can be shown to use weight w (it; js) = 1
d (it; js) =(m+1) in (22), where the summation is now over i; j; s and t, and d (it; js) = jt sj
if jt sj m and d (it; js) = 0 otherwise. This method requires that the number of time
periods T ! 1, so is not suitable for short panels, while N may be xed or N ! 1. The
Stata add-on command xtscc, due to Hoechle (2007), implements this variance estimator.
An estimator proposed by Thompson (2006) allows for across-cluster (in his example
rm) correlation for observations close in time in addition to within-cluster correlation at
any time separation. The Thompson estimator can be thought of as using w (it; js) = 1[i; j
share a rm, or d (it; js) m]. Foote (2007) contrasts the two-way cluster-robust and these
other variance matrix estimators in the context of a macroeconomics example. Petersen
(2009) contrasts various methods for panel data on nancial rms, where there is concern
about both within rm correlation (over time) and across rm correlation due to common
shocks.
Barrios, Diamond, Imbens, and Kolesar (2012) consider state-year panel data on individuals in states over years with state-level treatment and outcome (earnings) that is correlated
27
spatially across states. This spatial correlation can be ignored if the state-level treatment is
randomly assigned. But if the treatment is correlated over states (e.g. adjacent states may
be more likely to have similar minimum wage laws) then one can no longer use standard errors clustered at the state level. Instead one should additionally allow for spatial correlation
of errors across states. The authors additionally contrast traditional model-based inference
with randomization inference.
In practice data can have cluster, spatial and time series aspects, leading to hybrids of
cluster-robust, spatial-HAC and time-series HAC estimators. Furthermore, it may be possible to parameterize some of the error correlation. For example for a time series AR(1) error it
b
may be preferable to use E[ut us ] based on an AR(1) model rather than w (t; s) ut us . To date
bb
empirical practice has not commonly modeled these combined types of error correlations.
This may become more common in the future.
VI. Few Clusters
We return to one-way clustering, and focus on the Wald \t-statistic"
w=
b
0
sb
;
(23)
is one element in the parameter vector , and the standard error sb is the square root
b
of the appropriate diagonal entry in Vclu [ b ]. If G ! 1 then w N [0; 1] under H0 : = 0 .
For nite G the distribution of w is unknown, even with normal errors. It is common to use
the T distribution with G 1 degrees of freedom.
It is not unusual for the number of clusters G to be quite small. Despite few clusters, b
may still be a reasonably precise estimate of if there are many observations per cluster. But
b
with small G the asymptotics have not kicked in. Then Vclu [ b ] can be downwards-biased.
b
One should at a minimum use T (G 1) critical values and Vclu [ b ] de ned in (10)-(11)
p
with residuals scaled by c where c is de ned in (12) or c = G=(G 1). Most packages
rescale the residuals { Stata uses the rst choice of c and SAS the second. The use of
T (G 1) critical values is less common. Stata uses this adjustment after command regress
y x, vce(cluster). But the alternative command xtreg y x, vce(robust) instead uses
standard normal critical values.
Even with both of these adjustments, Wald tests generally over-reject. The extent of
over-rejection depends on both how few clusters there are and the particular data and model
used. In some cases the over-rejection is mild, in others cases a test with nominal size 0:05
may have true test size of 0:10.
The next subsection outlines the basic problem and discusses how few is \few" clusters.
The subsequent three subsections present three approaches to nite-cluster correction { biascorrected variance, bootstrap, with asymptotic re nement and use of the T distribution with
adjusted degrees-of-freedom. The nal subsection considers special cases.
where
28
VIA. The Basic Problems with Few Clusters
There are two main problems with few clusters. First, OLS leads to \over tting", with
estimated residuals systematically too close to zero compared to the true error terms. This
leads to a downwards-biased cluster-robust variance matrix estimate. The second problem is
b
that even with bias-correction, the use of tted residuals to form the estimate B of B leads
to over-rejection (and con dence intervals that are too narrow) if the critical values are from
the standard normal or even T (G 1) distribution.
For the linear regression model with independent homoskedastic normally distributed
errors these problems are easily controlled. An unbiased variance matrix is obtained by
bb
bb
estimating the error variance 2 by s2 = u0 u=(N K) rather than u0 u=N . This is the \ x"
in the OLS setting for the rst problem. The analogue to the second problem is that the
N [0; 1] distribution is a poor approximation to the true distribution of the Wald statistic. In
the i.i.d. case, the Wald statistic w can be shown to be exactly T (N K) distributed. For
nonnormal homoskedastic errors the T (N K) is still felt to provide a good approximation,
provided N is not too small. Both of these problems arise in the clustered setting, albeit
with more complicated manifestations and xes.
For independent heteroskedastic normally distributed errors there are no exact results.
MacKinnon and White (1985) consider several adjustments to the heteroskedastic-consistent
variance estimate of White (1980), including one called HC2 that is unbiased in the special
case that errors are homoskedastic. Unfortunately if errors are actually heteroskedastic, as
expected, the HC2 estimate is no longer unbiased for V[ b ] { an unbiased estimator depends
on the unknown pattern of heteroskedasticity and on the design matrix X. And there is no
way to obtain an exact T distribution result for w, even if errors are normally distributed.
Other proposed solutions for testing and forming con dence intervals include using a T
distribution with data-determined degrees of freedom, and using a bootstrap with asymptotic
re nement.
In the following subsections we consider extensions of these various adjustments to the
clustered case, where the problems can become even more pronounced.
Before proceeding we note that there is no speci c point at which we need to worry about
few clusters. Instead, \more is better". Current consensus appears to be that G = 50 is
enough for state-year panel data. In particular, Bertrand, Du o, and Mullainathan (2004,
Table 8) nd in their simulations that for a policy dummy variable with high within-cluster
correlation, a Wald test based on the standard CRVE with critical value of 1.96 had rejection
rates of .063, .058, .080, and .115 for number of states (G) equal to, respectively, 50, 20, 10
and 6. The simulations of Cameron, Gelbach and Miller (2008, Table 3), based on a quite
di erent data generating process but again with standard CRVE and critical value of 1.96,
had rejection rates of .068, .081, .118, and .208 for G equal to, respectively, 30, 20, 10 and
5. In both cases the rejection rates would also exceed .05 if the critical value was from the
T (G 1) distribution.
The preceding results are for balanced clusters. Cameron, Gelbach and Miller (2008,
29
Table 4, column 8) consider unbalanced clusters when G = 10. The rejection rate with
unbalanced clusters, half of size Ng = 10 and half of size 50, is :183, appreciably worse than
rejection rates of :126 and :115 for balanced clusters of sizes, respectively, 10 and 100. Recent
papers by Carter, Schnepel, and Steigerwald (2013) and Imbens and Kolesar (2012) provide
theory that also indicates that the e ective number of clusters is reduced when Ng varies
across clusters; see also the simulations in MacKinnon and Webb (2013). Similar issues
may also arise if the clusters are balanced, but estimation is by weighted OLS that places
di erent weights on di erent clusters. Cheng and Hoekstra (2013) document that weighting
can result in over-rejection in the panel DiD setting of Bertrand, Du o, and Mullainathan
(2004).
To repeat a key message, there is no clear-cut de nition of \few". Depending on the
situation \few" may range from less than 20 to less than 50 clusters in the balanced case,
and even more clusters in the unbalanced case.
VIB. Solution 1: Bias-Corrected Cluster-Robust Variance Matrix
b
A weakness of the standard CRVE with residual ug is that it is biased for Vclu [ b ], since
0
0
E[b g ug ] 6= E[ug ug ]. The bias depends on the form of g but will usually be downwards.
u b
e
b
Several corrected residuals ug to use in place of ug in (11) have been proposed. The simplest,
p
p
e
e
already mentioned, is to use ug = G=(G 1)b g or ug = cb g where c is de ned in (12).
u
u
One should at least use either of these corrections.
Bell and McCa rey (2002) use
e
ug = [INg
Hgg ]
1=2
b
ug ;
(24)
where Hgg = Xg (X0 X) 1 X0g . This transformed residual leads to unbiased CRVE in the
special case that E[ug u0g ] = 2 I. This is a cluster generalization of the HC2 variance estimate
of MacKinnon and White (1985), so we refer to it as the CR2VE.
Bell and McCa rey (2002) also use
r
G 1
e
b
ug =
[INg Hgg ] 1 ug :
(25)
G
This transformed residual leads to CRVE that can be shown to equal the delete-one-cluster
jackknife estimate of the variance of the OLS estimator. This jackknife correction leads to
upwards-biased CRVE if in fact E[ug u0g ] = 2 I. This is a cluster generalization of the HC3
variance estimate of MacKinnon and White (1985), so we refer to it as the CR3VE.
Angrist and Pischke (2009, Chapter 8) and Cameron, Gelbach and Miller (2008) provide
a more extensive discussion and cite more of the relevant literature. This literature includes
papers that propose corrections for the more general case that E[ug u0g ] 6= 2 I but has a
known parameterization, such as an RE model, and extension to generalized linear models.
Angrist and Lavy (2002) apply the CR2VE correction (24) in an application with G = 30
to 40 and nd that the correction increases cluster-robust standard errors by between 10 and
30
50 percent. Cameron, Gelbach and Miller (2008, Table 3) nd that the CR3VE correction
(24) has rejection rates of .062, .070, .092, and .138 for G equal to, respectively, 30, 20, 10
and 5. These rejection rates are a considerable improvement on .068, .081, .118, and .208
for the standard CRVE, but there is still considerable over-rejection for very small G.
The literature has gravitated to using the CR2VE adjustment rather than the CR3VE
adjustment. This reduces but does not eliminate over-rejection when there are few clusters.
VIC. Solution 2: Cluster Bootstrap with Asymptotic Re nement
In Section IIF we introduced the bootstrap as it is usually used, to calculate standard errors
that rely on regular asymptotic theory. Here we consider a di erent use of the bootstrap,
one with asymptotic re nement that may lead to improved nite-sample inference.
p
We consider inference based on G ! 1 (formally, G(b
) has a limit normal distribution). Then a two-sided Wald test of nominal size 0:05, for example, can be shown to
have true size 0:05 + O(G 1 ) when the usual asymptotic normal approximation is used. For
G ! 1 this equals the desired 0.05, but for nite G this di ers from 0:05. If an appropriate bootstrap with asymptotic re nement is instead used, the true size is 0:05 + O(G 3=2 ).
This is closer to the desired 0:05 for large G, as G 3=2 < G 1 . Hopefully this is also the
case for small G, something that is established using appropriate Monte Carlo experiments.
For a one-sided test or a nonsymmetric two-sided test the rates are instead, respectively,
0:05 + O(G 1=2 ) and 0:05 + O(G 1 ).
Asymptotic re nement can be achieved by bootstrapping a statistic that is asymptotically
pivotal, meaning the asymptotic distribution does not depend on any unknown parameters.
The estimator b is not asymptotically pivotal as its distribution depends on V[b] which in
turn depends on unknown variance parameters in V[ujX]. The Wald t-statistic de ned in
(23) is asymptotically pivotal as its asymptotic distribution is N [0; 1] which does not depend
on unknown parameters.
VIC.1. Percentile-t Bootstrap
A percentile-t bootstrap obtains B draws, w1 ; :::; wB , from the distribution of the Wald
t-statistic as follows. B times do the following:
1. Obtain G clusters f(y1 ; X1 ); :::; (yG ; XG )g by one of the cluster bootstrap methods
detailed below.
2. Compute the OLS estimate bb in this resample.
3. Calculate the Wald test statistic wb = (bb b)=sb where sb is the cluster-robust
b
b
standard error of bb , and b is the OLS estimate of j from the original sample.
Note that we center on b and not 0 , as the bootstrap views the sample as the population,
i.e., = b, and the B resamples are based on this \population." Note also that the standard
31
error in step 3 needs to be cluster-robust. A good choice of B is B = 999; this is more than
B for standard error estimation as tests are in the tails of the distribution, and is such that
(B + 1) is an integer for common choices of test size .
Let w(1) ; :::; w(B) denote the ordered values of w1 ; :::; wB . These ordered values trace out
the density of the Wald t-statistic, taking the place of a standard normal or T distribution.
For example, the critical values for a 95% nonsymmetric con dence interval or a 5% nonsymmetric Wald test are the lower 2.5 percentile and upper 97.5 percentile of w1 ; :::; wB ,
rather than, for example, the standard normal values of 1:96 and 1:96. The p-value for a
symmetric test based on the original sample Wald statistic w equals the proportion of times
that jwj > jwb j, b = 1; :::; B.
The simplest cluster resampling method is the pairs cluster resampling introduced in
Section IIF. Then in step 1. above we form G clusters f(y1 ; X1 ); :::; (yG ; XG )g by resampling with replacement G times from the original sample of clusters. This method has the
advantage of being applicable to a wide range of estimators, not just OLS. However, for
some types of data the pairs cluster bootstrap may not be applicable - see \Bootstrap with
Caution" below.
Cameron, Gelbach, and Miller (2008) found that in Monte Carlos with few clusters
the cluster bootstrap did not eliminate test over-rejection. The authors proposed using an
alternative percentile-t bootstrap, the wild cluster bootstrap, that holds the regressors xed
across bootstrap replications.
VIC.2. Wild Cluster Bootstrap
The wild cluster bootstrap resampling method is as follows. First, estimate the main model,
imposing (forcing) the null hypothesis H0 that you wish to test, to give estimate e H0 . For
example, for test of statistical signi cance of a regressor OLS regress yig on all components
of xig except the regressor with coe cient zero under the null hypothesis. Form the residual
uig = yi x0ig e H0 . Then obtain the bth resample for step 1 above as follows:
e
1a. Randomly assign cluster g the weight dg = 1 with probability 0:5 and the weight
dg = 1 with probability 0:5. All observations in cluster g get the same value of the
weight.
1b. Generate new pseudo-residuals uig = dg
x0ig e H0 + uig .
uig , and hence new outcome variables yig =
e
Then proceed with step 2 as before, regressing yig on xig , and calculate wb as in step
3. The p-value for the the test based on the original sample Wald statistic w equals the
proportion of times that jwj > jwb j, b = 1; :::; B.
For the wild bootstrap, the values w1 ; :::; wB cannot be used directly to form critical
values for a con dence interval. Instead they can be used to provide a p-value for testing a
hypothesis. To form a con dence interval, one needs to invert a sequence of tests, pro ling
32
over a range of candidate null hypotheses H0 : = 0 . For each of these null hypotheses,
obtain the p-value. The 95% con dence interval is the set of values of 0 for which p
0:05. This method is computationally intensive, but conceptually straightforward. As a
practical matter, you may want to ensure that you have the same set of bootstrap draws
across candidate hypotheses, so as to not introduce additional bootstrapping noise into the
determination of where the cuto is.
In principle it is possible to directly use a bootstrap for bias-reduction, such as to remove
bias in standard errors. In practice this is not done, however, as in practice any bias reduction
comes at the expense of considerably greater variability. A conservative estimate of the
standard error equals the width of a 95% con dence interval, obtained using asymptotic
re nement, divided by 2 1:96.
Note that for the wild cluster bootstrap the resamples f(y1 ; X1 ); :::; (yG ; XG )g have the
same X in each resample, whereas for pairs cluster both y and X vary across the B
resamples. The wild cluster bootstrap is an extension of the wild bootstrap proposed for
heteroskedastic data. It works essentially because the two-point distribution for forming ug
e e
ensures that E[ug ] = 0 and V[ug ] = ug u0g . There are other two-point distributions that also
do so, but Davidson and Flachaire (2008) show that in the heteroskedastic case it is best to
use the weights dg = f 1; 1g, called Rademacher weights.
The wild cluster bootstrap essentially replaces yg in each resample with one of two values
e
e
yg = X e H0 + ug or yg = X e H0 ug . Because this is done across G clusters, there are at most
G
2 possible combinations of the data, so there are at most 2G unique values of w1 ; :::; wB .
If there are very few clusters then there is no need to actually bootstrap as we can simply
enumerate, with separate estimation for each of the 2G possible datasets.
Webb (2013) expands on these limitations. He shows that there actually only 2G 1
possible t-statistics in absolute value. Thus with G = 5 there are at most 24 = 16 possible
values of w1 ; :::; wB . So if the main test statistic is more extreme than any of these 16
values, for example, then all we know is that the p-value is smaller than 1=16 = 0:0625.
Full enumeration makes this discreteness clear. Bootstrapping without consideration of this
issue can lead to inadvertently picking just one point from the interval of equally plausible
p-values. As G gets to be as large as 11 this concern is less of an issue since 210 = 1024.
Webb (2013) proposes greatly reducing the discreteness of p-values with very low G by
instead using a six-point distribution for the weights dg in step 1b. In this proposed distribup
p
p p p p
tion the weights dg have a 1=6 chance of each value in f
1:5;
1;
:5; :5; 1; 1:5g.
In his simulations this method outperforms the two-point wild bootstrap for G < 10 and is
de nitely the method to use if G < 10.
MacKinnon and Webb (2013) address the issue of unbalanced clusters and nd that, even
with G = 50, tests based on the standard CRVE with T (G 1) critical values can over-reject
considerably if the clusters are unbalanced. By contrast, the two-point wild bootstrap with
Rademacher weights is generally reliable.
33
VIC.3. Bootstrap with Caution
Regardless of the bootstrap method used, pairs cluster with or without asymptotic re nement
or wild cluster bootstrap, when bootstrapping with few clusters, an important step in the
process is to examine the distribution of bootstrapped values. This is something that should
be done whether you are bootstrapping to obtain a standard error, or bootstrapping tstatistics with re nement to obtain a more accurate p-value. This examination can take
the form of looking at four things: (1) basic summary statistics like mean and variance; (2)
the sample size to con rm that it is the same as the number of bootstrap replications (no
missing values); (3) the largest and smallest handful of values of the distribution; and (4) a
histogram of the bootstrapped values.
We detail a few potential problems that this examination can diagnose.
First, if you are using a pairs cluster bootstrap and one cluster is an outlier in some sense,
then the resulting histogram may have a big \mass" that sits separately from the rest of the
bootstrap distribution { that is, there may be two distinct distributions, one for cases where
that cluster is sampled and one for cases where it is not. If this is the case then you know
that your results are sensitive to the inclusion of that cluster.
Second, if you are using a pairs cluster bootstrap with dummy right-hand side variables,
then in some samples you can get no or very limited variation in treatment. This can lead
to zero or near-zero standard errors. For a percentile-t pairs cluster bootstrap, a zero or
missing standard error will lead to missing values for w , since the standard error is zero or
missing. If you naively use the remaining distribution, then there is no reason to expect that
you will have valid inference. And if the bootstrapped standard errors are zero plus machine
precision noise, rather than exactly zero, very large t-values may result. Then your bootstrap
distribution of t-statistics will have really fat tails, and you will not reject anything, even
false null hypotheses. No variation or very limited variation in treatment can also result
in many of your b 's being \perfect t" b 's with limited variability. Then the bootstrap
standard deviation of the b 's will be too small, and if you use it as your estimated standard
error you will over-reject. In this case we suggest using the wild cluster bootstrap.
Third, if your pairs cluster bootstrap samples draw nearly multicollinear samples, you
can get huge b 's. This can make a bootstrapped standard error seem very large. You need
to look at what about the bootstrap samples \causes" the huge b 's . If this is some pathological but common draw, then you may need to think about a di erent type of bootstrap,
such as the wild cluster bootstrap, or give up on bootstrapping methods. For an extreme example, consider a DiD model, with rst-order \control" xed e ects and an interaction term.
Suppose that a bootstrap sample happens to have among its \treatment group" only observations where \post = 1". Then the variables \treated" and \treated*post" are collinear,
and an OLS package will drop one of these variables. If it drops the \post" variable, it will
report a coe cient on \treated*post", but this coe cient will not be a proper interaction
term, it will instead also include the level e ect for the treated group.
Fourth, with less than ten clusters the wild cluster bootstrap needs to use the six-point
34
version of Webb (2013).
Fifth, in general if you see missing values in your bootstrapped t-statistics, then you need
to gure out why. Don't take your bootstrap results at face value until you know what's
going on.
VID. Solution 3: Improved Critical Values using a T-distribution
The simplest small-sample correction for the Wald statistic is to use a T distribution, rather
than the standard normal, with degrees of freedom at most the number of clusters G. Recent
research has proposed methods that lead to using degrees of freedom much less than G,
especially if clusters are unbalanced.
VID.1. G-L Degrees of Freedom
Some packages, including Stata after command regress, use G 1 degrees of freedom for
t-tests and F tests based on cluster-robust standard errors. This choice emanates from
the complex survey literature; see Bell and McCa rey (2002) who note, however, that with
normal errors this choice still tends to result in test over-rejection so the degrees of freedom
should be even less than this.
Even the T (G 1) can make quite a di erence. For example with G = 10 for a two-sided
test at level 0:05 the critical value for T (9) is 2:262 rather than 1:960, and if w = 1:960 the
p-value based on T (9) is 0:082 rather than 0:05. In Monte Carlo simulations by Cameron,
Gelbach, and Miller (2008) this choice works reasonably well, and at a minimum one should
use the T (G 1) distribution rather than the standard normal.
For models that include L regressors that are invariant within cluster, Donald and Lang
(2007) provide a rationale for using the T (G L) distribution. If clusters are balanced and all
regressors are invariant within cluster then the OLS estimator in the model yig = x0g + uig
is equivalent to OLS estimation in the grouped model yg = x0g + ug . If ug is i.i.d. normally
b
distributed then the Wald statistic is T (G L) distributed, where V[ b ] = s2 (X0 X) 1 and
P
2
b
s2 = (G L) 1 g ug . Note that ug is i.i.d. normal in the random e ects model if the error
components are i.i.d. normal. Usually if there is a time-invariant regressor there is only one,
in addition to the intercept, in which case L = 2.
Donald and Lang extend this approach to inference on in a model that additionally
includes regressors zig that vary within clusters, and allow for unbalanced clusters, leading
to G L for the RE estimator. Wooldridge (2006) presents an expansive exposition of the
Donald and Lang approach. He also proposes an alternative approach based on minimum
distance estimation. See Wooldridge (2006) and, for a summary, Cameron and Miller (2011).
VID.2. Data-determined Degrees of Freedom
For testing the di erence between two means of normally and independently distributed
populations with di erent variances the t test is not exactly T distributed { this is known
35
as the Behrens-Fisher problem. Satterthwaite (1946) proposed an approximation that was
extended to regression with clustered errors by Bell and McCa rey (2002) and developed
further by Imbens and Kolesar (2012).
p
The T (N k) distribution is the ratio of N [0; 1] to independent [ 2 (N K)]=(N k).
For linear regression under i.i.d. normal errors, the derivation of the T (N k) distribution
2
(N K), where s2 is
for the Wald t-statistic uses the result that (N K)(s2 = 2 )
b
b
b
2
= V[b]. This result no longer holds for non-i.i.d. errors,
the usual unbiased estimate of
b
even if they are normally distributed. Instead, an approximation uses the T (v ) distribution
where v is such that the rst two moments of v (s2 = 2 ) equal the rst two moments (v
b
b
2
2
and 2v ) of the (v ) distribution. Assuming sb is unbiased for 2 , E[v (s2 = 2 )] = v . And
b
b
b
V[v (s2 = 2 )] = 2v if v = 2[( 2 )2 =V[s2 ]].
b
b
b
b
Thus the Wald t-statistic is treated as being T (v ) distributed where v = 2( 2 )2 =V [s2 ].
b
b
2
Assumptions are needed to obtain an expression for V[sb ]. For clustered errors with u
N [0; ] and using the CRVE de ned in Section IIC, or using CR2VE or CR3VE de ned in
Section VIB, Bell and McCa rey (2002) show that the distribution of the Wald t-statistic
de ned in (23) can be approximated by the T (v ) distribution where
P
( G j )2
j=1
v = PG 2 ;
( j=1 j )
(26)
and j are the eigenvalues of the G G matrix G0 b G, where b is consistent for , the N G
matrix G has g th column (IN H)0g Ag Xg (X0 X) 1 ek , (IN H)g is the Ng N submatrix for
cluster g of the N N matrix IN X(X0 X) 1 X0 , Ag = (INg Hgg ) 1=2 for CR2VE, and ek
is a K 1 vector of zeroes aside from 1 in the k th position if b = bk . Note that v needs to
be calculated separately, and di ers, for each regression coe cient. The method extends to
tests on scalar linear combinations c0 b :
The justi cation relies on normal errors and knowledge of = E[uu0 jX]. Bell and McCa rey (2002) perform simulations with balanced clusters (G = 20 and Ng = 10) and
equicorrelated errors within cluster. They calculate v assuming
= 2 I, even though errors are in fact clustered, and nd that their method leads to Wald tests with true size closer
to the nominal size than tests based on the conventional CRVE, CRV2E, and CRV3E.
Imbens and Kolesar (2012) additionally consider calculating v where b is based on
equicorrelated errors within cluster. They follow the Monte Carlo designs of Cameron,
Gelbach and Miller (2008), with G = 5 and 10 and equicorrelated errors. They nd that
all nite-sample adjustments perform better than using the standard CRVE with T (G 1)
critical values. The best methods use the CR2VE and T (v ), with slight over-rejection with
v based on b = s2 I (Bell and McCa rey) and slight under-rejection with v based on b
assuming equicorrelated errors (Imbens and Kolesar). For G = 5 these methods outperform
the two-point wild cluster bootstrap, as expected given the very low G problem discussed in
Section VIC. More surprisingly these methods also outperform wild cluster bootstrap when
36
G = 10, perhaps because Imbens and Kolesar (2012) did not impose the null hypothesis in
forming the residuals for this bootstrap.
VID.3. E ective Number of Clusters
Carter, Schnepel and Steigerwald (2013) propose a measure of the e ective number of clusters. This measure is
G =
G
;
(1 + )
(27)
P
1
where = G G f( g
)2 = 2 g, g = e0k (X0 X) 1 X0g g Xg (X0 X) 1 ek , ek is a K 1 vector
g=1
P
1
of zeroes aside from 1 in the k th position if b = bk , and = G G g . Note that G varies
g=1
with the regression coe cient considered, and the method extends to tests on scalar linear
combinations c0 b :
The quantity measures cluster heterogeneity, which disappears if g =
for all g.
Given the formula for g , cluster heterogeneity ( 6= 0) can arise for many reasons, including
variation in Ng , variation in Xg and variation in g across clusters.
In simulations using standard normal critical values, Carter et al. (2013) nd that test
size distortion occurs for G < 20. In application they assume errors are perfectly correlated
within cluster, so g = ll0 where l is an Ng 1 vector of ones. For data from the Tennessee
STAR experiment they nd G = 192 when G = 318. For the Hersch (1998) data of Section
IIB, with very unbalanced clusters, they nd for the industry job risk coe cient and with
clustering on industry that G = 19 when G = 211:
Carter et al. (2013) do not actually propose using critical values based on the T (G )
distribution. The key component in obtaining the formula for v in the Bell and McCa rey
2
2
2
(2002) approach is determining V[s2 = 2 ], which equals E[(s2
b )= b ] given sb is unbiased
b
b
b
2
2
e2
sb
for 2 . Carter et al. (2013) instead work with E[(e2
b )= b ] where sb , de ned in their
b
eb b
paper, is an approximation to s2 that is good for large G (formally s2 = 2 ! s2 = 2 as
b
b
b
2
2
2
)= b ] = 2(1 + )=G, see Carter et al. (2013), where is de ned
G ! 1). Now E[(eb
s
b
in (27). This suggests using the T (G ) distribution as an approximation, and that this
approximation will improve as G increases.
VIE. Special Cases
With few clusters, additional results can be obtained if there are many observations in each
group. In DiD studies the few clusters problem arises if few groups are treated, even if G is
large. And the few clusters problem is more likely to arise if there is multi-way clustering.
VIE.1. Fixed Number of Clusters with Cluster Size Growing
The preceding adjustments to the degrees of freedom of the T distribution are based on the
assumption of normal errors. In some settings asymptotic results can be obtained when G
37
is small, provided Ng ! 1.
Bester, Conley and Hansen (2011), building on Hansen (2007a), give conditions under
p
whichp t-test statistic based on (11) is G=(G 1) times TG 1 distributed. Then using
the
e
u
ug = G=(G 1)b g in (11) yields a T (G 1) distributed statistic. In addition to assuming
G is xed while Ng ! 1, it is assumed that the within group correlation satis es a mixing
condition (this does not happen in all data settings, although it does for time series and
spatial correlation), and that homogeneity assumptions are satis ed, including equality of
1
plim Ng X0g Xg for all g.
P
Let bg denote the estimate of parameter in cluster g, b = G 1 G bg denote the
g=1
PG b
2
b)2 denote their variance. Suppose
average of the G estimates, and s = (G 1)
(
g=1
b
g
that the bg are asymptotically normal as Ng ! 1 with common mean , and consider test
p
of H0 : = 0 based on t = G(bg
u
0 )=sb . Then Ibragimov and M• ller (2010) show that
tests based on the T (G 1) will be conservative tests (i.e., under-reject) for level
0:083.
This approach permits correct inference even with extremely few clusters, assuming Ng is
large. However, the requirement that cluster estimates are asymptotically independent must
be met. Thus the method is not directly applicable to a state-year di erences-in-di erences
application when there are year xed e ects (or other regressor that varies over time but not
states). In that case Ibragimov and M•ller propose applying their method after aggregating
u
subsets of states into groups in which some states are treated and some are not.
VIE.2. Few Treated Groups
Problems arise if most of the variation in the regressor is concentrated in just a few clusters,
even when G is su ciently large. This occurs if the key regressor is a cluster-speci c binary
treatment dummy and there are few treated groups.
Conley and Taber (2011) examine a di erences-in-di erences (DiD) model in which there
are few treated groups and an increasing number of control groups. If there are group-time
random e ects, then the DiD model is inconsistent because the treated groups random e ects
are not averaged away. If the random e ects are normally distributed, then the model of
Donald and Lang (2007) applies and inference can use a T distribution based on the number
of treated groups. If the group-time shocks are not random, then the T distribution may be
a poor approximation. Conley and Taber (2011) then propose a novel method that uses the
distribution of the untreated groups to perform inference on the treatment parameter.
Abadie, Diamond and Hainmueller (2010) propose synthetic control methods that provide
a data-driven method to select the control group in a DiD study, and that provide inference
under random permutations of assignment to treated and untreated groups. The methods
are suitable for treatment that e ects few observational units.
38
VIE.3. What if you have multi-way clustering and few clusters?
Sometimes we are worried about multi-way clustering, but one or both of the ways has few
clusters. Currently we are not aware of an ideal approach to deal with this problem. One
potential solution is to try to add su cient control variables so as to minimize concerns about
clustering in one of the ways, and then use a one-way few-clusters cluster robust approach
on the other way. Another potential solution is to model one of the ways of clustering in a
parametric way, such as with a common shock or an autoregressive error correlation. Then
you can construct a variance estimator that is a hybrid of the parametric model, and cluster
robust in the remaining dimension.
VII. Extensions
The preceding material has focused on the OLS (and FGLS) estimator and tests on a single
coe cient. The basic results generalize to multiple hypothesis tests, instrumental variables
(IV) estimation, nonlinear estimators and generalized method of moments (GMM).
These extensions are incorporated in Stata, though Stata generally computes test p-values
and con dence intervals using standard normal and chisquared distributions, rather than T
and F distributions. And for nonlinear models stronger assumptions are needed to ensure
that the estimator of retains its consistency in the presence of clustering. We provide a
brief overview.
VIIA. Cluster-Robust F-tests
Consider Wald joint tests of several restrictions on the regression parameters. Except in
the special case of linear restrictions and OLS with i.i.d. normal errors, asymptotic theory
yields only a chi-squared distributed statistic, say W , that is 2 (h) distributed, where h is
the number of (linearly independent) restrictions.
Alternatively we can use the related F statistic, F = W=h. This yields the same p-value
as the chi-squared test if we treat F as being F (h; 1) distributed. In the cluster case, a
nite-sample adjustment instead treats F as being F (h; G 1) distributed. This is analogous
to using the T (G 1) distribution, rather than N [0; 1], for a test on a single coe cient.
In Stata, the nite-sample adjustment of using the T (G 1) for a t-test on a single
coe cient, and using the F (h; G 1) for an F-test, is only done after OLS regression with
command regress. Otherwise Stata reports critical values and p-values based on the N [0; 1]
and 2 (h) distributions.
Thus Stata does no nite-cluster correction for tests and con dence intervals following
instrumental variables estimation commands, nonlinear model estimation commands, or even
after command regress in the case of tests and con dence intervals using commands testnl
and nlcom. The discussion in Section VI was limited to inference after OLS regression, but
it seems reasonable to believe that for other estimators one should also base inference on the
39
T (G 1) and F (h; G 1) distributions, and even then tests may over-reject when there are
few clusters.
Some of the few-cluster methods of section VI can be extended to tests of more than
one restriction following OLS regression. The Wald test can be based on the bias-adjusted
variance matrices CR2VE or CR3VE, rather than CRVE. For a bootstrap with asymptotic
re nement of a Wald test of H0 : R = r, in the bth resample we compute Wb = (R b b
b
R b )0 [RVclu [ b b ]R0 ] 1 (R b b
R b ). Extension of the data-determined degrees of freedom
method of Section VID.2 to tests of more than one restriction requires, at a minimum,
extension of Theorem 4 of Bell and McCa rey (2002) from the case that covers , where
is a single component of , to R . An alternative ad hoc approach would be to use the
F (h; v ) distribution where v is an average (possibly weighted by estimator precision) of v
de ned in (26) computed separately for each exclusion restriction.
b
For the estimators discussed in the remainder of Section VII, the rank of Vclu [ b ] is again
the minimum of G 1 and the number of parameters (K). This means that at most G 1
restrictions can be tested using a Wald test, in addition to the usual requirement that h K.
VIIB. Instrumental Variables Estimators
The cluster-robust variance matrix estimate for the OLS estimator extends naturally to the
IV estimator, the two-stage least squares (2SLS) estimator and the linear GMM estimator.
The following additional adjustments must be made when errors are clustered. First, a
modi ed version of the Hausman test of endogeneity needs to be used. Second, the usual
inference methods when instruments are weak need to be adjusted. Third, tests of overidentifying restrictions after GMM need to be based on an optimal weighting matrix that
controls for cluster correlation of the errors.
VIIB.1. IV and 2SLS
In matrix notation, the OLS estimator in the model y = X +u is inconsistent if E[ujX] 6= 0.
We assume existence of a set of instruments Z that satisfy E[ujZ] = 0 and satisfy other
conditions, notably Z is of full rank with dim[Z] dim[X] and Cor[Z; X] 6= 0.
For the clustered case the assumption that errors in di erent clusters are uncorrelated is
now one of uncorrelated errors conditional on the instruments Z, rather than uncorrelated
errors conditional on the regressors X. In the g th cluster the matrix of instruments Zg is an
Ng M matrix, where M K, and we assume that E[ug jZg ] = 0 and Cov[ug u0h jZg ; Zh ] = 0
for g 6= h.
In the just-identi ed case, with Z and X having the same dimension, the IV estimator
b = (Z0 X) 1 Z0 y, and the cluster-robust variance matrix estimate is
is IV
b
where ug = yg
b
Vclu [ b IV ] = (Z0 X)
1
PG
g=1
b b
Z0g ug u0g Zg (X0 Z) 1 ;
(28)
Xg b IV are residuals calculated using the consistent IV estimator. We again
40
assume G ! 1. As for OLS, the CRVE may be rank-de cient with rank the minimum of
K and G 1.
In the over-identi ed case with Z having dimension greater than X, the 2SLS estimator
is the special case of the linear GMM estimator in (29) below with W = (Z0 Z) 1 , and
b
the CRVE is that in (30) below with W = (Z0 Z) 1 and ug the 2SLS residuals. In the
just-identi ed case 2SLS is equivalent to IV.
A test for endogeneity of a regressor(s) can be conducted by comparing the OLS estimator
to the 2SLS (or IV) estimator that controls for this endogeneity. The two estimators have the
same probability limit given exogeneity and di erent probability limits given endogeneity.
This is a classic setting for the Hausman test but, as in the Hausman test for xed e ects
discussed in Section IIID, the standard version of the Hausman test cannot be used. Instead
partition X = [X1 X2 ], where X1 is potentially endogenous and X2 is exogenous, and let
b
vig denote the residuals from rst-stage OLS regression of the endogenous regressors on
instruments and exogenous regressors. Then estimate by OLS the model
yig = x01ig
1
+ x02ig
2
b0
+ v1ig + uig :
The regressors x1 are considered endogenous if we reject H0 : = 0 using a Wald test based
on a CRVE. In Stata this is implemented using command estat endogenous. (Alternatively
a pairs cluster bootstrap can be used to estimate the variance of b 2SLS b OLS ).
VIIB.2. Weak Instruments
When endogenous regressor(s) are weakly correlated with instrument(s), where this correlation is conditional on the exogenous regressors in the model, there is great loss of precision,
with the standard error for the coe cient of the endogenous regressor much higher after IV
or 2SLS estimation than after OLS estimation.
Additionally, asymptotic theory takes an unusually long-time to kick in so that even
with large samples the IV estimator can still have considerable bias and the Wald statistic is
still not close to normally distributed. See, for example, Bound, Jaeger, and Baker (1995),
Andrews and Stock (2007), and textbook discussions in Cameron and Trivedi (2005, 2009).
For this second consequence, called the \weak instrument" problem, the econometrics
literature has focused on providing theory and guidance in the case of homoskedastic or
heteroskedastic errors, rather than within-cluster correlated errors. The problem may even
be greater in the clustered case, as the asymptotics are then in G ! 1 rather than N ! 1,
though we are unaware of evidence on this.
A standard diagnostic for detecting weak instruments, in the case of a single endogenous
regressor, is to estimate by OLS the rst-stage regression, of the endogenous regressor on the
remaining exogenous regressors and the additional instrument(s). Then form the F-statistic
for the joint signi cance of the instruments; in the case of a just-identi ed model there is
only one instrument to test so the F-statistic is the square of the t-statistic. A popular
rule-of-thumb, due to Staiger and Stock (1997), is that there may be no weak instrument
41
problem if F > 10. With clustered errors, this F-statistic needs to be based on a clusterrobust variance matrix estimate. In settings where there is a great excess of instruments,
however, this test will not be possible if the number of instruments (M K) exceeds the
number of clusters. Instead only subsets or linear combinations of the excess instruments
can be tested, and the F > 10 criteria becomes a cruder guide.
Baum, Scha er and Stillman (2007) provide a comprehensive discussion of various methods for IV, 2SLS, limited information maximum likelihood (LIML), k-class, continuous updating and GMM estimation in linear models, and present methods using their ivreg2 Stata
command. They explicitly allow for within-cluster correlation and state which of the proposed methods for weak instruments diagnostics and tests developed for i.i.d. errors can be
generalized to within-cluster correlated errors.
Chernozhukov and Hansen (2008) proposed a novel method that provides valid inference
on the coe cient of endogenous regressor(s) under weak instruments with errors that are
not restricted to being i.i.d. For simplicity suppose that there is one endogenous regressor,
yig = xig + uig , and that the rst-stage model is xig = z0ig + vig . If there are additional
exogenous regressors x2 , as is usually the case, the method still works if the variables y, x and
z are de ned after partialling out x2 . The authors show that a test of =
is equivalent
0
to a test of = 0 in the model yig
xig = zig + wi . A 95% con dence interval for can
be constructed by running this regression for a range of values of
and including in the
interval for only those values of
for which we did not reject H0 : = 0 when testing at
5%. This approach generalizes to more than one endogenous regressor. More importantly
it does not require i.i.d. errors. Chernozhukov and Hansen proposed the method for HAC
robust inference, but the method can also be applied to clustered errors. Then the test
of = 0 should be based on a CRVE. Note that as with other F-tests, this can only be
performed when the dimension of is less than G. Finlay and Magnusson (2009) provide
this and other extensions, and provide a command ivtest for Stata. We speculate that if
additionally there are few clusters, then some of the adjustments discussed in Section VI
would help.
VIIB.3. Linear GMM
For over-identi ed models the linear GMM estimator is more e cient than the 2SLS estimator if E[uu0 jZ] 6= 2 I, Then
where W is a full rank K
b GMM = (X0 ZWZ0 X) 1 (X0 ZWZ0 y);
K weighting matrix. The CRVE for GMM is
PG
0
0
0
0
1
b
b b0
Vclu [ b GMM ] = (X0 ZWZ0 X) 1 X0 ZW
g=1 Zg ug ug Zg WZ X(X ZWZ X) ;
(29)
(30)
b
where ug are residuals calculated using the GMM estimator.
P
b b
For clustered errors, the e cient two-step GMM estimator uses W = ( G Z0g ug u0g Zg ) 1 ,
g=1
b
where ug are 2SLS residuals. Implementation of this estimator requires that the number of
42
P
b b
clusters exceeds the number of instruments, since otherwise G Z0g ug u0g Zg is not invertible.
g=1
Here Z contains both the exogenous regressors in the structural equation and the additional
instruments required to enable identi cation of the endogenous regressors. When this condition is not met, Baum, Scha er and Stillman (2007) propose doing two-step GMM after
rst partialling out the instruments z from the dependent variable y, the endogenous variables in the initial model yig = x0ig + uig , and any additional instruments that are not also
exogenous regressors in this model.
The over-identifying restrictions (OIR) test, also called a Hansen test or a Sargan test, is
a limited test of instrument validity that can be used when there are more instruments than
necessary. When errors are clustered the OIR tests must be computed following the cluster
version of two-step GMM estimation; see Hoxby and Paserman (1998).
Just as GLS is more e cient than OLS, specifying a model for
= E[uu0 jZ] can lead
to more e cient estimation than GMM. Given a model for , and conditional moment
condition E[ujZ] = 0, a more e cient estimator is based on the unconditional moment
condition E[Z0 1 u] = 0. Then we minimize (Z0 b 1 u)0 (Z0 b 1 Z) 1 (Z0 b 1 u), where b is
consistent for . Furthermore the CRVE can be robusti ed against misspeci cation of ,
similar to the case of FGLS, though an OIR test is no longer possible if is misspeci ed. In
practice such FGLS-type improvements to GMM are seldom used, even in simpler settings
that the clustered setting. An exception is Shore-Sheppard (1996) who considers the impact
of equicorrelated instruments and group-speci c shocks in a model similar to that of Moulton.
One reason may be that this option is not provided in Stata command ivregress. In the
special case of a random e ects model for , command xtivreg can be used along with a
pairs cluster bootstrap used to guard against misspeci cation of :
VIIC. Nonlinear Models
For nonlinear models there are several ways to handle clustering. We provide a brief summary; see Cameron and Miller (2011) for further details.
For concreteness we focus on logit regression. Recall that in the cross-section case yi
takes value 0 or 1 and the logit model speci es that E[yi jxi ] = Pr[yi = 1jxi ] = (x0i ), where
(z) = ez =(1 + ex ):
VIIC.1. Di erent Models for Clustering
The simplest approach is a pooled approach that assumes that clustering does not change
the functional form for the conditional probability of a single observation. Thus, for the logit
model, whatever the nature of clustering, it is assumed that
E[yig jxig ] = Pr[yig = 1jxig ] = (x0ig ):
(31)
This is called a population-averaged approach, as (x0ig ) is obtained after averaging out any
within-cluster correlation. Inference needs to control for within-cluster correlation, however,
and more e cient estimation may be possible.
43
The generalized estimating equations (GEE) approach, due to Liang and Zeger (1986)
and widely used in biostatistics, introduces within-cluster correlation into the class of generalized linear models (GLM), a class that includes the logit model. One possible model for
within-cluster correlation is equicorrelation, with Cor[yig ; yjg jxig ; xjg ] = . The Stata command xtgee y x, family(binomial) link(logit) corr(exchangeable) estimates the
population-averaged logit model and provides the CRVE assuming the equicorrelation model
for within-cluster correlation is correctly speci ed. The option vce(robust) provides a
CRVE that is robust to misspeci cation of the model for within-cluster correlation. Command xtgee includes a range of models for the within-error correlation. The method is a
nonlinear analog of FGLS given in Section IID, and asymptotic theory requires G ! 1.
A further extension is nonlinear GMM. For example, with endogenous regressors and
instruments z that satisfy E[yig exp(x0ig )jzig ] = 0, a nonlinear GMM estimator minimizes
P P
h( )0 Wh( ) where h( ) = g i zig (yig exp(x0ig ). Other choices of h( ) that allow
for intracluster correlation may lead to more e cient estimation, analogous to the linear
GMM example discussed at the end of section VIIB. Given a choice of h( ), the two-step
nonlinear GMM estimator at the second step uses weighting matrix W that is the inverse of
a consistent estimator of V[ b ], and one can then use the minimized objection function for
an overidentifying restrictions test.
Now suppose we consider a random e ects logit model with normally distributed random
e ect, so
Pr[yig = 1j g ; xig ] = ( g + x0ig );
(32)
where g N [0;
joint density
2
]. If
f (y1g ; :::; yNg g j
Since
g
is known, the Ng observations in cluster g are independent with
g ; Xg )
=
Q Ng
i=1
(
g
+ x0ig )yig [1
(
g
+ x0ig )]1
is unknown it is integrated out, leading to joint density
Z
Q Ng
0
yig
f (y1g ; :::; yNg g jXg ) =
( g + x0ig )]1
i=1 ( g + xig ) [1
g
yig
h(
yig
:
gj
2
)d
g;
where h( g j 2 ) is the N [0; 2 ] density. There is no closed form solution for this integral, but
it is only one-dimensional so numerical approximation (such as Gaussian quadrature) can
be used. The consequent MLE can be obtained in Stata using the command xtlogit y x,
re: Note that in this RE logit model (31) no longer holds, so in the model (32) is scaled
di erently from in (31). Furthermore in (32) is inconsistent if the distribution for g
is misspeci ed, so there is no point in using option vce(robust) after command xtlogit,
re.
It is important to realize that in nonlinear models such as logit the population-averaged
and random e ects approaches lead to quite di erent estimates of that are not comparable
since
means di erent things in the di erent models. The resulting estimated average
marginal e ects may be similar, however, just as they are in logit and probit models.
44
With few clusters, Wald statistics are likely to over-reject as in the linear case, even if
we scale the CRVE's given in this section by G=(G 1) as is typically done; see (12) for
the linear case. McCa rey, Bell, and Botts (2001) consider bias-correction of the CRVE
in generalized linear models. Asymptotic re nement using a pairs cluster bootstrap as in
Section VIC is possible. The wild bootstrap given in Section VID is no longer possible in
a nonlinear model, aside from nonlinear least squares, since it requires additively separable
errors. Instead one can use the score wild bootstrap proposed by Klein and Santos (2012)
for nonlinear models, including maximum likelihood and GMM models. The idea in their
paper is to estimate the model once, generate scores for all observations, and then perform a
bootstrap in the wild-cluster style, perturbing the scores by bootstrap weights at each step.
For each bootstrap replication the perturbed scores are used to build a test statistic, and
the resulting distribution of this test statistic can be used for inference. They nd that this
method performs well in small samples, and can greatly ease computational burden because
the nonlinear model need only be estimated once. The conservative test of Ibragimov and
M•ller (2010) can be used if Ng ! 1.
u
VIIC.2. Fixed E ects
A cluster-speci c xed e ects version of the logit model treats the unobserved parameter g
in (32) as being correlated with the regressors xig . In that case both the population-averaged
and random e ects logit estimators are inconsistent for .
Instead we need a xed e ects logit estimator. In general there is an incidental parameters
problem if asymptotics are that Ng is xed while G ! 1, as there only Ng observations
for each g , and inconsistent estimation of g spills over to inconsistent estimation of :
Remarkably for the logit model it is nonetheless possible to consistently estimate . The
logit xed e ects estimator is obtained in Stata using the command xtlogit y x, fe.
Note, however, that the marginal e ect in model (32) is @ Pr[yig = 1j g ; xig ]=@xijk = ( g +
x0ig )(1
( g + x0ig )) k . Unlike the linear FE model this depends on the unknown g .
So the marginal e ects cannot be computed, though the ratio of the marginal e ects of the
k th and lth regressor equals k = l which can be consistently estimated.
The logit model is one of few nonlinear models for which xed e ects estimation is possible
when Ng is small. The other models are Poisson with E[yig jXg ; g ] = exp( g + x0ig ), and
nonlinear models with E[yig jXg ; g ] = g + m(x0ig ), where m( ) is a speci ed function.
The natural approach to introduce cluster-speci c e ects in a nonlinear model is to
include a full set of cluster dummies as additional regressors. This leads to inconsistent
estimation of in all models except the linear model (estimated by OLS) and the Poisson
regression model, unless Ng ! 1. There is a growing literature on bias-corrected estimation
in such cases; see, for example, Fernandez-Val (2009). This paper also cites several simulation
studies that gauge the extent of bias of dummy variable estimators for moderate Ng , such
as Ng = 20.
Yoon and Galvao (2013) consider xed e ects in panel quantile regression models with
45
correlation within cluster and provide methods under the assumption that both the number
of individuals and the number of time periods go to in nity.
VIID. Cluster-randomized Experiments
Increasingly researchers are gathering their own data, often in the form of eld or laboratory
experiments. When analyzing data from these experiments they will want to account for the
clustered nature of the data. And so when designing these experiments, they should also
account for clustering. Fitzsimons, Malde, Mesnard, and Vera-Hernandez (2012) use a wild
cluster bootstrap in an experiment with 12 treated and 12 control clusters.
Traditional guidance for computing power analyses and minimum detectable e ects (see
e.g. Du o, Glennerster and Kremer, 2007, pp. 3918-3922, and Hemming and Marsh (2013))
are based on assumptions of either independent errors or, in a clustered setting, a random
e ects common-shock model. Ideally one would account for more general forms of clustering
in these calculations (the types of clustering that motivate cluster-robust variance estimation), but this can be di cult to do ex ante. If you have a data set that is similar to the
one you will be analyzing later, then you can assign a placebo treatment, and compute the
ratio of cluster-robust standard errors to default standard errors. This can provide a sense
of how to adjust the traditional measures used in design of experiments.
VIII. Empirical Example
In this section we illustrate many of the issues and methods presented in this paper. The
data and accompanying Stata code will be posted on our websites.
The micro data are from the March CPS, downloaded from IPUMS-CPS (King et. al,
2010). We use data covering individuals who worked 40 or more weeks during the prior year,
and whose usual hours per week in that year was 30 or more. The hourly wage is constructed
as annual earnings divided by annual hours (usual hours per week times number of weeks
worked), de ated to real 1999 dollars, and observations with real wage in the range ($2,
$100) are kept.
Cross-section examples use individual-level data for 2012. Panel examples use data aggregated to the state-year level for 1977 to 2012. In both cases we estimate log-wage regressions
and perform inference on a generated regressor that has zero coe cient.
VIIIA. Individual-level Cross-section Data: One Sample
In our rst application we use data on 65,685 individuals from the year 2012. We randomly
generate a random dummy \policy" variable, equal to one for one-half of the states and zero
for the other half. Log-wage is regressed on this policy variable and a set of individual-level
controls (age, age squared, and education in years). Estimation is by OLS, using Stata
command regress, and by FGLS controlling for state-level random e ects, using command
46
xtreg, re. The policy variable is often referred to as a \placebo" treatment, and should be
statistically insigni cant in 95% of tests performed at signi cance level 0:05.
Table 1 reports the estimated coe cient of the policy variable, along with standard
errors computed in several di erent ways. The default standard errors for OLS are misleadingly small (se = 0:0042), leading to the dummy variable being very highly statistically
signi cant (t = 0:0226=0:0042 = 5:42) even though it was randomly generated independently of log-wage. The White heteroskedastic-robust standard errors, from regress option
vce(robust), are no better. These standard errors should not be used if Ng is small, see
Section IVB, but here Ng is large. The cluster-robust standard error (se = 0:0217), from option vce(cluster state) is 5.5 times as large, however, leading to the more sensible result
that the regressor is statistically insigni cant (t = 1:04). In results not presented in Table
1, the cluster-robust standard errors of age, age squared and education were, respectively,
1.2, 1.2 and 2.3 times the default, so again ignoring clustering understates the standard
errors. A pairs cluster bootstrap (without asymptotic re nement), from option vce(boot,
cluster(state)), yields very similar standard error as expected.
For FGLS the cluster-robust standard error, computed using (15) or by a pairs cluster
bootstrap, is fairly close to the default standard error computed using (14). This suggests
that random e ects models the error correlation well.
This example illustrates that clustering can make a big di erence even when errors are
only weakly correlated within cluster (the intraclass correlation of the residuals in this application is 0.018), if additionally the regressor is highly correlated within cluster (here perfectly
correlated within cluster) and cluster sizes are large (ranging here from 519 to 5866). The
cluster-robust OLS and FGLS standard errors are very similar (.0217 in both cases), so in
this example random e ects FGLS led to no improvement in e ciency.
Note that formula (6) suggests that the cluster-robust standard errors are 4:9 times the
p
default ( 1 + (1 0:018 (65685=51 1) = 4:9), close to the observed multiple of 5:5.
Formula (6) may work especially well in this example as taking the natural logarithm of
wage leads to model error that is close to homoskedastic and equicorrelation is a good error
model for individuals clustered in regions.
To illustrate the potential pitfalls of pairs cluster bootstrapping for standard errors with
few clusters, discussed in Section VIC, we examine a modi cation with six randomly selected
states broken into treated (AZ, LA, MD) and control (DE, PA, UT). For these states, we
estimate a model similar to that in Table 1. Then b = 0:0373 with default se = 0:0128. We
then perform 1000 replications, resampling the six states with replacement, and perform a
pairs cluster bootstrap in each replication. The bootstrap se = 0:0622 is not too di erent
from the cluster-robust se = 0:0577. But several problems arise. First, 38 replications
cannot be estimated, presumably due to no variation in treatment in the bootstrap samples.
Second, a kernel density estimate of the bootstrapped bs reveals that their distribution is
very multi-modal and has limited density near the middle of the distribution. And there
are many outliers as the 5th percentile of bootstrap replicates is 0:1010, and the 95th
percentile is 0:1032. These numbers are much more than would be expected based on the
47
CRVE. Considering these results, we would not feel comfortable using the pairs cluster
bootstrap in this dataset with these few clusters.
VIIIB. Individual-level Cross-section Data: Monte Carlo
We next perform a Monte Carlo exercise based on the same regression, with 1000 replications.
In each replication, we generate a dataset by sampling (with replacement) states and all
their associated observations. For quicker computation of the Monte Carlo simulation, we
draw states from a 3% sample of individuals within each state, so there are on average
approximately 40 observations per cluster. We explore the e ect of the number of clusters G
by performing varying simulations with G in f6, 10, 20, 30, 50g. Given a sample of states,
we assign a dummy \policy" variable to one-half of the states. We run OLS regressions of
log-wage on the policy variable and the same controls as in Table 1.
In these simulations we perform tests of the null hypothesis that the slope coe cient of
the policy variable is zero. Table 2 presents rejection rates that with millions of replications
should all be 0:05, since we are testing a true hypothesis at a nominal 5% level. Because
only 1,000 simulations are performed, we expect that 95% of these simulations will yield
estimated test size in the range (0:036, 0:064) if the true test size is 0:05.
We begin with lengthy discussion of the last column, which shows results for G = 50.
Rows 1-9 report Wald tests based on t = b=se where se is computed in various ways, while
rows 10-13 report results of bootstraps with an asymptotic re nement.
Rows 1-3 are obtained by standard Stata commands { row 1 by regress, vce(robust);
row 2 by xtreg, vce(robust) after xtset state; and row 3 by regress, vce(cluster
state). Ignoring clustering leads to great over-rejection. The problem is that standard
errors that (correctly) control for clustering are 1.3 times larger { using formula (6) for a
p
3% sample yields 1 + (1 0:018 (0:03 65685=51 1) = 1:30. So t = 1:96 using the
heteroskedastic-robust standard error is really t = 2=1:3 = 1:51 and, using standard normal
critical values, an apparent p = 0:05 is really p = 0:13. Rows 2 and 3 both use the CRVE,
but with di erent critical values. Using T (G 1) in row 3 leads to rejection rate that is
closer to 0:05 than does using N [0; 1] in row 2. Using T (G 2) in row 4, suggested by
the study of Donald and Lang (2007), leads to slight further improvement, but there is still
over-rejection.
Rows 5 and 6 use the residual bias adjustments CR2 and CR3 discussed in Section VIB,
along with T (G 1) critical values. This leads to further decrease in the rejection rates,
towards the desired 0:05.
Rows 7 and 8 combine the residual bias adjustment CR2 with the data-determined
degrees-of-freedom of Section VID. For these data the Imbens and Kolesar (2013) measure v in (26), denoted IK, and the Carter, Schnepel and Steigerwald (2013) measure G
in (27) both equal 17 on average when G = 50 (see rows 14 and 17). This leads to further
improvement in the Monte Carlo rejection rate.
Row 9 uses bootstrap standard errors obtained by a pairs cluster bootstrap, implemented
48
using command regress, vce(boot, cluster(state)). The rejection rate is essentially
the same as that in row 3, as expected since this bootstrap has no asymptotic re nement.
Rows 10-13 implement the various percentile-t bootstraps with asymptotic re nement
presented in Section VIC. These lead to mild over-rejection, with rejection rate of 0:06 rather
than 0:05. Row 10 can be computed using the bootstrap: command, see our posted code,
while rows 11-13 require additional coding. Only 399 bootstraps need be used here as any
consequent bootstrap simulation error averages out over the many Monte Carlo replications.
But if these bootstraps were used just once, as in an empirical study, a percentile-t bootstrap
should use at least 999 replications.
As we examine settings with fewer clusters most methods lead to rejection rates even
further away from the nominal test size of 0:05. The methods that appear to perform best
are the CR3 residual-correction with T (G 1) critical values and the CR2 residual-correction
with T (v ) critical values where v is the Imbens and Kolesar (2013) calculated degrees of
freedom.
Consider the case G = 6. The choice of degrees of freedom makes an enormous di erence,
as the critical value for a test at level 0:05 rises from 2:571 to 2:776 and 3:182 for, respectively,
the T (5), T (4) and T (3) distributions, and from row 14 the IK degrees of freedom averages
3:3 across the simulations. The CSS degrees of freedom is larger as, from Section VID, it
involves an approximation that only disappears as G becomes large.
It appears that using HC2 with the Imbens and Kolesar degrees of freedom does best;
although with 1000 Monte Carlo replications, we cannot statistically distinguish between the
results in rows 3-13.
Finally we compare the variability in cluster-robust standard errors to that for heteroskedasticrobust standard errors. We return to the full cross-section micro dataset and perform 1000
replications, resampling the 50 states with replacement. The standard deviation of the
cluster-robust standard error was 12.3% of the mean cluster-robust standard error, while the
standard deviation of the heteroskedastic-robust standard error was 4.5% of its mean. So
although the CRVE is less biased than heteroskedastic-robust (or default), it is also more
variable. But this variability is still relatively small, especially compared to the very large
bias if clustering is not controlled for.
VIIIC. State{Year Panel Data: One Sample
We next turn to a panel di erence-in-di erence application motivated by Bertrand, Du o,
and Mullainathan (2004). The model estimated for 51 states from 1977 to 2012 is
yts =
s
+
t
+
dts + uts ;
(33)
where yts is the average log-wage in year t and state s, s and t are state and year dummies,
and dts is a random \policy" variable that turns on and stays on for the last 18 years. Here
G = 51, T = 36 and N = 1836. To speed up bootstraps, and to facilitate computation
of the CR2 residual adjustment, we partial out the state and year xed e ects and regress
49
(without intercept) uy;ts on ud;ts , where uy;ts (or ud;ts ) is the residual from OLS regression of
b
b
b
b
yts (or dts ) on state and year xed e ects.
Table 3 presents results for the policy dummy regressor which should have coe cient
zero since it is randomly assigned.
We begin with model 1, OLS controlling for state and year xed e ects. Using default
or White-robust standard errors (rows 1-2) leads to a standard error of 0.0042 that is much
smaller than the cluster-robust standard error of 0.0153 (row 3), where clustering is on
state. Similar standard errors are obtained using the CR2 correction and bootstrap without
asymptotic re nement (rows 3-6). Note that even after controlling for state xed e ects, the
default standard errors were greatly under-estimated, with cluster-robust standard errors 3.6
times larger.
The second column of results for model 1 gives p-values for tests of the null hypothesis
that = 0. Default and heteroskedastic-robust standard errors lead to erroneously large
t-statistics (of 4.12 and 4.10), p = 0:000, and hence false rejections of the null hypothesis.
Using standard errors that control for clustering (rows 3-6) leads to p ' 0:2 so that the null
is not rejected. Note that here the IK and CSS degrees of freedom are calculated to be G 1,
an artifact of having balanced clusters and a single regressor that is invariant within cluster.
Rows 7-9 report p-values from several percentile-t bootstraps that again lead to rejection of
H0 : = 0.
Model 2 drops state xed e ects from the model (33). Then the cluster-robust standard
error (row 3) is 0.0276, compared to 0.0153 when state- xed e ects are included (Model 1).
So inclusion of state xed e ects does lead to more precise parameter estimates. But even
with state xed e ects included (model 1) there is still within-cluster correlation of the error
so that one still needs to use cluster-robust standard errors. In fact in both Models 1 and 2
the cluster-robust standard errors are approximately 3.7 times the default.
Model 3 estimates the model with FGLS, allowing for an AR(1) process for the errors.
Since there are 36 years of data the bias correction of Hansen (2007b), see Section IIIC, will
make little di erence. For this column we control for state and year xed e ects, similar
to OLS in Model 1. The cluster-robust standard errors (row 3) are much smaller for FGLS
(0.0101) than for OLS (0.0153), indicating improved statistical precision. For FGLS there is
some di erence between default standard errors (0.0070) and cluster-robust standard errors
(0.0101), suggesting that an AR(1) model is not the perfect model for the within-state error
correlation over time.
VIIID. State{Year Panel Data: Monte Carlo
We next embed the panel data example in a Monte Carlo simulation for the OLS estimator,
with 1000 replications. In each simulation, we draw a random set of G states (with replacement). When a state is drawn, we take all years of data for that state. We then assign our
DiD \policy" variable to half the states. As for Table 2, we examine varying numbers of
states, ranging from six to fty. In these simulations we perform tests of the null hypothesis
50
that the slope coe cient of the policy variable is zero. Again 95% of exercises such as this
should yield an estimated test size in the range (0:036, 0:064).
We begin with the last column of the table, with G = 50 states. All tests aside from
that based on default standard errors (row 1) have rejection rates that are not appreciably
di erent from 0.05, once we allow for simulation error.
As the number of clusters decreases it becomes clear that one should use the T (G 1)
or T (G 2) distribution for critical values, and even this leads to over-rejection with low G.
The pairs cluster percentile-t bootstrap fails with few clusters, with rejection rate of only
0:01 when G = 6. For low G, the wild cluster percentile-t bootstrap has similar results with
either 2-point or 6-point weights, with some over-rejection.
IX. Concluding Thoughts
It is important to aim for correct statistical inference, many empirical applications feature
the potential for errors to be correlated within clusters, and we need to make sure our
inference accounts for this. Often this is straightforward to do using traditional clusterrobust variance estimators - but sometimes things can be tricky. The leading di culties
are (1) determining how to de ne the clusters, and (2) dealing with few clusters; but other
complications can arise as well. When faced with these di culties, there is no simple hard
and fast rule regarding how to proceed. You need to think carefully about the potential for
correlations in your residuals, and how that interacts with correlations in your covariates. In
this essay we have aimed to present the current leading set of tools available to practitioners
to deal with clustering issues.
51
X. References
Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2010. \Synthetic Control
Methods for Comparative Case Studies: Estimating the E ect of California's Tobacco Control Program," Journal of the American Statistical Association 105(490): 493-505.
Acemoglu, Daron, and J•rn-Ste en Pischke. 2003. \Minimum Wages and On-the-job
o
Training." Research in Labor Economics 22: 159-202.
Andrews, Donald W. K. and James H. Stock. 2007. \Inference with Weak Instruments." In Advances in Economics and Econometrics, Theory and Applications: Ninth World
Congress of the Econometric Society, Vol. III, ed. Richard Blundell, Whitney K. Newey,
and T. Persson, Ch.3. Cambridge: Cambridge University Press.
Angrist, Joshua D., and Victor Lavy. 2002. \The E ect of High School Matriculation
Awards: Evidence from Randomized Trials." American Economic Review 99: 1384-1414.
Angrist, Joshua D., and J•rn-Ste en Pischke. 2009. Mostly Harmless Econometrics: An
o
Empiricist's Companion. Princeton: Princeton University Press.
Arellano, Manuel 1987. \Computing Robust Standard Errors for Within-Group Estimators." Oxford Bulletin of Economics and Statistic 49: 431-434.
Barrios, Thomas, Rebecca Diamond, Guido W. Imbens, and Michal Kolesar. 2012.
\Clustering, Spatial Correlations and Randomization Inference." Journal of the American
Statistical Association 107(498): 578{591.
Baum, Christopher F., Mark E. Scha er, Steven Stillman. 2007. \Enhanced Routines for
Instrumental Variables/GMM Estimation and Testing." The Stata Journal 7(4): 465-506.
Bell, Robert M., and Daniel F. McCa rey. 2002. \Bias Reduction in Standard Errors
for Linear Regression with Multi-Stage Samples." Survey Methodology 28: 169-179.
Bertrand, Marianne, Esther Du o, and Sendhil Mullainathan. 2004. \How Much Should
We Trust Di erences-in-Di erences Estimates?." Quarterly Journal of Economics 119: 249275.
Bester, C. Alan, Timothy G. Conley, and Christian B. Hansen. 2011. \Inference with
Dependent Data Using Cluster Covariance Estimators." Journal of Econometrics 165: 137151.
Bhattacharya, Debopam. 2005. \Asymptotic Inference from Multi-stage Samples." Journal of Econometrics 126: 145-171.
Bound, John, David A. Jaeger, and Regina M. Baker. 1995. \Problems with Instrumental
Variables Estimation when the Correlation Between the Instruments and the Endogenous
Explanatory Variable is Weak." Journal of the American Statistical Association 90: 443-450.
Brewer, Mike, Thomas F. Crossley, and Robert Joyce. 2013. \Inference with Di erencesin-Di erences Revisited." Unpublished.
Cameron, A. Colin, Jonah G. Gelbach, and Douglas L. Miller. 2006. \Robust Inference
with Multi-Way Clustering." NBER Technical Working Paper 0327.
|||. 2008. \Bootstrap-Based Improvements for Inference with Clustered Errors."
Review of Economics and Statistics. 90: 414-427.
52
|||. 2011. \Robust Inference with Multi-Way Clustering." Journal Business and
Economic Statistics 29(2): 238-249.
Cameron, A. Colin, and Douglas L. Miller. 2011. \Robust Inference with Clustered
Data." In Handbook of Empirical Economics and Finance. ed. Aman Ullah and David E.
Giles, 1-28. Boca Raton: CRC Press.
|||. 2012. \Robust Inference with Dyadic Data: with Applications to Country-pair
International Trade." University of California - Davis. Unpublished.
Cameron, A. Colin, and Pravin K. Trivedi. 2005. Microeconometrics: Methods and
Applications. Cambridge University Press.
|||. 2009. Microeconometrics using Stata. College Station, TX: Stata Press.
Carter, Andrew V., Kevin T. Schnepel, and Douglas G. Steigerwald. 2013. \Asymptotic
Behavior of a t Test Robust to Cluster Heterogeneity." University of California - Santa
Barbara. Unpublished.
Cheng, Cheng, and Mark Hoekstra. 2013. \Pitfalls in Weighted Least Squares Estimation: A Practitioner's Guide." Texas A&M University. Unpublished.
Chernozhukov, Victor, and Christian Hansen. 2008. \The Reduced Form: A Simple
Approach to Inference with Weak Instruments." Economics Letters 100: 68-71.
Conley, Timothy G. 1999. \GMM Estimation with Cross Sectional Dependence." Journal
of Econometrics 92: 1-45.
Conley, Timothy G., and Christopher R. Taber. 2011. \Inference with `Di erence in
Di erences' with a Small Number of Policy Changes." Review of Economics and Statistics
93(1): 113-125.
Davidson, Russell, and Emmanuel Flachaire. 2008. \The Wild Bootstrap, Tamed at
Last." Journal of Econometrics 146: 162{169.
Davis, Peter. 2002. \Estimating Multi-Way Error Components Models with Unbalanced
Data Structures." Journal of Econometrics 106: 67-95.
Donald, Stephen G., and Kevin Lang. 2007. \Inference with Di erence-in-Di erences
and Other Panel Data." Review of Economics and Statistics 89: 221-233.
Driscoll, John C., and Aart C. Kraay. 1998. \Consistent Covariance Matrix Estimation
with Spatially Dependent Panel Data." Review of Economics and Statistics 80: 549-560.
Du o, Esther, Rachel Glennerster and Michael Kremer. 2007. \Using Randomization
in Development Economics Research: A Toolkit." In Handbook of Development Economics,
Vol. 4, ed. Dani Rodrik and Mark Rosenzweig, 3895-3962. Amsterdam: North-Holland.
Fafchamps, Marcel, and Flore Gubert. 2007. \The Formation of Risk Sharing Networks."
Journal of Development Economics 83: 326-350.
Fernandez-Val, Ivan. 2009. \Fixed E ects Estimation of Structural Parameters and
Marginal E ects in Panel Probit Models." Journal of Econometrics 150: 70-85.
Finlay, Keith, and Leandro M. Magnusson. 2009. \Implementing Weak Instrument
Robust Tests for a General Class of Instrumental-Variables Models." Stata Journal 9: 398421.
53
Fitzsimons, Emla, Bansi Malde, Alice Mesnard, and Marcos Vera-Hernandez. 2012.
\Household Responses to Information on Child Nutrition: Experimental Evidence from
Malawi." IFS Working Paper W12/07.
Foote, Christopher L. 2007. \Space and Time in Macroeconomic Panel Data: Young
Workers and State-Level Unemployment Revisited." Working Paper 07-10, Federal Reserve
Bank of Boston.
Greenwald, Bruce C. 1983. \A General Analysis of Bias in the Estimated Standard
Errors of Least Squares Coe cients." Journal of Econometrics 22: 323-338.
Hansen, Christian. 2007a. \Asymptotic Properties of a Robust Variance Matrix Estimator for Panel Data when T is Large." Journal of Econometrics 141: 597-620.
Hansen, Christian. 2007b. \Generalized Least Squares Inference in Panel and Multi-level
Models with Serial Correlation and Fixed E ects." Journal of Econometrics 141: 597-620.
Hausman, Jerry, and Guido Kuersteiner. 2008. \Di erence in Di erence Meets Generalized Least Squares: Higher Order Properties of Hypotheses Tests." Journal of Econometrics
144: 371-391.
Hemming, Karla, and Jen Marsh. 2013. \A Menu-driven Facility for Sample-size Calculations in Cluster Randomized Controlled Trails." Stata Journal 13: 114-135.
Hersch, Joni. 1998. \Compensating Wage Di erentials for Gender-Speci c Job Injury
Rates." American Economic Review 88: 598-607.
Hoechle, Daniel. 2007. \Robust Standard Errors for Panel Regressions with Cross{
sectional Dependence." Stata Journal 7(3): 281-312.
Hoxby, Caroline, and M. Daniele Paserman. 1998. \Overidenti cation Tests with Group
Data." NBER Technical Working Paper 0223.
Ibragimov, Rustam, and Ulrich K. M•ller. 2010. \T-Statistic Based Correlation and
u
Heterogeneity Robust Inference." Journal of Business and Economic Statistics 28(4): 453468.
Imbens, Guido W., and Michal Kolesar. 2012. \Robust Standard Errors in Small Samples: Some Practical Advice." NBER Working Paper 18478.
Inoue, Atsushi, and Gary Solon. 2006. \A Portmanteau Test for Serially Correlated
Errors in Fixed E ects Models." Econometric Theory 22: 835-851.
Kezdi, Gabor. 2004. \Robust Standard Error Estimation in Fixed-E ects Panel Models."
Hungarian Statistical Review Special Number 9: 95-116.
King, Miriam, Steven Ruggles, J. Trent Alexander, Sarah Flood, Katie Genadek, Matthew
B. Schroeder, Brandon Trampe, and Rebecca Vick. 2010. Integrated Public Use Microdata
Series, Current Population Survey: Version 3.0. [Machine-readable database]. Minneapolis:
University of Minnesota.
Kish, Leslie. 1965. Survey Sampling. New York: John Wiley.
Kish, Leslie, and Martin R. Frankel. 1974. \Inference from Complex Surveys with
Discussion." Journal Royal Statistical Society B 36: 1-37.
Klein, Patrick, and Andres Santos. 2012. \A Score Based Approach to Wild Bootstrap
Inference." Journal of Econometric Methods:1(1): 23-41.
54
Kloek, T. 1981. \OLS Estimation in a Model where a Microvariable is Explained by
Aggregates and Contemporaneous Disturbances are Equicorrelated." Econometrica 49: 20507.
Liang, Kung-Yee, and Scott L. Zeger. 1986. \Longitudinal Data Analysis Using Generalized Linear Models." Biometrika 73: 13-22.
MacKinnon, James. G., and Halbert White. 1985. \Some Heteroskedasticity-Consistent
Covariance Matrix Estimators with Improved Finite Sample Properties." Journal of Econometrics 29: 305-325.
MacKinnon, James, and Matthew D. Webb. 2013. \Wild Bootstrap Inference for Wildly
Di erent Cluster Sizes." Queens Economics Department Working Paper No. 1314.
McCa rey, Daniel F., Bell, Robert M., and Carsten H. Botts. 2001. \Generalizations of
Bias Reduced Linearization." Proceedings of the Survey Research Methods Section, American
Statistical Association.
Miglioretti, D. L., and P. J. Heagerty. 2006. \Marginal Modeling of Nonnested Multilevel
Data using Standard Software." American Journal of Epidemiology 165: 453-463.
Moulton, Brent R. 1986. \Random Group E ects and the Precision of Regression Estimates." Journal of Econometrics 32: 385-397.
Moulton, Brent R. 1990. \An Illustration of a Pitfall in Estimating the E ects of Aggregate Variables on Micro Units." Review of Economics and Statistics 72: 334-38.
Newey, Whitney K., and Kenneth D. West. 1987. \A Simple, Positive Semi-De nite,
Heteroscedasticity and Autocorrelation Consistent Covariance Matrix." Econometrica 55:
703-708.
Petersen, Mitchell A. 2009. \Estimating Standard Errors in Finance Panel Data Sets:
Comparing Approaches." Review of Financial Studies 22: 435-480.
Pfe ermann, Daniel, and Gaf Nathan. 1981. \Regression Analysis of Data from a Cluster
Sample." Journal American Statistical Association 76: 681-689.
Rabe-Hesketh, Sophia, and Anders Skrondal. 2012. Multilevel and Longitudinal Modeling
Using Stata, Volumes I and II, Third Edition. College Station, TX: Stata Press.
Rogers, William H. 1993. \Regression Standard Errors in Clustered Samples." Stata
Technical Bulletin 13: 19-23.
Satterthwaite, F. E. 1946. \An Approximate Distribution of Estimates of Variance Components." Biometrics Bulletin 2(6): 110-114.
Scha er, Mark E., and Stillman, Steven. 2010. \xtoverid: Stata Module to Calculate Tests of Overidentifying Restrictions after xtreg, xtivreg, xtivreg2 and xthtaylor."
http://ideas.repec.org/c/boc/bocode/s456779.html
Scott, A. J., and D. Holt. 1982. \The E ect of Two-Stage Sampling on Ordinary Least
Squares Methods." Journal American Statistical Association 77: 848-854.
Shah, Bbabubhai V., M. M. Holt and Ralph E. Folsom. 1977. \Inference About Regression Models from Sample Survey Data." Bulletin of the International Statistical Institute
Proceedings of the 41st Session 47(3): 43-57.
55
Shore-Sheppard, L. 1996. \The Precision of Instrumental Variables Estimates with
Grouped Data." Princeton University Industrial Relations Section Working Paper 374.
Solon, Gary, Steven J. Haider, and Je rey Wooldridge. 2013. \What Are We Weighting
For?" NBER Working Paper 18859.
Staiger, Douglas, and James H. Stock. 1997. \Instrumental Variables Regression with
Weak Instruments." Econometrica 65: 557-586.
Stock, James H., and Mark W. Watson. 2008. \Heteroskedasticity-robust Standard
Errors for Fixed E ects Panel Data Regression." Econometrica 76: 155-174.
Thompson, Samuel. 2006. \Simple Formulas for Standard Errors that Cluster by Both
Firm and Time." SSRN paper. http://ssrn.com/abstract=914002.
Thompson, Samuel. 2011. \Simple Formulas for Standard Errors that Cluster by Both
Firm and Time." Journal of Financial Economics 99(1): 1-10.
Webb, Matthew D. 2013. \Reworking Wild Bootstrap Based Inference for Clustered
Errors." Queens Economics Department Working Paper 1315.
White, Halbert. 1980. \A Heteroskedasticity-Consistent Covariance Matrix Estimator
and a Direct Test for Heteroskedasticity." Econometrica 48: 817-838.
White, Halbert. 1984. Asymptotic Theory for Econometricians. San Diego: Academic
Press.
Wooldridge, Je rey M. 2003. \Cluster-Sample Methods in Applied Econometrics." American Economic Review 93: 133-138.
Wooldridge, Je rey M. 2006. \Cluster-Sample Methods in Applied Econometrics: An
Extended Analysis." Michigan State University. Unpublished.
Wooldridge, Je rey M. 2010. Econometric Analysis of Cross Section and Panel Data.
Cambridge, MA: MIT Press.
Yoon, Jungmo, and Antonio Galvao. 2013. \Robust Inference for Panel Quantile Regression Models with Individual E ects and Serial Correlation." Unpublished.
56
Table 1
Impacts of clustering and estimator choices on estimated coefficients and standard errors
Cross‐section Individual data with randomly‐assigned State‐level variable
Estimation Method
OLS
FGLS (Random Effects)
Slope coefficient
‐0.0226
‐0.0011
Standard Errors
Default
0.0042
0.0199
Heteroscedastic Robust
0.0042
‐
Cluster Robust (cluster on State)
0.0217
0.0217
Pairs cluster bootstrap
0.0215
0.0227
Number observations
Number clusters (states)
Cluster size range
Intraclass correlation
65685
51
519 to 5866
0.018
65685
51
519 to 5866
‐
Notes: March 2012 CPS data, from IPUMS download. Default standard errors for OLS assume
errors are iid; default standard errors for FGLS assume the Random Effects model is correctly
specified. 399 Bootstrap replications. A fixed effect model is not possible, since the regressor
is invariant within states.
IK effective DOF (mean)
IK effective DOF (5th percentile)
IK effective DOF (95th percentile)
CSS effective # clusters (mean)
10
11
12
13
14
15
16
17
5.5
4.1
6.9
6.6
0.037
0.062
0.063
0.076
0.174
0.130
0.094
0.090
0.075
0.058
0.056
0.077
0.063
9.6
5.3
14.3
10.2
0.043
0.050
0.058
0.060
0.172
0.091
0.075
0.075
0.066
0.047
0.045
0.055
0.066
20
12.6
6.7
20.3
12.9
0.069
0.068
0.064
0.073
0.181
0.098
0.080
0.079
0.071
0.061
0.056
0.062
0.070
30
17.1
9.7
30.4
17.1
0.057
0.055
0.055
0.062
0.176
0.080
0.070
0.070
0.065
0.063
0.055
0.060
0.072
50
Notes: Data drawn from March 2012 CPS data, 3% sample from IPUMS download (later version to use a larger data set). 1000 Monte
Carlo replications (later version to have more reps). 399 Bootstrap replications. "IK effective DOF" from Imbens and Kolesar (2013), and
"CSS effective # clusters" from Carter, Schnepel and Steigerwald (2013), see section x.x.
3.3
2.7
3.8
4.6
0.019
0.081
0.087
0.085
Bootstrap Percentile‐T methods
Pairs cluster bootstrap
Wild cluster bootstrap, Rademacher 2 point distribution
Wild cluster bootstrap, Webb 6 point distribution
Wild cluster bootstrap, Rademacher 2 pt, do not impose null hypothesis
1
2
3
4
5
6
7
8
9
0.165
0.213
0.124
0.108
0.089
0.051
0.060
0.118
0.090
Numbers of Clusters
6
10
Wald Tests
White Robust, T(N‐k) for critical value
Cluster on state, N(0,1) for critical value
Cluster on state, T(G‐1) for critical value
Cluster on state, T(G‐2) for critical value
Cluster on state, CR2 bias correction, T(G‐1) for critical value
Cluster on state, CR3 bias correction, T(G‐1) for critical value
Cluster on state, CR2 bias correction, IK degrees of freedom
Cluster on state, CR2 bias correction, T(CSS effective # clusters)
Pairs cluster bootstrap for standard error, T(G‐1) for critical value
Estimation Method
Table 2 ‐ Cross‐section individual level data
Monte Carlo rejection rates of true null hypothesis (slope = 0) with different number of clusters and different rejection methods
Nominal 5% rejection rates
Default standard errors, T(N‐k) for critical value
White Robust, T(N‐k) for critical value
Cluster on state, T(G‐1) for critical value
Cluster on state, CR2 bias correction, T(G‐1) for critical value
Cluster on state, CR2 bias correction, IK degrees of freedom
Pairs cluster bootstrap for standard error, T(G‐1) for critical value
Number observations
Number clusters (states)
10 Imbens‐Kolesar effective DOF
11 Carter‐Schnepel‐Steigerwald effective # clusters
Bootstrap Percentile‐T methods
7 Pairs cluster bootstrap
8 Wild cluster bootstrap, Rademacher 2 point distribution
9 Wild cluster bootstrap, Webb 6 point distribution
1
2
3
4
5
6
Slope coefficient
Standard Errors
1
2
1836
51
50
51
0.0042
0.0042
0.0153
0.0153
0.0153
0.0149
‐0.0198
0.1598
0.7353
0.7413
0.0000
0.0000
0.2016
0.2017
0.2017
0.1879
1836
51
50
51
0.0074
0.0067
0.0276
0.0276
0.0276
0.0278
‐0.0072
0.7832
0.9590
0.9590
0.3328
0.2845
0.7959
0.7958
0.7959
0.7977
p‐values for OLS, no p‐values for
test of
test of (state)
fixed statistical
statistical
Estimation Method: OLS‐FE significance effects significance
Model:
Table 3 ‐ State‐year panel data with differences‐in‐differences estimation
Impacts of clustering and estimation choices on estimated coefficients and standard errors
1836
51
‐
‐
‐
‐
‐
0.0070
‐
0.0101
‐
‐
0.0102
0.0037
‐
‐
‐
0.5967
‐
0.7142
‐
‐
0.7172
(AR(1) p‐values for
test of
within
statistical
each
state) significance
3
0.010
0.080
0.083
0.604
0.136
0.077
0.062
0.075
0.018
0.065
0.060
0.576
0.104
0.063
0.056
0.077
Numbers of Clusters
6
10
0.048
0.043
0.045
0.589
0.054
0.042
0.041
0.047
20
0.036
0.037
0.035
0.573
0.042
0.034
0.034
0.043
30
0.061
0.053
0.058
0.611
0.055
0.049
0.049
0.056
50
Notes: Outcome is mean log wages from March CPS data, 1977‐2012. Models include state and year fixed effects, and a "fake policy"
dummy variable which turns on starting at 1986, for a random subset of half of the states. 1000 Monte Carlo replications (later
version to have more reps). 399 Bootstrap replications.
Bootstrap Percentile‐T methods
6 Pairs cluster bootstrap
7 Wild cluster bootstrap, Rademacher 2 point distribution
8 Wild cluster bootstrap, Webb 6 point distribution
1
2
3
4
5
Wald Tests
Default standard errors, T(N‐k) for critical value
Cluster on state, N(0,1) for critical value
Cluster on state, T(G‐1) for critical value
Cluster on state, T(G‐2) for critical value
Pairs cluster bootstrap for standard error, T(G‐1) for critical value
Estimation Method
Table 4 ‐ State‐year panel data with differences‐in‐differences estimation
Monte Carlo rejection rates of true null hypothesis (slope = 0) with different # clusters and different rejection methods
Nominal 5% rejection rates
Exhibit 8
NBER WORKING PAPER SERIES
WHAT ARE WE WEIGHTING FOR?
Gary Solon
Steven J. Haider
Jeffrey Wooldridge
Working Paper 18859
http://www.nber.org/papers/w18859
NATIONAL BUREAU OF ECONOMIC RESEARCH
1050 Massachusetts Avenue
Cambridge, MA 02138
February 2013
We are grateful for helpful comments from David Autor, Raj Chetty, John DiNardo, Todd Elder, Mike
Elsby, Osborne Jackson, Fabian Lange, Jason Lindo, Jim Ziliak, and seminar participants at the University
of Kentucky. The views expressed herein are those of the authors and do not necessarily reflect the
views of the National Bureau of Economic Research.
NBER working papers are circulated for discussion and comment purposes. They have not been peerreviewed or been subject to the review by the NBER Board of Directors that accompanies official
NBER publications.
© 2013 by Gary Solon, Steven J. Haider, and Jeffrey Wooldridge. All rights reserved. Short sections
of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full
credit, including © notice, is given to the source.
What Are We Weighting For?
Gary Solon, Steven J. Haider, and Jeffrey Wooldridge
NBER Working Paper No. 18859
February 2013
JEL No. C1
ABSTRACT
The purpose of this paper is to help empirical economists think through when and how to weight the
data used in estimation. We start by distinguishing two purposes of estimation: to estimate population
descriptive statistics and to estimate causal effects. In the former type of research, weighting is called
for when it is needed to make the analysis sample representative of the target population. In the latter
type, the weighting issue is more nuanced. We discuss three distinct potential motives for weighting
when estimating causal effects: (1) to achieve precise estimates by correcting for heteroskedasticity,
(2) to achieve consistent estimates by correcting for endogenous sampling, and (3) to identify average
partial effects in the presence of unmodeled heterogeneity of effects. In each case, we find that the
motive sometimes does not apply in situations where practitioners often assume it does. We recommend
diagnostics for assessing the advisability of weighting, and we suggest methods for appropriate inference.
Gary Solon
Department of Economics
Michigan State University
East Lansing, MI 48824-1038
and NBER
solon@msu.edu
Steven J. Haider
Department of Economics
Michigan State University
101 Marshall Hall
East Lansing, MI 48824
haider@msu.edu
Jeffrey Wooldridge
Department of Economics
Michigan State University
wooldri1@msu.edu
What Are We Weighting For?
I. Introduction
At the beginning of their textbook’s section on weighted estimation of regression
models, Angrist and Pischke (2009, p. 91) acknowledge, “Few things are as confusing to
applied researchers as the role of sample weights. Even now, 20 years post-Ph.D., we
read the section of the Stata manual on weighting with some dismay.” After years of
discussing weighting issues with fellow economic researchers, we know that Angrist and
Pischke are in excellent company. In published research, top-notch empirical scholars
make conflicting choices about whether and how to weight, and often provide little or no
rationale for their choices. And in private discussions, we have found that accomplished
researchers sometimes own up to confusion or declare demonstrably faulty reasons for
their weighting choices.
Our purpose in writing this paper is to dispel confusion and dismay by clarifying
the issues surrounding weighting. Our central theme is that the confusion stems from a
lack of clarity about which among multiple potential motives for weighting pertains to
the research project at hand. Once one specifies the particular motive for weighting, it
becomes straightforward to consider whether the purpose for weighting really does apply,
to use appropriate diagnostics to check whether it does, and then to proceed with
appropriate estimation and inference methods. Hence the title of our paper: “What Are
We Weighting For?”
In the next section, we pose a prior question: “What Are We Trying to Estimate?”
In some projects, the purpose is to estimate descriptive statistics for a particular
population. In those cases, whether weighting is called for depends simply on whether
1
weighting is necessary to make the analysis sample representative of the target
population. But in many other projects, the purpose is to estimate causal effects. In those
cases, the weighting issue becomes more nuanced.
In Sections III, IV, and V, we successively discuss three distinct potential motives
for weighting when estimating causal effects: (1) to achieve more precise estimates by
correcting for heteroskedasticity, (2) to achieve consistent estimates by correcting for
endogenous sampling, and (3) to identify average partial effects in the presence of
heterogeneous effects. 1 In each case, after explaining the potential relevance of the
motive, we will note that the motive sometimes does not apply in situations where
practitioners often assume it does. We will recommend diagnostics for assessing the
advisability of weighting, and we will suggest methods for appropriate inference. In
Section VI, we will summarize our analysis and our recommendations for empirical
practice.
II. What Are We Trying to Estimate?
A. Descriptive Statistics for a Population
Sometimes the purpose of a research project is to estimate descriptive statistics of
interest for a population. Consider, for example, the 1967 poverty rate for the United
States, which was officially measured as 13 percent based on the Current Population
Survey (U.S. Bureau of the Census, 1968). But suppose that one sought to estimate that
rate on the basis of the reports of 1967 income in the first wave of the Panel Study of
Income Dynamics (PSID) in 1968. The complication is that the PSID began with a
1
The use of propensity-score weighting to control for covariates when estimating treatment effects is
discussed elsewhere in this symposium by Imbens ( ).
2
sample that purposefully overrepresented low-income households by incorporating a
supplementary sample drawn from households that had reported low income to the
Survey of Economic Opportunity in 1967. As in other surveys that purposefully sample
with different probabilities from different parts of the population, the point of the
oversampling was to obtain more precise information on a subpopulation of particular
interest, in this case the low-income population. 2
If one estimated the 1967 poverty rate for the United States population with the
poverty rate for the full PSID sample, without any weighting to adjust for the low-income
oversample, one would estimate the U.S. poverty rate at 26 percent.3 That, of course, is
an upward-biased estimate because the PSID, by design, overrepresents the poor. But
one might achieve unbiased and consistent estimation by using the PSID sample’s
weighted poverty rate, weighting by the inverse probabilities of selection. 4
A
visualization of how this works is that the PSID sample design views the U.S. population
through a funhouse mirror that exaggerates the low-income population.
Weighted
estimation views the sample through a reverse funhouse mirror that undoes the original
exaggeration. It turns out that the PSID’s weighted poverty rate is 12 percent, a more
reasonable estimate than the 26 percent figure.
The poverty-rate example illustrates the simple case of estimating a population
mean on the basis of a sample that systematically fails to represent the target population,
but can be made to represent it by weighting. Much economic research, however, seeks
2
Similarly, the Current Population Survey oversamples in less populous states, and the first wave of the
Health and Retirement Study oversampled blacks, Mexican-Americans, and residents of Florida.
3
This calculation is based on approximating the official poverty line by dividing the PSID-reported “needs
standard” by 1.25.
4
For simplicity, we are overlooking complications from nonresponse, including mishaps in the PSID’s
implementation of the low-income oversample. For discussion of the latter and further references, see Shin
and Solon (2011, footnote 11)
3
to estimate more complex population statistics.
Suppose, for example, that the
population descriptive statistic one wishes to estimate is the 1967 earnings gap between
black and white men with the same years of schooling and potential work experience (age
minus years of schooling minus 5). A typical approach is to attempt to estimate the
population linear projection of log earnings on a dummy variable that equals 1 for blacks
along with controls for years of schooling and a quartic in potential experience. 5
Now suppose that one estimates that population regression by performing
ordinary least squares (OLS) estimation of the regression of log earnings on the race
dummy, years of schooling, and a quartic in potential earnings for black and white male
household heads in the PSID sample. Doing so estimates the coefficient of the dummy
variable for blacks at -0.344. Because exp (-0.344) = 0.71, this estimate seems to imply
that, among male household heads with the same education and potential experience,
blacks tended to earn only 71 percent as much as whites.
As in the example of estimating the poverty rate, however, this estimate might be
distorted by the PSID’s oversampling of low-income households, which surely must lead
to an unrepresentative sample with respect to male household heads’ earnings. But again,
one can apply a reverse funhouse mirror by using weights. In particular, instead of
applying ordinary (i.e., equally weighted) least squares to the sample regression, one can
use weighted least squares (WLS), minimizing the sum of squared residuals weighted by
the inverse probabilities of selection.
Doing so leads to an estimated coefficient of
− 0.260 for the dummy variable for blacks, implying that, among male household heads
with the same education and potential experience, blacks tended to earn 77 percent as
5
Alternatively, one could control for race differences in covariates through propensity-score weighting.
See Imbens ( ) for a general discussion and Elder, Goddeeris, and Haider (2011) for an application to the
black/white difference in infant mortality rates.
4
much as whites. This is still a large shortfall, but not as large as implied by the OLS
estimate. A likely reason is that the particular way that the PSID overrepresented the
low-income population involved an especially concentrated oversampling of low-income
households in nonmetropolitan areas of the South. The unweighted PSID therefore may
understate typical income for blacks even more than for whites.
What both our examples have in common is that they involve estimating
descriptive statistics for a population on the basis of sample data. If the sample is
representative of the target population (the most straightforward case being a simple
random sample drawn from that population), the population statistic is consistently
estimated by the analogous sample statistic.
If the sample is systematically
unrepresentative of the population in a known manner, the population statistic generally
is not consistently estimated by the analogous sample statistic, but it can be consistently
estimated by reweighting the sample statistic with the inverse probabilities of selection. 6
This point is intuitive and not at all controversial. So why does the issue of
weighting provoke confusion and dismay among economic researchers? The answer,
which will occupy the rest of this paper, is that much economic research is directed not at
estimating population descriptive statistics, but at estimating causal effects.
B. Causal Effects
In the microeconometrics textbooks of both Wooldridge (2010) and Angrist and
Pischke (2009), the very first page describes the estimation of causal effects as the
principal goal of empirical microeconomists. According to Angrist and Pischke, “In the
beginning, we should ask, What is the causal relationship of interest? Although purely
6
For a general formal demonstration, see Wooldridge (1999).
5
descriptive research has an important role to play, we believe that the most interesting
research in social science is about questions of cause and effect, such as the effect of
class size on children’s test scores. . . .” Similarly, the first sentences in the Wooldridge
textbook are, “The goal of most empirical studies in economics and other social sciences
is to determine whether a change in one variable, say w, causes a change in another
variable, say y. For example, does having another year of education cause an increase in
monthly salary? Does reducing class size cause an improvement in student performance?
Does lowering the business property tax rate cause an increase in city economic
activity?”
In contrast to the case of estimating population descriptive statistics, when
economists perform estimation of causal effects, the question of whether to weight the
data is complex. There are several distinct reasons that we may (or, as we will stress,
sometimes may not) prefer to use weights in our estimation. We will take up these
distinct reasons separately in each of the next three sections.
III. Correcting for Heteroskedasticity
One motivation for weighting, taught for decades in undergraduate and graduate
econometrics classes, is to correct for heteroskedastic error terms and thereby achieve
more precise estimation of coefficients in linear or nonlinear regression models of causal
effects. A nice example of this motivation comes from the literature on the impact of
unilateral divorce laws on the divorce rate. During the 1970s, many states in the United
States adopted laws allowing unilateral divorce, instead of requiring mutual consent of
both spouses. Were these laws responsible for the rise in divorce rates that occurred
6
during that period? In two insightful and influential articles published in the American
Economic Review, Leora Friedberg (1998) and Justin Wolfers (2006) reported
differences-in-differences estimates of the impact of unilateral divorce laws on divorce
rates. In particular, using a panel of annual state divorce rates over time, they estimated
linear regressions of the divorce rate on dummy variables for unilateral divorce laws with
controls for state fixed effects and secular time trends. Following the practice of many
other top-notch empirical economists, 7 both Friedberg and Wolfers weighted by
state/year population in the estimation of their regression models. Friedberg justified the
weighting as a correction for population-size-related heteroskedasticity in the state/year
error terms.
Table 1 presents examples from a wide set of variations on the Friedberg/Wolfers
regressions reported in Lee and Solon (2011).
The regressions are estimated with
Wolfers’s 1956-1988 data on annual divorce rates by state. The main point of Wolfers’s
article was that the short-run and long-run effects of unilateral divorce may differ, so the
regressions in Table 1 follow Wolfers in representing unilateral divorce with a set of
dummy variables for whether unilateral divorce had been in place for up to 2 years, 3-4
years, 5-6 years, …, 13-14 years, or at least 15 years. The dependent variable is the
logarithm of the annual divorce rate by state, and the regressions include controls for
state fixed effects, year fixed effects, and state-specific linear time trends.
The table’s first column follows Friedberg and Wolfers in estimating by weighted
least squares with weighting by state/year population. The second column uses ordinary
least squares, which weights all observations equally. In both instances, to maintain
7
Some other prominent examples of similarly weighted estimation are Card and Krueger (1992), Autor,
Katz, and Krueger (1998), Levitt (1998), Donohue and Levitt (2001), Borjas (2003), and Dehejia and
Lleras-Muney (2004).
7
agnosticism about which weighting approach – if either – comes close to producing a
homoskedastic error term, Table 1 reports standard error estimates robust to
heteroskedasticity (as well as to serial correlation over time within the same state). 8
Setting aside other interesting aspects of these results (for example, the absence in
this specification of any evidence for a positive effect of unilateral divorce on divorce
rates), notice this striking pattern:
Even though Friedberg’s expressed purpose in
weighting was to improve the precision of estimation, the robust standard error estimates
are smaller for OLS than for WLS. For the estimated effects over the first eight years
after adoption of unilateral divorce, the robust standard error estimates for OLS are only
about half those for WLS. Apparently, weighting by population made the estimates much
less precise! And as discussed by Dickens (1990), this is quite a common phenomenon.
To see what’s going on here, let’s start with the classic heteroskedasticity-based
argument for weighting when the dependent variable is a group average and the averages
for different groups are based on widely varying within-group sample sizes. Suppose the
model to be estimated is
y i = X i β + vi
(1)
where y i is a group-level average outcome observed for group i and the error term is
fully independent of the explanatory variables. The group-average error term vi equals
Ji
∑v
j =1
ij
/ J i , where vij is the micro-level error term for individual j in group i and J i
denotes the number of individuals observed in group i. If vij is independently and
8
Lee and Solon (2011) show that, for both the OLS and WLS results, naïve standard error estimates that
correct for neither heteroskedasticity nor serial correlation are far smaller than the robust estimates. This
occurs mainly because the error term is highly serially correlated.
8
identically distributed with variance σ 2 , then elementary statistics shows that the
variance of the group-average error term vi is σ 2 / J i . Thus, if J i varies widely across
groups (e.g., if many more individuals are observed in California than in Wyoming), the
group-average error term vi is highly heteroskedastic. Then, as taught in almost every
introductory econometrics course, OLS estimation of β in equation (1) is inefficient and
also leads to inconsistent standard error estimation if nothing is done to correct the
standard error estimates for heteroskedasticity. The WLS estimator that applies least
squares to the reweighted equation
(2)
J i y i = J i X i β + J i vi
is the minimum-variance linear unbiased estimator and also generates consistent standard
error estimation.
This presumably is the line of thinking that led Friedberg and Wolfers to use WLS
to estimate their divorce-rate regressions. Compared to Wyoming, California offers
many more observations of the individual-level decision of whether or not to divorce, and
therefore it seems at first that weighting by state population should lead to more precise
coefficient estimation. And yet, for the specification shown in Table 1, it appears that
weighting by population harms the precision of estimation.
What is going on here is explained in Dickens’s (1990) excellent article subtitled
“Is It Ever Worth Weighting?” Dickens points out that, in many practical applications,
the assumption that the individual-level error terms vij are independent is wrong.
Instead, the individual-level error terms within a group are positively correlated with each
other because they have unobserved group-level factors in common. In current parlance,
9
the individual-level error terms are “clustered.” Dickens illustrates with the simple
example of an error components model for the individual-level error term:
(3)
vij = ci + u ij
where each of the error components, ci and u ij , is independently and identically
distributed (including independence of each other), with respective variances σ c2 and σ u2 .
In this scenario, the variance of the group-average error term vi is not σ 2 / J i , but
rather is
(4)
Var (vi ) = σ c2 + (σ u2 / J i ) .
If σ c2 is substantial and the sample size J i is sufficiently large in every group (e.g., a lot
of people live in Wyoming, even if not nearly as many as in California), the variance of
the group-average error term may be well approximated by σ c2 , which is homoskedastic.
In that case, OLS applied to equation (1) is nearly the best linear unbiased estimator. In
contrast, if one weights by
J i , as in equation (2), the reweighted error term has
variance J iσ c2 + σ u2 , which could be highly heteroskedastic.
This provides an
explanation for why weighting by the within-group sample size sometimes leads to less
precise estimation than OLS. 9 On the other hand, if σ c2 is small and the within-group
sample size J i is highly variable and small in some groups, weighting by the within-
9
An important related point is that, if one has access to the individual-level data on y ij and applies OLS to
the regression of y ij on X i , this is numerically identical to the group-average WLS of equation (2), and
hence suffers from the same inefficiency associated with ignoring the clustered nature of the error term.
For more discussion of the mapping between individual-level and group-average regressions, see
Wooldridge (2003) and Donald and Lang (2007).
10
group sample size may indeed improve the precision of estimation, sometimes by a great
deal.
So what is a practitioner to do? Fortunately, as Dickens points out, it is easy to
approach the heteroskedasticity issue as an empirical question. One way to go is to start
with OLS estimation of equation (1), and then use the OLS residuals to perform the
standard heteroskedasticity diagnostics we teach in introductory econometrics.
For
example, in this situation, the modified Breusch-Pagan test described in Wooldridge
(2013, pp. 276-8) comes down to just applying OLS to a simple regression of the squared
OLS residuals on the inverse within-group sample size 1 / J i . The significance of the tratio for the coefficient on 1 / J i indicates whether the OLS residuals display significant
evidence of heteroskedasticity. The test therefore provides some guidance for whether
weighted estimation seems necessary.
A remarkable feature of this test is that the
estimated intercept consistently estimates σ c2 , and the estimated coefficient of 1 / J i
consistently estimates σ u2 . This enables an approximation of the variance structure in
equation (4), which then can be used to devise a more refined weighting procedure that,
unlike the simple weighting scheme in equation (2), takes account of the group error
component ci .
So our first recommendation to practitioners in this situation is not to assume that
heteroskedasticity is (or is not) an issue, but rather to perform appropriate diagnostics
before deciding.
We wish to make two additional recommendations.
One is that,
regardless of whether one uses weighted or unweighted estimation, the inevitable
uncertainty about the true variance structure means that some heteroskedasticity may
11
remain in the error term. We therefore recommend reporting heteroskedasticity-robust
standard error estimates.
Finally, it often is good practice to report both weighted and unweighted
estimates. For one thing, as in our divorce example, a comparison of the robust standard
error estimates is instructive about which estimator is more precise. But there is an
additional consideration. Under exogenous sampling and correct specification of the
conditional mean of y in equation (1), both OLS and WLS are consistent for estimating
the regression coefficients. On the other hand, under either the endogenous sampling
discussed in the next section or model misspecification (an example of which is the
failure to model heterogeneous effects, to be discussed in Section V), OLS and WLS
generally have different probability limits. Therefore, as suggested by DuMouchel and
Duncan (1983), the contrast between OLS and WLS estimates can be used as a diagnostic
for model misspecification or endogenous sampling. 10
In truth, of course, the parametric models we use for estimating causal effects are
nearly always misspecified at least somewhat. Thus, the practical question is not whether
a chosen specification is exactly the true data-generating process, but rather whether it is
a good enough approximation to enable nearly unbiased and consistent estimation of the
causal effects of interest. When weighted and unweighted estimates contradict each
other, this may be a red flag that the specification is not a good enough approximation to
the true form of the conditional mean. For example, Lee and Solon (2011) find that,
when the dependent variable used in the divorce-rate regressions is specified not in logs,
but in levels (as in both the Friedberg and Wolfers studies), the OLS and WLS estimates
10
See Deaton (1997, p. 72) for a clear exposition of how to assess the statistical significance of the contrast
between OLS and WLS estimates of a linear regression model when, under the null hypothesis, OLS is
efficient. For a more general treatment, see Wooldridge (2001, pp. 463-4).
12
are dramatically different from each other. This in itself does not pinpoint exactly what is
wrong with the linear-in-levels model specification, but it is a valuable warning sign that
the issue of functional form specification warrants further attention.
IV. Correcting for Endogenous Sampling
An altogether different motive for weighting in research on causal effects is to
achieve consistent estimation in the presence of endogenous sampling. A nice example
comes from the classic paper on choice-based sampling by Manski and Lerman (1977).
Suppose one is studying commuters’ choice of transit mode, such as the choice between
driving to work and taking the bus. One might be particularly interested in how certain
explanatory variables, such as bus fare and walking distance to and from bus stops, affect
the probability of choosing one mode versus the other. Given a random sample of
commuters, most empirical researchers would perform maximum likelihood estimation of
a probit or logit model for the binary choice between transit modes.
But suppose the sample is drawn not as a random sample of commuters, but as a
choice-based sample. As Manski and Lerman explain, “in studying choice of mode for
work trips, it is often less expensive to survey transit users at the station and auto users at
the parking lot than to interview commuters at their homes.” Manski and Lerman show
that, if the resulting sample overrepresents one mode and underrepresents the other
relative to the population distribution of choices, maximizing the conventional log
likelihood (which is an incorrect log likelihood because it fails to account for the
endogenous sampling) generally results in inconsistent parameter estimation. 11 And if
11
They also note a quirky exception: In a logit model that includes mode-specific intercepts in the
associated random-utility model, the coefficients of the other explanatory variables are consistently
13
instead one maximizes the quasi-log likelihood that weights each observation’s
contribution to the conventional log likelihood by its inverse probability of selection from
the commuter population (thus using a reverse funhouse mirror to make the sample
representative of the population), consistent estimation of the parameters is restored.
Another example is estimating the earnings return to an additional year of
schooling. Most labor economists would frame their analysis within a linear regression
of log earnings on years of schooling with controls for other variables such as years of
work experience. Although that regression model has been estimated countless times by
OLS, researchers cognizant of the endogeneity of years of schooling often have sought to
devise instrumental variables (IV) estimators of the regression.
In any case, if the
regression were estimated with the full PSID without any correction for the oversampling
of the low-income population, this would lead to inconsistent estimation of the regression
parameters. The sampling would be endogenous because the sampling criterion, family
income, is related to the error term in the regression for log earnings. Again, however,
for an estimation strategy that would be consistent if applied to a representative sample,
suitably weighted estimation would achieve consistency. For example, if the schooling
variable somehow were exogenous so that OLS estimation with a representative sample
would be consistent, then applying WLS to the endogenously selected sample (weighting
each contribution to the sum of squares by its inverse probability of selection) also would
be consistent. This could be achieved by applying least squares to an equation that looks
like equation (2), but now with J i standing for the inverse probability of selection.
estimated. That is a peculiar feature of the logit specification, and it does not carry over to other
specifications such as the probit model. Furthermore, without a consistent estimate of the intercept, one
cannot obtain consistent estimates of the average partial effects, which are commonly reported in empirical
studies.
14
Similarly, if one were to perform IV estimation, one would need to weight the IV
orthogonality conditions by the inverse probabilities of selection.
These examples illustrate a more general point, analyzed formally in Wooldridge
(1999) for the entire class of M-estimation. In the presence of endogenous sampling,
estimation that ignores the endogenous sampling generally will be inconsistent. But if
instead one weights the criterion function to be minimized (a sum of squares, a sum of
absolute deviations, the negative of a log likelihood, a distance function for orthogonality
conditions, etc.) by the inverse probabilities of selection, the estimation becomes
consistent.
An important point stressed in Wooldridge (1999) is that, if the sampling
probabilities vary exogenously instead of endogenously, weighting might be unnecessary
for consistency and harmful for precision. In the case of a linear regression model that
correctly specifies the conditional mean, the sampling would be exogenous if the
sampling probabilities are independent of the error term in the regression equation. This
would be the case, for example, if the sampling probabilities vary only on the basis of
explanatory variables. More generally, the issue is whether the sampling is independent
of the dependent variable conditional on the explanatory variables.
For example, suppose one estimates a linear regression model with a sample that
overrepresents certain states (as in the Current Population Survey), but the model
includes state dummy variables among the explanatory variables. Then, if the model is
correctly specified (more about that soon), the error term is not related to the sampling
criterion, and weighting is unnecessary. If the error term obeys the ideal conditions, then
15
OLS estimation is optimal in the usual way. 12 Is there any cost to using WLS instead of
OLS when weighting is unnecessary for consistency? Yes, there can be an efficiency
cost. If the error term was homoskedastic prior to weighting, the weighting will induce
heteroskedasticity, with the usual consequence of imprecise estimation. More generally,
when the error term may have been heteroskedastic to begin with, the efficiency
comparison between weighting or not weighting by inverse probabilities of selection
becomes less clear. Again, as in Section III, we recommend using standard diagnostics
for heteroskedasticity as a guide in the search for an efficient estimator.
Of course, we need again to acknowledge that, in practice, one’s model is almost
never perfectly specified. At best, it is a good approximation to the data-generating
process. As a result, just as theorems in microeconomic theory based on unrealistically
strong assumptions provide only rough guidance about what is going on in the actual
economy, theorems from theoretical econometrics provide inexact (though valuable, in
our view) guidance about how to do empirical research. In that light, let’s reconsider the
example in the previous paragraph. If the sampling probability varies only across states
and the regression model that controls for state dummies is a good, though imperfect,
approximation to the true model for the conditional mean, then one might reasonably
hope that OLS estimation would come close to unbiased and consistent estimation of the
effects of the explanatory variables. The same goes for WLS estimation (which also
12
Another practical example is where the survey organization provides sampling weights to adjust for
differential non-response, including attrition from a panel survey. If the weights are based only on
observable characteristics that are controlled for in the regression model (perhaps gender, race, age,
location), it is not clear that there is an advantage to using such weights when estimating that model. For
more on this topic, see Wooldridge (2002) and Fitzgerald, Gottschalk, and Moffitt (1998).
16
would fall short of perfect unbiasedness and consistency 13), but WLS might be less
precise.
In the end, what is our advice to practitioners? First, if the sampling rate varies
endogenously, estimation weighted by the inverse probabilities of selection is needed on
consistency grounds. Second, the weighted estimation should be accompanied by robust
estimation of standard errors. For example, in the case of a linear regression model, the
heteroskedasticity induced by the weighting calls for the use of White (1980)
heteroskedasticity-robust standard error estimates. 14 Finally, when the variation in the
sampling rate is exogenous, both weighted and unweighted estimation are consistent for
the parameters of a correctly specified model, but unweighted estimation may be more
precise.
Even then, as in the previous section, we recommend reporting both the
weighted and unweighted estimates because the contrast serves as a useful joint test
against model misspecification and/or misunderstanding of the sampling process.
V. Identifying Average Partial Effects
To consider a third motivation for weighted estimation of causal effects, let’s
return to the example of divorce-rate regressions. Recall that, when Lee and Solon
followed Friedberg and Wolfers in using the level, rather than the log, of the divorce rate
13
In footnote 15 in the next section, we will mention special cases of model misspecification where,
although OLS may be inconsistent for estimating particular causal effects, certain weighted estimators do
achieve consistency.
14
Wooldridge (1999) presents the appropriate “sandwich” estimator of the asymptotic variance-covariance
matrix for the general case of M-estimation under endogenous sampling. Wooldridge (2001) analyzes a
subtly different sort of sampling that he calls “standard stratified sampling.” In standard stratified
sampling, the survey selects a deterministically set number of observations per stratum. In this case, the
asymptotic variance-covariance matrix is more complex, and is strictly smaller than the one analyzed in
Wooldridge (1999). Intuitively, sampling variability is reduced by not leaving the within-stratum sample
sizes to chance. In the example of a linear regression model, the White heteroskedasticity-robust standard
error estimates then become conservative in the sense that they are slightly upward-inconsistent estimates
of the true standard errors.
17
as the dependent variable, they found that the OLS and WLS estimates differed
dramatically from each other, with the WLS results showing more evidence of a positive
impact of unilateral divorce on divorce rates. One possible explanation is that, if the
impact of unilateral divorce is heterogeneous – i.e., if it interacts with other state
characteristics – then OLS and WLS estimates that do not explicitly account for those
interactions may identify different averages of the heterogeneous effects. For example, if
unilateral divorce tends to have larger effects in more populous states, then WLS
estimation that places greater weight on more populous states will tend to estimate larger
effects than OLS does. Indeed, Lee and Solon found that, when they redid WLS with
California omitted from the sample, the estimated effects of unilateral divorce came out
smaller and more similar to the OLS estimates, which gave the same weight to California
as to any other state.
This raises the question of whether one might want to weight in order to identify a
particular average of heterogeneous effects, such as the population average partial effect.
Indeed, we have the impression that many empirical practitioners believe that, by
performing WLS with weights designed to reflect population shares, they do achieve
consistent estimation of population average partial effects (e.g., the average impact of
unilateral divorce on divorce rates for the U.S. population). This belief may be based on
the fact, discussed above in Section II.A, that this WLS approach does consistently
estimate the population linear projection of the dependent variable on the explanatory
variables. That, however, is not the same thing as identifying the population average
18
partial effects. 15 For a previous demonstration of this point, see Deaton (1997, pp. 6770).
Here, we illustrate with a simple cross-sectional example. Suppose the true model
for an individual-level outcome y i is
(5)
y i = β1 + β 2 X i + β 3 Di + β 4 X i Di + vi
where D is a dummy variable indicating urban (rather than rural) location and the error
term v is fully independent of all the explanatory variables. Then the effect of X on y is
heterogeneous, with β 2 as the rural effect and β 2 + β 4 as the urban effect. The average
effect for the population is the population-weighted average of these two effects, which is
β 2 + β 4π where π represents the urban share of the population.
Suppose that one fails to model the heterogeneity of effects and instead estimates
the regression of y on just X and D, with the interaction term omitted. And suppose that
one does so with data from a survey that oversampled in the urban sector, so that the
urban fraction of the sample is p > π . The OLS estimator of the coefficient of X does
identify a particular weighted average of the rural and urban effects, but no one would
expect that weighted average to be the same as the population average effect. After all,
the sample systematically overrepresents the urban sector. And, as we soon will show,
that is indeed one of the reasons that the probability limit of the OLS estimator differs
from the population average partial effect. But the math also will reveal a second reason.
In least squares estimation, observations with extreme values of the explanatory variables
15
One exception in which it is the same thing is in a simple regression on one dummy regressor, that is, a
simple contrast between the means for two subpopulations. And this exception extends to the case of a
“fully saturated” regression on a set of category dummies, which is a contrast among means for multiple
subpopulations. Another case in which using suitably weighted estimators to identify population linear
projections identifies a population average causal effect is the “doubly robust” estimator of treatment
effects introduced by Robins, Rotnitzky, and Zhao (1994) and analyzed by Wooldridge (2007).
19
have particularly large influence on the estimates. As a result, the weighted average of
the rural and urban effects identified by OLS depends not only on the sample shares of
the two sectors, but also on how the within-sector variance of X differs between the two
sectors.
Now suppose that instead one estimates the regression of y on X and D by WLS
with weighting by the inverse probabilities of selection. By reweighting the sample to
get the sectoral shares in line with the population shares, WLS eliminates the first reason
that OLS fails to identify the population average partial effect, but it does not eliminate
the second. As a result, the WLS estimator and the OLS estimator identify different
weighted averages of the heterogeneous effects, and neither one identifies the population
average effect.
ˆ
To be precise, let β 2,OLS denote the OLS estimator of the coefficient of X when
ˆ
the interaction term is omitted, and let β 2,WLS denote the corresponding WLS estimator.
It is straightforward to show that the probability limit of the latter is what one would get
from the corresponding population linear projection:
(6)
πσ 12
ˆ
plim β 2,WLS = β 2 + β 4 2
2
πσ 1 + (1 − π )σ 0
2
where σ 0 and σ 12 respectively denote the within-sector variances of X for the rural and
urban sectors. In contrast, the probability limit of the OLS estimator is
(7)
pσ 12
ˆ
plim β 2,OLS = β 2 + β 4
.
2
2
pσ 1 + (1 − p )σ 0
If the effect of X were homogeneous (i.e., if β 4 = 0 ), then both estimators would be
consistent for the homogeneous effect β 2 . Which estimator is preferable would depend
20
on which is more precise, the question we already discussed in Section III’s analysis of
heteroskedasticity.
The point of the present section, however, is to consider the heterogeneous-effects
case where β 4 ≠ 0 . In that case, equations (6) and (7) imply that the inconsistencies of
the two estimators with respect to the true population average partial effect β 2 + β 4π are
(8)
πσ 12
ˆ
plim β 2,WLS − ( β 2 + β 4π ) = β 4 2
−π
2
πσ 1 + (1 − π )σ 0
and
(9)
pσ 12
ˆ
plim β 2,OLS − ( β 2 + β 4π ) = β 4
−π.
2
2
pσ 1 + (1 − p )σ 0
2
In the knife-edge special case where σ 0 = σ 12 , WLS is consistent for the population
average effect and OLS is not. More generally, though, both estimators are inconsistent
for the population average effect (or any other average effect that researchers commonly
consider interesting).
With either over- or undersampling of the urban sector ( p ≠ π ),
WLS and OLS are inconsistent in different ways, and neither strictly dominates the other.
It is easy to concoct examples in which each is subject to smaller inconsistency than the
other.
Here are the lessons we draw from this example. First, we urge practitioners not
to fall prey to the fallacy that, in the presence of unmodeled heterogeneous effects,
weighting to reflect population shares generally identifies the population average partial
effect.
Second, we reiterate the usefulness of the contrast between weighted and
unweighted estimates.
We said before that the contrast can serve as a test for
misspecification, and the failure to model heterogeneous effects is one sort of
21
misspecification that can generate a significant contrast. Third, where heterogeneous
effects are salient, we urge researchers to study the heterogeneity, not just try to average
it out. Typically, the average partial effect is not the only quantity of interest, and
understanding the heterogeneity of effects is important.
For example, unless one
understands the heterogeneity, it is impossible to extrapolate from even a well-estimated
population average effect in one setting to what the average effect might be in a different
setting. In the simple example above, this recommendation just amounts to advising the
practitioner to include the interaction term instead of omitting it. We understand that, in
most empirical studies, studying the heterogeneity is more complex, but we still consider
it worthwhile.
VI. Summary and General Recommendations for Empirical Practice
In Section II, we distinguished between two types of empirical research: (1)
research directed at estimating population descriptive statistics and (2) research directed
at estimating causal effects. For the former, weighting is called for when it is needed to
make the analysis sample representative of the target population. For the latter, the
question of whether and how to weight is more nuanced.
In Sections III-V, we proceeded to discuss three distinct potential motives for
weighting when estimating causal effects: (1) to achieve more precise estimates by
correcting for heteroskedasticity, (2) to achieve consistent estimates by correcting for
endogenous sampling, and (3) to identify average partial effects in the presence of
unmodeled heterogeneity of effects. In our detailed discussion of each case, we have
noted instances in which weighting is not as good an idea as empirical researchers
22
sometimes think. Our overarching recommendation therefore is to take seriously the
question in our title: What are we weighting for? Be clear about the reason that you are
considering weighted estimation, think carefully about whether the reason really applies,
and double-check with appropriate diagnostics.
A couple of other recurring themes also bear repeating. In situations in which you
might be inclined to weight, it often is useful to report both weighted and unweighted
estimates and to discuss what the contrast implies for the interpretation of the results.
And, in many of the situations we have discussed, it is advisable to use robust standard
error estimates.
23
Table 1. Estimated Effects of Unilateral Divorce Laws
(1)
Dependent
variable:
Estimation
method:
(2)
Log of
Log of
divorce rate divorce rate
WLS
OLS
First 2 years
-0.022
(0.063)
-0.017
(0.026)
Years 3-4
-0.049
(0.063)
-0.014
(0.031)
Years 5-6
-0.051
(0.064)
-0.022
(0.034)
Years 7-8
-0.033
(0.065)
-0.013
(0.039)
Years 9-10
-0.052
(0.067)
-0.030
(0.046)
Years 11-12
-0.051
(0.074)
-0.015
(0.052)
Years 13-14
-0.043
(0.077)
-0.005
(0.060)
Years 15+
0.006
(0.084)
0.026
(0.073)
Notes: These results are drawn from Lee and Solon (2011, Table 2). The divorce rate is
the number of divorces per 1,000 persons by state and year. The standard error estimates
in parentheses are robust to heteroskedasticity and serial correlation. Both regressions
include controls for state fixed effects, year fixed effects, and state-specific linear time
trends.
24
References
Angrist, Joshua D., and Jörn-Steffen Pischke. 2009. Mostly Harmless Econometrics: An
Empiricist’s Companion. Princeton, NJ: Princeton University Press.
Autor, David H., Lawrence F. Katz, and Alan B. Krueger. 1998. “Computing Inequality:
Have Computers Changed the Labor Market?” Quarterly Journal of Economics,
113(4): 1169-213.
Borjas, George J. 2003. “The Labor Demand Curve Is Downward Sloping:
Reexamining the Impact of Immigration on the Labor Market.” Quarterly
Journal of Economics, 118(4): 1335-74.
Card, David, and Alan B. Krueger. 1992. “Does School Quality Matter? Returns to
Education and the Characteristics of Public Schools in the United States.”
Journal of Political Economy, 100(1): 1-40.
Deaton, Angus. 1997. The Analysis of Household Surveys: A Microeconometric
Approach to Development Policy. Baltimore: The Johns Hopkins University
Press.
Dehejia, Rajeev, and Adriana Lleras-Muney. 2004. “Booms, Busts, and Babies’
Health.” Quarterly Journal of Economics, 119(3): 1091-130.
Dickens, William T. 1990. “Error Components in Grouped Data: Is It Ever Worth
Weighting?” Review of Economics and Statistics, 72(2): 328-33.
Donald, Stephen G., and Kevin Lang. 2007. “Inference with Difference-in-Differences
and Other Panel Data.” Review of Economics and Statistics, 89(2): 221-33.
Donohue, John J., III, and Steven D. Levitt. 2001. “The Impact of Legalized Abortion
on Crime.” Quarterly Journal of Economics, 116(2): 379-420.
25
DuMouchel, William H., and Greg J. Duncan. 1983. “Using Sample Survey Weights in
Multiple Regression Analyses of Stratified Samples.” Journal of the American
Statistical Association, 78(383): 535-43.
Elder, Todd E., John H. Goddeeris, and Steven J. Haider. 2011. “A Deadly Disparity: A
Unified Assessment of the Black-White Infant Mortality Gap.” B.E. Journal of
Economic Analysis and Policy (Contributions), 11(1).
Fitzgerald, John, Peter Gottschalk, and Robert Moffitt. 1998. “An Analysis of Sample
Attrition in Panel Data: The Michigan Panel Study of Income Dynamics.”
Journal of Human Resources, 33(2): 251-99.
Friedberg, Leora. 1998. “Did Unilateral Divorce Raise Divorce Rates? Evidence from
Panel Data.” American Economic Review, 88(3): 608-27.
Imbens, Guido W.….
Lee, Jin Young, and Gary Solon. 2011. “The Fragility of Estimated Effects of Unilateral
Divorce Laws on Divorce Rates.” B.E. Journal of Economic Analysis and Policy
(Contributions), 11(1).
Levitt, Steven D. 1998. “Juvenile Crime and Punishment.” Journal of Political
Economy, 102(4): 1156-85.
Manski, Charles F., and Steven R. Lerman. 1977. “The Estimation of Choice
Probabilities from Choice Based Samples.” Econometrica, 45(8): 1977-88.
Robins, James M., Andrea Rotnitzky, and Lue Ping Zhao. 1994. “Estimation of
Regression Coefficients When Some Regressors Are Not Always Observed.”
Journal of the American Statistical Association, 89 (427): 846-66.
Shin, Donggyun, and Gary Solon. 2011. “Trends in Men’s Earnings Volatility: What
26
Does the Panel Study of Income Dynamics Show?” Journal of Public Economics
95(7-8): 973-82.
U.S. Bureau of the Census. 1968. Current Population Reports: Consumer Income, P-60
(55). Washington, DC: U.S. Government Printing Office.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator
and a Direct Test for Heteroskedasticity.” Econometrica, 48(4): 817-38.
Wolfers, Justin. 2006. “Did Unilateral Divorce Laws Raise Divorce Rates? A
Reconciliation and New Results.” American Economic Review, 96(5): 1802-20.
Wooldridge, Jeffrey M. 1999. “Asymptotic Properties of Weighted M-Estimators for
Variable Probability Samples.” Econometrica, 67(6): 1385-406.
Wooldridge, Jeffrey M. 2001. “Asymptotic Properties of Weighted M-Estimators for
Standard Stratified Samples.” Econometric Theory, 17(2): 451-70.
Wooldridge, Jeffrey M. 2002. “Inverse Probability Weighted M-Estimators for Sample
Selection, Attrition, and Stratification.” Portuguese Economic Journal, 1(2): 11739.
Wooldridge, Jeffrey M. 2003. “Cluster-Sample Methods in Applied Econometrics.”
American Economic Review, 93(2): 133-8.
Wooldridge, Jeffrey M. 2007. “Inverse Probability Weighted Estimation for General
Missing Data Problems.” Journal of Econometrics, 141 (2): 1281-1301.
Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data.
2nd ed. Cambridge, MA: MIT Press.
Wooldridge, Jeffrey M. 2013. Introductory Econometrics: A Modern Approach. 5th ed.
Mason, OH: South-Western.
27
Exhibit 9
Page 1
1
UNITED STATES DISTRICT COURT
2
NORTHERN DISTRICT OF CALIFORNIA
3
OAKLAND DIVISION
4
5
____________________________________
)
6
)
THE APPLE IPOD ITUNES ANTI-TRUST
7
)
LITIGATION
No. C-05-0037 YGR
)
)
8
____________________________________)
9
10
11
12
VIDEOTAPED DEPOSITION OF ROGER G. NOLL
13
San Francisco, California
14
Thursday, May 16, 2013
15
Volume 1
16
17
18
19
20
21
Reported by:
22
JENNIFER L. FURIA, RPR, CSR
23
CA License No. 8394
24
Job No. 1663538
25
PAGES 1 - 262
Sarnoff, A VERITEXT COMPANY
877-955-3855
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
on them?
A Yeah, I think -- I think I do. Remember that
at the time I'm doing this I'm base -- I'm -- my mindset
is that I'm believing what's in Fruggia's declaration.
That it's everything. All new products after September. 10:16:06
First of all, he said October, then he changed it to
September.
So my mindset at the time is this is a
distinction without meaning. I now realize it's not
true, all right. That is to say, that what he said
10:16:19
isn't actually correct, so.
But what I asked them to do, I just don't
remember. I -- I do remember believing that every new
product that was brought on the market after September
2006 had 7.0 loaded on it, because that's what was in 10:16:37
Fruggia's deposition.
MR. MITTELSTAEDT: Let's a take a break.
THE VIDEOGRAPHER: This is the end of disc
number one. The time is 10:17 a.m. And we are now
going off the record.
10:16:50
(Recess.)
THE VIDEOGRAPHER: This is the beginning of
disc number two. The time is 10:31 a.m. And we are now
going back on the record.
BY MR. MITTELSTAEDT:
10:31:03
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Page 50
1
2
3
4
5
6
7
Q Okay. If you would open the report to Exhibit
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
16.1 again, please.
A Okay.
Q And let's go to -- yeah, let's go to actually
16.2. This is on the direct sales.
A Okay.
10:31:30
Q So this is the same approach as 16.1 except
8
instead of for the resellers, it's for the direct
9
sales?
A That's correct.
10
11
12
10:31:41
Q Where did the average price come from? How
13
did -- how did you derive that?
A It's just -- just derived from taking the
14
average of all the transactions of that product during
15
the -- during the damage period.
Page 52
10:32:03
16
17
18
19
20
21
22
10:32:39
Page 51
10:35:12
Page 53
Pages 50 to 53
Sarnoff, A VERITEXT COMPANY
877-955-3855
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
10:36:49
Page 54
Page 56
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
10:37:55
Page 55
10:40:16
Page 57
Pages 54 to 57
Sarnoff, A VERITEXT COMPANY
877-955-3855
1
1
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
Page 58
1
Page 60
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
10:44:56
Page 59
Page 61
Pages 58 to 61
Sarnoff, A VERITEXT COMPANY
877-955-3855
1
1
variable, what you'd get is a change in the coefficient
2
2
on Harmony and on 4.7, but that -- that's not relevant
3
3
for what we're interested in in the damages period.
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
But we don't care about 4.7 anymore, the -- so, you
12
12
know -- yes, if we were trying to estimate damages for
13
13
4.7 then it would be worthwhile to do that, but given
14
14
that it's no longer in the case it's not worthwhile.
15
15
16
16
coefficient for 4.7 if you included a variable for the
17
17
Harmony relaunch?
18
18
19
19
larger and then Harmony relaunch to offset some of it,
20
20
so I would expect that the net effect of 4.7 will be
21
21
larger during the first period than during the second
22
22
period.
23
24
10:45:48
25
Q Did you ever ask Econ, Inc. to run this with
Harmony relaunch as a variable?
10:47:47
A No. It's pointless. It would not change the
results we're interested in.
Q Well, would it change the coefficient on
Harmony blocked?
A It would if -- if we cared about 4.7 it would. 10:48:00
Q Okay. What would you expect to happen to the 10:48:19
A I'd expect the coefficient for 4.7 to be
10:48:36
Q By second period what do you mean?
A After the Harmony relaunch. In March of 2005
or whenever it was.
10:48:48
Page 62
Page 64
1
Q Okay. While we're still on 13.2 or if you'd
1
2
reopen 13.2. I just want to walk down the variables.
2
be lower if you add a variable for the Harmony relaunch
3
For Harmony when did you turn on -- what date is the
3
4
Harmony variable turned on?
4
than if you didn't?
A Well, it depends on how you -- it would depend
A The day -- the date that it's released, which
10:46:28
Q So you would expect the coefficient for 4.7 to
5
on how you specified it. The coefficient for 4.7, are
6
is, you know, July -- sometime in July or August of
6
you going to turn it on or turn it -- are you going to
7
2004. I don't remember the exact date.
7
leave it on or turn it off with the Harmony relaunch.
5
8
9
8
blocked. What date is that turned on?
A That's the -- the date of 4.7, which is fall
10:46:43
If you keep 4.7 on during the entire period
9
Q Okay. And then the next variable is Harmony
10:49:04
between its initial launch until the launch of 7.0, and
10
then you add a Harmony relaunch, the coefficient for 4.7 10:49:24
11
of 2000 -- wait a minute. No, wait a minute. When is
11
is going to be the same during the whole period, but
12
it? I -- I can't remember. It's the date at which 4.7
12
then you're going to subtract something off of that that
13
is released and I can't remember. It's in the report.
13
would be the coefficient on Harmony, so the net effect
14
A few months later, a few months after the Harmony is
14
of 4.7 would be less after Harmony was relaunched than
15
released.
15
before.
10
16
17
18
10:47:06
Q At an earlier deposition we talked about the
16
relaunch of Harmony.
17
18
Q Did you consider for this regression having a
20
variable for the relaunch of Harmony which was early
21
2005?
Harmony is relaunched?
A Because 4.7 is still the operating system and
19
A Right.
19
22
10:49:39
Q But why would you leave 4.7 variable on after
you would test the hypothesis whether the Harmony
10:47:15 20
21
A You know, I didn't -- I didn't bother, because
relaunch completely offset it or not.
10:49:56
Q But why would you expect -- what would be the
22
circumstance in which you would expect 4.7 to continue
23
by the time we get to this the 4.7 issue is out of the
23
to have an impact on iPod prices after Harmony is
24
case, so I didn't bother.
24
relaunched and anybody who wants to can use it?
A Because the -- the issue that's going on in
10:50:12
25
If you -- if you stuck in the relaunch
10:47:26
25
Page 63
Page 65
Pages 62 to 65
Sarnoff, A VERITEXT COMPANY
877-955-3855
Exhibit 11
Page 1
Page 2
·1· · · · · · · · · UNITED STATES DISTRICT COURT
·1· · · · · · · · · UNITED STATES DISTRICT COURT
·2· · · · · · · · NORTHERN DISTRICT OF CALIFORNIA
·2· · · · · · · · NORTHERN DISTRICT OF CALIFORNIA
·3· · · · · · · · · · · OAKLAND DIVISION
·3· · · · · · · · · · · OAKLAND DIVISION
·4
·4
·5· ·THE APPLE iPOD iTUNES· · · · · Lead Case No. C 05-00037
· · ·ANTI-TRUST LITIGATION
·5· ·THE APPLE iPOD iTUNES· · · · · Lead Case No. C 05-00037
· · ·ANTI-TRUST LITIGATION
·6
·6
·7· ·____________________________
·7· ·____________________________
·8· ·This Document Relates To:
·8· ·This Document Relates To:
·9· ·ALL ACTIONS
·9· ·ALL ACTIONS
10
11· ·____________________________
10
12
11· ·____________________________
13· · · · · · · ·CONFIDENTIAL - ATTORNEYS' EYES ONLY
12
14· · ·VIDEOTAPED DEPOSITION OF JEFFREY M. WOOLDRIDGE, Ph.D.
13
15· · · · · · · · · · Monday, January 6, 2014
14
16· · · · · · · · · · ·San Diego, California
15
17
16
18
17
19
18· · · · · ·Videotaped Deposition of JEFFREY M.
20
19· ·WOOLDRIDGE, Ph.D., taken on behalf of the
21
20· ·Defendant at 655 West Broadway, Suite 1900,
22· ·Reported By:
21· ·San Diego, California, beginning at 10:29
· · ·Debby M. Gladish
22· ·a.m. and ending at 4:26 p.m., on Monday,
23· ·RPR, CLR, CCRR, CSR No. 9803
· · ·NCRA Realtime Systems Administrator
23· ·January 6, 2014, before Debby M. Gladish,
24
24· ·RPR, CLR, CCRR, CSR No. 9803, NCRA
25· ·Job No. 10009202
25· ·Realtime Systems Administrator.
Page 3
·1·
·2
·3·
·4·
· ·
·5·
· ·
·6·
· ·
·7·
·8
·9·
10·
· ·
11·
· ·
12·
· ·
13·
14
15·
· ·
16·
· ·
17·
· ·
18
19
· ·
20
· ·
21·
· ·
22·
· ·
23·
24
25
· · · · · · · · · · · ·APPEARANCES
·For the Plaintiffs:
· · · ROBBINS GELLER RUDMAN & DOWD, LLP
· · · BY: BONNY SWEENEY, ESQ.
· · · BY: JENNIFER N. CARINGAL, ESQ.
· · · 655 West Broadway, Suite 1900
· · · San Diego, California· 92101
· · · (619)231-1058
· · · bonnys@rgrdlaw.com
·For the Defendant, Apple Inc.:
· · · JONES DAY
· · · BY: DAVID KIERNAN, ESQ.
· · · BY: AMIR AMIRI, ESQ.
· · · 555 California Street, 26th Floor
· · · San Francisco, California· 94104
· · · (415)626-3939
· · · aamiri@jonesday.com
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
APPLE
BY: SCOTT B. MURRAY, ESQ. (TELEPHONIC APPEARANCE)
1 Infinite Loop, MS 169-2NYJ
Cupertino, California· 95014
(408)783-8369
scott_murray@apple.com
·Also present:
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·
·CHRISTOPHER TISA (VIDEO OPERATOR)
·APTUS COURT REPORTING
·600 West Broadway
·Suite 300
·San Diego, California· 92101
·619.546.9151
Page 4
·1·
·2
·3·
·4·
·5
·6·
·7
·8·
·9·
10·
· ·
11·
· ·
12·
· ·
13
14·
· ·
15·
16·
· ·
17·
· ·
18·
19·
· ·
20·
· ·
21
22·
· ·
23·
· ·
24
25
· · · · · · · · · · · · · INDEX
·WITNESS· · · · · · · · · · · · · · · · · · · EXAMINATION
·JEFFREY M. WOOLDRIDGE, Ph.D.
· · · · · ·BY MR. KIERNAN· · · · · · · · · · · · · · · ·7
· · · · · · · · · · ·E X H I B I T S
·MARKED· · · · · · · · · · · · · · · · · · · · · · · PAGE
·Exhibit 1· · Declaration of Jeffrey M.· · · · · · · · 17
· · · · · · · Wooldridge in Support of
· · · · · · · Plaintiff's Daubert Motion to
· · · · · · · Exclude Certain Opinion Testimony
· · · · · · · of Robert H. Topel and Kevin M.
· · · · · · · Murphy, 34 pages
·Exhibit 2·
· · · · · ·
· · · · · ·
·Exhibit 3·
· · · · · ·
· · · · · ·
· · · · · ·
· · · · · ·
·Exhibit 4·
· · · · · ·
· · · · · ·
· · · · · ·
·
·
·
·
·
·
·
·
·
·
·
·
Econometric Analysis of Cross· · · · · ·121
Section and Panel Data, Second
Edition, 34 pages
Journal of Econometrics titled· · · · · 135
"Asymptotic properties of a
robust variance matrix estimator
for panel data when T is large,"
by Christian B. Hansen, 24 pages
Document titled "Exhibit 3-A· · · · · · 140
Reseller Sales Preferred Log
Regression Results Outliers
Excluded," 4 pages
·Exhibit 5·
· · · · · ·
· · · · · ·
· · · · · ·
·
·
·
·
Document titled· · · · · · · · · · · · ·144
"Imbens/Wooldridge, Lecture Notes
1, Summer '07, Whats New in
Econometrics," 42 pages
Page 5
·1· · · · · · · · · · · ·E X H I B I T S
·2· ·MARKED· · · · · · · · · · · · · · · · · · · · · · · PAGE
·3· ·Exhibit 6· · Document titled· · · · · · · · · · · · ·145
· · · · · · · · · "Imbens/Wooldridge, Lecture Notes
·4· · · · · · · · 2, Summer '07, What's New in
· · · · · · · · · Economics?" 32 pages
·5
·6· ·Exhibit 7· · Document titled "Did Unilateral· · · · ·148
· · · · · · · · · Divorce Laws Raise Divorce Rates?
·7· · · · · · · · A Reconciliation and New
· · · · · · · · · Results," Justin Wolfers, Working
·8· · · · · · · · Paper 10014, 30 pages
·9· ·Exhibit 8· · Document titled "NBER Working· · · · · ·149
· · · · · · · · · Paper Series, What Are We
10· · · · · · · · Weighting For?" 29 pages
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Page 6
·1·
·2·
·3
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · · · · · SAN DIEGO, CALIFORNIA
· · · · · ·MONDAY, JANUARY 6, 2014, 10:29 a.m.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·today is under oath subject to penalty of perjury?
· · · A.· ·I do.
· · · Q.· ·Okay.· Is there any reason that you cannot
·testify completely and truthfully today?
· · · A.· ·No.
· · · Q.· ·Any substance that you've taken that would
·impair your ability to testify completely and
·truthfully?
· · · A.· ·No.
· · · Q.· ·When were you first contacted to do work on
·this case?
· · · A.· ·December 5th I received an e-mail from Bonny
·Sweeney.
· · · Q.· ·December 5th, 2014 -- or 2013?
· · · A.· ·December 5th, 2013, yes.
· · · Q.· ·Okay.· And what is your assignment in this
·case?
· · · A.· ·My assignment is to evaluate different claims
·about how the proper standard error should be computed
·in the Noll regression analysis -· · · Q.· ·Okay.
· · · A.· ·-- and whether cluttering is important or not
·or I should say whether it's valid or not.
· · · Q.· ·And when did you start work in this matter?
·When did you start to do the work after being first
Page 7
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8
·9·
10·
11
12·
13·
14
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MR. AMIRI:· Amir Amiri on behalf of Apple.
· · · · · ·MS. SWEENEY:· Bonny Sweeney on behalf of the
·plaintiffs.
· · · · · ·MS. CARINGAL:· Jennifer Caringal on behalf of
·the plaintiffs.
· · · · · ·THE VIDEOGRAPHER:· The court reporter may now
·swear in the deponent.
· · · · · · · · · JEFFREY M. WOOLDRIDGE,
· · · · ·having been sworn, testified as follows:
· · · · · ·THE VIDEOGRAPHER:· You may proceed, Counsel.
· · · · · ·MR. KIERNAN:· Okay.
··
·BY
··
··
··
··
··
··
··
··
··
· · · · · · · · · ·EXAMINATION
MR. KIERNAN:
· Q.· ·Good morning, Dr. Wooldridge.
· A.· ·Good morning.
· Q.· ·Could you state your full name for the record.
· A.· ·Jeffrey M. Wooldridge.
· Q.· ·Okay.· Have you ever been deposed before?
· A.· ·No.
· Q.· ·Okay.· Have you ever testified before?
· A.· ·No.
· Q.· ·Okay.· Do you understand that your testimony
· · · · · ·THE VIDEOGRAPHER:· Good morning.· We are now
·on the record.· The time is 10:29 a.m.· Today's date is
·January 6, 2014.
· · · · · ·My name is Christopher Tisa of Aptus Court
·Reporting.· The court reporter is Debby Gladish with
·Aptus Court Reporting located at 600 West Broadway,
·Suite 300, San Diego, California 92101.
· · · · · ·This begins the video-recorded deposition of
·Jeffrey M. Wooldridge, testifying in the matter of the
·Apple iPod iTunes Anti-Trust Litigation, pending the
·United States District Court, Northern District of
·California, Oakland Division, Case Number C 05-00037
·YGR, taken at 655 West Broadway, Suite 1900, San Diego,
·California 92101.
· · · · · ·The video and audio recording will take place
·at all times during this deposition unless all counsel
·agree to go off the record.· The beginning and end of
·each video recording will be announced.
· · · · · ·Will counsel please identify yourselves and
·state whom you represent.
· · · · · ·MR. KIERNAN:· David Kiernan on behalf of
·Apple.
Page 8
Page 9
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·contacted on December 5th?
· · · A.· ·A week later, December 12th, 2013.
· · · Q.· ·And about how many hours have you put in on
·this case?
· · · A.· ·Up to writing the dec- -- submitting the
·declaration or after that as well?
· · · Q.· ·That's a good time.· Up through submitting
·your declaration.
· · · A.· ·Five to six hours.
· · · Q.· ·Okay.· And then, after submitting your
·declaration, how much time have you spent, if any?
· · · A.· ·Probably another ten hours.
· · · Q.· ·Putting aside conversations that you've had
·with counsel -· · · A.· ·Yes.
· · · Q.· ·-- including Bonny or anyone else from Robbins
·Geller, have you discussed this case with anybody else?
· · · A.· ·No.
· · · Q.· ·Have you discussed the case with Dr. Noll?
· · · A.· ·No.
· · · Q.· ·Have you ever had a discussion with Dr. Noll
·at any time in your life?
· · · A.· ·No, I don't believe we've met.
· · · Q.· ·Okay.· Did you have any support staff or any
·other person who assisted you?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Okay.· Anything else?
· · · A.· ·No.
· · · Q.· ·Did you prepare for the deposition?
· · · A.· ·Yes.
· · · Q.· ·And, just briefly, describe what you did to
·prepare for -· · · A.· ·I read -· · · Q.· ·And I don't want to hear any conversations
·that you had with counsel.· You can tell me if you met
·with counsel, but I don't want to hear what you guys
·talked about.
· · · A.· ·We -- we did meet over the phone.· I read the
·various reports, the Murphy, Topel report, the Noll
·rebuttal report, and I reviewed my own declaration.
· · · Q.· ·Okay.· Anything else?
· · · A.· ·Reviewed some of my old work on clustering,
·but . . .
· · · Q.· ·Like old pub- -- publications?
· · · A.· ·Yes, and my book.
· · · Q.· ·And -- and which book, the graduate book or
·the undergrad book?
· · · A.· ·My graduate book, which -- which is published
·with MIT Press.
· · · Q.· ·And with respect to the other clustering work,
·aside from the textbook, do you recall which -- what the
Page 10
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·No.
· · · Q.· ·Okay.· It was just you?
· · · A.· ·Just me.
· · · Q.· ·All right.· And how are you being compensated
·for your work in this matter?
· · · A.· ·Hourly wage?
· · · Q.· ·Uh-huh.
· · · A.· ·$500 an hour.
· · · Q.· ·And how much have you been paid?
· · · A.· ·Nothing.
· · · Q.· ·Okay.· And so when you submit an invoice it's
·going to be between 15 and 16 hours plus whatever work
·your deposition today time?
· · · A.· ·Yes.
· · · Q.· ·Have you submitted any invoices or any other
·bill that reflects the hours spent and the amount you
·are owed?
· · · A.· ·I haven't submitted invoices yet.
· · · Q.· ·Okay.· Since submitting your declaration, what
·work have you done?
· · · A.· ·I've done some simulation work on -- on the
·properties of clustered standard errors.
· · · Q.· ·Okay.· Anything else?
· · · A.· ·And -- and working out some formulas that can
·explain the simulation findings.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·titles were of those works?
· · · A.· ·The main -- the -- the one was the paper I
·published in the American Economic Review called the
·Cluster -- Cluster Sampling and Applied Econometrics.
· · · Q.· ·The 2003 paper?
· · · A.· ·Yes, uh-huh.
· · · Q.· ·With respect to the -- let me just see what
·that is.
· · · · · ·MR. KIERNAN:· Do I have to hit escape to go
·up?
· · · · · ·THE REPORTER:· Yes.
·BY MR. KIERNAN:
· · · Q.· ·With respect to the simulation work on the
·properties of clustered standard errors, was the
·simulation work done on the standard errors in -- from
·Noll's regress- -- rebuttal regressions?
· · · A.· ·No.· I set up a simplified framework so that
·the issues would be more transparent, showing what would
·happen if you took an independent sample and clustered
·after the fact based on some characteristics.· I did
·that after I wrote my declaration.
· · · Q.· ·And why did you do that work?
· · · A.· ·Because in my declaration I asserted things
·that seemed self-evident, but thought it would be useful
·to actually see the -- the simulation findings that
Page 11
Page 12
Page 13
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·actually showed what I was claiming.
· · · Q.· ·And are you relying on those simulations for
·any of the opinions that you're giving in this matter?
· · · A.· ·Not necessarily.· I guess -- I didn't rely on
·them in my declaration and so I'll be talking about
·my -- the opinions in my declaration, which haven't
·changed.
· · · Q.· ·Okay.· So you're not relying on the
·simulations that you've done after submitting your
·declaration as a basis for any -· · · A.· ·No.
· · · Q.· ·-- of the opinions in your report?
· · · A.· ·No, I'm not.
· · · Q.· ·Okay.· In -- in the simulations, what was the
·dataset that you used?
· · · A.· ·Well, the simulation generates data based on
·some assumptions about what the population distribution
·is and then draws randomly from using a standard
·program, such as Stata, to draw random samples from the
·population.
· · · Q.· ·And the -- when you're referring to the
·population, what's the dataset for that?
· · · A.· ·When -· · · Q.· ·That's part of the simulation program?
· · · A.· ·Yes.· So when you define a population you
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·one point I also did work -- there was a case where the
·NBA was suing, I believe, the -- the super station in
·Chicago, WGN, for showing Chicago Bulls games with
·Michael Jordan nationwide.
· · · Q.· ·Okay.· Any others that you can think of?
· · · A.· ·I wish my memory were better.· I -- I did do
·some more cases for Charles River Associates.· There was
·a case having to do with airline reservation systems, I
·believe.· And that's as much as I can remember.
· · · Q.· ·And have you ever been retained -- excuse
·me -- to estimate the impact of some conduct on the
·prices of consumer products?
· · · A.· ·No.
· · · Q.· ·Okay.· Have you ever been retained to estimate
·damages resulting from alleged impact of conduct on
·consumer prices -- on consumer products?
· · · A.· ·No.
· · · Q.· ·You note in your declaration that you're
·currently providing consulting work to Industrial
·Economics, Inc. on a damage assessment.
· · · A.· ·Uh-huh.
· · · Q.· ·And describe that for me.
· · · A.· ·That's through the government, NOAA, for the
·deep water horizon oil spill.
· · · Q.· ·And what is the work that you're -- the
Page 14
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·define a distribution such as a normal distribution or a
·quasi distribution or something like that and then you
·randomly sample from that -- from a random variable that
·has that distribution.· It's a very common method used
·to evaluate any kind of estimator that somebody proposes
·in econometrics or statistics.
· · · · · ·MR. KIERNAN:· How do I get this going again?
· · · · · ·THE REPORTER:· Hit the pause button.
· · · · · ·MR. KIERNAN:· Pause break?· Say again?
· · · · · ·THE REPORTER:· Use your mouse and -· · · · · ·MR. KIERNAN:· Oh, I see it.
· · · · · ·THE REPORTER:· And hit -· · · · · ·MR. KIERNAN:· Got it, got it.· Thank you.
·BY MR. KIERNAN:
· · · Q.· ·I notice in the declaration you note that
·you've done some consulting work, like you worked for
·CRA and -· · · A.· ·Yes.
· · · Q.· ·Okay.· Have you done any work, provided any
·opinions, in antitrust cases or any antitrust matters?
· · · A.· ·Yes.
· · · Q.· ·Okay.
· · · A.· ·Back with the Charles River Associates work I
·did some econometric work at the request of Frank Fisher
·on the Kodak Polaroid patent infringement case.· And at
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·consulting work that you're doing in connection with
·that?
· · · A.· ·We're estimating damages from the oil spill
·based on consumer willingness to pay.
· · · Q.· ·And you said "we."· Are there other people
·involved?
· · · A.· ·Yes.
· · · Q.· ·Okay.
· · · A.· ·It's -- it's a large team.
· · · Q.· ·And what is your role?
· · · A.· ·My role is mainly as the econometrician to
·think about sampling issues and model estimation issues
·and how to compute standard errors.
· · · Q.· ·And in that matter have you proposed a model
·to estimate damages?
· · · A.· ·Yes.
· · · Q.· ·And is it a regression model or -· · · A.· ·It's a bit -· · · Q.· ·Strike that.
· · · · · ·Why don't you describe the model.· I'll start
·. . .
· · · A.· ·I'm not sure I'm at liberty to do that.· I -·I don't know what the protocol is, but I -- I am -· · · · · ·MS. SWEENEY:· Yeah, if it's -· · · · · ·THE WITNESS:· It -- it --
Page 15
Page 16
Page 17
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. SWEENEY:· If it's confidential -· · · · · ·THE WITNESS:· It's confidential.
· · · · · ·MS. SWEENEY:· I'll just object to form.
· · · · · ·THE WITNESS:· Sorry.· That is confidential
·information.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· And -- and just so I have it -- the
·record clear, even the type of model that you are or not
·using to estimate damages in the case, your testimony is
·you cannot describe it for me because of a protective
·order -· · · A.· ·Yes.
· · · Q.· ·-- in that matter?
· · · A.· ·Yes.· That's correct.
· · · Q.· ·Other than that matter, have you been retained
·to estimate or consult in estimating damages?
· · · A.· ·No.
· · · · · ·MR. KIERNAN:· All right.· Let me have his
·report.
· · · · · ·Can mark that as -- why don't we mark it as
·Wooldridge 1 because I don't think we've been doing them
·sequentially.
· · · · · ·(Exhibit 1 marked.)
·BY MR. KIERNAN:
· · · Q.· ·Okay.· I'm handing you what's been marked as
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·What I mean by "deposition transcript" -· · · A.· ·Oh, deposition -· · · Q.· ·-- is -· · · A.· ·Oh, I'm sorry.· Not -· · · Q.· ·Like today we have a deposition and then -· · · A.· ·I did not read any deposition transcripts.
· · · Q.· ·Okay.
· · · A.· ·I'm sorry.· Yes.
· · · Q.· ·And did you review the -- any of the data -·the datasets that Dr. Noll used in his regressions?
· · · A.· ·No, I didn't see the datasets.
· · · Q.· ·Okay.· Did you review the documents that Dr.
·Noll cites in his reports?
· · · A.· ·Um -· · · · · ·MS. SWEENEY:· Objection.· Overbroad.
· · · · · ·THE WITNESS:· Did I -- I was familiar with
·some of the econometrics works that he cited, but I did
·not review -- he has previous dec- -- declarations
·listed there.· I did not review those or any of the
·other -- there was a long list, I believe, of documents,
·and I did not look at them.
·BY MR. KIERNAN:
· · · Q.· ·Okay.
· · · A.· ·I had a limited amount of time.
· · · Q.· ·And do you recall that Dr. Murphy and
Page 18
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·Wooldridge 1.· Can you identify as -- Wooldridge 1 as
·the declaration -- as your declaration that you
·submitted in this case?
· · · A.· ·Yes, it is.
· · · Q.· ·Okay.· And is it -- does Wooldridge 1 contain
·all the opinions that you're offering in this matter?
· · · A.· ·Yes.
· · · Q.· ·And contains all the bases for those opinions?
· · · A.· ·Yes.
· · · Q.· ·And does it list all the materials that you
·relied on?
· · · A.· ·Yes.
· · · Q.· ·Okay.· And did you draft Wooldridge 1?
· · · A.· ·Yes.
· · · Q.· ·Did anyone else assist you with drafting it?
· · · A.· ·Counsel read through and made small editorial
·comments.
· · · Q.· ·Did you review any deposition transcripts?
· · · A.· ·Yes, I did.· I read the Noll report, both the
·initial report and the rebuttal, and the Murphy and
·Topel reports.
· · · Q.· ·Okay.
· · · · · ·MS. SWEENEY:· I -- I -· · · · · ·MR. KIERNAN:· I'm going to -- I'll clarify.
·BY MR. KIERNAN:
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·Dr. Topel, they also listed a number of documents that
·they considered?
· · · A.· ·Yes.
· · · Q.· ·And did you review any of those?
· · · A.· ·No, I did not.
· · · Q.· ·Did you review a supplemental report that was
·jointly signed by Drs. Murphy and Dr. Topel?
· · · A.· ·If it's the -- the recent one -· · · Q.· ·Yes.
· · · A.· ·-- yes, I did.
· · · Q.· ·Did you review the regression equations used
·by Dr. Noll?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
· · · · · ·THE WITNESS:· I -- I looked at the equations
·and the reported standard errors, but I did not evaluate
·the equations for content.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· And did you evaluate the -- well,
·strike that.
· · · · · ·On page 2 of Wooldridge 1 of your declaration,
·at the bottom, you state, "I restrict my comments to
·issues associated with computing proper standard errors
·and do not discuss model specification."
· · · · · ·Do you see that?
· · · A.· ·Yes.
Page 19
Page 20
Page 21
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·And what do you mean by "model specification"?
· · · A.· ·Well, every regression analysis has a
·dependent variable, which you're trying to explain, and
·a set of explanatory variables sometimes called
·independent variables.· And different people can have
·different opinions on what those variables should be.
·And I was not asked to evaluate that part of Dr. Noll's
·analysis and so I haven't formed an opinion.· I didn't
·look at the equations with an eye toward did I think
·this was proper or not.
· · · · · ·I was asked to do something fairly narrow,
·which was evaluate the clustering issue.· And that's
·what I spent my limited time on.
· · · Q.· ·Okay.· Okay.· So you're not offering an
·opinion on the model specification?
· · · A.· ·That's correct.
· · · Q.· ·And not offering an opinion on whether he
·included the correct explanatory variables or what you
·called the independent variables?
· · · A.· ·That's correct.
· · · Q.· ·Not offering an opinion on whether the
·regression suffered from omitted-variable bias?
· · · A.· ·Correct.
· · · Q.· ·And no opinion on whether Dr. Noll's
·regressions estimate or provide -- or produce a reliable
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·understanding of what the data structure used by Dr.
·Noll in his two regressions is?
· · · A.· ·He has transactions level data for shipments
·of various classes of iPods, along with information
·about the characteristics of the iPods and the prices at
·which the transactions occurred, in- -- including when
·they occurred.
· · · Q.· ·And you understand Dr. Noll has two
·regressions, one for resellers and then the other -· · · A.· ·Yes.· I -- I did take note of that, yes.
· · · Q.· ·Okay.· And the data structure that you just
·described, is that true for both types of customers,
·sales to both types of customers or -- let me stop
·there.
· · · A.· ·The direct sales have that structure and,
·yeah, I -- I -- I didn't see their both transactions
·records in the direct sales.· There's -- there's perhaps
·more than one unit sold.
· · · Q.· ·Okay.· And I -- when you say shipments of
·iPods, what are you referring to?
· · · A.· ·Well, I call them transactions, I believe.
·But a shipment is -- for the -- the purposes of the data
·analysis what matters is that there's a transaction that
·happened for a certain kind of iPod on a certain day at
·a certain price and so it could have been one iPod or it
Page 22
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·damages estimate?
· · · A.· ·I would need to study what he did in much more
·depth to -- to -- to comment on that.· And I wasn't
·asked to do that and I -- I formed no opinion on that.
· · · Q.· ·Fair enough.· Fair enough.· Okay.· So you have
·not formed an opinion on whether Dr. Noll's regressions
·produces reliable damages estimates?
· · · A.· ·That's correct.
· · · Q.· ·And no opinion on whether the conduct at issue
·in this litigation impacted iPod prices?
· · · A.· ·Correct.
· · · Q.· ·No opinion on the amount of damages?
· · · A.· ·That's correct.
· · · Q.· ·What is your understanding about -- of what
·this case is about?
· · · A.· ·Oh, well, there were certain versions of iPods
·that were installed with software that essentially
·blocked a competitor's software that allowed downloading
·music from competing sites other than the iTunes store.
·But I -- I have to say I focused my attention on the
·cluster sampling issue.· I understood what the data
·structure was and what the basic question was, and I
·didn't think in-depth about what the actual antitrust
·issue is here.
· · · Q.· ·Okay.· And what is the basis for your
·1·
·2·
·3·
·4·
·could have been many iPods.
· · · Q.· ·Okay.· So ship- -- when you used the term
·"shipment" previously you're referring to transaction?
· · · A.· ·Yes.
Page 23
Page 24
23· · · · Q.· ·And in your declaration you state that
24· ·Professor Noll is using the entire population of
25· ·transactions.· What do you mean by "the entire
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 25
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·population"?
· · · A.· ·Well, my understanding is that Apple provided
·every -- every transaction over this ten- or 11-year
·period and except for a couple that were dropped due to
·missing data issues and, I believe, some outlining
·observations, he has every transaction ever done.
· · · · · ·Alternatively, Apple could have said, "Here's
·a 10 percent random sample of our transaction," and then
·it would have been a random sample from that population,
·but instead he has all the transactions.
· · · Q.· ·And all the transactions worldwide or just in
·the United States?
· · · A.· ·I didn't read it that -- that closely.· Sorry.
·So if -· · · Q.· ·Okay.
· · · A.· ·-- if it's just in the United States, then
·it's the pop- -- then that defines the population.
· · · Q.· ·Okay.· And so your understanding is that the
·transactional data for -- used for both regressions
·contain virtually every iPod sold in the U.S. during the
·time period?
· · · A.· ·Yes.
· · · Q.· ·If you go to page 13 -· · · A.· ·Yes.
· · · Q.· ·-- does paragraph 6 list the summary of your
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Let me give you -- let me give you a different
·example.· Suppose we had the entire population of
·students in a state and we wanted to estimate the effect
·of some intervention on student test scores, if we had
·that entire population, we would just use a standard
·regression analysis.· If we wanted to do something like
·test whether there are peer effects, say, within the
·neighborhood or the school, and we included a variable
·that measured characteristics of students nearby the
·other -- the -- the student in question, then that
·could -- could create a clustering problem.
· · · Q.· ·Okay.
· · · A.· ·But my understanding is that Professor Noll
·did not do that.
· · · Q.· ·And using the data that's at issue in our
·case -· · · A.· ·Uh-huh.
· · · Q.· ·-- and the products that are at issue that are
·being modeled, can you give me an example?· You -- you
·used the school example.· Can you give an example using
·the -· · · A.· ·Frankly, I can't even -· · · · · ·MS. SWEENEY:· Objection.
· · · · · ·You've got to pause for a moment so I can
·interject my objection.
Page 26
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·opinions in the case?
· · · A.· ·Uh-huh.
· · · Q.· ·Okay.· And with respect to the first,
·"Clustering is inappropriate where, as here, the
·regressions use the entire population of transactions,"
·is it your opinion that clustering is inappropriate
·whenever the entire population is used?
· · · A.· ·It -- no.· It -- it could be appropriate if,
·for example, you use information from other units, other
·transactions in the data as part of an explanatory
·variable in a transaction for a particular transaction.
·So -- but Professor Noll did not do that.
· · · · · ·Each transaction was its own separate unit and
·each provides independent information on prices at which
·these transactions occurred, given the -- given the
·characteristics of the -- the different iPods and the
·different time periods when they were purchased.
· · · Q.· ·Going back to the circumstance that you
·described where clustering could be appropriate when
·using the entire population of transactions, you say it
·could be appropriate if, for example, you use
·information from other units other transactions in a
·transaction for a particular transaction.
· · · A.· ·I better clarify that.
· · · Q.· ·Please.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·Objection to form.· Vague and ambiguous.
·Incomplete hypothetical.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· Go ahead.
· · · A.· ·Frankly, I can't think of a reason you would
·do that.· It -- it makes no sense to me to say that a
·transaction that happened someplace else, some other
·time period, would somehow have an effect on the price
·of this particular transaction when -- and -- and the
·point is Professor Noll didn't do it, so there's nothing
·to -- to be concerned about here.
· · · Q.· ·So, in your opinion, there's no circumstance
·under which clustering could be a problem with respect
·to the data that Dr. Noll used to estimate his
·regressions?
· · · A.· ·That's correct.· Let's -- and let me -- let me
·expand on that a little bit.
· · · · · ·I mentioned that he has, essentially, the
·entire population of transactions.· He -- he could have
·or Apple could have given him a 10 percent random
·sample.· There're easy ways to generate a random sample
·from a large population like that.· And then he would,
·really, have had a random sample and the analysis would
·have clearly been not subject to a criticism of
·clustering because the -- the -- the observations would
Page 27
Page 28
Page 25..28
www.aptusCR.com
YVer1f
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 29
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·have been -- been drawn to represent the population and
·independently.
· · · · · ·And so if you, then, say, well, what if he
·took an additional 10 percent of the sample, then he
·would have more data and that would be reflected in the
·standard errors falling because you're getting more
·data.· And, again, because it's a random sample, there's
·no reason to cluster and clustering, in fact, only could
·inflate the standard errors in an artificial way.· And
·by the time you get up to the population -- having the
·whole population is not a problem.· That's a -- that's a
·good thing.· You have more information.· You want more
·data to more precisely estimate the coefficients in the
·regression model.
· · · · · ·So that's why I assumed that a 10 percent
·random sample wasn't taken because it's better to use
·more data than -- than less data.
· · · Q.· ·Right.· And so your opinion is that if
·you have the entire population of iPod transactions
·there's no circumstance under which clustering would be
·appropriate?
· · · A.· ·That's correct.
· · · Q.· ·And you state, "There can be no cluster
·sampling problem because there is no sampling."
· · · · · ·If there were sampling --
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·collect a random sample of hourly wage workers and they
·compute the sample average.
· · · · · ·The sample average is the simplist example of
·an OLS estimate, an ordinary least squares estimator.
·It minimizes the sum of squared deviation.· So it is -·it is an example, essentially, of what Professor Noll is
·doing.· Of course, regression analysis is a little more
·complicated, but -- but let's stick with that.
· · · · · ·Suppose that -- so the proper thing to do
·would be to collect a random sample and you can look at
·any introductory statistics book and it will show you
·the formula for the standard error, which is the
·standard deviation you estimate from your sample divided
·by the square root of the sample size.· But suppose that
·you -- along with the hourly wage you actually collected
·information on the person's occupation.· So some people
·are in the service industry, some people are
·construction workers, some people are computer
·programmers.
· · · · · ·Now, suppose that after you've computed the
·sample average, you then compute the residuals, which
·would be everybody's -- every person's hourly wage net
·of the total average across the entire sample.· So what
·will you find if you do that?· Well, if you compute the
·residuals within the -- the -- the now cluster of
Page 30
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·If -· · · Q.· ·-- are there circumstances under which
·clustering would be appropriate dealing with iPod
·transactions?
· · · A.· ·With iPod transactions?· I don't see how -·not -- not the way -- it -- these were sampled
·transaction by transaction and so there can't be a
·cluster sampling problem if that's the way the -- the
·data had been sampled.
· · · Q.· ·Is it your opinion that the resid- -- that the
·error terms in Professor Noll's regressions are
·independent?
· · · A.· ·They're not independent ex post after you
·choose the clusters, and it's very, very simple to see
·that.· The clustering of a -- either the entire
·population in this case or a random sample create -·artificially creates a problem that isn't there.· So the
·idea is -- and -- and let's take a simple example of
·this.
· · · · · ·Suppose that we wanted to estimate -- and make
·this simple -- we want to estimate the average, let's
·say, hourly wage in the population of all hourly wage
·workers.· The way we would do that -- and, of course,
·that's a -- that's a big population in the United
·States.· And surveys do this, they go out and they
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·workers in the service industry, those, very likely,
·will be negative on average because service workers
·earn -- or, let's say, fast-food workers earn a lower
·hourly wage than the overall average.
· · · · · ·If you go to the computer programmers, you're
·going to find that that residual is positive because on
·average they earn more than the total average in the
·population.· In fact, the difference is simply -- the
·average residual is just the average of the hourly wage
·for computer programmers minus the overall average.
· · · · · ·This is exactly what Murphy and Topel do.
·They, then, do this ex post clustering, and they -- they
·make a point of saying that the residuals are negative
·here, they're negative and, you know, bigger here,
·they're positive here and so on, when this is perfectly
·predictable by the ex post clustering, but it does mean
·that you should compute the standard error by clustering
·the data, we already know how to compute the proper
·standard error and that's to use the simple formula.
· · · Q.· ·So is it your opinion that using cluster
·standard errors overstate the true standard errors?
· · · A.· ·Yes.
· · · Q.· ·And your opinion is that they -- in using
·clustered robust errors in this case -- overstate the
·standard errors in Professor Noll's two regressions?
Page 31
Page 32
Page 29..32
www.aptusCR.com
YVer1f
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 33
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Yes, I do.
· · · Q.· ·Okay.· And what's the basis for that?
· · · A.· ·The -- the basis is that we know how to
·compute the proper standard errors, which is how
·Professor Noll does it, and the clustering only -- the
·clustering ex post induces correlation.· And so if
·you add the term that's at the end of the cluster, the
·cluster robust standard errors, it's positive on average
·for exactly the -- the reason I just explained, using
·the simple example of hourly wages, because you've
·clustered workers, say, by their occupation and you know
·that workers in certain occupations are going to be
·correlated with each other because they're in that
·occupation, so they have either lower than average
·wages, they might have average wages or higher than
·average wages, but those averages move together within
·each cluster.
· · · Q.· ·And are there procedures or any tests that one
·could perform to test your conclusion that the clustered
·errors overstate the true standard errors in this case?
· · · A.· ·There aren't tests because the tests are going
·to -- are going to give you the conclusion that I just
·said.· You don't need to test it because you know what's
·going to happen ahead of time, that if you cluster on
·the basis of some feature where the average
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·touch, let's say, and suppose there's just a single
·before-after period where harmony was blocked and when
·it wasn't.· So in this situation what would the right
·analysis be?· If you just wanted an estimate of the
·average damage, then you could simply run a regression
·of the -- the price or the log price on the before-after
·dummy and not even actually have to include the -- the
·kind of -- the kind of iPod if you -- if both were
·available, both before and after and basically simulate
·the data so that it -- it is a random sample from the
·population and then ask what happens if we cluster after
·the fact on the kind of iPod, whether it's a classic,
·mini or shuffle or whatever.· This is what Murphy and
·Topel do.
· · · Q.· ·That's your understanding of what they did?
· · · A.· ·They -- they clustered by time period -· · · Q.· ·Okay.
· · · A.· ·-- and by family of iPod.
· · · Q.· ·And define for me what family.
· · · A.· ·Well, in this case there -- there would be no
·difference because -- I wanted to -- if I didn't say
·this -- to simplify things so that there's only one kind
·of nano -- nano, one kind of touch, one kind of shuffle,
·so that there's no differences in capacity or any other
·features like that.
Page 34
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·systematically changes by that feature you will find
·that there's been cluster correlation because you've
·induced it by this -- this clustering that wasn't
·necessary.
· · · Q.· ·You stated that there aren't tests because the
·tests are going to give you the conclusion I just said.
· · · A.· ·There -- there aren't -· · · Q.· ·Sounds -· · · A.· ·-- useful tests.· There aren't -- there aren't
·useful tests.
· · · Q.· ·So there are no tests?
· · · A.· ·That's correct.
· · · Q.· ·Okay.
· · · A.· ·There is theory and there is simulations.
· · · Q.· ·Are there any simulations that you could run
·to test your hypothesis that clustered robust errors
·overstate the true standard errors in Professor Noll's
·regressions?
· · · A.· ·Yes.
· · · Q.· ·And what are those?
· · · A.· ·You can do the exercise that I basically just
·laid out.· In fact, we should make this -- if we were to
·make this about iPods and simplify the setting you could
·do the following: Suppose that there are five different
·kinds of iPods, what, classic, mini, nano, shuffle,
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Okay.· And define for me what "family" means
·with respect to iPods.
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
·Are you asking him -- well, I -- I just don't understand
·the question.
·BY MR. KIERNAN:
· · · Q.· ·If you don't know -- do you know what "family"
·refers to with respect to iPods?
· · · A.· ·"Family," I believe, refers to not just the
·type, but also different characteristics of the -- of
·the iPod -· · · Q.· ·And -- and -· · · A.· ·-- capacity and features and so on.
· · · Q.· ·And how did Dr. Murphy and Dr. Topel cluster
·the standard errors?· What's the cluster?
· · · A.· ·They said they clustered by family and -- and
·quarter.
· · · Q.· ·Okay.· And how many clusters do they use?
· · · A.· ·I'm not exactly sure because I don't believe
·it was apparent from the report or I might have -- I -·I believe it's a few hundred.
· · · Q.· ·Do you know how many observations per cluster?
· · · A.· ·That's another thing that I did not find, but
·if you take in the one case 2 million observations and
·if it were 400 clusters, that would be 5,000 per cluster
Page 35
Page 36
Page 33..36
www.aptusCR.com
YVer1f
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 37
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·on average.
· · · Q.· ·Are there any -- is an alternative simulation
·from the one you described -- well, strike that.
· · · · · ·So the simulation that you just proposed would
·be to simplify it by going to the model level rather
·than the family level?
· · · A.· ·Yes.
· · · Q.· ·Am I -- okay.
· · · A.· ·Uh-huh.
· · · Q.· ·Could you also run simulations using the
·family level?
· · · A.· ·You could, yes.
· · · Q.· ·Okay.
· · · A.· ·Uh-huh.
· · · Q.· ·And is one preferable to the other or would
·you run them both?
· · · A.· ·Well, if one had the time, you would want to
·run a simulation that reflects the particular
·application, yes.
· · · Q.· ·Okay.
· · · A.· ·But the -- the -- if -- if the data structure
·had been the simple one that I had proposed, then the
·only clustering that could have been done is by the
·class of iPod.· And this is the analogue of what Murphy
·and Topel did in their more complicated situation.· They
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·And how would one work that out?
· · · A.· ·You start off with the presumption that -·first -- first, again, starting under the assumption
·that you have a random sample and then you see what
·happens after you cluster on a feature which could be
·family by quarter of -- of observation and -- and show
·that the -- what the bias in the clustered standard
·error is relative to the correct one.
· · · · · ·The simulation -- it's important also to
·understand the -- the point of the simulation is that
·you can actually figure out what the proper standard
·error is -· · · Q.· ·Right.
· · · A.· ·-- because you control the data and so you
·know which standard error is -- is the one that's close
·to the one you're trying to get.· And the standard error
·that's going to win convincingly is the usual standard
·error that does not cluster.
· · · Q.· ·Well, isn't the point of the simulations to
·see which standard error is going to win?
· · · A.· ·That's correct.
· · · Q.· ·Okay.
· · · A.· ·Uh-huh.
· · · Q.· ·Does the OLS standard error -· · · A.· ·I should -- I should bring something into
Page 38
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·defined more clusters, but that's because there's more
·time periods and more family -- the -- the family -- the
·number of families is larger in that case.
· · · Q.· ·Well, the simple structure that you described
·does not exist; correct?
· · · A.· ·Oh -· · · Q.· ·That doesn't define the data structure?
· · · A.· ·It doesn't exist for this -· · · Q.· ·In reality.
· · · A.· ·-- particular application -· · · Q.· ·Correct.
· · · A.· ·-- but it defines lots of data structures
·that -- that have been used for intervention analysis,
·sure.
· · · Q.· ·Right.· But not in this case?
· · · A.· ·It has the features, though, because, for
·example, once you have several thousand observations per
·cluster, then the simpler setting at least helps you
·learn something about how clustering can give you very
·overstated standard errors.
· · · Q.· ·Other than the simulations, are there any
·other procedures that one could perform to verify or
·test your claim that clustered robust errors overstate
·the true standard errors in Dr. Noll's regressions?
· · · A.· ·One could work out the theory, uh-huh.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·this.· We already know that the original standard error
·is going to do well because in this situation there are
·many -- more than 2 million observations in your case,
·but it would work really well with 1,000 observations,
·for example, because there's a simple formula that has
·been derived from theory that has been known for a long
·time and so we know that that formula is going to work.
· · · · · ·The only issue is, is there any bias and what
·is the nature of the bias in doing the clustering when
·you don't have to?· And so the issue is, really, how
·much are you going to be wrong by doing the clustering
·and how -- more to the point, how conservative will the
·clustering be.
· · · Q.· ·And as you -- as one increases the number of
·observations using OLS standard errors, can the bias -·would the bias tend to increase or decrease?
· · · A.· ·Oh, the bias will decrease.· In fact, the -·as I said, in most applications, once you have a 1,000
·or a couple thousand observations the standard error
·that you compute from OLS, even if they're the so-called
·heteroskedasticity robust standard records do quite well
·in those cases.· That, actually, raises an interesting
·point about the clustering, is that once you've decided
·on the clusters, so in the Murphy and Topel case,
·they've taken a stand that the clustering should be at
Page 39
Page 40
Page 37..40
www.aptusCR.com
YVer1f
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 41
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·the family quarter level, which does raise the question,
·why not at the month family level or the week family
·level or the year family level?· So how they came up
·with that clustering, I'm not sure.· I don't think
·it's -- it's ever described.· But once you've chosen the
·clustering scheme, the clustered standard errors will
·never get smaller.· They depend only on the number of
·clusters you've chosen, not on the number of overall
·observations, which is peculiar because if you think of
·standard we should think that information is
·accumulating in a random sample as we get more and more
·data and that is what happens.
· · · · · ·That's why you see the usual OLS standard
·errors heading to zero at the rate one over the square
·to the sample size and the clustered standard errors
·will just stay constant, given the number of groups that
·you have.· So you're left in the odd situation that
·having lots of transactions data is viewed as being the
·same as having not very much transactions data.· And
·that's because you're inappropriately clustering the
·standard errors.
· · · Q.· ·And just so I understand your opinion, if
·you keep group size constant -· · · A.· ·If you keep -· · · Q.· ·It's the G --
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·That's why you have to use a different kind of thought
·experiment, which is -- one of which I laid out in
·section 5 of my declaration that shows you that,
·essentially, with the whole population you can argue
·that the usual standard errors are the -- are the right
·ones to use and, if anything, they're actually
·conservative because when you have the whole population,
·there's a -- a population correction that always reduces
·the standard errors.
· · · · · ·So the -- as you get more and more data,
·again, if you fix the number of clusters using the
·entire population is operationally the same as getting
·more data in the perspective that I laid out in -- in
·section 5.
· · · Q.· ·And would you expect the OLS standard -- the
·bias of the OLS standard errors to increase or decrease
·as you increase the number of transactions per -- number
·of observations per group, keeping the group constant?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
·Incomplete hypothetical.
· · · · · ·THE WITNESS:· Okay.· So let me -- let me say
·this again.· The data have been collected by random
·sampling and so the clusters that have been -- that -·the usual OLS sustained errors ignore the clustering and
·they properly ignore the clustering because this is an
Page 42
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·-- the number of groups constant?
· · · Q.· ·Right.
· · · A.· ·Yes.
· · · Q.· ·Let's call it G.
· · · A.· ·Uh-huh.
· · · Q.· ·So G, using your example, equals ten.
· · · A.· ·Yes.
· · · Q.· ·Okay.· If you increase the number of
·observations per group -· · · A.· ·Yes.
· · · Q.· ·-- per cluster -· · · A.· ·Uh-huh.· And the data have -- have come from a
·random sample.
· · · Q.· ·Oh, okay.· Well, what if the data comes from
·the entire population?
· · · A.· ·Well -· · · Q.· ·What's the impact of using clustered robust
·standard errors -· · · A.· ·Well -· · · Q.· ·-- as the number of observations per cluster
·increase?
· · · A.· ·Well, the traditional standard error, if
·you don't -- if you actually act as if you have -- the
·entire population is zero because you have no -- you
·have no sampling error in the -- in the estimation.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·ex post structure that you've imposed on the data after
·you've collected it.· So the clustering that you use has
·no effect on the traditional OLS standard errors as it
·should be.
· · · · · ·Again, let me give you an example.· I mention
·the -- the hourly wage occupation example.· Suppose that
·in addition to occupation you collected information on
·highest grade completed, so schooling.· Now, if you did
·exactly the same exercise, if you collect the data -·and, remember, the goal here is to just estimate the
·average wage in the population, but you say, I have
·information on schooling, now I'm going to put people
·into clusters based on the highest grade they've
·completed, which might be five, ten categories, you're
·going to find exactly the same phenomenon.
· · · · · ·On average, people with lower education are
·going to have a lower hourly -- hourly wage, and so
·within that cluster you're going to find correlation.
·Same thing, people with high levels of education are
·going to have on average a higher hourly wage.· So now
·you've got occupation and you've got education and
·there're two different ways of cluster, so which is the
·right one?· You know the answer has to be neither is the
·right one because you've -- you've collected it via a
·random sample.· The goal is to estimate the population
Page 43
Page 44
Page 41..44
www.aptusCR.com
YVer1f
Page 53
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·11:36 a.m.
· · · · · ·(Recess.)
· · · · · ·(Mr. Murray telephonically joins deposition.)
· · · · · ·THE VIDEOGRAPHER:· Okay.· We're back on the
·record 11:59 a.m.
· · · · · ·MS. SWEENEY:· Professor Wooldridge, go ahead
·and make those clarifications.
· · · · · ·THE WITNESS:· The first clarification was the
·amount of hours I spent on the case up to writing the
·declaration, that was -- when I said five or six hours,
·that was the actual time writing the declaration. I
·spent another six hours reading the background material,
·the Noll report and the Murphy and Topel report.
· · · · · ·MR. KIERNAN:· Okay.
· · · · · ·MS. SWEENEY:· Was there one other one?
· · · · · ·THE WITNESS:· The Noll rebuttal?
· · · · · ·MS. SWEENEY:· No, I'm sorry.· I thought that
·you were going to clarify two issues of testimony.
· · · · · ·THE WITNESS:· Oh, oops.
· · · · · ·MS. SWEENEY:· That's okay.
·BY MR. KIERNAN:
· · · Q.· ·All right.· Dr. Wooldridge, I was going back
·through my notes and I -- it wasn't entirely clear to
·me, under what circumstances could clustering standard
·errors be appropriate when you have all the -- the --
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. SWEENEY:· -- the deposition?
· · · · · ·MR. KIERNAN:· Scott Murray from Apple is on
·the phone.
· · · · · ·MS. SWEENEY:· Okay.
· · · · · ·MR. KIERNAN:· Hi, Scott.
· · · · · ·MR. MURRAY:· Hello.
·BY MR. KIERNAN:
· · · Q.· ·As you sit here today, can you think of any
·circumstances under which one would ex post group the
·observations of iPod transactions and then use
·information that's computed from other transactions as
·part of the regression model for a particular
·transaction?
· · · A.· ·I can't think of why you would do that because
·a hedonic price regression is about relating the price
·of a particular unit to characteristics of that unit.
·And, of course, prices will change over time as demand
·in supply, conditions affect prices, but that's the -·the nature of a before-after analysis where you want to
·account for or control for the characteristics of the
·particular units that are being transacted.
· · · Q.· ·Earlier you describe two ways, two procedures,
·that could be used to test the hypothesis that
·clustering -- that the clustered standard errors by Drs.
·Murphy and Topel inflate the standard errors compared to
Page 54
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·you have the entire population of transactions.
· · · · · ·MS. SWEENEY:· Objection.· Asked and answered.
· · · · · ·THE WITNESS:· As I said before, there's only
·one case that I could think of and that's where ex post
·you -- you group the observations and then you use
·information that's computed from other transactions as
·part of the regression model for a particular
·transaction.
· · · · · ·This would be like taking a sample of students
·and then computing family income of some peers who live
·next to them and including that in a regression model,
·but there's nothing like that done in the analysis by
·Professor Noll.
·BY MR. KIERNAN:
· · · Q.· ·Is that something that could be done with the
·transactional data for iPods?
· · · · · ·MS. SWEENEY:· Objection.· Incomplete
·hypothetical.· Vague and ambiguous.
· · · · · ·THE WITNESS:· It could be done, but I'm not
·sure why anybody would do that.
·BY MR. KIERNAN:
· · · Q.· ·Are there any -· · · · · ·MS. SWEENEY:· Hold it.· Excuse me.· Before we
·go on, did -- did anyone join the -· · · · · ·MR. KIERNAN:· Oh, yes.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·the true precision of the estimates.
· · · A.· ·Yes.
· · · Q.· ·Are there any other procedures that you can
·think of?
· · · A.· ·No.· One -- one has basically two tools at -·at one's disposal when trying to evaluate any kind of
·statistical procedure.· And since standard errors are a
·measure of precision of estimates.· That measure is
·across different realizations or samples of data.· And
·so you can either do a theoretical calculation, which
·uses the tools of statistics to account for the fact
·that we're seeing different realizations of data or you
·can actually do a simulation which creates different
·samples or realizations of the data and study the
·problem that way.
· · · Q.· ·With respect to the theoret- -- you said
·"theoretical calculation"?
· · · A.· ·Yes.
· · · Q.· ·Okay.· Have you done a theoretical calculation
·that's set forth in your declaration that tests the
·conclusion or hypothesis that Drs. Murphy's and Topel's
·clustering standard errors vastly inflate the standard
·errors compared with the true precision of the
·estimates?
· · · A.· ·After writing my declaration, I did do a
Page 55
Page 56
Page 57
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·calculation like that, yes.
· · · Q.· ·Is it in your declaration?
· · · A.· ·No.· It happened after my declaration.· It was
·to support what I knew had to be true by thinking
·through the -- the different kinds of examples.
· · · Q.· ·And have you produced that calculation that
·you're referring to?
· · · A.· ·Produced it to?
· · · Q.· ·To the lawyers in this case.
· · · A.· ·No.
· · · Q.· ·And are you relying upon those calculations,
·the theoretical calculations, for the opinions set forth
·in your declaration?
· · · A.· ·No.· What I was relying on was the idea
·that -- and -- and, again, I have to admit, I scaled the
·problem down so I could think about it better and
·thinking about either a few occupational classes or a
·few classes of iPod and what would happen in that case
·if you clustered on a characteristic such as occupation
·or class of iPod after you collected the data, and it
·became clear that, of course, you would find in some of
·the clusters there -- the residuals have a below -- an
·average below zero, in some cases it would be above
·zero.· The weighted average of them has to even out, has
·to be zero, because you know that you're -- you're
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·And what are some of the reasons why the
·prices would be different for different families?
· · · A.· ·Oh, well, of course different places can have
·different sales going on.· They can have -- this isn't
·my area of expertise.
· · · · · ·I -- as I said, I haven't even looked at the
·data, and I don't need to to understand that a family
·decides -- is presented with a price or a reseller is
·presented with a price, and they make a decision to buy
·at that price or not.
· · · · · ·The fact that those prices may be the same for
·several families is -- does not imply that there is a
·clustering problem.· One can think of many situations
·where -- where that's true.· I give an example in my
·declaration.
· · · Q.· ·Sure.· But could there be a clustering
·problem?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
·Incomplete hypothetical.
·BY MR. KIERNAN:
· · · Q.· ·Well, you said -- so we can clarify it, you
·said the fact that they may be the same for several
·families does not imply that there is a clustering
·problem, one can think of many situations where that's
·true.· I gave an example in my declaration.
Page 58
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·computing the overall population average.
· · · Q.· ·And -· · · A.· ·And so that -· · · Q.· ·Go ahead.
· · · A.· ·-- perfectly explained what Murphy and -·Murphy -- Professors Murphy and Topel were finding in
·their calculation of clustered standard errors and
·looking at the residuals.
· · · Q.· ·What perfectly explained what Professors
·Murphy and Topel were finding in the calculation?
· · · A.· ·Well, Professors Murphy and Topel report after
·they do the clustering based on the family by calendar
·-- by -- by quarter, that they -- to show that there was
·a -- a problem that needed to be addressed with
·clustering, they computed the residuals within each of
·these clusters and they used as their main piece of
·evidence that -- or one of the main pieces of evidence
·that these average residuals were different across the
·different clusters.
· · · · · ·And I said that that is perfectly explained by
·the fact that there -- the prices are going to be on
·average different for different families as well as
·different quarters and that in no way implied that the
·standard errors had to be computed with -- with
·clustering.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·Are there circumstances -· · · A.· ·Oh -· · · Q.· ·-- that you can think of where there is a
·clustering problem or could be?
· · · A.· ·No, not with the way the data have been
·collected.· So in -- in my declaration I use the example
·that also had prices that you would expect not to -- to
·vary much, especially within geographic units and within
·time and that would be looking at the prices of some
·standardized item at a fast-food restaurant.
· · · · · ·The fact that two people might go to the same
·fast-food chain and pay the same price does not mean
·that those two observations form a cluster.· They're
·independent draws based on a person's decision to buy or
·not at that particular price and, in fact, it's -- the
·fact that there isn't that much variation in the prices,
·so the -- the example I used was suppose you're trying
·to -- to decide whether prices are systematically
·different in poor neighborhoods and -- and what I call
·nonpoor neighborhoods, the fact that there may be little
·price variation makes it all the more impressive if
·you can actually find a difference across the two
·different kinds of neighborhoods.
· · · · · ·And, of course, the fact that there's little
·price variation means that the variance of the residuals
Page 59
Page 60
Page 61
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·will be small and that helps with the precision of the
·estimates.· And this is what you find in Professor
·Noll's calculation where the standard errors there's a
·small residual variance and there's a lot of
·observations and so he properly finds small standard
·errors in his regression analysis.
· · · Q.· ·Is there a point at which the standard errors
·are so low that would cause an econometrician like
·yourself to question whether they were accurately
·calculated?
· · · · · ·MS. SWEENEY:· Objection.· Incomplete
·hypothetical.
· · · · · ·THE WITNESS:· All I'm concerned about is given
·the particular application, the model, the estimation
·method, the way the data have been collected, has the
·appropriate method been used or not and with lots of -·lots of observations and with little residual variance,
·there's no rule of thumb below which the standard errors
·would have to hit before you -- before you got
·suspicious.· So I would say, no, there isn't -- there
·isn't some sort of threshold.
· · · · · ·I would -- I -- I evaluate these on the -- on
·the -- on the merits of the modeling exercise and the -·the estimator used and in this particular case on how
·the sample is -- is obtained or in this case the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·quote, harmless if it's not needed, but this is -- this
·is not true in the context of clustering after you've
·collected a random sample.· If you have collected a
·cluster sample and you have a large number of clusters
·and relatively small observations within a cluster,
·then -- then you can show, as the number of clusters get
·large, the standard errors will approach the right
·values, but that's assuming you've collected a cluster
·sample.
· · · Q.· ·Can you cite to any authorities, textbooks or
·articles, that support your conclusion that clustered
·robust errors overstate the true standard errors when
·using the entire population of transactions?
· · · · · ·MS. SWEENEY:· Objection.· Asked and answered.
·BY MR. KIERNAN:
· · · Q.· ·I'd like the actual names of the authorities,
·textbooks or articles to the extent the question was
·confusing.
· · · A.· ·Oh, so I said that I -- I don't know of any.
·I've -- I've worked this out since submitting my
·declaration.
· · · Q.· ·You mention that at some point you talked
·to -- did you say two people?
· · · A.· ·Uh-huh.
· · · Q.· ·And who were they?
Page 62
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·population.
·BY MR. KIERNAN:
· · · Q.· ·Are there factors that could impact the
·reliability of the precision of the standard errors that
·Dr. Noll reports?
· · · A.· ·I can't think of any.· A standard error
·calculation is a fairly straightforward thing in most
·cases with standard econometric methods such as OLS once
·you understand how the data have been -- been obtained.
·I should add, he did make the standard errors robust to
·heteroskedasticity of unknown form, which means the
·variance can change in an arbitrary way across
·transaction and that is the appropriate thing to do.
· · · Q.· ·Are there any authorities, textbooks, public
·articles that support your conclusion that clustered
·robust errors overstate the true standard errors when
·using the entire population of transactions?
· · · A.· ·Actually, this is fairly recent material. I
·started thinking about this a couple of years ago when I
·had conversations with two -- two people that I've
·worked with and we let it go.· And since then I've been,
·after writing the declaration, thinking about the merits
·of this case, I've worked out a little bit of theory as
·well as the simulation.
· · · · · ·It is commonly thought that clustering is,
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Not -- not for this case.
· · · Q.· ·Understood.
· · · A.· ·Alberto Abadie is an econometrician at Kennedy
·School of Harvard and Guido Imbens is an econometrician
·at the Stanford Graduate School of Business.· I've
·co-authored with Guido before and I actually do lectures
·with him.
· · · Q.· ·And you noted that the three of you let it go.
·What -- what did you mean by that?
· · · A.· ·Oh, it actually -- we didn't completely let it
·go.· We just all got busy and were working on different
·things.
· · · Q.· ·And did the three of you or any number of you
·author a working paper?
· · · A.· ·There's no working paper.
· · · Q.· ·Any drafts of a working paper?
· · · A.· ·No.
· · · Q.· ·Any working paper of a working paper?
· · · A.· ·No.
· · · Q.· ·And are you working with them now -- either
·one of them, now that you've picked this topic back up?
· · · A.· ·I believe we will pick the topic up, yes.
· · · Q.· ·Okay.· Have you talked to them about it?
· · · A.· ·No.· We have been actually talking about the
·other -- just to clarify -- the material on the finite
Page 63
Page 64
Page 65
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
·population analysis that I mentioned in section 5, I
·should say when you're using the entire population.
· · · Q.· ·All right.· Describe for me, as precisely as
·you can, Apple's pricing strategy for iPods.
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
·Overbroad.
· · · · · ·THE WITNESS:· Actually, I don't know what
·their strategy is.
·BY MR. KIERNAN:
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·BY MR. KIERNAN:
· · · Q.· ·Do you know how often Apple reviewed its
·pricing for particular iPod families?
· · · · · ·MS. SWEENEY:· Same objections.
· · · · · ·THE WITNESS:· No.
·BY MR. KIERNAN:
· · · Q.· ·If you could state it again because you
·guys -· · · A.· ·No.
· · · Q.· ·-- talked over each other.
· · · A.· ·No.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
· · · Q.· ·For a particular iPod family, how often did
·Apple change the price on the iPod?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous
·as to time.· Overbroad.· Compound.
· · · · · ·THE WITNESS:· I'm not sure.
·BY MR. KIERNAN:
· · · Q.· ·Do you have any knowledge of how frequent
·Apple changed the price of an iPod family?
· · · A.· ·No.
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Have you examined how Dr. Noll's regressions
·controlled for Apple's pricing policies and its impact
·on Apple's prices for iPods?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
· · · · · ·THE WITNESS:· I looked at his regressions and
·noted that there are various features of the families -·the units themselves as you would expect in a hedonic
·price regress- -- regression.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· And did you examine whether Dr. Noll
·included all of the explanatory -- all of the variables
Page 66
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Okay.· Is that something you examined?
· · · A.· ·When I read the reports, I focused mainly on
·the econometric issue of clustering.· I -- I really
·didn't form an opinion or absorb the particular of the
·pricing strategy.
· · · Q.· ·Do you know what factors Apple takes in
·account in setting the prices for a typical -- pardon
·me.· Strike that.
· · · · · ·Do you know what factors Apple takes into
·account in setting prices for particular types of iPods?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
·Overbroad.· Compound.
· · · · · ·THE WITNESS:· Well, I assume -- I assume cost
·is involved and I assume that the -- the -- the demand
·for the various features is involved.
·BY MR. KIERNAN:
· · · Q.· ·Putting aside your assumptions, do you know
·what factors Apple actually took into account in setting
·the prices for any iPod that Dr. Noll examined in this
·case?
· · · A.· ·No.
· · · · · ·MS. SWEENEY:· Same objections.
· · · · · ·Give me a second to interject my objections.
· · · · · ·THE WITNESS:· Okay.
·BY MR. KIERNAN:
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·that -- or strike that.
· · · · · ·Did you examine whether Dr. Noll controlled
·for all the factors that Apple considered when it set
·the price for an iPod family?
· · · A.· ·I didn't form an opinion about that.· I wasn't
·asked to consider the model specification.
· · · Q.· ·How did Apple set the price of iPods sold in
·Apple retail stores?
· · · A.· ·I don't know that.
· · · Q.· ·Okay.· How did Apple set the price of iPods
·sold on Apple online store?
· · · A.· ·Again, I don't know that.
· · · · · ·MS. SWEENEY:· And I'm going to belatedly
·object.· Vague and ambiguous.· Compound.
·BY MR. KIERNAN:
· · · Q.· ·For a particular iPod model, let's say, the
·iPod nano second-generation, 4 gigabyte, would the price
·that Apple listed for that iPod be the same at the
·retail store and the Apple online store?
· · · A.· ·I don't know.
· · · · · ·MS. SWEENEY:· Same objection.
· · · · · ·Sorry.· Go ahead.
· · · · · ·THE WITNESS:· I don't know.
·BY MR. KIERNAN:
· · · Q.· ·How did Apple set the price of iPods sold to
Page 67
Page 68
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 77
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·-- family has a different feature?
· · · A.· ·No.· So if you -- again, if you go back to the
·case of how the clustering was done, it's not true that
·you have to account for all of those features in order
·for the usual standard errors collected under random
·sampling to be valid.· So -· · · Q.· ·Go ahead.
· · · A.· ·-- for the issue of computing the standard
·errors -- and that -- that's why I'm not commenting on
·model specification -- no, it doesn't matter that there
·are some features that may not have been accounted for
·or some interactions of features or something like that.
·That's a modeling question.· That's not a question about
·the standard errors.
· · · Q.· ·Well, isn't the -· · · A.· ·So this -· · · Q.· ·Go ahead.
· · · A.· ·So this is -- this is a common misperception.
·If you take -- again, let's just start with a -- a large
·population and you're going to take a large random
·sample and you're going to estimate two models.· You
·have Y and you have X1 and X2 and you regress Y on X1
·and you regress Y on X1 and X2.
· · · · · ·Now, whether you should include X2 or not is a
·modeling issue.· The standard error that you compute in
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Yes.
· · · Q.· ·And is it your testimony that omitted·variable bias -- omitted-variable bias has no impact on
·the -- on reliably calculating the standard errors for a
·model?
· · · A.· ·That's correct.
· · · Q.· ·And just to make sure that you and I are on
·the same page, when you refer to "omitted-variable
·bias," what -- what are you referring to?· Define that
·for me.
· · · A.· ·Well, you would like to estimate the
·coefficient on X1, let's call it beta 1, controlling for
·the effects of X2 and if X2 is correlated with X1 and
·you leave it out of the regression, then, in general,
·the estimator of beta 1 will be biased.
· · · Q.· ·The coefficient on the X1 will be biased?
· · · A.· ·That's correct.· Yes.
· · · Q.· ·But that will have -- your testimony is that
·will have no impact on the calculation of the standard
·errors?
· · · A.· ·That's correct.· You'll get a valid standard
·error and confidence interval for the parameter that you
·are estimating.· It's actually easy to think this
·through.· You just -- you can always write any equation
·that you're estimating.· There's a population version of
Page 78
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·the usual way are valid even in the first regression
·where you've omitted X2.· The fact that you've omitted
·X2 does not affect the calculation of the standard
·errors.
· · · Q.· ·And what does it affect?
· · · · · ·MS. SWEENEY:· Objection.
·BY MR. KIERNAN:
· · · Q.· ·What would it affect under that scenario?
· · · · · ·MS. SWEENEY:· Vague and ambiguous.
·Incomplete.
· · · · · ·THE WITNESS:· Well, it -·BY MR. KIERNAN:
· · · Q.· ·I'm not going to use -- let me strike that.
· · · · · ·I want to use a hypothetical that you were
·just using and you said that omitting the variable would
·not affect the calculation of the standard errors.
·Would omitting the variable effect anything else in the
·model?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
· · · · · ·THE WITNESS:· Well, sure, it could bias the
·coefficient -- the coefficient and the simple
·regression.
·BY MR. KIERNAN:
· · · Q.· ·And so -- and is that what you referred to in
·your book as omitted-variable bias?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·it.· And so you can write Y as a linear function of X1.
·It may not have the coefficient that you want, but you
·can always do that and you can always write the model
·for Y as a function of X1 and X2.· Once you have a
·random sample, the calculation of the standard errors is
·standard.· There's no adjustment that needs to be made
·because you might have omitted X2.
· · · · · ·MS. SWEENEY:· Did you want to break for lunch?
· · · · · ·MR. KIERNAN:· Let me see if I'm done.
· · · · · ·Yeah, why don't we do that.
· · · · · ·THE VIDEOGRAPHER:· This will be the end of DVD
·No. 1.· We're going off the record at 12:41 p.m.
· · · · · ·(Recess.)
· · · · · ·THE VIDEOGRAPHER:· This is the beginning of
·DVD No. 2.· We're going back on the record at 1:49 p.m.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· Dr. Wooldridge, the error terms in Dr.
·Noll's regressions, what do they represent?
· · · A.· ·Factors that affect price that we don't
·observe.
· · · Q.· ·And what -· · · A.· ·Actually, they -- they can just be viewed as
·the difference between Y and its expectation conditional
·on the variables that are included in the regression.
· · · Q.· ·And what would be some reasons why factors
Page 79
Page 80
Page 77..80
www.aptusCR.com
YVer1f
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 81
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·that affect price would not be observed?
· · · A.· ·Well, usually there are various factors that
·can -· · · · · ·MR. MURRAY: Scott here.
·BY MR. KIERNAN:
· · · Q.· ·Go ahead.· You said usually there are various
·factors that can . . .
· · · A.· ·Right.· So if -- for example, if there are
·systematic differences across family and calendar year,
·then those differences would be included in the error
·term.
· · · Q.· ·And would -- if there are omitted product
·attributes that impact price, would those be captured in
·the error term?
· · · A.· ·The -- well, let me -- let me answer that like
·this:· If -- it's not necessarily what is in the error
·term that is -- that's important.· It's basically if
·you're trying to learn about the coefficient on a
·particular variable the question is whether you've
·included enough of the other factors.
· · · · · ·So if there -- I mean, as I tried to explain
·before, the -- the nature of the error term, whether it
·includes omitted factors or not, does not affect the
·issue that I was asked to -- to evaluate, which is the
·clustering issue.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·estimate under the assumption that the variance of the
·error term doesn't depend on any of the factors you've
·included in your regression model.· And there's an
·adjustment that allows for that variance to be
·unrestricted, an unrestricted function of those factors
·that is a little more complicated than that.
· · · Q.· ·And -- and what is that?
· · · A.· ·It's the so-called Eicker Huber White
·Estimator, which sometimes is called a sandwich
·estimator because the way the formula appears where on
·the inside there's a more general matrix that's
·estimated that allows for the squared error term to be
·correlated with the Xs and that's where the robustness
·to heteroskedasticity comes from.
· · · Q.· ·And in calculating the standard errors, is
·there an assumption that the error terms in the
·regression are independent of one another?
· · · A.· ·Yes.· Well, the -- the -- the foundation of my
·book, actually, in -- in the case of random sampling,
·they're always independent of one another.· So when you
·have a random sample there's no issue about whether
·they're independent or not because they automatically
·have to be.
· · · Q.· ·Right.· And what about the case when it's not
·a random sample?· Are you applying that assumption that
Page 82
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Okay.· Could variables that are unobserved or
·not measured by the regression impact the calculation of
·the standard errors?
· · · A.· ·Not when -- not when it's the population or a
·random sample from the population.· I -- so I gave you
·that example where you had X1 and then you had X1 and
·then X2, and it is, I believe, common -- some- -·somewhat commonly thought that the omission of X2 can
·somehow affect the calculation of the standard error for
·the coefficient on X1, but that's -- that's not true
·under the sampling scheme that we're talking about,
·random sampling or knowing the population.
· · · Q.· ·Okay.· In the -- let me back up a step or two.
·How is the standard error calculated?· As Dr. Noll -·using Dr. Noll's regression, how did he calculate the
·standard errors?
· · · A.· ·So he -- you would take the residuals from the
·regression and from those residuals -- so the -- the
·basic calculation that you learn in your first
·econometrics course estimates the variance of the error
·term by using the sum of squared residuals from the
·regression and then dividing it by a degrees of freedom
·correction and then that gets multiplied by the
·so-called X prime X inverse matrix.· And that's the
·valid -- that's the valid calculation for the variance
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·the -· · · A.· ·Well, if -- for example, if the data were
·truly collected by a cluster sample so that you were
·sampling clusters rather than individual units, then
·there would be some -- then that usual calculation of
·the heteroskedasticity robust matrix would not be
·correct, but that is assuming that you have collected -·you have cluster sampling.
· · · Q.· ·Are there any circumstances under which ex
·post clustering, as you've described it in your report
·and today, would be appropriate when calculating
·standard errors when dealing with an entire population
·of transactions?
· · · A.· ·It -- it depends on the regressors that were
·included -- the -- the factors that -- that are included
·in the model and as long as those factors are specific
·to the individual transaction, then, no.· And that's
·what Professor Noll did.
· · · · · ·Believe early I men- -- earlier I mentioned
·a -- a case where if you sampled students independently,
·but then after the fact looked for peer effects by
·looking at, you know, children who live near them and so
·on, then that sort of addition to the model would or
·possibly could induce cluster correlation within the
·errors.
Page 83
Page 84
Page 81..84
www.aptusCR.com
YVer1f
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 85
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·With respect to that example -- I'm glad you
·brought it up -- at what level would you cluster in that
·example that you just gave?
· · · A.· ·Um, you would define -- typically what you
·would do is you would have to define the notion of who
·are the potential peers for a particular student.· And
·so it might be something like the classroom or something
·like the, you know, school and then you would compute an
·average once you have defined what the peer group is and
·then you would include those and so you would cluster at
·that level.
· · · Q.· ·And what factors would an econometrician
·consider when deciding at what level -- the level of
·clustering, whether it was, in your example, the
·classroom or something else, some other level?
· · · A.· ·Well, ideally in the case where you've
·actually collected a cluster sample, that determines it
·for you because you know what the clusters actually are.
·And the other consideration is if you include
·explanatory variables that are created by defining
·clusters, then that would also define the level of
·clustering.
· · · Q.· ·And -- and going back to your example where
·the data was not collected by cluster sampling -· · · A.· ·Uh-huh.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Yes.· But if there are -- if there are no
·so-called peer effects included in the equation, then
·there's no need to cluster.
· · · Q.· ·And -- one second.
· · · · · ·If under that scenario there was no peer
·effect, but the researcher did cluster by the -- by the
·peer clustering level, as you suggested -· · · A.· ·Uh-huh.
· · · Q.· ·-- what impact, if any, would that have on the
·precision of the standard errors?
· · · A.· ·Well, it depends on the -- if you had a large
·number of clusters with few observations, then the
·effect might be fairly small.· But then you would see
·that the effect is fairly small.· That's the -- the
·proof, essentially, is in the pudding, is that since you
·know that the original standard errors are correct, if
·you do the clustering and it matters a lot, then you've
·either got an unusual sample or the cluster effects have
·a bias in them -- I'm sorry -- the clustered standard
·errors have a bias in them.
· · · Q.· ·And, then, knowing that -- sorry -- you said
·since you know that the original standard errors are
·correct.
· · · A.· ·So the situation was -· · · Q.· ·Yeah.
Page 86
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·-- what factors would an econometrician
·consider in determining the level of clustering in that
·scenario?
· · · A.· ·Again, you'd have to first define what the
·peer group is and then that would determine the level of
·clustering.· So if -· · · Q.· ·Based on what?· Based on what factors? I
·mean -· · · A.· ·Well, in -- in -· · · Q.· ·-- what would an econometrician consider?
· · · A.· ·-- in that particular example, it's -- it
·actually doesn't have anything to do with being an
·econometrician.· That's the sort of question that the
·person undertaking the empirical work has to decide,
·what -- what sort of children do I think affects a
·particular child's outcome?· So it would be up to you to
·specify ahead of time that it's, you know, the
·neighborhood or the classroom or something like that.
· · · Q.· ·Okay.· And my view of that may be different
·from some other -· · · A.· ·That's correct.
· · · Q.· ·-- somebody else's.
· · · · · ·So it's a matter of judgment -· · · A.· ·Uh-huh.
· · · Q.· ·-- of the person running the study?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·-- we randomly sample from the population -· · · Q.· ·Right.
· · · A.· ·-- and then I said where you might have to
·cluster is when you create a peer effect.· And then I -·I -- maybe I misunderstood you.
· · · Q.· ·No, no.
· · · A.· ·You -- you said, but suppose there is no pure
·effect and you cluster anyway, well, that can only
·increase the standard errors on average.
· · · Q.· ·What I was referring to, Dr. Wooldridge, is I
·want to understand when you stated -- what you meant by
·the original standard errors are correct and you were
·referring to -· · · A.· ·I mean the ones that were obtained via the
·Eicker Huber White heteroskedasticity robust formula,
·yes.
· · · Q.· ·By drawing a random sample?
· · · A.· ·Yes.
· · · Q.· ·I asked you are there circumstances under
·which ex post clustering, as you've described it in your
·report and today, would be appropriate when calculating
·standard errors when dealing with an entire population
·of transactions.· Do you recall that question?
· · · A.· ·I do.
· · · Q.· ·Okay.· Depends on what regressors were
Page 87
Page 88
Page 85..88
www.aptusCR.com
YVer1f
Page 89
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·included, the factors that are included in the model, as
·long as those factors are specific to the individual
·transaction, then, no, and that's what Professor Noll
·did.
· · · · · ·What is an example when there are factors that
·are not included where ex -- that would make ex post
·clustering appropriate -· · · A.· ·There -· · · Q.· ·-- or something to be considered?
· · · A.· ·There -- there aren't any unless you actually
·create these clusters ex post.· So, in other words,
·you'd have to make a conscious decision that you wanted
·to -- to include information about other transactions
·directly in the equation for this particular
·transaction.
· · · · · ·So, again, maybe I'm not -- you would be the
·cause of the clustering problem because you decided to
·do that.· If you don't decide to do that, then there
·can't be a clustering problem.
· · · Q.· ·Yeah, I think you and I are talking past one
·another.
· · · · · ·My question is -- maybe I'll ask -- reask the
·question.
· · · · · ·Are there any circumstances where you have the
·entire population of the data in which ex post
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·understood to be you sample clusters from the population
·and this notion that you would, essentially, take a
·random sample and then create clusters after you've,
·essentially, observed the random sample is, I think, a
·fairly recent phenomenon and turn -- it turns out to be
·incorrect.
· · · Q.· ·And what authority can you point to that
·supports your conclusion that ex post clustering is
·incorrect, as you just put it?
· · · A.· ·Well, I -- I like to think of myself as an
·authority on this, and this is an issue that has come up
·in this particular case, and it's come up in some other
·areas that I'm aware of.· That's why, in fact, I mention
·the two authors that I've been working with started
·talking about this problem a couple of years ago.· So I
·guess I and my coauthors are the authorities.
· · · Q.· ·And can -- have you published any peer review
·articles that support your conclusion that ex post
·clustering is inappropriate when dealing with an entire
·population?
· · · A.· ·Not on that specific topic, no.
· · · Q.· ·And are you aware of any peer-reviewed
·articles or other publications that support the
·conclusion that ex post clustering is inappropriate when
·dealing with the entire population?
Page 90
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·clustering, as you described it in your report and
·today, would be appropriate when calculating the
·standard errors?
· · · · · ·MS. SWEENEY:· Objection.· Asked and answered.
· · · · · ·THE WITNESS:· So, again, if you are using
·information from the transaction that has been drawn,
·the answer is, no.· Only if you've created a problem
·where you, essentially, use information from other
·observations can you create a clustering problem in that
·setting.
·BY MR. KIERNAN:
· · · Q.· ·Ex post clustering, is that a generally
·accepted term in econometrics?
· · · A.· ·There's something closely related called ex
·post stratification.
· · · Q.· ·And is it your testimony today that those are
·the same thing?
· · · A.· ·No, they're not the same thing.
· · · Q.· ·Okay.
· · · A.· ·But they're -· · · Q.· ·Focusing on ex post clustering, is that a term
·that is used in the field of econometrics?
· · · A.· ·That's a good question.· I'm not sure I could
·point to a source for that, actually.· I basically -·the idea is that for a long time cluster sampling was
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·The closest thing, which, again, the -- the
·whole population versus sampling is a bit of a red
·herring here because if -- if we had taken a -- a large
·random sample, then the conclusions would -- would be,
·essentially, the same.· The closest I can think of is my
·own work on talking about inappropriately clustering a
·stratified sample.
· · · Q.· ·And where's that work?
· · · A.· ·That's the American Economic review paper, the
·2003 paper.
· · · Q.· ·And is it your testimony that that paper
·states that ex post clustering is inappropriate when
·dealing with an entire population of transactions?
· · · A.· ·No.· It's -- so let's be clear.· It's a -- a
·statement about how clustering, when the data had been
·collected from a stratified sampling, is inappropriate.
· · · Q.· ·And is the -- was the data in -- that's used
·by Professor Noll, was that collected from a stratified
·sampling?
· · · A.· ·No.· So a special case of stratified sampling
·is random sampling and so if he had thrown out, you
·know, 80 percent of the data and called that a random
·sample, then it would apply -- apply directly to that
·case.· But, as I said before, when you collect more
·data, that's only a good thing.· So the different -- the
Page 91
Page 92
Page 93
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·issue of the population versus the sample is irrelevant
·here for the clustering issue.
· · · Q.· ·So yours and Dr. Noll's opinions with respect
·to the population are irrelevant to the issues in the
·case of whether clustering is appropriate?
· · · · · ·MS. SWEENEY:· Objection.· Misstates his
·testimony.· Argumentative.
· · · · · ·THE WITNESS:· No, I think that -- that's
·not -- certainly not what I said.· The -·BY MR. KIERNAN:
· · · Q.· ·How is it relevant then?
· · · A.· ·How is?
· · · Q.· ·How is -- how is the fact that, in your view,
·that Dr. Noll had the entire population of iPod
·transactions relevant to the opinions that you're
·offering in this case?
· · · · · ·MS. SWEENEY:· Objection.· Asked and answered.
· · · · · ·THE WITNESS:· I'm not sure how I can answer
·that differently.
·BY MR. KIERNAN:
· · · Q.· ·Well, you stated -· · · A.· ·So -- so, again -- so -·BY MR. KIERNAN:
· · · Q.· ·-- the issue -· · · A.· ·Okay.· So let me --
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·authority supports that opinion?
· · · A.· ·The argument that I just used for you, that -· · · Q.· ·Okay.
· · · A.· ·-- the -· · · Q.· ·Any peer-review publication that supports your
·opinion that there can be no cluster sampling problem
·because there is no sampling?
· · · · · ·MS. SWEENEY:· And I'd just like to interject
·an objection and ask the witness, you can go ahead and
·finish your prior answer that Mr. Kiernan interrupted.
·BY MR. KIERNAN:
· · · Q.· ·And if I did interrupt you, I did not mean to.
· · · A.· ·So, again, the -- when you have a large
·population you could take a random sample from that in
·which case there's no justification for the clustering.
· · · Q.· ·Yeah.· I understood your argument.· What I'm
·asking is what peer-reviewed authorities can you cite to
·me just -· · · A.· ·For the population problem, I -- I can't.
· · · · · ·MS. SWEENEY:· And, again, I'm going to ask you
·to stop interrupting the witness.· You -- every time he
·starts to give an answer, you interrupt if you don't
·like it.· You're not entitled to do that.· So please
·don't interrupt the witness anymore.
·BY MR. KIERNAN:
Page 94
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. SWEENEY:· Please don't -- David, don't
·interrupt.
· · · · · ·MR. KIERNAN:· He had stopped -· · · · · ·THE WITNESS:· Yeah, and -· · · · · ·MR. KIERNAN:· -- and I started.· So I didn't
·interrupt him.· He actually interrupted me.
· · · · · ·THE WITNESS:· That was my fault, yeah.
·BY MR. KIERNAN:
· · · Q.· ·But go ahead.
· · · A.· ·The -- again, if you take the millions of
·transactions that are in the population and you took a
·random sample from those, okay, the clustering on the
·basis of characteristics that you, you know, observe
·like the family in the quarter, would be the incorrect
·thing to do.
· · · · · ·As you get more and more data, that doesn't
·change and so whether you think of that as the entire
·population or a larger random sample, essentially you -·you get -- you get the same answer.
· · · · · ·MS. SWEENEY:· The spotlight is on you.
· · · · · ·MR. KIERNAN:· But the documents are great.
·We're all laughing because I made a joke.
·BY MR. KIERNAN:
· · · Q.· ·Your opinion that there can be no cluster
·sampling problem because there is no sampling, what
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·In your report you describe the -- section 5
·you referred to it a couple of times today -- to the
·unconfoundedness assumption.
· · · A.· ·Uh-huh.
· · · Q.· ·Do you recall that?· And define that for me.
· · · A.· ·That means that the assignment to the
·treatment in the control group doesn't depend on, in
·this case, what the price differential would be under
·two regimes.· So think of -- think of it before harmony
·was blocked and after harmony was blocked and actually
·just -- you don't have to bring time into it, just two
·states of the world, harmony is blocked, harmony is not
·blocked.· And then you see some units where that was
·true and some units where that wasn't true and the idea
·is that the intervention is independent of what the
·difference in prices would have been in the two states
·of the world.
· · · Q.· ·And what circumstances must exist for the
·unconfoundedness assumption to hold?
· · · A.· ·Well, there -- you could have an experiment
·where you have a random intervention or you can have a
·before-after where you have included enough factors so
·that it's -- the intervention is effectively random
·after you've included those factors.
· · · Q.· ·And have you examined whether or not with
Page 95
Page 96
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 97
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·respect to Professor Noll's two regressions the
·unconfoundedness assumption holds?
· · · A.· ·That's more -- that's a modeling question, so
·I did not think about that, yes.
· · · Q.· ·And so you're not offering an opinion on
·whether or not the unconfoundedness assumption applies
·with respect to Professor Noll's two regressions?
· · · A.· ·That's correct.
· · · Q.· ·If there were important variables that
·explained prices of iPods that were left out of the
·regression, could that cause the unconfoundedness
·assumption to be violated?
· · · A.· ·Yes, it could and -- but I don't have an
·opinion on whether that's the case here.
· · · Q.· ·When dealing with an entire population, are
·there methods available to econometricians to account
·for correlation of error terms when calculating standard
·errors?
· · · A.· ·Well, like I said, there's no need to do it
·when you're using only the information from each record
·in your econometric analysis.
· · · · · ·So the -- you know, the fact that you then
·decide that after you have this entire population that
·you're going to, essentially, arbitrarily define
·clusters and then conclude that based on your
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·errors have any desirable properties.
· · · Q.· ·Okay.· And when you state, "We should not
·expect good properties of the cluster robust standards,"
·what are the good properties that you're referring to?
· · · A.· ·You would want them to, essentially, be
·unbiased estimates or even consistent estimates of the
·actual sampling variances.
· · · Q.· ·Right.· So one would be unbiasness?
· · · A.· ·Yes.· Well, that's -- so the sampling
·variances, the -- the usual OLS estimators are unbiased
·estimators of the -- the usual -- the usual variance
·estimators for the OLS -- that's a -- the sampling
·variances for the OLS estimators are unbiased and then
·we take the square roots of them to get the standard
·errors.· And you can't always -- the -- the cluster
·robust ones, they're never exactly unbiased, so you talk
·about approximations and you often talk about what
·happens as you get more and more data.
· · · · · ·And the theory on cluster sampling does not
·allow for the case where you have a small number of
·clusters and -- well, having a small number of clusters
·is a problem in general and certainly when you have a
·large number of observations per cluster none of the
·theory applies to that case.
· · · Q.· ·And what number of clusters is -- what
Page 98
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·definitions you're seeing correlation within those
·clusters does not say that you should use cluster robust
·inference.
· · · Q.· ·In your textbook you note that "We should not
·expect good properties of the cluster robust inference
·with small groups and very large group sizes when
·cluster effects are left in the error term."
· · · · · ·Do you recall saying something along those
·lines?
· · · A.· ·Yes, uh-huh.
· · · Q.· ·What do you mean by that?
· · · · · ·MS. SWEENEY:· Can -- can -- can I interject
·for a moment?· Do you have a copy of that?
· · · · · ·MR. KIERNAN:· Well, he recalls it.
· · · · · ·MS. SWEENEY:· Yeah, but can you give the page
·cite?
· · · · · ·MR. KIERNAN:· I don't have the page cite.
· · · · · ·THE WITNESS:· Is that from the second edition
·of my book?
·BY MR. KIERNAN:
· · · Q.· ·Yes.
· · · A.· ·Means that if you -- if you use -- if you use
·cluster sampling and you choose only a relatively small
·number of clusters with large cluster sizes, there's no
·theory that says that those cluster robust standard
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·threshold is the point at which the number of clusters
·is no longer a problem?
· · · A.· ·This is a -- an impossible question to answer.
·It is the question empirical people are most interested
·in.· So assuming that it's appropriate to cluster, it
·depends on lots of different characteristics of the
·problem.· For example, it depends on how big the cluster
·sizes are.· It depends on the distribution of the
·observables and unobservables in the population.
· · · · · ·So this is something that theory can't easily
·answer and that's why people do simulations to try to
·find when the -- the -- the theory of having a large
·number of clusters seems to work fairly well.
· · · Q.· ·When you referred to "observables" in your
·last answer, were you -- and "unobservables," what were
·you referring to?
· · · A.· ·The explanatory variables and -- as the
·observables and then the error term is the
·unobservables.
· · · Q.· ·And so that I understand your answer, it's
·that theory doesn't provide the answer and so there are
·econometricians that are implying empirical analysis to
·try to answer that question?
· · · A.· ·Simulation studies, yes.
· · · Q.· ·Is it simulation studies like what Hanson did
Page 99
Page 100
Page 97..100
www.aptusCR.com
YVer1f
Page 109
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·and the quarter or the date of the transaction and so
·on.
·BY MR. KIERNAN:
· · · Q.· ·And with respect to the school example -· · · A.· ·Uh-huh.
· · · Q.· ·-- if I got all the students within a state -· · · A.· ·Uh-huh.
· · · Q.· ·-- and I got the big data file, wouldn't it be
·analogous to what you just described in that I would
·learn of the school and the district from that same
·dataset?
· · · · · ·MS. SWEENEY:· Objection to form.· Vague and
·ambiguous.· Incomplete.
· · · · · ·THE WITNESS:· Yes.· And, in fact, that's why I
·included that example in my declaration is because once
·you've gathered the information on the students, the
·fact that you also learned about their school and their
·school district does not mean you should then group them
·into clusters based on their school or their district.
·BY MR. KIERNAN:
· · · Q.· ·But that example in your report -- just to
·make sure I understand it -- aren't you, what you're
·referring to right now, is when you randomly sample the
·students at the first stage?
· · · A.· ·Yes.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·what you're clustering on.· In the case of a whole
·population you know because the clustering is
·essentially arbitrarily defined.
· · · Q.· ·Any other factors that one would consider if
·ex post factor clustering occurred when a whole
·population is being used?· You named the clustering is
·arbitrarily defined.· Any other factors?
· · · A.· ·Well, in -- again, since the distinction
·between the population and the sample is really one of
·number of observations here, if you can determine it
·based on a random sampling thought experiment, then you
·can determine it in the population as well.
· · · · · ·So, in other words, if I took a random sample
·of the transactions and then clustered on the basis of
·family and quarter, then the same sort of clustering
·would be inappropriate with the entire population.
· · · Q.· ·And what factors does an econometrician
·consider in determining whether the clustering was, in
·your words, essentially arbitrary?
· · · A.· ·Well, because -- because the observations
·don't have a cluster structure.
· · · Q.· ·And how do you determine that?· How does -·what are the factors that an econometrician considers to
·determine whether, in your words, there is -- they do
·not have a cluster structure?
Page 110
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Yeah.· Okay.· Differ- -- I have a different
·hypothetical.
· · · A.· ·Okay.
· · · Q.· ·You pull all the students' information
·first -· · · A.· ·Uh-huh.
· · · Q.· ·-- and then from there you learn about -- you
·learn from the dataset you have the entire population of
·students -· · · A.· ·Uh-huh.
· · · Q.· ·-- and then you learn from the dataset, the
·schools and districts and so forth.· How is that
·different from what you describe on page 3?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
·Incomplete.
· · · · · ·THE WITNESS:· As a practical matter I don't
·think it's different.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· How does one determine -- or strike
·that.
· · · · · ·What methods does an econometrician apply to
·determine if something is ex post clustering?
· · · A.· ·Oh, well, you have to know -- there are -- so
·if you have a sample of data, then, of course, you know
·because after you've drawn the observations you can see
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. SWEENEY:· Objection.· Asked and answered.
·Vague and ambiguous.
· · · · · ·THE WITNESS:· You have to have a population of
·clusters and then from that you know what the cluster
·structure is.
·BY MR. KIERNAN:
· · · Q.· ·And what -- what factors do you consider to
·determine whether you have a population of clusters?
· · · A.· ·Well, it, again, depends on -- the population
·is -- of clusters is defined once you have determined
·the sampling scheme.· So in other words, if you're
·sampling schools, clusters of schools.
· · · Q.· ·On page 6 of your declaration -- let me know
·when you get there.
· · · A.· ·Yes.
· · · Q.· ·It's roughly a quarter of the way down,
·"Clustering is a property of how the data are collected
·and has nothing to do with how much variation there is
·in the underlying population variable or variables."
· · · A.· ·Uh-huh.
· · · Q.· ·Do you see that?
· · · A.· ·Uh-huh.
· · · Q.· ·And what authority supports that statement?
· · · A.· ·Well, it's -- the -- I've given several
·examples.· It's -- the authority is basically that you
Page 111
Page 112
Page 113
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·can't really anticipate every time somebody is going to
·get something wrong like this, so I tried to explain
·through examples that if you -- if you have a variable
·that doesn't change very much in a population and you
·draw a random sample from it, that has -- how much
·variation there is in the population has nothing to do
·with whether you have to treat those observations as
·being from a cluster sample.
· · · · · ·MR. KIERNAN:· Okay.· Move to strike as
·nonresponsive.
· · · · · ·MS. SWEENEY:· And I disagree with that
·characterization.
·BY MR. KIERNAN:
· · · Q.· ·For the statement, "Clustering is a property
·of how the data are collected and has nothing to do with
·how much variation there is in the underlying population
·variable or variables," stated on page 6 of Wooldridge
·1, please cite for me authority that supports that
·proposition.
· · · A.· ·I can't give you a citation for that.
· · · Q.· ·And is it your testimony today that that
·statement is generally accepted in the field of
·econometrics?
· · · A.· ·Yes, I believe it would be.
· · · Q.· ·And can you cite to any authority, any peer-
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·would not use clustering in the scenario that you just
·described?
· · · A.· ·If -- if you ran- -- if you had a random
·sample, I certainly hope not.· Again, you could only
·cluster if after you obtained your data you define some
·clusters to -- to cluster on and that would lead to an
·increased bias in your standard errors.
· · · Q.· ·Okay.· Name for me the ones that you can
·recall, as you sit here, the literally thousands, if not
·tens of thousands, of papers published empirical
·economics that never discuss whether the variation in
·the underlying population variable or variables -· · · A.· ·Well, I certainly -· · · Q.· ·-- is relevant with respect to clustering?
· · · · · ·MS. SWEENEY:· Yeah, I'm going to object.
·That's sort of a ridiculous question.· He's not going to
·sit here and identify tens of thousands of articles for
·you.
·BY MR. KIERNAN:
· · · Q.· ·Could you list, as you sit here today, tens of
·thousands of articles?
· · · · · ·MS. SWEENEY:· Objection.· Improper -· · · · · ·THE WITNESS:· As I sit here right now, no.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· Name -- name five for me.
Page 114
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·reviewed publications, that support your testimony that
·that is generally accepted in the field of econometrics?
· · · A.· ·I -- I'm sorry.· Could you repeat that?· That
·sounded like the same question to me.
· · · Q.· ·Slightly different.· And that is, can you cite
·to any authority, including a peer-reviewed publication,
·that support your testimony that your statement on
·page 6 is generally accepted in the field of
·econometrics?
· · · A.· ·Well, here's what I can do:· I can point to
·the literally thousands, if not tens of thousands, of
·papers published in empirical economics that never
·discusses whether the variation and the dependent
·variable has any bearing on whether to cluster or not.
·So I would think it would show up somewhere if that were
·actually an issue.
· · · · · ·So many labor economists have done analyses
·with all kinds of response variables, including, as I
·said, variables such as do you have a job or not, and
·that has much less variation because it's a 01 variable
·than if you look at their annual earnings.· And the fact
·that one is much less variable than the other has
·nothing to do with whether you should cluster the data.
· · · Q.· ·And it's your testimony that labor
·econometricians, it's generally accepted, that they
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Well, look, I follow empirical work and -· · · Q.· ·Just name five.
· · · · · ·MS. SWEENEY:· Objection.
· · · · · ·MR. KIERNAN:· Okay.
· · · · · ·MS. SWEENEY:· Asked and answered.
·Argumentative.
·BY MR. KIERNAN:
· · · Q.· ·Name one.
· · · · · ·MS. SWEENEY:· Harassing the witness.
·Stop now.
· · · · · ·THE WITNESS:· Angrist -- Angrist and Krueger,
·their paper on estimating the effects of schooling on
·wages.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· Any others you can think of?
· · · A.· ·Melon's paper on evaluating a job training
·program, an AER paper.· The thing is that they don't
·discuss this issue because it doesn't come up.
· · · Q.· ·What issue doesn't come up?
· · · A.· ·The fact that there's -- how much variation
·there is in their data, whether that leads to a
·clustering problem or not.
· · · · · ·MR. KIERNAN:· Let's take a short break.
· · · · · ·MS. SWEENEY:· Are you almost done?
· · · · · ·THE VIDEOGRAPHER:· Going off the record at
Page 115
Page 116
Page 117
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·2:43 p.m.
· · · · · ·(Recess.)
· · · · · ·THE VIDEOGRAPHER:· Okay.· We're back on the
·record at 3:04 p.m.
·BY MR. KIERNAN:
· · · Q.· ·Dr. Wooldridge, do you agree with the
·statement, "It is probably a sensible rule to at least
·consider the data as being generated as a cluster sample
·whenever covariates at a level more aggregated in the
·individual units are included in an analysis"?
· · · A.· ·That sounds like something I wrote.
· · · Q.· ·And do you agree with it?
· · · A.· ·Actually, I would want to re-examine that
·statement in light of this sort of recent research that
·I've done.· It certainly is -- when you have variables
·that are defined at, like, a school district level and
·you have schools, you may or may not have to cluster.
·But then when you do it or you don't do it, you can see
·the answer, assuming that you have a large number of
·clusters with relatively few units per cluster.· But it
·may be too -- it may be too conservative to do that.
· · · Q.· ·And, as you sit here today, do you stand by
·your statement that "It's a sensible rule to at least
·consider the data as being generated as a cluster sample
·whenever covariates at a level more aggregated than the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·transactional data, did he include covariates at a level
·more aggregated than the individual unit transactions?
· · · A.· ·Well, each -- each variable -- no, each
·transaction is defined by its characteristics.
· · · Q.· ·Okay.· So you're understanding is that
·Professor Noll's analysis of the iPod transaction data,
·he did not include covariates at a level more aggregated
·than the individual transactions?· Is that your
·understanding?
· · · A.· ·No.· He included time effects and he included
·attributes of the products.
· · · Q.· ·Okay.· And is it your testimony that time
·effects -· · · A.· ·Those are -· · · Q.· ·-- and attributes of products are covariates
·at a level more aggregated than the individual unit
·transactions?
· · · A.· ·They're not more aggregated.· An example would
·be to say that because some people have the same level
·of education that -- that how -- is somehow a variable
·that's aggregated -- that's defined at a more aggregated
·level because you can put everybody into a class of
·education.
· · · Q.· ·Okay.· And either we're not communicating
·well -- well, let me -- I just want to ask this to make
Page 118
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·individual units are included in an analysis"?
· · · A.· ·If you -- I mean, ideally you would know the
·level of which the data were clustered, so that's -·that's a case -- that's a conservative approach to the
·problem where you might not know how the data were
·generated.
· · · Q.· ·And so today do you stand by your statement in
·your book?
· · · A.· ·Actually, I would -- as I said, in light of
·these sort of new findings on clustering random samples,
·I would actually want to revisit that to see whether
·that's a -- a useful thing to do or not.· And it has to
·do with -- it's always going to have to do with the
·number of clusters that you have and the number of
·observations per cluster.
· · · Q.· ·And what would you want to examine in
·analyzing or considering whether or not to revise the
·statement in your book that we've been discussing?
· · · A.· ·Well, I would want to work out the theory for
·what happens when you assume the data are generated from
·a random sample, but you include covariates that are
·defined at a higher level.
· · · Q.· ·Anything else?
· · · A.· ·Simulation.
· · · Q.· ·In Professor Noll's analysis of the iPod
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·sure I have this clear.
· · · · · ·Did Professor Noll's analysis of iPod
·transactional data include covariates at a level more
·aggregated in individual iPod transactions?
· · · A.· ·Not the way I would define them, no.
· · · Q.· ·And when you say "define them," what is the
·"them" referring to?
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Did you examine Professor Noll's analysis to
·determine whether he included covariates at a level more
·aggravated than the individual units that are included
·in the analysis?
· · · A.· ·Did I examine the regressions?· I did look -· · · Q.· ·For that purpose.
· · · A.· ·I did look at the variables that were included
Page 119
Page 120
Page 121
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·in the regression, yes.
· · · Q.· ·Do you agree that in some cases you can define
·the clusters to allow additional spacial correlation?
· · · A.· ·Spacial correlation?· Well, spacial
·correlation has to do when you have -- is usually a
·feature where you have large geographical units and you
·don't have random sampling.
· · · Q.· ·So, for example, if you think of sampling
·fourth grade classrooms and you're concerned about
·correlation in student performance not just within the
·class, but also within the school, then you could define
·the clusters to be the schools?
· · · A.· ·Not if you take a random sample of fourth
·graders from the population of fourth-grade classrooms.
·That would be an example of what I'm calling ex post
·clustering because you would then be, essentially,
·looking at the school that the classroom came from and
·making that your cluster when you already have a random
·sample of fourth-grade classrooms so you don't need to
·do anything further.
· · · · · ·(Exhibit 2 marked.)
·BY MR. KIERNAN:
· · · Q.· ·Okay.· Let me hand you what's been marked as
·Wooldridge 2.
· · · · · ·And I'll represent to you, Dr. Wooldridge,
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·Where are you -- I'm sorry.
·BY MR. KIERNAN:
· · · Q.· ·I'm looking at Clustering Sampling.
· · · A.· ·This page.
· · · · · ·Yeah, I might rethink that now.
· · · Q.· ·Why is that?· Well, rethink what?
· · · A.· ·In other words, whether you actually have to
·cluster at the school level -· · · Q.· ·So in -· · · A.· ·-- after doing -· · · Q.· ·-- your textbook -- oh, sorry.
· · · A.· ·After doing more recent analysis, yes.
· · · Q.· ·Okay.· In your textbook you propose clustering
·when calculating the -· · · A.· ·So this is a conservative thing to do, yes.
·That doesn't mean that if you have good reason not to do
·it, that you -- that you should still do it.
· · · Q.· ·Where do you state it's a conservative -·conservative thing to do and if you have a reason not to
·do it, you shouldn't do it?· Where is that in your
·textbook?
· · · A.· ·Well, we -- as I mentioned, this is a learning
·process, right?· So after examining these problems with
·this ex post clustering, I now know that it's a
·conservative thing to do.· So, yes, I would probably --
Page 122
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·that this is a copy of the section on Clustering
·Sampling, 20.3, from Chapter 20 of your textbook
·Econometric Analysis of Cross Section and Panel Data,
·Second Edition.
· · · · · ·Do you recognize the section on Cluster
·Sampling?
· · · A.· ·Yes, I do.
· · · Q.· ·Okay.· And if you turn to page 864 -· · · A.· ·Okay.
· · · Q.· ·-- and this is under Section 20.3.1 -· · · A.· ·Uh-huh.· Okay.
· · · Q.· ·-- and in this section you're discussing an
·example which a random sample fourth-grade classrooms is
·drawn in the state and the common factor affecting
·students in a given classroom is the characteristics of
·the teacher.
· · · A.· ·Uh-huh.
· · · Q.· ·Is that right?
· · · A.· ·I'm sorry.· Where are you seeing that?
· · · Q.· ·In this section where you're using the sample
·of fourth-grade classroom, my under- -· · · · · ·MS. SWEENEY:· Go ahead and take a second to
·read it.
· · · · · ·MR. KIERNAN:· Yeah.
· · · · · ·THE WITNESS:· Are you looking at the back?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·I would rewrite this section a bit if I -- if there's a
·third edition coming.
· · · Q.· ·And did you make the same recommendation in
·the first edition of your textbook, Econometric Analysis
·of Cross Section Panel Data?
· · · A.· ·Actually, that's -- I can't remember.· It was
·a fairly extensive revision of the book.
· · · Q.· ·And you state, "After examining these problems
·with this ex post clustering," are you referring to in
·connection with this case?
· · · A.· ·No, just in general.· Just the theory that
·I've worked out and that, as I mentioned, my coauthors
·and I had been working on.
· · · · · ·So, in other words, when you realize that if
·you are randomly sampling any unit, whether it's an
·individual student or a fourth-grade classroom, it's
·actually -- it's certainly conservative to compute the
·cluster robust standard errors and it's not going to be
·very costly in a case like this if you don't have to
·because you have a large number of clusters with
·relatively small cluster sizes.· But if you do it in the
·case where you don't -- where your cluster sizes are
·very large and so there's no theory to tell you that
·those clusters standard errors are going to settle down
·to the usual ones, then I would be more careful here and
Page 123
Page 124
Page 125
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·argue that you -- you shouldn't necessarily do it
·because your inference could be much too conservative.
· · · Q.· ·Okay.
· · · A.· ·This is the -- the point of -- there's another
·section in here on stratified sampling.· And it's the
·same sort of argument that you can always do something
·that's very conservative, but you may not learn much and
·if you can do something else that is -- is actually
·providing the proper standard errors, then you should do
·that.
· · · Q.· ·And have you reached an ultimate conclusion of
·whether to revise the paragraphs in your Chapter 20.3.1?
· · · A.· ·Yes.· I probably -- I think I would revise
·them in light -- in fact, I would add a section on ex
·post clustering.
· · · Q.· ·And what -- what authorities or peer-reviewed
·papers would you cite in support of your new section in
·your textbook?
· · · A.· ·Well, a lot of this book is actually based on
·original research, so I probably wouldn't.· If I finish
·the work with my co-authors, then I would cite that.
· · · Q.· ·And, as of today, you have not completed that
·work?
· · · A.· ·The theory is -- is essentially finished -· · · Q.· ·And --
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·If you turn to page 868, you have an example
·20.3.· Just tell me when you get there.
· · · A.· ·Yes.
· · · Q.· ·And this is Cluster Correlation in Teacher
·Compensation.· Do you see that?
· · · A.· ·Uh-huh.
· · · Q.· ·And "The data set is in BENEFITS.RAW, includes
·average compensation, at the school level, for teachers
·in Michigan."
· · · · · ·Do you see that?
· · · A.· ·Yes.
· · · Q.· ·And do you understand that that data include
·the entire population?
· · · A.· ·Yes.
· · · Q.· ·Okay.· And then in your textbook you state,
·"We view this as a cluster sample of school districts,
·the schools within districts representing the individual
·units"; is that accurate?
· · · A.· ·It's an example, yes.
· · · Q.· ·Okay.· And this is an example of what you
·described on page 864?
· · · A.· ·Actually, this is -- yeah, this is just an
·example to, essentially, create a -- what could have
·been a cluster sample so that they can see what happens
·when you -- when you do cluster at a -- at a more
Page 126
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·-- and some -- and some simulation results.
· · · Q.· ·Have you completed the simulation result -·work?
· · · A.· ·Simulation work is -- it's hard to decide
·where to stop, but there's -- there's a fair amount of
·simulation work.
· · · Q.· ·Okay.· Have you completed the empirical work
·with respect to your theory on ex post clustering?
· · · A.· ·So by "empirical" do you mean with an actual
·dataset or do you mean the simulations?
· · · Q.· ·The simulations.· I know this morning you were
·describing the simulations as the empirical work like -· · · A.· ·Oh.
· · · Q.· ·-- what Hansen was doing.
· · · A.· ·Okay.· So just to be clear, when economists
·say "empirical work," they usually mean data that's been
·collected from the real world -· · · Q.· ·Sure.
· · · A.· ·-- as opposed to generating.· So if you mean
·the simulations experiments, is it completed?· Well, you
·can always -- you can always vary parameters and see how
·things change when you vary parameters, but the
·simulations predict the theory quite well.
· · · Q.· ·And have the simulations been peer reviewed?
· · · A.· ·No.
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·aggregate level, but -- and you can see that the
·standard errors do go up, so it's a conservative thing
·to do.
· · · Q.· ·And in example 20.3, the -- the data was not
·collected using cluster sampling; isn't that correct?
· · · A.· ·That's correct.
· · · Q.· ·You collected the entire population?
· · · A.· ·Actually, it's not the entire population, no.
·It's -- it's a subset of the districts from the state of
·Michigan.· It contains a lot of them, but not all of
·them.
· · · Q.· ·How many did you exclude?
· · · A.· ·500 and -- probably about -· · · Q.· ·Does eight come to mind?
· · · A.· ·Eight?· No, I think it's got to be more than
·that.· I think there are currently 500 and 55 -- 18 or
·20 or something like that, yeah.
· · · Q.· ·Okay.· So virtually all the school districts?
· · · A.· ·It -- very close, yes.· Where you have -·yeah.· So G equals 537 when these are elementary
·schools, I believe.· So you have a large G with
·relatively few schools per -- so few observations per
·cluster.
· · · Q.· ·Now, are you familiar with -· · · A.· ·And this, by the way, actually fits with the
Page 127
Page 128
Page 129
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·theory that I -- that the standard errors would go up by
·a fair amount.
· · · Q.· ·And are you -- other than theory, are you
·relying upon anything to support that the standard
·errors would go up by a large amount?· Any -· · · A.· ·The simulations.
· · · Q.· ·Other than the ones that you -- that you've
·been working on recently, are there other simulations
·that you're relying upon for that statement?
· · · A.· ·I don't know of any simulations that ask the
·question what would happen if you simulated -- if
·you clustered a random sample based on characteristics
·that you draw along with the main variable, yes.
·Usually when -- when properties of the clustered
·standard errors are evaluated via simulation, they are
·actually cluster samples that have been drawn like in
·Chris Hansen's work, for example, or the DueFlow, et
·al., paper in the QJE.
· · · Q.· ·You state in your report that clustered
·standard errors are not justified with, say, ten
·clusters and 200 observations per cluster.
· · · · · ·Do you recall that page 5 of your report?
· · · A.· ·Uh-huh.
· · · Q.· ·And I notice you don't have a citation there.
·Can you cite to me --
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·large sample theory that's obtained from letting G get
·large is not going to be very relevant for that
·particular structure.
· · · Q.· ·So are you relying upon Hansen's paper for the
·statement that clustered standard errors are not
·justified with ten clusters in 200 observations?
· · · A.· ·He considers a similar configuration.· I don't
·know if it's exactly that configuration.· In fact, he
·may have fewer observations per cluster and shows that
·they don't work as the theory -- as -- as the large
·cluster theory says they should.
· · · Q.· ·When was the last time you reviewed Hansen's
·paper, 2007 paper?
· · · A.· ·Ah, it has been a little while.· Well,
·actually, no.· I -- I just looked at it the other day,
·but now I can't remember what the -- yes.· So I'd have
·to go back and look at that more carefully.
· · · Q.· ·Did you review it before submitting this
·report?· In connection with drafting this report, did
·you review it?
· · · A.· ·I reviewed -- I -- I reviewed the lecture
·notes that I've written that refer to his report, his
·paper.
· · · Q.· ·Did you review any of the simulations that
·were included in tables 1 through 4 in his -- in his
Page 130
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·I should -· · · Q.· ·-- the authority -· · · A.· ·I should have put a citation in there.· I'd
·probably have to go look that up.· There's a paper by, I
·think it's Mitchell Peterson.· It's in a financial
·journal.· I'd have to go -- Journal of Financial, maybe
·it's Journal of Finance.
· · · Q.· ·And what is a -- the conclusion or opinion set
·forth in that paper with respect to this issue?
· · · A.· ·That clustering can be effective for computing
·standard errors when you have a large number of clusters
·and not too many units per cluster.· That's actually a
·paper that the setting is a bit different because it's
·-- it's panel data, so there is a time dimension, and so
·they're considering the case of clustering both in the
·cross-sectional dimension in some cases and the time
·series and others and then across both dimensions and
·others.
· · · · · ·I could probably -- I -- I'd have to think.
·I -- I mentioned this before, that there's no sort of
·absolute rule of thumb that you can use, but the theory
·certainly doesn't allow for that kind of configuration.
·If you think that thinking of G heading off to infinity
·with the number of observations per cluster fixed is a
·good thought experiment, it isn't because the -- the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·paper?
· · · A.· ·I thought I looked at them, yes.
· · · Q.· ·Okay.· And is it your testimony that Chris
·Hansen stated that the higher the ratio of observations
·per cluster to number of clusters the more poorly
·clustered standard errors -· · · A.· ·I'm not sure he said that, no.
· · · Q.· ·Is it your testimony that his paper supports
·that conclusion in your report?
· · · A.· ·I'm not sure it supports that statement.· He
·does -- he does show that the performance deteriorates
·as for a fixed number of clusters you get more
·observations per cluster, yes.
· · · Q.· ·What deteriorates?
· · · A.· ·The -· · · Q.· ·Performance of what?· Oh, sorry.
· · · A.· ·Sorry.· The performance of the standard
·errors.
· · · Q.· ·Under what -- under what calculation?· So you
·recall Hansen does OLS clustered and random.· Under
·which does it perform more poorly as you increase the
·number of observations -· · · A.· ·Well -· · · Q.· ·-- per cluster as you keep number -· · · A.· ·All that's --
Page 131
Page 132
Page 133
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·-- number of clusters constant?
· · · A.· ·All that's relevant for this is the OLS versus
·clustering.
· · · Q.· ·Right.
· · · A.· ·Right.· And clustering -- well, no, but, see,
·in the -- in the OLS case, there -- if you're talking
·about computing the usual standard errors, he has built
·cluster correlation into his simulations -- that was my
·comment earlier -- whereas my simulation said, suppose
·we take a random sample and then we do clustering, he's
·built that into his analysis so that there is cluster
·correlation.· And so neither of them works very well
·when you have a small number of clusters and many
·observations per cluster.
· · · Q.· ·Your testimony is that's what Chris Hansen
·states in his paper -· · · A.· ·That's what -· · · Q.· ·-- that as you increase -- that as you
·increase the number of observations per cluster, keeping
·cluster -- the number of clusters constant that OLS
·performs more poorly and so does the clustered standard
·errors?
· · · A.· ·Yes.
· · · · · ·MS. SWEENEY:· And I -- I don't think the court
·reporter caught the witness's first answer because the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·of a specific form, but it still has the same effect
·that it induces correlation within a unit in the
·cluster.
· · · · · ·MR. KIERNAN:· Let's just do this.
· · · · · ·I will hand you what is Exhibit 3.· Is that
·right?
· · · · · ·(Exhibit 3 marked.)
·BY MR. KIERNAN:
· · · Q.· ·Can you identify Exhibit 3?
· · · A.· ·Yes.
· · · Q.· ·And what is this?
· · · A.· ·This is the -- this is Chris Hansen's 2007
·paper on clustering.
· · · Q.· ·And is this the paper that you're relying upon
·in your declaration, Wooldridge 1?
· · · A.· ·Somewhat, yes.
· · · Q.· ·Okay.· And if you turn to pages 612 to 615 -· · · A.· ·Uh-huh.
· · · Q.· ·-- you'll see the simulations that you've been
·discussing.
· · · A.· ·Uh-huh.
· · · Q.· ·And if -- pardon me.
· · · · · ·Okay.· I just want to walk through this.· So
·if I look down at table one, N equals 10 and T equals
·10, is it your understanding that N represents a number
Page 134
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·examiner, again, talked over him.
· · · · · ·Do you want to repeat what you had said
·before, before David interrupted you, if you can
·remember it.
· · · · · ·THE WITNESS:· Oh, the -- I -- I think it was
·about -- talking about three different versions of the
·standard errors, did you say, OLS, clustered, and you
·said something about random -·BY MR. KIERNAN:
· · · Q.· ·Random effects.
· · · A.· ·-- random effects.· So, yeah, that -- the
·issue here is the OLS versus the clustered standard
·errors and the OLS does poorly, but that's because
·cluster correlation has been built into the simulation.
· · · · · ·Again, it's a -- it's a different situation
·because he's dealing with panel data and so the cluster
·correlation is actually what we call serial correlation
·in the errors across time.
· · · Q.· ·Now, Dr. Wooldridge, you cite to Hansen -·Hansen's paper as supporting your opinions in this case;
·right?
· · · A.· ·Yeah.· Uh-huh.· Yes.
· · · Q.· ·Okay.· Even though he's using panel data?
· · · A.· ·Yes, because the panel data -- the panel data
·introduces correlation in the cluster.· It introduces it
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·of clusters?
· · · A.· ·Yes.
· · · Q.· ·And T represents number of observations per
·cluster?
· · · A.· ·Yes.
· · · Q.· ·Okay.· And then column four, that's the target
·standard error?
· · · A.· ·Yes.
· · · Q.· ·Okay.· And then the difference between two and
·four -- columns two and four is how well the estimator
·is computing the standard errors?
· · · A.· ·Yes.
· · · Q.· ·And, therefore, the difference shows you the
·bias of the estimated standard errors?
· · · A.· ·Yes.
· · · Q.· ·Okay.· And if we look down at row -- where it
·says B. Random Effects -· · · A.· ·Right.
· · · Q.· ·-- and let's consider the case where the
·intergroup correlation is high, so it's 0.9.
· · · A.· ·Okay.· Uh-huh.
· · · Q.· ·Okay.· And if we look at OLS -- compare OLS to
·clustering, which performs better?
· · · A.· ·The clustering -· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
Page 135
Page 136
Page 137
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·I'm sorry.· Go ahead.
·BY MR. KIERNAN:
· · · Q.· ·Go ahead, Dr. Wooldridge.
· · · A.· ·The clustering performs better because the
·cluster effect is left in the error term for OLS, so
·this is data that had been generated with a cluster
·affect.
· · · Q.· ·And then in table two Chris Hansen keeps the
·number of clusters constant at ten; correct?
· · · A.· ·Uh-huh.
· · · Q.· ·And he increases the observations to 50
·observations per cluster; correct?
· · · A.· ·Yes.
· · · · · ·MS. SWEENEY:· I'm sorry.· Where are you?
· · · · · ·MR. KIERNAN:· I'm on table two.
· · · · · ·MS. SWEENEY:· Any particular place in table
·two?
· · · · · ·MR. KIERNAN:· No.
· · · · · ·MS. SWEENEY:· No?· Okay.· Sorry.
· · · · · ·MR. KIERNAN:· That's all right.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· And in looking at the same case where
·the intergroup correlation is high, so it's 0.9 under
·Random Effects, here this shows that when the number of
·observations per cluster increase, keeping the number of
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·The cluster.
· · · Q.· ·And then in table four, Dr. Hansen runs a
·simulation again, keeping the number of clusters
·constant at 50, but doubles the number of observations
·per cluster.
· · · · · ·Do you see that?
· · · A.· ·Yes.
· · · Q.· ·And what impact does that have on the
·performance on the OLS estimator?
· · · A.· ·It deteriorates.
· · · Q.· ·And what impact does that have on the cluster
·robust estimator?
· · · A.· ·So in this case it gets a little better.
· · · Q.· ·Roughly -- do you recall roughly how many
·variables Dr. Noll had in his regressions?
· · · A.· ·Well, it's two -- two pages' worth.· So I
·don't know.· Maybe -- maybe it's more than two pages'
·worth.· Must be 50 or 60, something like that.
· · · · · ·MR. KIERNAN:· Okay.· I'm going to hand you
·what is his rebuttal report.· Okay?· I'm going to mark
·this as Exhibit Wooldridge 4.
· · · · · ·MS. SWEENEY:· So which one are you marking as
·4?· Just the -· · · · · ·MR. KIERNAN:· Just the -- yeah, these, but
·I've provided the --
Page 138
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·clusters constant, OLS performs even worse; isn't that
·correct?
· · · A.· ·The OLS standard error performs worse,
·correct.
· · · Q.· ·Okay.· And how about the cluster standard
·errors?
· · · A.· ·It -- they're performing a little better.
· · · Q.· ·Okay.
· · · A.· ·And that's -- but in this data-generating
·mechanism, the cluster is of a form where it's serial
·correlation that's dying out over time, not a -- not
·what you would get if you were using randomly sampled
·data and then clustering after you've randomly sampled.
· · · Q.· ·And then if you turn to page, look at Table
·three, now Dr. Hansen increases the number of clusters
·or group, from ten to 50.
· · · A.· ·Uh-huh.
· · · Q.· ·But uses ten observations per cluster.· Do you
·see that?
· · · A.· ·Yes.
· · · Q.· ·And then using the same case where the
·intergroup correlation is at 0.9 -· · · A.· ·Uh-huh.
· · · Q.· ·-- which performs better, the OLS or the
·cluster?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·MS. SWEENEY:· Okay.· Thank you.
· · · · · ·MR. KIERNAN:· -- entire rebuttal at your
·request.
· · · · · ·MS. SWEENEY:· Okay.
· · · · · ·(Exhibit 4 marked.)
·BY MR. KIERNAN:
· · · Q.· ·And you can take a moment to see that
·Exhibit 4 is the -- Dr. Noll's reseller and direct
·consumer regressions, the reports in Exhibit 3 of his
·rebuttal report.
· · · · · ·Do you see that?
· · · A.· ·Yes.
· · · Q.· ·And, roughly, how many variables does Dr. Noll
·include in his regressions?
· · · A.· ·Let's look.· It's more like 100, roughly.
· · · · · ·MS. SWEENEY:· Do you want him to count them on
·this page?
· · · · · ·THE WITNESS:· Eighty, something like that.
·BY MR. KIERNAN:
· · · Q.· ·Okay.
· · · A.· ·Yeah.
· · · Q.· ·And then -- just a second.· I lost my copy.
· · · · · ·MR. KIERNAN:· Oh, there it is.· Thank you.
·BY MR. KIERNAN:
· · · Q.· ·And you'll notice that the vast majority are
Page 139
Page 140
Page 141
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·statistically significant at the one percent level.
· · · A.· ·Yes.
· · · Q.· ·Okay.· Is it -- is it unusual to find this
·level of significant over so many variables in a
·regression?
· · · · · ·MS. SWEENEY:· Objection.· Overbroad.· Vague
·and ambiguous.
· · · · · ·THE WITNESS:· It's unusual to have 2 million
·observations.
·BY MR. KIERNAN:
· · · Q.· ·And not my question.
· · · A.· ·And to have -- and to have -- so -- so is it
·unusual?· The -- often we don't have good explanatory
·variables for micro-type outcomes.· So if this were a
·wage equation and we only had, you know, a dozen
·characteristics of people to explain their wage, then I
·would expect much more residual variance.· But if we had
·two million observations, we can still get quite small
·standard errors and statistical significance.
· · · · · ·I mentioned some work earlier by Angrist and
·Krueger and Angrist has also done work with Bill Evans,
·using five percent census data and, yeah, you -- you get
·small standard errors when you have large datasets like
·that.
· · · Q.· ·And if you look at Exhibit 3B -- so we were
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·Okay.· And if you look at -- if you look at
·Exhibit 3B and let's take the harmony2 variable, do you
·know what that refers to?
· · · A.· ·I believe there was a second version of the
·harmony software and that's what that indicates.
· · · Q.· ·Okay.
· · · A.· ·It's a 01 variable indicating when harmony2
·was released, I believe.
· · · Q.· ·And what would be the T statistic on the
·harmony2 variable in the regression represented in 3
·here?
· · · A.· ·You're asking me to do calculations that I'm
·not very good at, so maybe you have done it for me.
· · · Q.· ·How would you calculate the -· · · A.· ·Oh, I would -· · · Q.· ·-- T statistic?
· · · A.· ·Yeah, I would take the coefficient estimate
·and divide by the standard error.· Actually, I would get
·it from the Stata output probably because that would
·be -- yes.
· · · Q.· ·And if you had a T statistic, let's say, 25,
·what would that tell you?· How would you interpret that,
·a T statistic of 25?
· · · A.· ·It's a strong rejection that the coefficient
·is equal to zero.
Page 142
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·looking at 3A -- 3B's the direct sales regression -· · · A.· ·Uh-huh.
· · · Q.· ·-- and if you look at the regression output,
·Dr. Noll reports that every single coefficient is
·statistically significant at the one percent level.
· · · · · ·Do you see that?
· · · A.· ·Yes.
· · · Q.· ·Do you find it unusual or is it unusual that a
·regression with this many variables could -- or strike
·that.
· · · · · ·If you have a regression with close to 80
·variables, could it all be considered near perfect
·variables for explaining iPod prices?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous.
· · · · · ·THE WITNESS:· Yes.· I'm not sure what you mean
·by "near perfect."· If you mean that statistically
·significant at a low significance level, then, again,
·I'm not surprised with this very large sample size.
· · · · · ·This is the difference between -- so without
·commenting on the coefficients, the difference between
·practical significance and statistical significance.· If
·you have -- even if you have really small coefficients
·with enough data you can drive the standard error to be
·close to zero and so, no, it's not surprising to me.
·BY MR. KIERNAN:
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · Q.· ·And if the T statistic -- strike that.
· · · A.· ·These are much bigger than that.
· · · Q.· ·I know.
· · · A.· ·But that's, again, with 36, 37 million
·observations, it's rare that one has a dataset with that
·many observations.· And, as I said, it's -- it's also
·because these are such good predictors of price you have
·little residual variance to explain.
· · · Q.· ·And how do you know that they are such good
·predictors of price?· What are you basing that statement
·on?
· · · A.· ·The R-squared.
· · · Q.· ·Anything else?
· · · A.· ·No, because you could have statistical
·significance even at a very high level and not have
·necessarily a high adjusted R-squared.· There's nothing
·that says there has to be any particular relationship
·between those.
· · · · · ·As I mentioned in the Angrist and Krueger
·work, they have a fair amount of residual variance left
·over, but because they're using the 5 percent census,
·they still get large T statistics.
· · · · · ·(Exhibit 5 marked.)
·BY MR. KIERNAN:
· · · Q.· ·Let me hand you what's been marked as
Page 143
Page 144
Jeffrey Wooldridge, Ph.D.
Confidential - Attorneys' Eyes Only
The Apple iPod iTunes Anti-Trust Litigation
Page 145
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·Wooldridge 5.· Can you identify Wooldridge 5 for me?
· · · A.· ·This is a set of lecture notes for the basis
·of a set of lectures that Guido Imbens and I gave at the
·National Bureau of Economic Research in the summer of
·2007.
· · · · · ·(Exhibit 6 marked.)
·BY MR. KIERNAN:
· · · Q.· ·Okay.· I will hand you what's been marked as
·Wooldridge 6.
· · · · · ·Can you identify Wooldridge 6 for the record?
· · · A.· ·This is also a set of lecture notes.· This -·I should have said, the first set is on estimating
·average treatment effects under unconfoundedness and
·these are lecture notes on linear panel data models for
·the series of lectures.
· · · Q.· ·In your report on page 5, if you look at the
·second full paragraph -· · · A.· ·Uh-huh.
· · · Q.· ·-- about two-thirds of the way down you state,
·"The higher is the ratio" -- "The higher the ratio of
·observations per cluster to number of clusters, the more
·poorly clustered standard errors behave."
· · · · · ·Do you see that?
· · · A.· ·Yes.
· · · Q.· ·What support do you have for that statement?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·ratio."
· · · · · ·And at what point -- at what ratio does the
·clustered standard errors start performing more poorly?
· · · A.· ·You can't -- this is a question you can't
·really know because it depends so much on the specifics
·of the simulation.· So it would depend whether it's a
·panel dataset or a true cluster sample.· It would depend
·on whether you've applied clustering to a random sample.
·There are all these things that would come into play.
· · · Q.· ·Okay.· And in this paragraph, you're
·describing that if clustering were legitimate in this
·case -· · · A.· ·Yes.
· · · Q.· ·-- and you're stating that it's not
·legitimate -· · · A.· ·Right.
· · · Q.· ·-- but you're assuming here that even if it
·were legitimate, there's another problem which is the
·ratio of observations to clusters per -- to observations
·per cluster could be too high.
· · · A.· ·Uh-huh.
· · · Q.· ·Is that something that you've examined in this
·case?
· · · A.· ·Not in this specific case, no.
· · · Q.· ·So you're stating here it could be a
Page 146
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Well, I would have to go back and look at the
·papers that I've read that have done simulations on
·this.· They don't -- this is a general statement.· So
·you could -- you could find -- in other words, general
·patterns.· You could find specific simulations for given
·number of clusters and observations.· This might not be
·true, but it's -- generally -- it's a general statement
·about the patterns you would observe across lots of
·simulations as you get more and more observations per
·cluster.
· · · Q.· ·And, as you sit here today, can you identify
·any work that supports that statement?
· · · A.· ·Well, I -- as I -- any -- any published work?
·No.· I've done my own simulations that -- that show the
·standard errors become -- are -- are conservative when
·you -- when you get -- well, with -- with a fixed and
·relatively small number of clusters, yes.· In -- in the
·cases I have looked at very conservative, but that's
·a -- that's an issue of inappropriate clustering.
·There's a separate issue of how would the clustered
·standard errors behave if -- if you had a cluster sample
·where you just had a small number of clusters with many
·observations per cluster.
· · · Q.· ·And you state, "The higher is the rate -- "The
·higher is the ratio," that should read, "The higher the
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·problem -· · · A.· ·Uh-huh.
· · · Q.· ·-- but you haven't reached an opinion on that?
· · · A.· ·That's -- that's correct.
· · · · · ·MR. KIERNAN:· Let's go off the record, give me
·five minutes.
· · · · · ·MS. SWEENEY:· Okay.
· · · · · ·THE VIDEOGRAPHER:· Going off the record at
·3:53 p.m.
· · · · · ·(Recess.)
· · · · · ·THE VIDEOGRAPHER:· Back on the record at
·4:07 p.m.
· · · · · ·(Exhibit 7 marked.)
·BY MR. KIERNAN:
· · · Q.· ·I will be handing you what has been marked as
·Wooldridge 7.
· · · · · ·And Wooldridge 7 is the paper by Justin
·Wolfers, "Did Unilateral Divorce Laws Raise Divorce
·Rates?· A Reconciliation and New Results."
· · · · · ·Do you recognize this paper?
· · · A.· ·I know of this paper, yes.
· · · Q.· ·In fact, it was the subject of research in the
·paper that you did with -· · · A.· ·Yup.
· · · Q.· ·-- Solon and Haider?
Page 147
Page 148
Page 145..148
www.aptusCR.com
YVer1f
Page 157
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · · · ·THE WITNESS:· No.
·BY MR. KIERNAN:
· · · Q.· ·Okay.· And are you aware -- or do you have any
·understanding of the datasets that are used by
·econometricians in antitrust cases involving price
·fixing?
· · · · · ·MS. SWEENEY:· Objection.· Compound.
·Overbroad.· Foundation.
· · · · · ·MR. KIERNAN:· Didn't like that one.
· · · · · ·THE WITNESS:· No, I'm not -- I'm familiar with
·the dataset that Professor Noll analyzed, at least the
·description of it.
·BY MR. KIERNAN:
· · · Q.· ·In antitrust cases involving where there's
·allegations by plaintiffs that some conduct impacted
·price, is it unusual for the econometricians in such
·cases to have the entire population of data?
· · · · · ·MS. SWEENEY:· Objection.· Foundation.
·Compound.· Overbroad.· Vague and ambiguous.
· · · · · ·THE WITNESS:· I -- I don't know.
·BY MR. KIERNAN:
· · · Q.· ·Is it your opinion that in antitrust cases
·where the parties are attempting to estimate the impact
·of the challenge conduct on pricing when they're using
·the entire population of transactions that clustering
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
· · · A.· ·Yes.· It recommended clustering, but that's in
·the context where -- or it should -- there should have
·been qualifications that the clustering should be done
·when it's actually appropriate.· The clustering, based
·on essentially an arbitrary, you know, partitioning of
·the data after you've looked at it is not appropriate.
·So there should have been a qualifier in there, I
·believe.
· · · Q.· ·So you disagree with the proposal in the ABA
·guidelines?
· · · A.· ·I think it's overly broad, yes.· It doesn't
·discuss the issue at all of taking a random sample and
·then clustering on some characteristics.
· · · Q.· ·And is one possible reason for that is because
·in most antitrust cases parties are dealing with the
·entire population of transactions, the prices from the
·defendants?
· · · A.· ·I don't believe so.
· · · · · ·MS. SWEENEY:· Objection.
· · · · · ·THE WITNESS:· I'm sorry.
· · · · · ·MS. SWEENEY:· Foundation.· Vague and
·ambiguous.
· · · · · ·Sorry.
· · · · · ·THE WITNESS:· I don't believe so, no.· For the
·same reason that I have talked about over and over again
Page 158
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·due to common factors -- accounting for clustering due
·to common factors is never appropriate?
· · · · · ·MS. SWEENEY:· Objection.· Vague and ambiguous
·and compound.· Overbroad.
· · · · · ·THE WITNESS:· Again, the way I think about
·this is if you had a large population of data, then you
·could randomly sample and still have a large number of
·observations.
· · · · · ·The original -- the usual calculation of the
·standard errors without clustering would be appropriate
·and as you get more and more data you will find that the
·standard errors shrink to zero.· And that's what I think
·the appropriate thing that -- the appropriate finding
·is.
·BY MR. KIERNAN:
· · · Q.· ·Have you -- have you reviewed any text,
·publications on guidelines or recommendations for
·proving damages in an antitrust case?
· · · A.· ·I was sent Chapter 6 of the book Proving
·Antitrust Damages -· · · Q.· ·And do you -· · · A.· ·-- by the American Bar Association.
· · · Q.· ·And do you recall the recommendation in that
·text on whether to account for a clustering due to
·common factors?
·1·
·2·
·3·
·4·
·5·
·6·
·7·
·8·
·9·
10·
11·
12·
13·
14·
15·
16·
17·
18·
19·
20·
21·
22·
23·
24·
25·
·because you could if you wanted to take a large random
·sample and then you would know that this clustering ex
·post, as I've called it, is the inappropriate thing to
·do.
· · · · · ·MR. KIERNAN:· I'm not going to take two more
·minutes.
· · · · · ·Mark the transcript attorneys' eyes only per
·the protective order.
· · · · · ·Last chance.· Not going to ask anything?
· · · · · ·That's all I have.
· · · · · ·MS. SWEENEY:· We don't have anything.
· · · · · ·THE VIDEOGRAPHER:· Stipulations.
· · · · · ·THE REPORTER:· Handling of the original?· Who
·will handle the original transcript?
· · · · · ·MS. SWEENEY:· What have we been doing?
· · · · · ·THE REPORTER:· Do you want to go off the
·record?
· · · · · ·MR. KIERNAN:· Yeah, yeah.
· · · · · ·MS. SWEENEY:· Yeah, let's go off the record.
· · · · · ·MR. KIERNAN:· Yeah.· Let's go off the record
·because he doesn't need to hear this.
· · · · · ·THE VIDEOGRAPHER:· Okay.· This concludes the
·video portion of the deposition.· Two DVDs were made.
·We're going off the record at 4:26 p.m.
· · · · · ·(Deposition concluded at 4:26 p.m.)
Page 159
Page 160
Page 161
Page 162
·1· · · · · · ·I, the undersigned, a Certified Shorthand
·1· · · · · · · DECLARATION UNDER PENALTY OF PERJURY
·2· ·Reporter of the State of California, do hereby certify:
·2· ·Case Name: The Apple iPod iTunes Anti-Trust Litigation
·3· · · · · · ·That the foregoing proceedings were taken
·3· ·Date of Deposition: 01/06/2014
·4· ·before me at the time and place herein set forth; that
·4· ·Job No.: 10009202
·5· ·any witnesses in the foregoing proceedings, prior to
·6· ·testifying, were duly sworn; that a record of the
·5
·6· · · · · · · · ·I, JEFFREY WOOLDRIDGE, PH.D., hereby certify
·7· ·proceedings was made by me using machine shorthand,
·7· ·under penalty of perjury under the laws of the State of
·8· ·which was thereafter transcribed under my direction;
·8· ·________________ that the foregoing is true and correct.
·9· ·that the foregoing transcript is a true record of the
·9· · · · · · · · ·Executed this ______ day of
10· ·testimony given.
11· · · · · · ·Further, that if the foregoing pertains to the
10· ·__________________, 2014, at ____________________.
12· ·original transcript of a deposition in a federal case,
11
13· ·before completion of the proceedings, review of the
12
14· ·transcript [ ] was [ ] was not requested.
13
15
14· · · · · · · · · · · · ·_________________________________
16· · · · · · ·I further certify I am neither financially
15· · · · · · · · · · · · · · · · ·JEFFREY WOOLDRIDGE, PH.D.
17· ·interested in the action nor a relative or employee of
16
18· ·any attorney or party to this action.
17
19· · · · · · ·IN WITNESS WHEREOF, I have this date
18
20· ·subscribed my name.
19
21
20
22· ·Dated: January 10, 2014
21
23
22
· · · · · · · · ·_____________________________________
24· · · · · · · ·Debby M. Gladish
23
· · · · · · · · ·RPR, CLR, CCRR, CSR No. 9803
24
25· · · · · · · ·NCRA Realtime Systems Administrator
25
Page 163
Page 164
·1· ·DEPOSITION ERRATA SHEET
·1· ·DEPOSITION ERRATA SHEET
·2· ·Case Name: The Apple iPod iTunes Anti-Trust Litigation
·2· ·Page _____ Line ______ Reason ______
· · ·Name of Witness: Jeffrey Wooldridge, Ph.D.
·3· ·From _______________________ to ________________________
·3· ·Date of Deposition: 01/06/2014
·4· ·Page _____ Line ______ Reason ______
· · ·Job No.: 10009202
·5· ·From _______________________ to ________________________
·4· ·Reason Codes:· 1. To clarify the record.
· · · · · · · · · · 2. To conform to the facts.
·5· · · · · · · · · 3. To correct transcription errors.
·6· ·Page _____ Line ______ Reason ______
·7· ·From _______________________ to ________________________
·6· ·Page _____ Line ______ Reason ______
·8· ·Page _____ Line ______ Reason ______
·7· ·From _______________________ to ________________________
·9· ·From _______________________ to ________________________
·8· ·Page _____ Line ______ Reason ______
10· ·Page _____ Line ______ Reason ______
·9· ·From _______________________ to ________________________
11· ·From _______________________ to ________________________
10· ·Page _____ Line ______ Reason ______
11· ·From _______________________ to ________________________
12· ·Page _____ Line ______ Reason ______
13· ·From _______________________ to ________________________
12· ·Page _____ Line ______ Reason ______
14· ·Page _____ Line ______ Reason ______
13· ·From _______________________ to ________________________
14· ·Page _____ Line ______ Reason ______
15· ·From _______________________ to ________________________
15· ·From _______________________ to ________________________
16· ·Page _____ Line ______ Reason ______
16· ·Page _____ Line ______ Reason ______
17· ·From _______________________ to ________________________
17· ·From _______________________ to ________________________
18· ·Page _____ Line ______ Reason ______
18· ·Page _____ Line ______ Reason ______
19· ·From _______________________ to ________________________
19· ·From _______________________ to ________________________
20· ·Page _____ Line ______ Reason ______
20· ·Page _____ Line ______ Reason ______
21· ·From _______________________ to ________________________
21· ·From _______________________ to ________________________
22· ·Page _____ Line ______ Reason ______
22· ·Page _____ Line ______ Reason ______
23· ·From _______________________ to ________________________
23· ·From _______________________ to ________________________
24· ·Page _____ Line ______ Reason ______
24· ·Page _____ Line ______ Reason ______
25· ·From _______________________ to ________________________
25· ·From _______________________ to ________________________
Exhibit 12
ARTICLE IN PRESS
Journal of Econometrics 141 (2007) 597–620
www.elsevier.com/locate/jeconom
Asymptotic properties of a robust variance matrix
estimator for panel data when T is large
Christian B. Hansen
University of Chicago, Graduate School of Business, 5807 South Woodlawn Ave., Chicago, IL 60637, USA
Available online 20 November 2006
Abstract
I consider the asymptotic properties of a commonly advocated covariance matrix estimator for
panel data. Under asymptotics where the cross-section dimension, n, grows large with the time
dimension, T, ﬁxed, the estimator is consistent while allowing essentially arbitrary correlation within
each individual. However, many panel data sets have a non-negligible time dimension. I extend the
usual analysis to cases where n and T go to inﬁnity jointly and where T ! 1 with n ﬁxed. I provide
conditions under which t and F statistics based on the covariance matrix estimator provide valid
inference and illustrate the properties of the estimator in a simulation study.
r 2007 Elsevier B.V. All rights reserved.
JEL classiﬁcation: C12; C13; C23
Keywords: Panel; Heteroskedasticity; Autocorrelation; Robust; Covariance matrix
1. Introduction
The use of heteroskedasticity robust covariance matrix estimators, cf. White (1980), in
cross-sectional settings and of heteroskedasticity and autocorrelation consistent (HAC)
covariance matrix estimators, cf. Andrews (1991), in time series contexts is extremely
common in applied econometrics. The popularity of these robust covariance matrix
estimators is due to their consistency under weak functional form assumptions. In
particular, their use allows the researcher to form valid conﬁdence regions about a set of
parameters from a model of interest without specifying an exact process for the
disturbances in the model.
E-mail address: chansen1@chicagoGSB.edu.
0304-4076/$ - see front matter r 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.jeconom.2006.10.009
ARTICLE IN PRESS
598
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
With the increasing availability of panel data, it is natural that the use of robust
covariance matrix estimators for panel data settings that allow for arbitrary within
individual correlation are becoming more common. A recent paper by Bertrand et al.
(2004) illustrated the pitfalls of ignoring serial correlation in panel data, ﬁnding through a
simulation study that inference procedures which fail to account for within individual
serial correlation may be severely size distorted. As a potential resolution of this problem,
Bertrand et al. (2004) suggest the use of a robust covariance matrix estimator proposed by
Arellano (1987) and explored in Kezdi (2002) which allows arbitrary within individual
correlation and ﬁnd in a simulation study that tests based on this estimator of the
covariance parameters have correct size.
One drawback of the estimator of Arellano (1987), hereafter referred to as the
‘‘clustered’’ covariance matrix (CCM) estimator, is that its properties are only known in
conventional panel asymptotics as the cross-section dimension, n, increases with the time
dimension, T, ﬁxed. While many panel data sets are indeed characterized by large n and
relatively small T, this is not necessarily the case. For example, in many differences-indifferences and policy evaluation studies, the cross-section is composed of states and the
time dimension of yearly or quarterly (or occasionally monthly) observations on each state
for 20 or more years.
In this paper, I address this issue by exploring the theoretical properties of the CCM
estimator in asymptotics that allow n and T to go to inﬁnity jointly and in asymptotics
where T goes to inﬁnity with n ﬁxed. I ﬁnd that the CCM estimator, appropriately
normalized, is consistent without imposing any conditions on the rate of growth of T
relative to n even when the time series dependence between the observations within each
individual is left unrestricted. In this case, both the OLS estimator and the CCM estimator
pﬃﬃﬃ
converge at only the n-rate, essentially because the only information is coming from
cross-sectional variation. If the pﬃﬃﬃﬃﬃﬃﬃ series process is restricted to be strongly mixing, I
time
show that the OLS estimator is nT -consistent but that, because high lags pﬃﬃﬃ not down
are
weighted, the robust covariance matrix estimator still converges at only the n-rate. This
behavior suggests, as indicated in the simulations found in Kezdi (2002), that it is the n
dimension and not the size of n relative to T that matters for determining the properties of
the CCM estimator.
It is interesting to note that the limiting behavior of b changes ‘‘discontinuously’’ as the
b
amount of dependence is limited. In ﬃﬃﬃﬃﬃﬃﬃ
particular, the rate of convergence of b changes from
b
p
pﬃﬃﬃ
n in the ‘‘no-mixing case’’ to nT when mixing is imposed. However, despite the
difference in the limiting behavior of b there is no difference in the behavior of standard
b,
inference procedures based on the CCM estimator between the two cases. In particular, the
same t and F statistics will be valid in either case (and in the n ! 1 with T ﬁxed case)
without reference to the asymptotics or degree of dependence in the data.
I also derive the behavior of the CCM estimator as T ! 1 with n ﬁxed, where I ﬁnd the
estimator is not consistent but does have a limiting distribution. This result corresponds to
asymptotic results for HAC estimators without truncation found in recent work by Kiefer
and Vogelsang (2002, 2005), Phillips et al. (2003), and Vogelsang (2003). While the limiting
distribution is not proportional to the true covariance matrix in general, it is proportional
to the covariance matrix in the important special case of iid data across individuals,1
1
Note that this still allows arbitrary correlation and heteroskedasticity within individuals, but restricts that the
pattern is the same across individuals.
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
599
allowing construction of asymptotically pivotal statistics in this case. In fact, in this case,
the standard t-statistic is not asymptotically normal but converges in distribution to a
random variable which is exactly proportional to a tnÀ1 distribution. This behavior
suggests the use of the tnÀ1 for constructing conﬁdence intervals and tests when the CCM
estimator is used as a general rule, as this will provide asymptotically correct critical values
under any asymptotic sequence.
I then explore the ﬁnite sample behavior of the CCM estimator and tests based upon it
through a short simulation study. The simulation results indicate that tests based on the
robust standard error estimates generally have approximately correct size in serially
correlated panel data even in small samples. However, the standard error estimates
themselves are considerably more variable than their counterparts based on simple
parametric models. The bias of the simple parametric estimators is also typically smaller in
the cases where the parametric model is correct, suggesting that these standard error
estimates are likely preferable when the researcher is conﬁdent in the form of the error
process. In the simulation, I also explore the behavior of an analog of White’s (1980) direct
test for heteroskedasticity proposed by Kezdi (2002).2 The results indicate the performance
of the test is fairly good for moderate n, though it is quite poor when n is small. This
simulation behavior suggests that this test may be useful for choosing between the use of
robust standard error estimates and standard errors estimated from a more parsimonious
model when n is reasonably large.
The remainder of this paper is organized as follows. In Section 2, I present the basic
framework and the estimator and test statistics that will be considered. The asymptotic
properties of these estimators are collected in Section 3, and Section 4 contains a discussion
of a Monte Carlo study assessing the ﬁnite sample performance of the estimators in simple
models. Section 5 concludes.
2. A heteroskedasticity–autocorrelation consistent covariance matrix estimator for panel
data
Consider a regression model deﬁned by
yit ¼ x0it b þ it ,
(1)
where i ¼ 1; . . . ; n indexes individuals, t ¼ 1; . . . ; T indexes time, xit is a k Â 1 vector of
observable covariates, and it is an unobservable error component. Note that this
formulation incorporates the standard ﬁxed effects model as well as models which include
other covariates that enter the model with individual speciﬁc coefﬁcients, such as
individual speciﬁc time trends, where these covariates have been partialed out. In these
cases, the variables xit , yit , and it should be interpreted as residuals from regressions of xÃ ,
it
yÃ , and Ã on an auxiliary set of covariates zÃ from the underlying model
it
it 0
it
0
yÃ ¼ xÃ b þ zÃ g þ Ã . For example, in the ﬁxed effects model, ZÃ is a matrix of dummy
it
it
it
it
variables for each individual and g is a vector of individual speciﬁc ﬁxed effects. In this
P
case, xit ¼ xÃ À ð1=TÞ T xÃ , and yit and it are deﬁned similarly. Alternatively, xit , yit ,
it
t¼1 it
and it could be interpreted as variables resulting from other transformations which
2
Solon and Inoue (2004) offers a different testing procedure for detecting serial correlation in ﬁxed effects panel
models. See also Bhargava et al. (1982), Baltagi and Wu (1999), Wooldridge (2002, pp. 275, 282–283), and
Drukker (2003).
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
600
remove the nuisance parameters from the equation, such as ﬁrst-differencing to remove the
ﬁxed effects. In what follows, all properties are given in terms of the transformed variables
for convenience. Alternatively, conditions could be imposed on the underlying variables
and the properties derived as T ! 1 as in Hansen (2006).3
Within each individual, the equations deﬁned by (1) may be stacked and represented in
matrix form as
y i ¼ xi b þ i ,
(2)
where yi is a T Â 1 vector of individual outcomes, xi is a T Â k vector of observed covariates,
and i is a T Â 1 vector of unobservables affecting the outcomes yi with E½i 0i jxi ¼ Oi . The
P
P
OLS estimator of b from Eq. (2) may then be deﬁned as b ¼ ð n x0 xi ÞÀ1 n x0 y . The
b
i¼1
i
i¼1
i i
properties of b as n ! 1 with T ﬁxed are well known. In particular, under regularity
b
pﬃﬃﬃ b
conditions, nP À bÞ is asymptotically normal with covariance matrix QÀ1 WQÀ1 where
ðb
P
Q ¼ limn ð1=nÞ n E½x0i xi and W ¼ limn ð1=nÞ n E½x0i Oi xi .
i¼1
i¼1
The problem of robust covariance matrix estimation is then estimating W without
imposing a parametric structure on the Oi . In this paper, I consider the estimator suggested
by Arellano (1987) which may be deﬁned as
n
1 X 0 0
b
x bib xi ,
W¼
nT i¼1 i i
(3)
where bi ¼ yi À xi b are OLS residuals from Eq. (2). This estimator is an appealing
b
generalization of White’s (1980) heteroskedasticity consistent covariance matrix estimator
that allows for arbitrary intertemporal correlation patterns and heteroskedasticity across
individuals.4 The estimator is also appealing in that, unlike HAC estimators for time series
data, its implementation does not require the selection of a kernel or bandwidth parameter.
b
The properties of W under conventional panel asymptotics where n ! 1 with T ﬁxed are
well-established. In the remainder of this paper, I extend this analysis by considering the
b
properties of W under asymptotic sequences where T ! 1 as well.
The chief reason for interest in the CCM estimator is for performing inference about b
b.
pﬃﬃﬃﬃﬃﬃﬃﬃ b
d
b
Suppose d nT ðb À bÞ ! Nð0; BÞ and deﬁne an estimator of the asymptotic variance of b as
b where B ! B. The following estimator of the asymptotic variance of b based on
b p
ð1=d nT ÞB
b
b
W is used throughout the remainder of the paper:
d bÞ
Avarðb ¼
n
X
!À1
i¼1
¼
n
X
i¼1
3
b
ðnT W Þ
x0i xi
!À1
x0i xi
n
X
i¼1
n
X
i¼1
x0ibib0i xi
!À1
x0i xi
!
n
X
!À1
x0i xi
.
ð4Þ
i¼1
This is especially relevant in Theorem 3 where the mixing conditions will not hold for the transformed variables
if, for example, the transformation is to remove ﬁxed effects by differencing out the individual means. Hansen
(2006) provides conditions on the untransformed variables which will cover this case in a different but related
context. This approach complicates the proof and notation and is not pursued here.
4
It does, however, ignore the possibility of cross-sectional correlation, and it will be assumed that there is no
cross-sectional correlation for the remainder of the paper.
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
601
In addition, for testing the hypothesis Rb ¼ r for a q Â k matrix R with rank q, the usual t
(for R a 1 Â k vector) and Wald statistics can be deﬁned as
pﬃﬃﬃﬃﬃﬃﬃ
nT ðRb À rÞ
b
t ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
bÀ1 b bÀ1
RQ W Q R0
(5)
bÀ1 b bÀ1
b
b
F Ã ¼ nTðRb À rÞ0 ½RQ W Q R0 À1 ðRb À rÞ,
(6)
Ã
and
P
b
b
respectively, where W is deﬁned above and Q ¼ ð1=nTÞ n x0i xi . In Section 3, I verify
i¼1
d
d
b tÃ ! Nð0; 1Þ, F Ã ! w2 , and Avarðb is
d bÞ
that, despite differences in the limiting behavior of b,
q
valid for estimating the asymptotic variance of b as n ! 1 regardless of the behavior of T.
b
b
I also consider the behavior of tÃ and F Ã as T ! 1 with n ﬁxed. In this case, W is not
consistent for W but does have a limiting distribution; and when the data are iid across i,5 I
d
show that tÃ !ðn=ðn À 1ÞÞ1=2 tnÀ1 and that F Ã is asymptotically pivotal and so can be used
b
to construct valid tests. This behavior suggests that inference using ðn=ðn À 1ÞÞW and
forming critical values using a tnÀ1 distribution will be valid regardless of the asymptotic
sequence considered.
b
It is worth noting that the estimator W has also been used extensively in multilevel
models to account for the presence of correlation between individuals within cells; cf.
Liang and Zeger (1986) and Bell and McCaffrey (2002). For example, in a schooling study,
one might have data on individual outcomes where the individuals are grouped into
classes. In this case, the cross-sectional unit of observation could be deﬁned as the class,
and arbitrary correlation between all individuals within each class could be allowed. In this
case, one would expect the presence of a classroom speciﬁc random effect resulting in
equicorrelation between all individuals within a class. While this would clearly violate the
mixing assumptions imposed in obtaining the asymptotic behavior as T ! 1 with n ﬁxed,
b
it would not invalidate the use of W for inference about b in cases where n and T go to
inﬁnity jointly.
In addition to being useful for performing inference about b W may also be used to test
b, b
the speciﬁcation of simple parametric models of the error process.6 Such a test may be
useful for a number of reasons. If a parametric model is correct, the estimates of the
variance of b based on this model will tend to behave better than the estimates obtained
b
b
from W . In particular, parametric estimates of the variance of b will often be considerably
b
b
less variable and will typically converge faster than estimates made using W ; and if the
parametric model is deemed to be adequate, this model may be used to perform FGLS
estimation. The FGLS estimator is asymptotically more efﬁcient than the OLS estimator,
and simulation evidence in Hansen (2006) suggests that the efﬁciency gain to using FGLS
over OLS in serially correlated panel data may be substantial.
5
Note that this still allows arbitrary correlation and heteroskedasticity within individuals but restricts that the
pattern is the same across individuals.
6
The test considered is a straightforward generalization of the test proposed by White (1980) for
heteroskedasticity and was suggested in the panel context by Kezdi (2002).
ARTICLE IN PRESS
602
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
To deﬁne the speciﬁcation test, called hereafter the heteroskedasticity–autocorrelation
P
b yÞ
(HA) test, let W ðb ¼ ð1=nTÞ n x0i Oi ðb 0 xi where b are estimates of a ﬁnite set of
yÞ
y
i¼1
yÞ
parameters describing the disturbance process and Oi ðb is the implied covariance matrix
for individual i.7 Deﬁne a test statistic
À
b
b yÞÞ b
b
b yÞÞ,
S Ã ¼ ðnTÞ½vecðW À W ðb 0 D vecðW À W ðb
(7)
b
b
where D is a positive semi-deﬁnite weighting matrix that estimates the variance of vecðW À
b ðb and AÀ is the generalized inverse of a matrix A.8 In the following section, it will be
W yÞÞ
d
b
shown that S Ã ! w2
kðkþ1Þ=2 for D deﬁned below.
b is
A natural choice for D
n
1 X
b
D¼
½ðvecðx0ibib0i xi À x0i Oi ðb i ÞÞðvecðx0ibib0i xi À x0i Oi ðb i ÞÞ0 .
yÞx
yÞx
nT i¼1
(8)
b
Under asymptotics where fn; Tg ! 1 jointly, another potential choice for D is an estimate
b:
of the asymptotic variance of W
n
1 X
b
b
b
½ðvecðx0ibib0i xi À W ÞÞðvecðx0ibib0i xi À W ÞÞ0 .
V¼
nT i¼1
(9)
b
b
b yÞÞ
That V provides an estimatorﬃ of the variance of vecðW À W ðb follows from the fact that
pﬃﬃﬃﬃﬃﬃﬃ
pﬃﬃ
b
b yÞÞ
as fn; Tg ! 1, vecðW Þ is n-consistent while vecðW ðb will be nT -consistent in many
b yÞÞ
b
cases, so vecðW ðb may be taken as a constant relative to vecðW Þ. The difference in rates
of convergence would arise, for example, in a ﬁxed effects panel model where the errors
follow an AR process with common AR coefﬁcients across individuals. However, it is
important to note that this will not always be the case. In particular, in random effects
models, the estimator of the variance of the individual speciﬁc shock will converge at only
pﬃﬃﬃ
a n rate, implying the same rate of convergence for both the robust and parametric
estimators of the variance. In the following section, I outline the asymptotic properties of
b W , and V from which the behavior of tÃ , F Ã , and SÃ will follow. The properties of D,
b
b
b, b
b
though not discussed, will generally be the same as those of V under the different
asymptotic sequences considered.
3. Asymptotic properties of the robust covariance matrix estimator
To develop the asymptotic inference results, I impose the following conditions.
b yÞ
Consistency and asymptotic normality of W ðb will generally follow from consistency and asymptotic
b In particular, deﬁning W i ðyÞ as the derivative of W with respect to yi and letting y be a p Â 1
normality of y:
P
¯ y
¯
vector, a Taylor series expansion of W ðb yields W ðb ¼ W ðyÞ þ p W i ðyÞðb À yÞ where y is an intermediate
yÞ
yÞ
i¼1
b À W ðyÞ will inherit the properties of
value. As long as a uniform law of large numbers applies to W i ðyÞ, W ðyÞ
7
b À y. The problem is then reduced to ﬁnding an estimator of y that is consistent and asymptotically normal with a
y
mean zero asymptotic distribution. Finding such an estimator in ﬁxed effects panel models with serial correlation
and/or heteroskedasticity when n ! 1 and T=n ! r where ro1 is complicated, though there are estimators
which exist. See, for example, Nickell (1981), MaCurdy (1982), Solon (1984), Lancaster (2002), Hahn and
Kuersteiner (2002), Hahn and Newey (2004), and Hansen (2006).
8
b
b yÞ
The test could alternatively be deﬁned by only considering the ðkðk þ 1ÞÞ=2 unique elements of W À W ðb and
using the inverse of the implied covariance matrix. This test will be equivalent to the test outlined above.
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
603
Assumption 1. fxi ; i g are independent across i, and E½i 0i jxi ¼ Oi .
P
Assumption 2. QnT ¼ E½ n ðx0i xi =nTÞ is uniformly positive deﬁnite with constant limit Q
i¼1
where limits are taken as n ! 1 with T ﬁxed in Theorem 1, as fn; Tg ! 1 in Theorems 2
and 3, and as T ! 1 with n ﬁxed in Theorem 4.
In addition, I impose either Assumption 3(a) or Assumption 3(b) depending on the
context.
Assumption 3.
(a) E½i jxi ¼ 0.
(b) E½xit it ¼ 0.
Assumptions 1–3 are quite standard for panel data models. Assumption 1 imposes
independence across individuals, ruling out cross-sectional correlation, but leaves the time
series correlation unconstrained and allows general heterogeneity across individuals.
Assumption 2 is a standard full rank condition, and the restriction that QnT has a constant
limit could be relaxed at the cost of more complicated notation. Assumption 3 imposes
that one of two orthogonality conditions is satisﬁed. Assumption 3(b) imposes that xit
and it are uncorrelated and is weaker than the strict exogeneity imposed in Assumption 3(a). Assumption 3(a) is stronger than necessary, but it simpliﬁes the proof of asympb
b
totic normality of W and consistency of V . In addition, Assumption 3(a) would typically
9
be imposed in ﬁxed effects models.
The ﬁrst theorem, which is stated here for completeness, collects the properties of b and
b
b in asymptotics where n ! 1 with T ﬁxed.
W
Theorem 1. Suppose the data are generated by model (1), that Assumptions 1 and 2 are
satisﬁed, and that n ! 1 with T ﬁxed.
(i) If Assumption 3(b) holds and Ejxith j4þd oDo1 and Ejit j4þd oDo1 for some d40,
then
!
n
pﬃﬃﬃﬃﬃﬃﬃ
1 X
d
nT ðb À bÞ ! QÀ1 N 0; W ¼ lim
b
E½x0i Oi xi ,
n nT
i¼1
and
b p
W ! W.
(ii) In addition, if Assumption 3(a) holds and Ejxith j8þd oDo1 and Ejit j8þd oDo1 for
some d40, then
pﬃﬃﬃﬃﬃﬃﬃ
b
nT ½vecðW À W Þ
!
n
1 X
d
E½ðvecðx0i i 0i xi À W ÞÞðvecðx0i i 0i xi À W ÞÞ0 ,
! N 0; V ¼ lim
n nT
i¼1
9
Note that a balanced panel has also implicitly been assumed. All of the results with the exception of Corollary
4.1 could be extended to accommodate unbalanced panels at the cost of more complicated notation.
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
604
and
b p
V ! V.
Remark 3.1. It follows from Theorem 1(i) that the asymptotic variance of b can be
b
estimated using (4) since
!À1
!À1
n
n
n
X
X
X
0
d bÞ
x0 xi
x0bib xi
x0 xi
Avarðb ¼
i
i¼1
1
¼
nT
i
i¼1
n
1 X 0
x xi
nT i¼1 i
!À1
i
i
i¼1
n
X
b 1
W
x0 xi
nT i¼1 i
!À1
¼
1 bÀ1 b bÀ1
Q WQ ,
nT
bÀ1 b bÀ1 p
where Q W Q ! QÀ1 WQÀ1 . It also follows immediately from the deﬁnitions of tÃ and
d
F Ã in Eqs. (5) and (6) and Theorem 1(i) that, under the null hypothesis, tÃ ! Nð0; 1Þ and
d
b yÞ
F Ã ! w2 . Similarly, using Theorem 1(ii) and assuming W ðb has properties similar to those
q
b
b
of W , it will follow that the HA test statistic, S Ã , formed using D deﬁned above converges
2
in distribution to a wkðkþ1Þ=2 under the null hypothesis.
b
b
Theorem 1 veriﬁes that b and W are consistent and asymptotically normal as n ! 1
with T ﬁxed without imposing any restrictions on the time series dimension. In the
following results, I consider alternate asymptotic approximations under the assumption
that both n and T are going to inﬁnity.10 In these cases, consistency and asymptotic
b
normality of suitably normalized versions of W are established under weak conditions.
Theorem 2, given immediately below, covers the case where n and T are going to inﬁnity
and there is not weak dependence in the time series. In particular, the results of Theorem 2
P
are only interesting in the case where W ¼ limn;T ð1=nT 2 Þ n E½x0i Oi xi 40. Perhaps the
i¼1
leading case where this behavior would occur is in a model where it includes an individual
speciﬁc random effect that is uncorrelated to xit and the estimated model does not include
an individual speciﬁc effect. In this case, all observations for a given individual will be
equicorrelated, and the condition given above will hold. Theorem 3, given following
Theorem 2, covers the case where there is mixing in the time series.
Theorem 2. Suppose the data are generated by model (1), that Assumptions 1 and 2 are
satisﬁed, and that fn; Tg ! 1 jointly.
(i) If Assumption 3(b) holds and Ejxith j4þd oDo1 and Ejit j4þd oDo1 for some d40,
then
n
pﬃﬃﬃ b
1 X
d
nðb À bÞ ! QÀ1 Nð0; W ¼ lim 2
E½x0i Oi xi Þ,
n;T nT
i¼1
10
One could also consider sequential limits in which one takes limits as n or T goes to inﬁnity with the other
dimension ﬁxed and then lets the other dimension go to inﬁnity. It could be shown that under the conditions of
Theorem 2 and appropriate normalizations sequential limits taken ﬁrst with respect to either n or T would yield
the same results as the joint limit. Similarly, under the conditions of Theorem 3, the sequential limits taken ﬁrst
with respect to either n or T would produce the same results as the joint limit.
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
605
and
p
b
W =T ! W .
(ii) In addition, if Assumption 3(a) holds and Ejxith j8þd oDo1 and Ejit j8þd oDo1 for
some d40, then
pﬃﬃﬃ
b
n½vecðW =T À W Þ
!
n
1 X
d
0
0
0
0
0
E½ðvecðxi i i xi À W ÞÞðvecðxi i i xi À W ÞÞ ,
! N 0; V ¼ lim 4
n;T nT
i¼1
and
p
b
V =T 3 ! V .
Remark 3.2. It is important to note that the results presented in Theorem 2 are not
interesting in the setting where the fj; kg element of Oi becomes small when jj À kj is large
P
since in these circumstances ð1=nT 2 Þ n E½x0i Oi xi ! 0. Theorem 3 presents results which
i¼1
are relevant in this case.
b
Remark 3.3. Theorem 2 veriﬁes consistency and asymptotic normality of both b and W
b
while imposing essentially no constraints on the time series dependence in the data. The
large cross-section effectively allows the time series dimension to pﬃﬃﬃignored even when ﬃﬃﬃﬃﬃﬃﬃ
be
pT is
large. However, without constraints on the time series, b is n-consistent, not nT b
consistent. Intuitively, the slower rate of convergence is due to the fact that there may be
little information contained in the time series since it is allowed to be arbitrarily dependent.
pﬃﬃﬃﬃﬃﬃﬃ
b
Remark 3.4. The fact that b and W are not nT -consistent will not affect practical
b
implementation of inference about b In particular, the estimate of the asymptotic variance
b.
b based on Eq. (4) is
of b
!À1
!À1
n
n
n
X
X
X
0
0
0
0
d b ¼
x xi
x bib xi
x xi
AvarðbÞ
i
i¼1
i
n
1 1 X 0
¼
x xi
n nT i¼1 i
i
i
i¼1
!À1
i¼1
b
ðW =TÞ
n
1 X 0
x xi
nT i¼1 i
!À1
1 bÀ1 b
bÀ1
¼ Q ðW =TÞQ ,
n
bÀ1 b
bÀ1 p
where Q ðW =TÞQ ! QÀ1 WQÀ1 : The t-statistic deﬁned in Eq. (5) may also be expressed
as
pﬃﬃﬃﬃﬃﬃﬃ
nT ðRb À rÞ
b
Ã
t ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
bÀ1 b bÀ1
RQ W Q R0
pﬃﬃﬃ b
nðRb À rÞ
¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
À1
b
bÀ1
b
RQ ðW =TÞQ R0
which converges in distribution to a Nð0; 1Þ random variable under the null hypothesis,
d
Rb ¼ r, by Theorem 2(i). Similarly, it follows that F Ã ! w2 under the null. Finally, the HA
q
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
606
test statistic, S Ã , deﬁned above also satisﬁes
b
b yÞÞ bÀ
b
b yÞÞ
SÃ ¼ ðnTÞ½vecðW À W ðb 0 D vecðW À W ðb
b
b yÞ=TÞ0 ðD=T 3 ÞÀ vecðW =T À W ðb
b
b
b yÞ=TÞ,
¼ n½vecðW =T À W ðb
which converges in distribution to a w2
kðkþ1Þ=2 under the conditions of the theorem and the
b ðb behaves similarly to V .
b
additional assumption that W yÞ
The previous theorem establishes the properties of b and the robust variance matrix
b
estimator as n and T go to inﬁnity jointly without imposing restrictions on the time series
dependence. While the result is interesting, there are many cases in which one might expect
the time series dependence to diminish over time. In the following theorem, the properties
b
of b and W are established under the assumption that the data are strong mixing in the
b
time series dimension.
Theorem 3. Suppose the data are generated by model (1), that Assumptions 1 and 2 are
satisﬁed, and that fn; Tg ! 1 jointly.
(i) If Assumption 3(b) is satisﬁed, Ejxith jrþd oD and Ejit jrþd oD for some d40, and fxit ; it g
is a strong mixing sequence in t with a of size À3r=ðr À 4Þ for r44,
!
n
X
pﬃﬃﬃﬃﬃﬃﬃ
d
0
b À bÞ ! QÀ1 N 0; W ¼ lim 1
E½xi Oi xi
nT ðb
n;T nT
i¼1
and
p
b
W À W ! 0.
(ii) In addition, if Assumption 3(a) is satisﬁed, Ejxith jrþd oD and Ejit jrþd oD for some d40,
and fxit ; it g is a strong mixing sequence in t with a of size À7r=ðr À 8Þ for r48,
pﬃﬃﬃ
b
n½vecðW À W Þ
!
n
1 X
d
0
0
0
0
0
! N 0; V ¼ lim 2
E½ðvecðxi i i xi À W ÞÞðvecðxi i i xi À W ÞÞ ,
n;T nT
i¼1
and
p
b
V =T ! V .
b
Remark 3.5. Theorem 3 veriﬁes consistency and asymptotic normality of both b and W
b
under fairly conventional conditions on the time series dependence of the variables. The
pﬃﬃﬃﬃﬃﬃﬃ
added restriction on the time series dependence pﬃﬃﬃ
allows estimation of b at the nT -rate,
which differs from the case above where b is only n-consistent. Intuitively, the increase in
b
the rate of convergence is due to the fact that under the mixing conditions, the time series is
more informative than in the case analyzed in Theorem 2.
Remark 3.6. It follows immediately from the conclusions of Theorem 3 and the deﬁnitions
d bÞ
d bÞ,
of Avarðb tÃ , and F Ã in Eqs. (4)–(6) that Avarðb is valid for estimating the asymptotic
d
d
variance of b and that tÃ ! Nð0; 1Þ and F Ã ! w2 under the null hypothesis. The HA test
b
q
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
607
statistic, S Ã , also satisﬁes
b
b yÞÞ bÀ
b
b yÞÞ
S Ã ¼ ðnTÞ½vecðW À W ðb 0 D vecðW À W ðb
b
b yÞÞ,
b
b yÞÞ b
¼ n½vecðW À W ðb 0 ðD=TÞÀ vecðW À W ðb
which converges in distribution to a w2
kðkþ1Þ=2 under the conditions of the theorem and the
b
b
b
assumption that D behaves similarly to V . In this case, V could also typically be used as
pﬃﬃﬃﬃﬃﬃﬃ
Ã
b yÞ
the weighting matrix in forming S since it will often be the case that W ðb will be nT pﬃﬃﬃ
b
consistent while W is n-consistent.
Theorems 1–3 establish that conventional estimators of the asymptotic variance of b and
b
b
t and F statistics formed using W have their usual properties as long as n ! 1 regardless
of the behavior of T. In addition, the results indicate that it is essentially only the size of n
that matters for the asymptotic behavior of the estimators under these sequences. To
b
complete the theoretical analysis, I present the asymptotic properties of W as T ! 1 with
n ﬁxed below. The results are interesting in providing a justiﬁcation for a commonly used
procedure and in unifying the results and the different asymptotics considered.
Theorem 4. Suppose the data are generated by model (1), that Assumptions 1, 2, and 3(b) are
satisﬁed, and that T ! 1 with n ﬁxed. If Ejxith jrþd oD, Ejit jrþd oD, and fxit ; it g is a strong
mixing sequence in t with a of size À3r=ðr À 4Þ for r44, then
pﬃﬃﬃﬃﬃﬃﬃ d
pﬃﬃﬃﬃﬃﬃﬃ
p
d
nT ðb À bÞ ! QÀ1 Nð0; W Þ; x0i xi =nT À Qi =n ! 0; x0i i = nT ! Nð0; W i =nÞ,
b
and
!
!À1
n
n
n
X
X
1X
b
W !U ¼
ðLi Bi B0i Li À Li Bi
B0j Lj
Qj
Qi
n i¼1
j¼1
j¼1
!À1
!
n
n
X
X
À Qi
Qj
Lj Bj B0i Li
d
j¼1
þ Qi
n
X
j¼1
!À1
Qj
j¼1
n
X
j¼1
!
Lj Bj
n
X
j¼1
!
B0j Lj
n
X
!À1
Qj
Qi ,
j¼1
P
where W i ¼ limT ð1=TÞE½x0i Oi xi , W ¼ limT ð1=nTÞ i E½x0i Oi xi , Bi $Nð0; I k Þ is a k-dimen1=2
sional normal vector with E½Bi B0j ¼ 0 and Li ¼ W i .
b
Remark 3.7. Theorem 4 veriﬁes that W is not consistent but does have a limiting distribution as
T ! 1 with n ﬁxed. Unfortunately, the result here differs from results obtained in Phillips
et al. (2003), Kiefer and Vogelsang (2002, 2005), and Vogelsang (2003) who consider HAC
estimation in time series data without truncation in that how to construct asymptotically pivotal
statistics from U is not immediately obvious. However, in one important special case, U is
proportional to the true covariance matrix allowing construction of asymptotically pivotal tests.
Corollary 4.1. Suppose the conditions of Theorem 4 are satisﬁed and that Qi ¼ Q and
W i ¼ W for all i. Then
!
n
n
n
X
1
1X X 0
d
b
Bi B0i À
Bi
Bi L
W !U ¼ L
n
n i¼1
i¼1
i¼1
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
608
for Bi deﬁned in Theorem 4 and L ¼ W 1=2 . Then, for testing the null hypothesis H0 : Rb ¼ r
against the alternative H1 : Rbar for a q Â k matrix R with rank q, the limiting distributions
of the conventional Wald (F Ã ) and t-type ðtÃ Þ tests under H0 are
bÀ1 b bÀ1
b
b
F Ã ¼ ðnTÞðRb À rÞ0 ½RQ W Q R0 À1 ðRb À rÞÞ
"
!#À1
X
nq
d
e0 1
e
e e0
Bq;n ; ¼
F q;nÀq ,
! Bq;n
Bq;i B0q;i À Bq;n Bq;n
n
nÀq
i
ð10Þ
and
pﬃﬃﬃﬃﬃﬃﬃ
nT ðRb À rÞ
b
t ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
À1
b b bÀ1
RQ W Q R0
rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
e
B1;n
n
d
! qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ ¼
ð11Þ
tnÀ1 ,
P 2
2
nÀ1
e
ð1=nÞð i B1;i À B1;n Þ
pﬃﬃﬃ P
e
where Bq;i $Nð0; I q Þ, Bq;n ¼ ð1= nÞ n Bq;i , tnÀ1 is a t distribution with n À 1 degrees of
i¼1
freedom, and F q;nÀq is an F distribution with q numerator and n À q denominator degrees of
freedom.
Ã
b
Corollary 4.1 gives the limiting distribution of W as T ! 1 under the additional
restriction that Qi ¼ Q and W i ¼ W for all i. These restrictions would be satisﬁed when
the data vectors for each individual fxi ; yi g are iid across i. While this is more restrictive
than the condition imposed in Assumption 1, it still allows for quite general forms of
conditional heteroskedasticity and does not impose any structure on the time series process
within individuals.
The most interesting feature about the result in Corollary 4.1 is that under the
b
conditions imposed, the limiting distribution of W is proportional to the actual covariance
matrix in the data. This allows construction of asymptotically pivotal statistics based on
standard t and Wald tests as in Phillips et al. (2003), Kiefer and Vogelsang (2002, 2005),
and Vogelsang (2003). This is particularly convenient in the panel case since the limiting
pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
distribution of the t-statistic is exactly
ðn=ðn À 1ÞÞ tnÀ1 where tnÀ1 denotes the t
distribution with n À 1 degrees of freedom.11 It is also interesting that EU ¼ ð1 À ð1=nÞÞW .
b
This suggests normalizing the estimator W by n=ðn À 1Þ will result in an asymptotically
unbiased estimator in asymptotics where T ! 1 with n ﬁxed and will likely reduce the
ﬁnite-sample bias under asymptotics where n ! 1. In addition, the t-statistic constructed
b
based on the estimator deﬁned by ðn=ðn À 1ÞÞW will be asymptotically distributed as a tnÀ1
for which critical values are readily available.12
The conclusions of Corollary 4.1 suggest a simple procedure for testing hypotheses
regarding regression coefﬁcients which will be valid under any of the asymptotics
b
considered. Using ðn=ðn À 1ÞÞW and obtaining critical values from a tnÀ1 distribution will
yield tests which are asymptotically valid regardless of the asymptotic sequence since the
11
b
If n ¼ 1, W is identically equal to 0. In this case, it is easy to verify that U equals 0, though the results of
Theorem 4 and Corollary 4.1 are obviously uninteresting in this case.
12
b
This is essentially the normalization used in Stata’s cluster command, which normalizes W by
½ðnT À 1Þ=ðnT À kÞ ½n=ðn À 1Þ, where the normalization is motivated as a ﬁnite-sample adjustment under the
usual n ! 1, T ﬁxed asymptotics; see Stata User’s Guide Release 8, p. 275 (Stata Corporation, 2003).
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
609
tnÀ1 ! Nð0; 1Þ and n=ðn À 1Þ ! 1 as n ! 1. Thus, this approach will yield valid tests
under any of the asymptotics considered in the presence of quite general heteroskedasticity
and serial correlation.13
In addition, it is important to note that in the cases where there is weak dependence in
the time series and T is large, more efﬁcient estimators of the covariance matrix which
make use of this information are available. In particular, standard time series HAC
estimators which downweight the correlation between observations that are far apart will
have faster rates of convergence than the CCM estimator.
b
Finally, it is worth noting that the maximum rank of W will generally be n À 1, which
b
b
suggests that W will be rank deﬁcient when k4n À 1: Since W is supposed to estimate a
b
full rank matrix, it seems likely that inference based on W will perform poorly in these
cases. Also, the above development ignores time effects, which will often be included in
panel data models. Under T ﬁxed, n ! 1 asymptotics, the time effects can be included in
the covariate vector xit and pose no additional complications. However, as T ! 1,
they also need to be considered separately from x and partialed out with the individual
ﬁxed effects. This partialing out will generally result in the presence of an Oð1=nÞ
correlation between individuals. When n is large, this correlation should not matter,
but in the ﬁxed n, T ! 1 case, it will invalidate the results. The effect of the presence
of time effects was explored in a simulation study with the same design as that
reported in the following section where each model was estimated including a full set of
time ﬁxed effects. The results, which are not reported below but are available upon
b
request, show that tests based on W are somewhat more size distorted than when no
time effects are included for small n, but that this size distortion diminishes quickly
as n increases.
4. Monte Carlo evidence
The asymptotic results presented above suggest that tests based on the robust standard error estimates should have good properties regardless of the relative sizes of n
and T. I report results from a simple simulation study used to assess the ﬁnite sample
effectiveness of the robust covariance matrix estimator and tests based upon it below.
Speciﬁcally, the simulation focuses on t-tests for regression coefﬁcients and the HA test
discussed above.
The Monte Carlo simulations are based on two different speciﬁcations: a ‘‘ﬁxed effect’’
speciﬁcation and a ‘‘random effects’’ speciﬁcation. The terminology refers to the fact that
in the ‘‘ﬁxed effect’’ speciﬁcation, the models will be estimated including individual speciﬁc
ﬁxed effects with the goal of focusing on the case where the underlying disturbances exhibit
weak dependence. In the ‘‘random effects’’ speciﬁcation individual speciﬁc effects are not
estimated and the goal is to examine the behavior of the CCM estimator and tests based
upon it in an equicorrelated model.
The ﬁxed effect speciﬁcation is
yit ¼ x0it b þ ai þ eit ,
where xit is a scalar and ai is an individual speciﬁc effect. The data generating process for
the ﬁxed effect speciﬁcation allows for serial correlation in both xit and eit and
13
This argument also applies to testing multiple parameters using F Ã .
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
610
heteroskedasticity:
xit ¼ :5xitÀ1 þ vit ; vit $Nð0; :75Þ,
qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
eit ¼ reitÀ1 þ a0 þ a1 x2 uit ; uit $Nð0; 1 À r2 Þ,
it
ai $Nð0; :5Þ.
Data are simulated using four different values of r, r 2 f0; :3; :6; :9g, in both the
homoskedastic ða0 ¼ 1; a1 ¼ 0Þ and heteroskedastic ða0 ¼ a1 ¼ :5Þ cases, resulting in a
total of eight distinct parameter settings. The models are estimated including xit and a full
set of individual speciﬁc ﬁxed effects.14
The random effects speciﬁcations is
yit ¼ x0it b þ it ,
where xit is a normally distributed scalar with E½x2 ¼ 1 and E½xit1 xit2 ¼ :8 for all t1 at2 . it
it
contains an individual speciﬁc random component and a random error term:
it ¼ ai þ uit ,
ai $Nð0; rÞ,
uit $Nð0; 1 À rÞ.
Note that the random effects data generating process implies that E½it1 it2 ¼ r for t1 at2 .
Three values of r are employed for the random effects speciﬁcation: .3, .6, and .9. The
model is estimated by regressing yit on xit and a constant.
The ﬁxed effects model is commonly used in empirical work when panel data are
available. The random effects speciﬁcation is also widely used in the policy evaluation
literature. In many policy evaluation studies, the covariate of interest is a policy variable
that is highly correlated within aggregate cells, often with a correlation of one, which has
led to the dominance of the random effects estimator in this context. For example, a
researcher may desire to estimate the effect of classroom level policies on student-level
micro data containing observations from multiple classrooms. In this setting, T indexes the
number of students within each class, n indexes the number of classrooms, and ai is a
classroom speciﬁc random effect. The CCM estimator has been widely utilized in such
situations in order to consistently estimate standard errors.15
Simulation results for various values of the cross-sectional (n) and time ðTÞ dimensions
are reported. For each fn; Tg combination, reported results for each of the 11 parameter
settings (eight for the ﬁxed effects speciﬁcation and three for the random effects
speciﬁcation) are based on 1,000 simulation repetitions. Each simulation estimates three
types of standard errors for b unadjusted OLS standard errors, bOLS , CCM standard
b:
s
s
errors, bCLUS , and standard errors consistent with an AR(1) process, bARð1Þ .16 For the
s
14
Since ai is uncorrelated with xi , this model could be estimated using random effects. I chose to consider a
different speciﬁcation for the random effects estimates where the xit were generated to more closely resemble
covariates which appear in policy analysis studies.
15
This is, in fact, one of the original motivations for the development of the CCM estimator, cf. Liang and
Zeger (1986).
16
bARð1Þ imposes the parametric structure implied by an AR(1) process. The r parameter is estimated from the
s
OLS residuals using the procedure described in Hansen (2006) which consistently estimates AR parameters in
r
r
ﬁxed effects panel models. The standard errors are then computed as ðX 0 X ÞÀ1 X 0 OðbÞX ðX 0 X ÞÀ1 where OðbÞ is the
covariance matrix implied by an AR(1) process.
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
611
random effects speciﬁcation, standard errors consistent with random effects, bRE , are
s
s
s
substituted for bARð1Þ .17 bCLUS is consistent for all parameter settings. bOLS is consistent only
s
in the iid case (the homoskedastic data generating process with r ¼ 0Þ. bARð1Þ is consistent
s
in all homoskedastic data generating processes, and bRE is consistent in all models for
s
which it is reported. In all cases, the CCM estimator is computed using the normalization
implied by T ! 1 with n ﬁxed asymptotics; that is, the CCM estimator is computed as
b
b
ðn=ðn À 1ÞÞW for W deﬁned in Eq. (3).
Tables 1–4 present the results of the Monte Carlo study, where each table corresponds to
a different fn; Tg combination.18 In each table, Panel A presents the ﬁxed effects results for
the homoskedastic and heteroskedastic cases, while Panel B presents the random effects
results. Column (1) presents t-test rejection rates for 5% level tests based on OLS, CCM,
and AR(1) standard errors. The critical values for tests based on OLS and AR(1) errors
are taken from a tnTÀnÀ1 distribution, and the critical values for tests based on clustered
standard errors are taken from a tnÀ1 distribution. Columns (2) and (3) present the mean
and standard deviation of the estimated standard errors respectively. Column (4) presents
the standard deviation of the b The difference between columns (2) and (4) is therefore
b’s.
the bias of the estimated standard errors. Finally, column (5) presents the rejection rates
for the HA test described above which tests the null hypothesis that both the CCM
estimator and the parametric estimator are consistent.
s
As expected, tests based on bOLS and bARð1Þ perform well in the cases where the assumed
s
model is consistent with the data across the full range of n and T combinations. The results
pﬃﬃﬃﬃﬃﬃ
ﬃ
are also consistent with the asymptotic theory, clearly illustrating the nT -consistency of b
b
b
b
b
and W with the bias of W and the variance of both b and W decreasing as either n or T
b
increases. Of course, when the assumed parametric model is inconsistent with the data,
tests based on parametric standard errors suffer from size distortions and the standard
error estimates are biased. The RE tests have the correct size for moderate and large n, but
not for small n (i.e. n ¼ 10); and as indicated by the asymptotic theory, the T dimension
has no apparent impact on the size of RE based tests or the overall performance of the RE
estimates.
Tests based on the CCM estimator have approximately correct size across all
combinations of n and T and all models of the disturbances considered in the ﬁxed effect
speciﬁcation. The estimator does, however, display a moderate bias in the small n case; it
seems likely that this bias does not translate into a large size distortion due to the fact that
the bias is small relative to the standard error of the estimator and the use of the tnÀ1
distribution to obtain the critical values. While the clustered standard errors perform well
in terms of size of tests and reasonably well in terms of bias, the simulations reveal that a
potential weakness of the clustered estimator is a relatively high variance. The CCM
estimates have a substantially higher standard deviation than the other estimators and
this difference, in percentage terms, increases with T. This behavior is consistent with the
17
bRE is estimated in a manner analogous to bARð1Þ where the covariance parameters are estimated in the usual
s
s
manner from the OLS and within residuals.
18
Tables 1–4 correspond to fn; Tg ¼ f10; 10g, fn; Tg ¼ f10; 50g, fn; Tg ¼ f50; 10g, fn; Tg ¼ f50; 50g, respectively.
Additional results for fn; Tg ¼ f10; 200g, fn; Tg ¼ f50; 20g, fn; Tg ¼ f50; 200g, fn; Tg ¼ f200; 10g, and fn; Tg ¼
f200; 50g are available from the author upon request. The results are consistent with the asymptotic theory with
the performance of the CCM estimator improving as either n or T increases in the ﬁxed effects speciﬁcation and as
n increases in the random effects speciﬁcation. In the random effects case, the performance does not appear to be
greatly inﬂuenced by the size of T relative to n.
ARTICLE IN PRESS
612
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
Table 1
Data generating process
N ¼ 10; T ¼ 10
A. Fixed effects
Homoskedastic, r ¼ 0
OLS
Cluster
AR1
Homoskedastic, r ¼ :3
OLS
Cluster
AR1
Homoskedastic, r ¼ :6
OLS
Cluster
AR1
Homoskedastic, r ¼ :9
OLS
Cluster
AR1
Heteroskedastic, r ¼ 0
OLS
Cluster
AR1
Heteroskedastic, r ¼ :3
OLS
Cluster
AR1
Heteroskedastic, r ¼ :6
OLS
Cluster
AR1
Heteroskedastic, r ¼ :9
OLS
Cluster
AR1
B. Random effects
r ¼ :3
OLS
Cluster
RE
r ¼ :6
OLS
Cluster
RE
r ¼ :9
OLS
Cluster
RE
t-test rejection
rate
(1)
Mean (s.e.)
Std (s.e.)
Std ðbÞ
(2)
(3)
(4)
0.038
0.043
0.041
0.1180
0.1149
0.1170
0.0133
0.0330
0.0141
0.1152
0.1152
0.1152
0.152
0.082
0.054
0.055
0.1130
0.1212
0.1240
0.0136
0.0357
0.0161
0.1269
0.1269
0.1269
0.095
0.093
0.060
0.051
0.1005
0.1167
0.1219
0.0133
0.0352
0.0181
0.1231
0.1231
0.1231
0.074
0.145
0.053
0.054
0.0609
0.0772
0.0795
0.0090
0.0249
0.0136
0.0818
0.0818
0.0818
0.038
0.126
0.057
0.126
0.1150
0.1410
0.1140
0.0126
0.0458
0.0137
0.1502
0.1502
0.1502
0.051
0.171
0.068
0.143
0.1165
0.1538
0.1284
0.0137
0.0500
0.0172
0.1708
0.1708
0.1708
0.036
0.187
0.074
0.117
0.1238
0.1717
0.1503
0.0153
0.0572
0.0219
0.1853
0.1853
0.1853
0.027
0.198
0.087
0.097
0.1406
0.1872
0.1830
0.0209
0.0641
0.0336
0.2181
0.2181
0.2181
0.031
0.295
0.115
0.097
0.1063
0.1561
0.1693
0.0231
0.0609
0.0460
0.1926
0.1926
0.1926
0.017
0.399
0.118
0.094
0.1030
0.2024
0.2180
0.0248
0.0788
0.0600
0.2438
0.2438
0.2438
0.054
0.482
0.108
0.095
0.0987
0.2346
0.2546
0.0293
0.0909
0.0723
0.2925
0.2925
0.2925
0.093
HA test
rejection rate
(5)
0.135
0.133
0.123
0.085
0.042
0.044
0.049
0.074
0.027
0.023
0.018
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
613
Table 2
Data generating process
N ¼ 10; T ¼ 50
A. Fixed effects
Homoskedastic, r ¼ 0
OLS
Cluster
AR1
Homoskedastic, r ¼ :3
OLS
Cluster
AR1
Homoskedastic, r ¼ :6
OLS
Cluster
AR1
Homoskedastic, r ¼ :9
OLS
Cluster
AR1
Heteroskedastic, r ¼ 0
OLS
Cluster
AR1
Heteroskedastic, r ¼ :3
OLS
Cluster
AR1
Heteroskedastic, r ¼ :6
OLS
Cluster
AR1
Heteroskedastic, r ¼ :9
OLS
Cluster
AR1
B. Random effects
r ¼ :3
OLS
Cluster
RE
r ¼ :6
OLS
Cluster
RE
r ¼ :9
OLS
Cluster
RE
t-test rejection
rate
(1)
Mean (s.e.)
Std (s.e.)
Std ðbÞ
(2)
(3)
(4)
0.054
0.050
0.057
0.0462
0.0449
0.0460
0.0024
0.0117
0.0026
0.0472
0.0472
0.0472
0.184
0.088
0.043
0.050
0.0459
0.0520
0.0529
0.0024
0.0133
0.0031
0.0519
0.0519
0.0519
0.077
0.155
0.042
0.047
0.0447
0.0574
0.0598
0.0028
0.0150
0.0044
0.0590
0.0590
0.0590
0.049
0.225
0.046
0.049
0.0372
0.0562
0.0583
0.0034
0.0159
0.0072
0.0600
0.0600
0.0600
0.046
0.158
0.051
0.162
0.0459
0.0606
0.0458
0.0021
0.0169
0.0023
0.0637
0.0637
0.0637
0.052
0.199
0.041
0.142
0.0479
0.0735
0.0553
0.0022
0.0198
0.0032
0.0724
0.0724
0.0724
0.046
0.229
0.043
0.112
0.0558
0.0928
0.0748
0.0031
0.0260
0.0054
0.0934
0.0934
0.0934
0.067
0.239
0.046
0.076
0.0857
0.1428
0.1338
0.0079
0.0451
0.0163
0.1490
0.1490
0.1490
0.059
0.568
0.104
0.097
0.0471
0.1356
0.1475
0.0092
0.0547
0.0413
0.1636
0.1636
0.1626
0.147
0.703
0.104
0.095
0.0466
0.1897
0.2079
0.0105
0.0727
0.0567
0.2331
0.2331
0.2331
0.212
0.744
0.106
0.103
0.0450
0.2310
0.2539
0.0130
0.0920
0.0701
0.2785
0.2785
0.2785
0.245
HA test
rejection rate
(5)
0.185
0.159
0.184
0.150
0.057
0.047
0.059
0.099
0.014
0.007
0.014
ARTICLE IN PRESS
614
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
Table 3
Data generating process
N ¼ 50; T ¼ 10
A. Fixed effects
Homoskedastic, r ¼ 0
OLS
Cluster
AR1
Homoskedastic, r ¼ :3
OLS
Cluster
AR1
Homoskedastic, r ¼ :6
OLS
Cluster
AR1
Homoskedastic, r ¼ :9
OLS
Cluster
AR1
Heteroskedastic, r ¼ 0
OLS
Cluster
AR1
Heteroskedastic, r ¼ :3
OLS
Cluster
AR1
Heteroskedastic, r ¼ :6
OLS
Cluster
AR1
Heteroskedastic, r ¼ :9
OLS
Cluster
AR1
B. Random effects
r ¼ :3
OLS
Cluster
RE
r ¼ :6
OLS
Cluster
RE
r ¼ :9
OLS
Cluster
RE
t-test rejection
rate
(1)
Mean (s.e.)
Std (s.e.)
Std ðbÞ
(2)
(3)
(4)
0.049
0.057
0.047
0.0522
0.0515
0.0522
0.0026
0.0062
0.0028
0.0526
0.0526
0.0526
0.106
0.080
0.059
0.055
0.0500
0.0552
0.0556
0.0027
0.0072
0.0033
0.0569
0.0569
0.0569
0.053
0.102
0.048
0.049
0.0447
0.0549
0.0553
0.0026
0.0071
0.0037
0.0539
0.0539
0.0539
0.132
0.156
0.075
0.067
0.0273
0.0364
0.0367
0.0273
0.0367
0.0367
0.0387
0.0387
0.0387
0.220
0.119
0.047
0.116
0.0517
0.0673
0.0516
0.0025
0.0093
0.0028
0.0659
0.0659
0.0659
0.213
0.197
0.062
0.139
0.0521
0.0741
0.0581
0.0026
0.0114
0.0033
0.0768
0.0768
0.0768
0.369
0.214
0.048
0.108
0.0558
0.0820
0.0688
0.0031
0.0126
0.0045
0.0840
0.0840
0.0840
0.451
0.152
0.038
0.057
0.0623
0.0899
0.0834
0.0043
0.0144
0.0070
0.0883
0.0883
0.0883
0.324
0.291
0.062
0.059
0.0451
0.0776
0.0788
0.0041
0.0135
0.0091
0.0822
0.0822
0.0822
0.673
0.357
0.073
0.068
0.0452
0.1004
0.1028
0.0049
0.0183
0.0127
0.1034
0.1034
0.1034
0.892
0.497
0.062
0.063
0.0447
0.1192
0.1210
0.0056
0.0212
0.0147
0.1246
0.1246
0.1246
0.943
HA test
rejection rate
(5)
0.099
0.092
0.072
0.078
0.210
0.140
0.056
0.023
0.058
0.054
0.048
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
615
Table 4
Data generating process
N ¼ 50; T ¼ 20
A. Fixed effects
Homoskedastic, r ¼ 0
OLS
Cluster
AR1
Homoskedastic, r ¼ :3
OLS
Cluster
AR1
Homoskedastic, r ¼ :6
OLS
Cluster
AR1
Homoskedastic, r ¼ :9
OLS
Cluster
AR1
Heteroskedastic, r ¼ 0
OLS
Cluster
AR1
Heteroskedastic, r ¼ :3
OLS
Cluster
AR1
Heteroskedastic, r ¼ :6
OLS
Cluster
AR1
Heteroskedastic, r ¼ :9
OLS
Cluster
AR1
B. Random effects
r ¼ :3
OLS
Cluster
RE
r ¼ :6
OLS
Cluster
RE
r ¼ :9
OLS
Cluster
RE
t-test rejection
rate
(1)
Mean (s.e.)
Std (s.e.)
Std ðbÞ
(2)
(3)
(4)
0.050
0.049
0.052
0.0342
0.0341
0.0342
0.0013
0.0040
0.0014
0.0341
0.0341
0.0341
0.097
0.094
0.051
0.056
0.0334
0.0379
0.0382
0.0013
0.0045
0.0016
0.0393
0.0393
0.0393
0.077
0.120
0.059
0.050
0.0315
0.0407
0.0412
0.0014
0.0052
0.0021
0.0414
0.0414
0.0414
0.300
0.200
0.059
0.060
0.0222
0.0327
0.0329
0.0013
0.0047
0.0024
0.0336
0.0336
0.0336
0.580
0.168
0.063
0.171
0.0340
0.0458
0.0340
0.0011
0.0056
0.0012
0.0479
0.0479
0.0479
0.408
0.209
0.051
0.145
0.0350
0.0527
0.0399
0.0012
0.0068
0.0016
0.0536
0.0536
0.0536
0.675
0.228
0.050
0.119
0.0394
0.0636
0.0514
0.0017
0.0084
0.0027
0.0653
0.0653
0.0653
0.802
0.196
0.036
0.058
0.0507
0.0809
0.0751
0.0028
0.0131
0.0056
0.0775
0.0775
0.0775
0.681
0.405
0.069
0.063
0.0320
0.0726
0.0738
0.0029
0.0131
0.0085
0.0756
0.0756
0.0756
0.915
0.515
0.066
0.055
0.0318
0.0976
0.0996
0.0033
0.0169
0.0118
0.1012
0.1012
0.1012
0.944
0.614
0.054
0.051
0.0314
0.1166
0.1194
0.0038
0.0204
0.0140
0.1203
0.1203
0.1203
0.948
HA test
rejection rate
(5)
0.088
0.086
0.092
0.094
0.406
0.294
0.123
0.034
0.064
0.055
0.053
ARTICLE IN PRESS
616
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
pﬃﬃﬃ
n-consistency of the estimator and does suggest that if a parametric estimator is
available, it may have better properties for estimating the variance of b
b:
The clustered estimator performs less well in the random effects speciﬁcation. For small
n, tests based on the CCM estimator suffer from a substantial size distortion for all values
of T. For moderate to large values of n, the tests have the correct size, and the overall
performance does not appear to depend on T. In addition, the variance of b does ﬃﬃﬃﬃﬃﬃﬃ
b
p not
appear to decrease as T increases. These results are consistent with the lack of nT consistency in this case.19
The performance of the HA test is much less robust than that of t-tests based on
clustered standard errors. For small n, the tests are badly size distorted and have essentially
no power against any alternative hypotheses. As n and T grow, the test performance
improves. With n ¼ 50, the test remains size distorted, but it does have some power against
alternatives that increases as T increases. The HA test also performs poorly for the random
effects speciﬁcation for small n. However, for moderate or large n, the test has both the
correct size and good power.
Overall, the simulation results support the use of clustered standard errors for
performing inference on regression coefﬁcient estimates in serially correlated panel data,
though they also suggest care should be taken if n is small and one suspects a ‘‘random
b
effects’’ structure. The poor performance of W in ‘‘random effects’’ models with small n is
already well-known; see for example Bell and McCaffrey (2002) who also suggest a bias
b
reduction for W in this case. However, that the estimator does quite well even for small n
in the serially correlated case where the errors are mixing is somewhat surprising and is a
new result which is suggested by the asymptotic analysis of the previous section. The
simulation results conﬁrm the asymptotic results, suggesting that the clustered standard
errors are consistent as long as n ! 1 and that they are not sensitive to the size of n
relative to T. The chief drawback of the CCM estimator is that the robustness comes at the
cost of increasing the variance of the standard error estimate relative to that of standard
errors estimated through more parsimonious models.
The HA test offers one simple information based criterion for choosing between the
CCM estimator and a simple parametric model of the error process. However, the
simulation evidence regarding its usefulness is mixed. In particular, the properties of the
test are poor in small sample settings where there is likely to be the largest gain to using a
parsimonious model. However, in moderate sized samples, the test performs reasonably
well, and there still may be gains to using a simple parametric model in these cases.
5. Conclusion
This paper explores the asymptotic behavior of the robust covariance matrix estimator
of Arellano (1987). It extends the usual analysis performed under asymptotics where n !
1 with T ﬁxed to cases where n and T go to inﬁnity jointly, considering both non-mixing
and mixing cases, and to the case where T ! 1 with n ﬁxed. The limiting behavior of the
OLS estimator, b in each case is different. However, the analysis shows that the
b,
conventional estimator of the asymptotic variance and the usual t and F statistics have the
same properties regardless of the behavior of the time series as long as n ! 1: In addition,
The inconsistency of b when T increases with n ﬁxed in differences-in-differences and policy evaluation studies
b
has also been discussed in Donald and Lang (2001).
19
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
617
when T ! 1 with n ﬁxed and the data satisfy mixing conditions and an iid assumption
across individuals, the usual t and F statistics can be used for inference despite the fact that
the robust covariance matrix estimator is not consistent but converges in distribution to a
limiting random variable. In this case, it is shown that the t statistic constructed using
n=ðn À 1Þ times the estimator of Arellano (1987) is asymptotically tnÀ1 , suggesting the use
of n=ðn À 1Þ times the estimator of Arellano (1987) and critical values obtained from a tnÀ1
in all cases. The use of this procedure is also supported in a short simulation experiment,
which veriﬁes that it produces tests with approximately correct size regardless of the
relative size of n and T in cases where the time series correlation between observations
diminishes as the distance between observations increases. The simulations also verify that
tests based on the robust standard errors are consistent as n increases regardless of the
relative size of n and T even in cases when the data are equicorrelated.
Acknowledgments
The research reported in this paper was motivated through conversations with Byron
Lutz, to whom I am very grateful for input in developing this paper. I would like to thank
Whitney Newey and Victor Chernozhukov as well as anonymous referees and a coeditor
for helpful comments and suggestions. This work was partially supported by the William
S. Fishman Faculty Research Fund at the Graduate School of Business, the University of
Chicago. All remaining errors are mine.
Appendix
For brevity, sketches of the proofs are provided below. More detailed versions are
available in an additional Technical Appendix from the author upon request and in
Hansen (2004).
pﬃﬃﬃﬃﬃﬃﬃ
P
p
d
nT ðb À bÞ ! QÀ1 Nð0; W ¼ limn ð1=nTÞ n
b
Proof of Theorem 1. b À b ! 0 and
b
i¼1
E½x0i Oi xi Þ follow immediately under the conditions of Theorem 1 from the Markov
LLN and the Liapounov CLT. The remaining conclusions follow from repeated use of the
Cauchy–Schwarz inequality, Minkowski’s inequality, the Markov LLN, and the
Liapounov CLT. &
The proofs of Theorems 2 and 3 make use of the following lemmas which provide a LLN
and CLT for inid data as fn; Tg ! 1 jointly.
Lemma 1. Suppose fZ i;T g are independent across i for all T with E½Z i;T ¼ mi;T and
P
p
EjZ i;T j1þd oDo1 for some d40 and all i; T. Then ð1=nÞ n ðZ i;T À mi;T Þ ! 0 as fn; Tg !
i¼1
1 jointly.
Proof. The proof follows from standard arguments, cf. Chung (2001) Chapter 5. Details
are given in Hansen (2004). &
Lemma 2. For k Â 1 vectors Z i;T , suppose fZ i;T g are independent across i for all T with
E½Z i;T ¼ 0, E½Z i;T Z 0i;T ¼ Oi;T , and EkZ i;T k2þd oDo1 for some d40. Assume O ¼
P
pﬃﬃﬃ P
limn;T ð1=nÞ n Oi;T is positive deﬁnite with minimum eigenvalue lmin 40. Then ð1= nÞ n
i¼1
i¼1
d
Z i;T ! Nð0; W Þ as fn; Tg ! 1 jointly.
ARTICLE IN PRESS
618
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
Proof. The result follows from verifying the Lindeberg condition of Theorem 2 in Phillips
and Moon (1999) using an argument similar to that used in the proof of Theorem 3 in
Phillips and Moon (1999). Details are given in Hansen (2004). &
Proof of Theorem 2. The conclusions follow from conventional arguments making
repeated use of the Cauchy–Schwarz inequality, Minkowski’s inequality, and Lemmas 1
and 2. &
In addition to using Lemmas 1 and 2, I make use of the following mixing inequality,
restated from Doukhan (1994) Theorem 2 with a slight change of notation, to establish the
properties of the estimators as fn; Tg ! 1 when mixing conditions are imposed. Its proof
may be found in Doukhan (1994, p. 25–30).
Lemma 3. Let fzt g be a strong mixing sequence with E½zt ¼ 0, Ekzt ktþ oDo1, and mixing
coefﬁcient aðmÞ of size ð1 À cÞr=ðr À cÞ where c 2 2N, P and r4c. Then there is a constant
cXt,
C depending only on t and aðmÞ such that Ej T yt jt pCDðt; ; TÞ with Dðt; ; TÞ
t¼1
deﬁned in Doukhan (1994) and satisfying Dðt; ; TÞ ¼ OðTÞ if tp2 and Dðt; ; TÞ ¼
OðT t=2 Þ if t42.
Proof of Theorem 3. The conclusions follow under the conditions of the theorem by
making use of the Cauchy–Schwarz inequality, Minkowsk’s inequality, and Lemma 3 to
verifythe conditions of Lemmas 1 and 2. &
pﬃﬃﬃ
d
Proof of Theorem 4. Under ﬃﬃﬃﬃ hypotheses of the theorem, nðb À bÞ ! QÀ1 Nð0; W Þ,
b
p thed
p
x0i xi =T À Qi ! 0, and x0i i = T ! Nð0; W i Þ are immediate from a LLN and CLT for
mixing sequences, cf. White (2001, Theorems 3.47 and 5.20). The conclusion then follows
b
from the deﬁnition of W and bi . &
qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
pﬃﬃﬃﬃﬃﬃﬃ
bÀ1 b bÀ1
Proof of Corollary 4.1. Consider tÃ ¼ nT ðRb À rÞ= RQ W Q R0 . Under the null
b
pﬃﬃﬃﬃﬃﬃﬃ
P
b
nT Rðb À bÞ ¼ Rðð1=nTÞ i x0i xi ÞÀ1
hypothesis, Rb ¼ r, so the numerator of tÃ is
pﬃﬃﬃﬃﬃﬃﬃ P
P
pﬃﬃﬃ
d
ðð1= nT Þ i x0i i Þ ! RQÀ1 L i Bi = n. From Theorem 4 and the hypotheses of the
Corollary, the denominator of tÃ converges in distribution to
vﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
!
u
n
n
n
X
u
1X X 0
À1 1
0
tRQ
L
Bi Bi À
Bi
Bi LQÀ1 R0 .
n
n i¼1
i¼1
i¼1
It follows from the Continuous Mapping Theorem that
P
pﬃﬃﬃ
RQÀ1 L i Bi = n
Ã d
t ! qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ .
P
P
P
ð1=nÞRQÀ1 Lð n Bi B0i À ð1=nÞ n Bi n B0i ÞLQÀ1 R0
i¼1
i¼1
i¼1
Deﬁne d ¼ ðRQÀ1 LLQÀ1 R0 Þ1=2 , so
P
pﬃﬃﬃ
d i B1;i = n
d
tÃ ! U ¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
P
P
P
ðd2 =nÞð n B1;i B01;i À ð1=nÞ n B1;i n B01;i Þ
i¼1
i¼1
i¼1
e
B1;n
¼ qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ .
P
e2
ð1=nÞð i B2 À B1;n Þ
1;i
ARTICLE IN PRESS
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
e
It is straightforward to show that B1;n $Nð0; 1Þ, that
e2
B1;n
P
2
i B1;i
2
e
À B1;n $w2 , and that
nÀ1
619
P
2
i B1;i
À
e
and B1;n are independent, from which it follows that
U¼
n 1=2
n 1=2
e
B1;n
qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ $
tnÀ1 .
P
nÀ1
nÀ1
e2
ð i B2 À B1;n Þ=ðn À 1Þ
1;i
The result for F Ã is obtained through a similar argument, and using a result from Rao
(2002) Chapter 8b to verify that the resulting quantity follows an F distribution. &
References
Andrews, D.W.K., 1991. Heteroskedasticity and autocorrelation consistent covariance matrix estimation.
Econometrica 59 (3), 817–858.
Arellano, M., 1987. Computing robust standard errors for within-groups estimators. Oxford Bulletin of
Economics and Statistics 49 (4), 431–434.
Baltagi, B.H., Wu, P.X., 1999. Unequally spaced panel data regressions with AR(1) disturbances. Econometric
Theory 15, 814–823.
Bell, R.M., McCaffrey, D.F., 2002. Bias reduction in standard errors for linear regression with multi-stage
samples. Mimeo RAND.
Bertrand, M., Duﬂo, E., Mullainathan, S., 2004. How much should we trust differences-in-differences estimates?
Quarterly Journal of Economics 119 (1), 249–275.
Bhargava, A., Franzini, L., Narendranathan, W., 1982. Serial correlation and the ﬁxed effects model. Review of
Economic Studies 49, 533–549.
Chung, K.L., 2001. A Course in Probability Theory, third ed. Academic Press, San Diego.
Donald, S., Lang, K., 2001. Inference with differences in differences and other panel data. Mimeo.
Doukhan, P., 1994. Mixing: properties and examples. In: Fienberg, S., Gani, J., Krickeberg, K., Olkin, I.,
Wermuth, N. (Eds.), Lecture Notes in Statistics, vol. 85. Springer, New York.
Drukker, D.M., 2003. Testing for serial correlation in linear panel-data models. Stata Journal 3, 168–177.
Hahn, J., Kuersteiner, G.M., 2002. Asymptotically unbiased inference for a dynamic panel model with ﬁxed
effects when both N and T are large. Econometrica 70 (4), 1639–1657.
Hahn, J., Newey, W.K., 2004. Jackknife and analytical bias reduction for nonlinear panel models. Econometrica
72 (4), 1295–1319.
Hansen, C.B., 2004. Inference in linear panel data models with serial correlation and an essay on the impact of
401(k) participation on the wealth distribution. Ph.D. Dissertation, Massachusetts Institute of Technology.
Hansen, C.B., 2006. Generalized least squares inference in multilevel models with serial correlation and ﬁxed
effects. Journal of Econometrics, doi:10.1016/j.jeconom.2006.07.011.
Kezdi, G., 2002. Robust standard error estimation in ﬁxed-effects panel models. Mimeo.
Kiefer, N.M., Vogelsang, T.J., 2002. Heteroskedasticity–autocorrelation robust testing using bandwidth equal to
sample size. Econometric Theory 18, 1350–1366.
Kiefer, N.M., Vogelsang, T.J., 2005. A new asymptotic theory for heteroskedasticity–autocorrelation robust tests.
Econometric Theory 21, 1130–1164.
Lancaster, T., 2002. Orthogonal parameters and panel data. Review of Economic Studies 69, 647–666.
Liang, K.-Y., Zeger, S., 1986. Longitudinal data analysis using generalized linear models. Biometrika 73 (1),
13–22.
MaCurdy, T.E., 1982. The use of time series processes to model the error structure of earnings in a longitudinal
data analysis. Journal of Econometrics 18 (1), 83–114.
Nickell, S., 1981. Biases in dynamic models with ﬁxed effects. Econometrica 49 (6), 1417–1426.
Phillips, P.C.B., Moon, H.R., 1999. Linear regression limit theory for nonstationary panel data. Econometrica 67
(5), 1057–1111.
Phillips, P.C.B., Sun, Y., Jin, S., 2003. Consistent HAC estimation and robust regression testing using sharp
origin kernels with no truncation. Cowles Foundation Discussion Paper 1407.
Rao, C.R., 2002. Linear Statistical Inference and Its Application. Wiley-Interscience.
ARTICLE IN PRESS
620
C.B. Hansen / Journal of Econometrics 141 (2007) 597–620
Solon, G., 1984. Estimating autocorrelations in ﬁxed effects models. NBER Technical Working Paper no. 32.
Solon, G., Inoue, A., 2004. A portmanteau test for serially correlated errors in ﬁxed effects models. Mimeo.
Stata Corporation, 2003. Stata User’s Guide Release 8. Stata Press, College Station, Texas.
Vogelsang, T.J., 2003. Testing in GMM models without truncation. In: Fomby, T.B., Hill, R.C. (Eds.), Advances
in Econometrics, volume 17, Maximum Likelihood Estimation of Misspeciﬁed Models: Twenty Years Later.
Elsevier, Amsterdam, pp. 192–233.
White, H., 1980. A heteroskedasticity-consistent covariance matrix estimator and a direct test for
heteroskedasticity. Econometrica 48 (4), 817–838.
White, H., 2001. Asymptotic Theory for Econometricians, revised edition. Academic Press, San Diego.
Wooldridge, J.M., 2002. Econometric Analysis of Cross Section and Panel Data. The MIT Press, Cambridge,
MA.
**Disclaimer:** Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.

**Why Is My Information Online?**