Campbell et al v. Facebook Inc.
Filing
109
MOTION for Extension of Time to File Plaintiffs' Motion for Extension of Class Certification and Summary Judgment Deadlines filed by Matthew Campbell, Michael Hurley. (Attachments: # 1 Proposed Order, # 2 Declaration of David Rudolph, # 3 Exhibit 1, # 4 Exhibit 2, # 5 Exhibit 3, # 6 Exhibit 4, # 7 Exhibit 5, # 8 Exhibit 6, # 9 Exhibit 7, # 10 Exhibit 8, # 11 Exhibit 9, # 12 Exhibit 10, # 13 Exhibit 11, # 14 Exhibit 12, # 15 Exhibit 13, # 16 Exhibit 14, # 17 Exhibit 15, # 18 Exhibit 16, # 19 Exhibit 17, # 20 Exhibit 18, # 21 Exhibit 19, # 22 Exhibit 20, # 23 Exhibit 21)(Sobol, Michael) (Filed on 9/16/2015)
EXHIBIT 15
August 20, 2015
VIA ELECTRONIC MAIL
Michael Sobol, Esq.
David Rudolph, Esq.
Melissa Gardner, Esq.
Lieff Cabraser Heimann & Bernstein, LLP
275 Battery Street, 29th Floor
San Francisco, CA 94111-3339
Re:
Hank Bates, Esq.
Allen Carney, Esq.
David Slade, Esq.
Carney Bates & Pulliam, PLLC
11311 Arcade Drive
Little Rock, AR 72212
Campbell et al. v. Facebook, Inc., N.D. Cal. Case No. 13-cv-05996-PJH
Dear David:
I write in response to your July 23, 2015 letter regarding Facebook’s use of predictive coding.
As discussed in our June 19 letter and during our call on July 17, Facebook’s predictive coding
process is intended to apply advanced machine learning techniques to the text of documents to
automatically classify un-reviewed documents into predefined categories of interest, such as
responsiveness. The classification models are “trained” through supervised learning—meaning
the model is built from a human-reviewed subset of documents—and can be iteratively
strengthened with fine-tuning techniques. Our team can then leverage the results to review the
documents that are most likely to be responsive so we can more quickly understand and produce
responsive content from the document population.
Terminology
As a preliminary matter, we wish to clarify some terminology that is used in your letter but
which we believe may otherwise cause some confusion in characterizing how a predictive coding
process has been used in this case. Accordingly, for purposes of the explanation below, the term
“training set” refers to the set of documents that is initially used to “train” the classification
model, as described above. The term “assessment” refers to a random sample of documents
used to evaluate the performance of the classification model—this is often also referred to as a
“QC set,” “validation set,” or (as in your letter) the “control set.” The term “recall” describes
August 20, 2015
Page 2
a measurement used to determine the completeness of the review, and denotes the percentage of
all responsive documents that is returned by the model. Finally, the term “excluded documents”
describes documents that do not meet the requirements for predictive coding and are thus
excluded from the process altogether. In this case, Facebook has excluded the following
document types from the predictive coding process (in addition, of course, to any file we already
reviewed linearly, which may have been used for training or assessment of the predictive coding
model as described below). We have no reason to believe that responsive documents are
included among these documents.
.ARC File
Progressive JPEG
.ZIP/JAR File
Quicktime Movie
Adobe Indesign Interchange
Tagged Image File Format
Adobe Photoshop
TrueType Font Collection File
Compuserve GIF
UNIX GZip
Enhanced Windows Metafile
UNIX Tar
EPS (TIFF Header)
Unknown format
EXE / DLL File
Windows Bitmap
Extensible Markup Language
(XML)
Windows Media Audio
Windows Metafile
Windows shortcut
Windows Sound
Microsoft Digital Video Recording
MPEG-4 file
TrueType Font File
Windows Icon
Windows Media Video
Windows Video
ISO Base Media File
Java Class File
JPEG File Interchange
Macromedia Flash 10
Macromedia Flash 4-8
Macromedia Flash 9
MPEG-1 audio - Layer 3
MPEG-2 audio - Layer 3
Portable Network Graphics Format
Post Script
August 20, 2015
Page 3
Predictive Coding Process
We also wish to provide some clarification as to the nature of the process we have undertaken to
conduct predictive coding in this case.
First, we created a training set to teach the computer. The training set includes (i) a set of
documents identified as responsive in our linear review, (ii) the results of a review of randomly
selected documents from a subset of the overall data set, and (iii) the materials included in
Facebook’s Production Volumes 3, 4, and 6.1 As we continue to review and produce responsive
documents, those documents will likewise be incorporated into training sets to further train the
computer. This method helps to identify enough responsive documents for training where—as
here—the responsiveness rate is very low.
Next, we conducted an assessment of the results by reviewing documents in an “assessment set”
(the equivalent of what is labelled in your letter as a “control set”). The assessment set is a
random, statistically valid, representative sample of the overall data set of unreviewed
documents. The statistical parameters used to generate the assessment size were: Confidence
level of 95% and Margin of Error of +/- 2.5% and a Variance of 50%.
We then used the training set and the assessment set to generate a predictive model. More
specifically, as described during our call, we used a software suite—Equivio—to select
documents from the training population until a stable model was created. It then tested the
model using the reviewed assessment set. More concretely, in order to evaluate the performance
of the predictive model, the assessment was used to evaluate the rate of agreement between the
legal team reviewers and the computer.
The predictive model assigned each document in the data set a probability score from 0 to 100,
with 0 being the most unlikely to be responsive and 100 being the most likely to be responsive.
The coding applied to the documents in the assessment was compared to the probability scores of
the documents in the assessment (generated by the predictive model).
“Recall,” as described previously, is a measurement used to evaluate the results of the predictive
model, and describes the percentage of all relevant documents that is returned by the model. We
added additional documents to the model until it provided a responsive recall of 80%. In other
words, the model accurately identified 80% of responsive documents in the assessment set. We
intend to review all documents identified by the model as likely to be responsive and produce
those that are indeed responsive.
In addition to reviewing the documents identified by the model as likely to be responsive, we
will use a “Test the Rest” process to finalize the results of the predictive coding process. “Test
the Rest” involves pulling a statistical sample from the documents identified by the model to be
unlikely to be responsive and analyzing those documents to confirm that the richness within the
1
Productions 1 and 2 did not consist of emails, and Production 5 consisted of a single document; these materials
therefore were not ideal for inclusion in the training set.
August 20, 2015
Page 4
“Rest” is not higher than expected (in other words, the documents with relevance scores below
the cut-off do not include an unexpectedly high proportion of relevant documents). The goal is
to reconfirm that the responsive recall rate was achieved.
Responses to Additional Statements and Inquiries in your July 23rd Letter
Several statements in your letter concerning our July 17th call do not fully capture our
recollection of that call or the relevant facts, and we therefore provide the following
clarifications:
•
We remind you, as stated during our call, that predictive coding is an iterative process,
and as such, if a new population requires review and a linear review cannot be performed
efficiently, it may be appropriate to conduct another iteration of the predictive coding
process.
•
The assessment set (or “control set,” as your letter calls it) actually contains 1,576
documents, not 1,591 documents. This assessment had a point estimate richness of
1.90% (rather than 0.06% as you stated) calculated with a 95% confidence level and
2.5% margin of error.
•
Your statement that “Facebook does not intend to further review or produce any of the
Filtered By Search Term Documents that fell beneath the cut-off score established during
the predictive coding” is incorrect. Facebook will manually review any document that
was excluded from the predictive coding process but has a family member with a score
above the cut-off.
•
Your assertion that “No further training has been performed, although the model
produced by the training has been used to classify an unspecified number of additional
Filtered By Search Term Documents” also is incorrect. We performed training at
multiple stages, including after the most recent upload of documents. Similarly, the
assessment set was drawn randomly from all unreviewed “Filtered By Search Term
Documents.”
As my colleague Jeana Bisnar Maute mentioned in her email to you on July 27, 2015, we did not
agree during our call on July 17th to provide the information you list on page 2 of your letter.
Rather, we agreed to engage in a productive discussion and consider your further inquiries with
the goal of ensuring that this process is amenable to both parties as a means of making the
discovery process efficient and appropriate. We have considered your inquiries and provide the
following responses to the questions posed on page 2 of your letter:
1. The total number of documents against which the search terms were run is not readily
available using existing tools.
August 20, 2015
Page 5
2. We identified the documents to include in the set against which search terms were
applied by identifying those documents sent or received by the following former or
current Facebook personnel:
Michael Adkins
Jordan Blackthorne
Peng Fan
Dan Fechete
Jonathan Gross
Ray He
Alex Himel
Matt Jones
Mark Kinsey
Ryan Lim
Jiakai Liu
Malorie Lucich
Caryn Marooney
Ben Mathews
Christopher Palow
Giri Rajaram
Scott Renfro
Rob Sherman
Mathew Verghese
Mike Vernal
Frederic Wolens
Gary Wu
3. The number of true positives, true negatives, false positives, and false negatives that
resulted from application of the predictive coding model against the assessment set are as
follows:
Above the cutoff:
True Positive: 24
False Positive: 654
Below the cutoff:
True Negative: 892
False Negative: 6
We also provide the following additional information in response to your requests:
1. The assessment set is 1,576 documents, comprising a random sample from all
unreviewed “Filtered By Search Term Documents” that are appropriate for predictive
coding (i.e., not an excluded file type, etc., described above).
August 20, 2015
Page 6
2. For information about “whether seeding was used, and if so, when and how,” please see
the section above describing the contents and procedure undertaken for the training set.
3. No subset of documents below the cutoff score has been manually reviewed yet (aside
from family members of documents above the cutoff score).
4. The former and current Facebook personnel whose documents were included in the
roughly 590,000 unique Filtered By Search Terms documents work/ed in the following
areas:
• Social Plugins: Dan Fechete, Jonathan Gross, Ray He, Alex Himel, Mark Kinsey,
Scott Renfro, Mike Vernal
• Communications: Malorie Lucich, Caryn Marooney, Frederic Wolens
• Messages: Michael Adkins, Ryan Lim, Jiakai Liu
• Site Integrity: Matt Jones, Ben Mathews, Christopher Palow
• Advertising: Jordan Blackthorne, Peng Fan, Giri Rajaram, Mathew Verghese, and
Gary Wu
• Policy: Rob Sherman
5. None of the custodians from whom we have collected documents has been excluded from
the predictive coding population.
6. The assessment set was classified as follows:
Responsive: 30
Not Responsive: 1,546
7. No documents were unclassified in the training set. All documents were coded. Equivio
was able to create a stable model after selecting 1,040 of the documents classified for
training. Those documents were classified as follows:
Responsive: 393
Not Responsive: 647
8. The individuals who classified the documents in both the training and assessment sets are
Gibson Dunn attorneys.
9. Gibson Dunn attorneys considered the responsiveness of each document and conferred as
appropriate.
10. The individuals who classified the documents in both the training and assessment sets are
Gibson Dunn attorneys.
August 20, 2015
Page 7
11. Gibson Dunn attorneys considered the responsiveness of each document and conferred as
appropriate.
12. As described above (supra pp. 3-4), both the training and assessment sets contained
random samples of documents that were manually classified by Gibson Dunn attorneys,
who considered the responsiveness of each document and conferred as appropriate.
13. We have produced such documents, and will continue to produce them, to the extent they
are discoverable, responsive, and non-privileged.
As we have explained, Facebook is using these methods to identify the most relevant documents
from an enormous set, to expedite the production, and to ensure that there is a fair, reasonable,
and proportionate review process. See N.D. Cal. ESI Guideline 2.02 (recommending conferring
regarding “[o]pportunities to reduce costs and increase efficiency and speed, such as by
conferring about the methods and technology used for searching ESI to help identify the relevant
information and sampling methods to validate the search for relevant information”).
Please let us know if you have any additional questions.
Sincerely,
/s/ Priyanka Rajagopalan
Priyanka Rajagopalan
cc:
All counsel of record (via e-mail only)
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?