23andMe, Inc. v. Ancestry.com DNA, LLC et al
Filing
1
COMPLAINT for Patent Infringement, Violations of the Lanham Act, Cal. Bus. & Prof. Code 17200 and 17500, and Declaratory Relief of No Trademark Infringement and Trademark Invalidity; Demand for Jury Trial against All Defendants. (Filing Fee $400.00, Receipt No. 0971-12346102) Filed by 23andMe, Inc.. (Attachments: # 1 Exhibit 1, # 2 Exhibit 2, # 3 Exhibit 3, # 4 Exhibit 4, # 5 Exhibit 5, # 6 Exhibit 6, # 7 Exhibit 7, # 8 Exhibit 8, # 9 Exhibit 9, # 10 Exhibit 10, # 11 Exhibit 11, # 12 Civil Cover Sheet )(Gaede, William) (Filed on 5/11/2018) Modified on 5/11/2018 (jmlS, COURT STAFF). Modified on 5/15/2018 (gbaS, COURT STAFF).
EXHIBIT
8
Scaling AncestryDNA with the Hadoop
Ecosystem
June 5, 2014
What Ancestry uses from the Hadoop ecosystem
● Hadoop, HDFS, and MapReduce
● HBase
● Columnar, NoSQL data store, unlimited rows and columns
● Azkaban
● Workflow
2
What will this presentation cover?
● Describe the problem
● Discoveries with DNA
● Three key steps in the pipeline process
● Measure everything principle
● Three steps with Hadoop
● Hadoop as a job scheduler for the ethnicity step
● Scaling matching step
● MapReduce implementation of phasing
● Performance
● What comes next?
3
Discoveries with DNA
● Autosomal DNA test that analyzes 700,000 SNPs
● Over 400,000 DNA samples in our database
● Identified 30 million relationships that connect the genotyped members
through shared ancestors
450,000
DNA DB Size
30,000,000
Cousin Matches
Database Size (# of samples)
25,000,000
350,000
300,000
20,000,000
250,000
15,000,000
200,000
150,000
10,000,000
100,000
5,000,000
Cousin matches (4th or closer)
400,000
50,000
Jan-12
Apr-12
Jul-12
Oct-12
Jan-13
Apr-13
Jul-13
Oct-13
Jan-14
Apr-14
4
Three key steps in the pipeline
● What is a pipeline?
Init
Results
Processing
1.
Ethnicity (AdMixture)
2.
Matching (GERMLINE and Jermline)
3.
Finalize
Phasing (Beagle and Underdog)
24 cores,
256G mem
● First pipeline executed on a single, beefy box.
●
Only option is to scale vertically
“Beefy Box”
5
Measure everything principle
● Start time, end time, duration in seconds, and sample count for every step
in the pipeline. Also the full end-to-end processing time.
● Put the data in pivot tables and graphed each step
● Normalize the data (sample size was changing)
● Use the data collected to predict future performance
6
Challenges and pain points
Performance degrades when DNA pool grows
● Static (by batch size)
● Linear (by DNA pool size)
● Quadratic (matching related steps) – time bomb
7
Ethnicity step on Hadoop
Using Hadoop as a job scheduler to scale AdMixture
8
First step with Hadoop
● What was in place?
● Smart engineers with no Hadoop experience
● Pipeline running on a single computer that would not scale
● New business that needed a scalable solution to grow
First step using Hadoop
● Run AdMixture step in parallel on Hadoop
● Self contained program with set inputs and outputs
● Simple MapReduce implementation
● Experience running jobs on Hadoop
● Freed up CPU and memory on the single computer for the other steps
9
What did we do? (Don’t cringe…)
For a batch of 1000
Submit 40 jobs with
25 samples per job
Mapper that forks
AdMixture
sequentually for
each sample
AdMixture Job #1
Pipeline Beefy Box
(Process Step Control)
AdMixture Job #2
...
Reducer #1
Reducer #2
(Shuffle)
AdMixture Job #40
1.
Mapper -> Key: User ID, Ethnicity Result
2.
Reducer -> Key: User ID, Array [Ethnicity Result]
...
Reducer #3
DNA Hadoop Cluster
(40 nodes)
Results go to a
simple reducer that
merges them into a
single results file
10
Performance results
● Went from processing 500 samples in 20 hours
to processing 1,000 samples in 2 ½ hours
● Reduced Beagle phasing step by 4 hours
AdMixture Time (sec)
Sum of Run Size
Admixture Time
100,000
90,000
80,000
70,000
60,000
50,000
H1 Release
AdMixture on
Hadoop
40,000
30,000
20,000
10,000
0
Provided valuable experience and bought us time
11
GERMLINE to Jermline
Moving the matching step to MapReduce and HBase
12
Introducing … GERMLINE!
● GERMLINE is an algorithm that finds hidden relationships within a pool of DNA
● Also refers to the reference implementation of that algorithm written in C++. You
can find it here:
http://www1.cs.columbia.edu/~gusev/germline/
So what’s the problem?
● GERMLINE (the implementation) was not meant to be used in an industrial setting
● Stateless, single threaded, prone to swapping (heavy memory usage)
● GERMLINE performs poorly on large data sets
● Our metrics predicted exactly where the process would slow to a crawl
● Put simply: GERMLINE couldn't scale
13
10
Hours
GERMLINE run times (in hours)
25
20
15
5
0
60,000
57,500
55,000
52,500
50,000
47,500
45,000
42,500
40,000
37,500
35,000
32,500
30,000
27,500
25,000
22,500
20,000
17,500
15,000
12,500
10,000
7,500
5,000
2,500
14
Samples
Projected GERMLINE run times
700
(in hours)
700 hours = 29+ days
600
Hours
500
400
300
GERMLINE run
times
200
Projected
GERMLINE run
times
100
0
122,500
112,500
102,500
92,500
82,500
72,500
62,500
52,500
42,500
32,500
22,500
12,500
2,500
Samples
15
DNA matching walkthrough
Simplified example of showing how the code works
16
DNA matching : How it works
The Input
Cersei : ACTGACCTAGTTGAC
Joffrey : TTAAGCCTAGTTGAC
Cersei Baratheon
Joffrey Baratheon
•
Former queen of
Westeros
•
Pretty much the human
embodiment of evil
•
Machiavellian
manipulator
•
Needlessly cruel
•
Mostly evil, but
occasionally
sympathetic
•
Kinda looks like Justin
Bieber
17
DNA matching : How it works
Separate into words
0
1
2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
18
DNA matching : How it works
Build the hash table
0
1
2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
19
DNA matching : How it works
Iterate through genome and find matches
0
1
2
Cersei : ACTGA CCTAG TTGAC
Joffrey : TTAAG CCTAG TTGAC
ACTGA_0 : Cersei
TTAAG_0 : Joffrey
CCTAG_1 : Cersei, Joffrey
TTGAC_2 : Cersei, Joffrey
Cersei and Joffrey match from position 1 to position 2
20
Does that mean they’re related?
...maybe
21
IBD to relationship estimation
● We use the total length of all shared
ERSA
0.05
segments to estimate the relationship
between two genetic relatives
m1
m2
m3
m4
m5
m6
m7
m8
m9
m10
m11
0.03
0.02
0.01
0.00
probability
0.04
● This is basically a classification problem
5
10
20
50
100 200
500 1000
5000
total_IBD(cM)
22
But wait...what about Jaime?
Jaime : TTAAGCCTAGGGGCG
Jaime Lannister
•
Kind of a has-been
•
Killed the Mad King
•
Has the hots for his
sister, Cersei
23
The
way
Step one: Update the hash table
Cersei
2_ACTGA_0
Joffrey
1
2_TTAAG_0
1
2_CCTAG_1
1
1
2_TTGAC_2
1
Already stored in HBase
1
Jaime : TTAAG CCTAG GGGCG
New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]
Qualifier : [USER ID]
Cell value : A byte set to 1, denoting that the user has that word at that position on that
chromosome
24
The
way
Step two: Find matches, update the results table
2_Cersei
2_Cersei
2_Joffrey
2_Joffrey
{ (1, 2), ...}
Already stored in HBase
{ (1, 2), ...}
Jaime and Joffrey match from position 0 to position 1
Jaime and Cersei match at position 1
New matches to add
Key : [CHROMOSOME]_[USER ID]
Qualifier : [CHROMOSOME]_[USER ID]
Cell value : A list of ranges where the two users match on a chromosome
25
The
way
Hash Table
Cersei
2_ACTGA_0
Joffrey
Jaime
1
1
1
1
2_TTAAG_0
2_CCTAG_1
1
1
2_TTGAC_2
1
1
2_GGGCG_2
1
Results Table
2_Cersei
2_Joffrey
{ (1), ...}
{ (1), ...}
{ (1, 2), ...}
2_Jaime
2_Jaime
{ (1, 2), ...}
2_Cersei
2_Joffrey
{ (0,1), ...}
{ (0,1), ...}
26
But wait...what about Daenerys, Tyrion,
Arya, and Jon Snow?
27
Run them in parallel with Hadoop!
28
Parallelism with Hadoop
● Batches are usually about a thousand people
● Each mapper takes a single chromosome for a single person
● MapReduce jobs:
● Job #1: Match words
- Updates the hash table
● Job #2: Match segments
- Identifies areas where the samples match
29
Matching steps on Hadoop
Input:
User ID, Phased DNA
Hash Table
Reducer #1
Hash Table
Mapper #1
Hash Table
Reducer #2
Hash Table
Mapper #2
Pipeline Beefy Box
(Process Step Control)
...
Hash Table
Mapper #N
Results
Mapper #1
1
...
Hash Table
Reducer #N
Results
Reducer #1
Results
Mapper #2
2
...
Results
Mapper #N
Results
Reducer #2
3
...
Results
Reducer #N
4
DNA Hadoop Cluster
(40 nodes)
1.
Hash Table Mapper -> Breaks input into words and fills the hash table (HBase Table #1)
2.
Hash Table Reducer -> default reducer (does nothing)
3.
Results Mapper -> For each new user, read hash table, fill in the results table (HBase
Table #2)
4.
Results Reducer -> Key: Object(User ID #1, User ID #2), Array[Object(chrom + pos,
Matching DNA Segment)]
30
10
Hours
Run times for matching with
25
A 1700% performance improvement
over GERMLINE!
20
15
5
0
2,500
7,500
12,500
17,500
22,500
27,500
32,500
37,500
42,500
47,500
52,500
57,500
62,500
67,500
72,500
77,500
82,500
87,500
92,500
97,500
102,500
107,500
112,500
117,500
31
Samples
20
117,500
40
112,500
60
107,500
80
97,500
92,500
87,500
82,500
77,500
100
102,500
Samples
72,500
67,500
62,500
57,500
52,500
47,500
42,500
37,500
32,500
27,500
22,500
17,500
12,500
7,500
2,500
Hours
Run times for matching
(in hours)
180
160
140
120
GERMLINE run
times
Jermline run
times
Projected
GERMLINE run
times
0
32
performance
● Science team is sure the Jermline algorithm is linear
● Improving the accuracy
● Found a bug in original C++ reference code
● Balancing false positives and false negatives
● Binary version of Jermline
● Use less memory and improve speed
● Paper submitted describing the implementation
● Releasing as an Open Source project soon
33
Beagle to Underdog
Moving phasing step from a single process to MapReduce
34
Phasing goes to the dogs
● Beagle
● Open source, freely available program
● Multi-threaded process that runs on one computer
● More accurate with a large sample set
● Underdog
● Does the same statistical calculations with a larger reference set, which increases accuracy
● Carefully split into a MapReduce implementation that allows parallel processing
● Collaboration between the DNA Science and Pipeline Developer Teams
35
What did we do?
Load Window Reference
Data Once
Input is per User:
ATGCATCGTACGGACT...
Window
Reducer #1
Window
Mapper #1
Window
Reducer #2
Window
Mapper #2
Pipeline Beefy Box
(Process Step Control)
...
Window
Mapper #N
Phase
Mapper #1
1
...
Window
Reducer #N
Phase
Reducer #1
Phase
Mapper #2
2
...
Phase
Mapper #N
Phase
Reducer #2
3
...
Phase
Reducer #N
4
DNA Hadoop Cluster
(40 nodes)
1.
Window Mapper -> Key: Window ID, Object(User ID, DNA)
2.
Window Reducer -> Key: Window ID, Array[Object(User ID, DNA)]
3.
Phase Mapper (loads window data) -> Key: User ID, Object(Window ID, Phased DNA)
4.
Phase Reducer -> Key: User ID, Array[Object(Window ID, Phased DNA)]
36
Underdog performance
● Went from 12 hours to process 1,000 samples
to under 25 minutes with a MapReduce implementation
80,000
70,000
Underdog
replaces Beagle
60,000
50,000
40,000
30,000
20,000
10,000
0
Total Run Size
Total Beagle-Underdog Duration
With improved accuracy!
37
Performance and next steps
Incremental change
38
Pipeline steps and incremental change…
● Incremental change over time
● Supporting the business in a “just in time” Agile way
250000
AdMixture on
Hadoop
200000
Jermline replaces
Germline
Ethnicity V2
Release
Underdog Replaces
Beagle
Beagle-Underdog Phasing
Pipeline Finalize
Relationship Processing
150000
Germline-Jermline Results Processing
Germline-Jermline Processing
100000
Beagle Post Phasing
Admixture
Plink Prep
50000
Pipeline Initialization
500
5618
9615
14446
19522
24820
31172
38307
45252
52232
61675
73407
84337
93937
104684
115758
127194
138601
149988
161616
173719
185701
197853
209575
221290
232673
243550
255111
267266
279210
291017
302658
314662
326704
338813
350854
362954
375161
387334
399512
0
39
…while the business continues to grow rapidly
DNA Database Size
450,000
400,000
# of processed samples)
350,000
300,000
250,000
200,000
150,000
100,000
50,000
Jan-12
Apr-12
Jul-12
Oct-12
Jan-13
Apr-13
Jul-13
Oct-13
Jan-14
Apr-14
40
What’s next? Building different pipelines
● Azkaban
● Allows us to easily tie together steps on Hadoop
● Drop different steps in/out and create different pipelines
● Significant improvement over a hand coded pipeline
● Cloud
● New algorithm changes will force a complete re-run of the entire DNA pool
● Best example: New matching or ethnicity algorithm will force us to reprocess 400K+
samples
● Solution: Use the cloud for this processing while the current pipeline keeps chugging along
41
What’s next? Other areas for improvement
● Admixture as a MapReduce implementation
● Last major algorithm that needs to be addressed
● Expect to get performance improvements similar to Underdog
● Matching growth will cause problems
● Matches per run increasing
Batch Run Size
Total Matches Per Run
9,000,000
8,000,000
7,000,000
6,000,000
5,000,000
4,000,000
3,000,000
2,000,000
1,000,000
0
84337
92130
101143
109887
119669
129170
138601
148166
157710
167551
177654
187745
197853
207799
217516
226958
236516
245548
255111
265284
275335
285149
295020
312617
322655
332777
342854
352904
362954
373109
383312
393394
● Change the handoff
Jermline Processing Duration
Keep measuring everything and adjust
42
Questions?
Ancestry is hiring for the DNA Pipeline Team!
Tech Roots Blog: http://blogs.ancestry.com/techroots
byetman@ancestry.com
Special thanks to the DNA Science and Pipeline Development Teams at
Ancestry
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?