23andMe, Inc. v. Ancestry.com DNA, LLC et al

Filing 1

COMPLAINT for Patent Infringement, Violations of the Lanham Act, Cal. Bus. & Prof. Code 17200 and 17500, and Declaratory Relief of No Trademark Infringement and Trademark Invalidity; Demand for Jury Trial against All Defendants. (Filing Fee $400.00, Receipt No. 0971-12346102) Filed by 23andMe, Inc.. (Attachments: # 1 Exhibit 1, # 2 Exhibit 2, # 3 Exhibit 3, # 4 Exhibit 4, # 5 Exhibit 5, # 6 Exhibit 6, # 7 Exhibit 7, # 8 Exhibit 8, # 9 Exhibit 9, # 10 Exhibit 10, # 11 Exhibit 11, # 12 Civil Cover Sheet )(Gaede, William) (Filed on 5/11/2018) Modified on 5/11/2018 (jmlS, COURT STAFF). Modified on 5/15/2018 (gbaS, COURT STAFF).

Download PDF
EXHIBIT 8 Scaling AncestryDNA with the Hadoop Ecosystem June 5, 2014 What Ancestry uses from the Hadoop ecosystem ● Hadoop, HDFS, and MapReduce ● HBase ● Columnar, NoSQL data store, unlimited rows and columns ● Azkaban ● Workflow 2 What will this presentation cover? ● Describe the problem ● Discoveries with DNA ● Three key steps in the pipeline process ● Measure everything principle ● Three steps with Hadoop ● Hadoop as a job scheduler for the ethnicity step ● Scaling matching step ● MapReduce implementation of phasing ● Performance ● What comes next? 3 Discoveries with DNA ● Autosomal DNA test that analyzes 700,000 SNPs ● Over 400,000 DNA samples in our database ● Identified 30 million relationships that connect the genotyped members through shared ancestors 450,000 DNA DB Size 30,000,000 Cousin Matches Database Size (# of samples) 25,000,000 350,000 300,000 20,000,000 250,000 15,000,000 200,000 150,000 10,000,000 100,000 5,000,000 Cousin matches (4th or closer) 400,000 50,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 4 Three key steps in the pipeline ● What is a pipeline? Init Results Processing 1. Ethnicity (AdMixture) 2. Matching (GERMLINE and Jermline) 3. Finalize Phasing (Beagle and Underdog) 24 cores, 256G mem ● First pipeline executed on a single, beefy box. ● Only option is to scale vertically “Beefy Box” 5 Measure everything principle ● Start time, end time, duration in seconds, and sample count for every step in the pipeline. Also the full end-to-end processing time. ● Put the data in pivot tables and graphed each step ● Normalize the data (sample size was changing) ● Use the data collected to predict future performance 6 Challenges and pain points Performance degrades when DNA pool grows ● Static (by batch size) ● Linear (by DNA pool size) ● Quadratic (matching related steps) – time bomb 7 Ethnicity step on Hadoop Using Hadoop as a job scheduler to scale AdMixture 8 First step with Hadoop ● What was in place? ● Smart engineers with no Hadoop experience ● Pipeline running on a single computer that would not scale ● New business that needed a scalable solution to grow First step using Hadoop ● Run AdMixture step in parallel on Hadoop ● Self contained program with set inputs and outputs ● Simple MapReduce implementation ● Experience running jobs on Hadoop ● Freed up CPU and memory on the single computer for the other steps 9 What did we do? (Don’t cringe…) For a batch of 1000 Submit 40 jobs with 25 samples per job Mapper that forks AdMixture sequentually for each sample AdMixture Job #1 Pipeline Beefy Box (Process Step Control) AdMixture Job #2 ... Reducer #1 Reducer #2 (Shuffle) AdMixture Job #40 1. Mapper -> Key: User ID, Ethnicity Result 2. Reducer -> Key: User ID, Array [Ethnicity Result] ... Reducer #3 DNA Hadoop Cluster (40 nodes) Results go to a simple reducer that merges them into a single results file 10 Performance results ● Went from processing 500 samples in 20 hours to processing 1,000 samples in 2 ½ hours ● Reduced Beagle phasing step by 4 hours AdMixture Time (sec) Sum of Run Size Admixture Time 100,000 90,000 80,000 70,000 60,000 50,000 H1 Release AdMixture on Hadoop 40,000 30,000 20,000 10,000 0 Provided valuable experience and bought us time 11 GERMLINE to Jermline Moving the matching step to MapReduce and HBase 12 Introducing … GERMLINE! ● GERMLINE is an algorithm that finds hidden relationships within a pool of DNA ● Also refers to the reference implementation of that algorithm written in C++. You can find it here: http://www1.cs.columbia.edu/~gusev/germline/ So what’s the problem? ● GERMLINE (the implementation) was not meant to be used in an industrial setting ● Stateless, single threaded, prone to swapping (heavy memory usage) ● GERMLINE performs poorly on large data sets ● Our metrics predicted exactly where the process would slow to a crawl ● Put simply: GERMLINE couldn't scale 13 10 Hours GERMLINE run times (in hours) 25 20 15 5 0 60,000 57,500 55,000 52,500 50,000 47,500 45,000 42,500 40,000 37,500 35,000 32,500 30,000 27,500 25,000 22,500 20,000 17,500 15,000 12,500 10,000 7,500 5,000 2,500 14 Samples Projected GERMLINE run times 700 (in hours) 700 hours = 29+ days 600 Hours 500 400 300 GERMLINE run times 200 Projected GERMLINE run times 100 0 122,500 112,500 102,500 92,500 82,500 72,500 62,500 52,500 42,500 32,500 22,500 12,500 2,500 Samples 15 DNA matching walkthrough Simplified example of showing how the code works 16 DNA matching : How it works The Input Cersei : ACTGACCTAGTTGAC Joffrey : TTAAGCCTAGTTGAC Cersei Baratheon Joffrey Baratheon • Former queen of Westeros • Pretty much the human embodiment of evil • Machiavellian manipulator • Needlessly cruel • Mostly evil, but occasionally sympathetic • Kinda looks like Justin Bieber 17 DNA matching : How it works Separate into words 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC 18 DNA matching : How it works Build the hash table 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey 19 DNA matching : How it works Iterate through genome and find matches 0 1 2 Cersei : ACTGA CCTAG TTGAC Joffrey : TTAAG CCTAG TTGAC ACTGA_0 : Cersei TTAAG_0 : Joffrey CCTAG_1 : Cersei, Joffrey TTGAC_2 : Cersei, Joffrey Cersei and Joffrey match from position 1 to position 2 20 Does that mean they’re related? ...maybe 21 IBD to relationship estimation ● We use the total length of all shared ERSA 0.05 segments to estimate the relationship between two genetic relatives m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 0.03 0.02 0.01 0.00 probability 0.04 ● This is basically a classification problem 5 10 20 50 100 200 500 1000 5000 total_IBD(cM) 22 But wait...what about Jaime? Jaime : TTAAGCCTAGGGGCG Jaime Lannister • Kind of a has-been • Killed the Mad King • Has the hots for his sister, Cersei 23 The way Step one: Update the hash table Cersei 2_ACTGA_0 Joffrey 1 2_TTAAG_0 1 2_CCTAG_1 1 1 2_TTGAC_2 1 Already stored in HBase 1 Jaime : TTAAG CCTAG GGGCG New sample to add Key : [CHROMOSOME]_[WORD]_[POSITION] Qualifier : [USER ID] Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome 24 The way Step two: Find matches, update the results table 2_Cersei 2_Cersei 2_Joffrey 2_Joffrey { (1, 2), ...} Already stored in HBase { (1, 2), ...} Jaime and Joffrey match from position 0 to position 1 Jaime and Cersei match at position 1 New matches to add Key : [CHROMOSOME]_[USER ID] Qualifier : [CHROMOSOME]_[USER ID] Cell value : A list of ranges where the two users match on a chromosome 25 The way Hash Table Cersei 2_ACTGA_0 Joffrey Jaime 1 1 1 1 2_TTAAG_0 2_CCTAG_1 1 1 2_TTGAC_2 1 1 2_GGGCG_2 1 Results Table 2_Cersei 2_Joffrey { (1), ...} { (1), ...} { (1, 2), ...} 2_Jaime 2_Jaime { (1, 2), ...} 2_Cersei 2_Joffrey { (0,1), ...} { (0,1), ...} 26 But wait...what about Daenerys, Tyrion, Arya, and Jon Snow? 27 Run them in parallel with Hadoop! 28 Parallelism with Hadoop ● Batches are usually about a thousand people ● Each mapper takes a single chromosome for a single person ● MapReduce jobs: ● Job #1: Match words - Updates the hash table ● Job #2: Match segments - Identifies areas where the samples match 29 Matching steps on Hadoop Input: User ID, Phased DNA Hash Table Reducer #1 Hash Table Mapper #1 Hash Table Reducer #2 Hash Table Mapper #2 Pipeline Beefy Box (Process Step Control) ... Hash Table Mapper #N Results Mapper #1 1 ... Hash Table Reducer #N Results Reducer #1 Results Mapper #2 2 ... Results Mapper #N Results Reducer #2 3 ... Results Reducer #N 4 DNA Hadoop Cluster (40 nodes) 1. Hash Table Mapper -> Breaks input into words and fills the hash table (HBase Table #1) 2. Hash Table Reducer -> default reducer (does nothing) 3. Results Mapper -> For each new user, read hash table, fill in the results table (HBase Table #2) 4. Results Reducer -> Key: Object(User ID #1, User ID #2), Array[Object(chrom + pos, Matching DNA Segment)] 30 10 Hours Run times for matching with 25 A 1700% performance improvement over GERMLINE! 20 15 5 0 2,500 7,500 12,500 17,500 22,500 27,500 32,500 37,500 42,500 47,500 52,500 57,500 62,500 67,500 72,500 77,500 82,500 87,500 92,500 97,500 102,500 107,500 112,500 117,500 31 Samples 20 117,500 40 112,500 60 107,500 80 97,500 92,500 87,500 82,500 77,500 100 102,500 Samples 72,500 67,500 62,500 57,500 52,500 47,500 42,500 37,500 32,500 27,500 22,500 17,500 12,500 7,500 2,500 Hours Run times for matching (in hours) 180 160 140 120 GERMLINE run times Jermline run times Projected GERMLINE run times 0 32 performance ● Science team is sure the Jermline algorithm is linear ● Improving the accuracy ● Found a bug in original C++ reference code ● Balancing false positives and false negatives ● Binary version of Jermline ● Use less memory and improve speed ● Paper submitted describing the implementation ● Releasing as an Open Source project soon 33 Beagle to Underdog Moving phasing step from a single process to MapReduce 34 Phasing goes to the dogs ● Beagle ● Open source, freely available program ● Multi-threaded process that runs on one computer ● More accurate with a large sample set ● Underdog ● Does the same statistical calculations with a larger reference set, which increases accuracy ● Carefully split into a MapReduce implementation that allows parallel processing ● Collaboration between the DNA Science and Pipeline Developer Teams 35 What did we do? Load Window Reference Data Once Input is per User: ATGCATCGTACGGACT... Window Reducer #1 Window Mapper #1 Window Reducer #2 Window Mapper #2 Pipeline Beefy Box (Process Step Control) ... Window Mapper #N Phase Mapper #1 1 ... Window Reducer #N Phase Reducer #1 Phase Mapper #2 2 ... Phase Mapper #N Phase Reducer #2 3 ... Phase Reducer #N 4 DNA Hadoop Cluster (40 nodes) 1. Window Mapper -> Key: Window ID, Object(User ID, DNA) 2. Window Reducer -> Key: Window ID, Array[Object(User ID, DNA)] 3. Phase Mapper (loads window data) -> Key: User ID, Object(Window ID, Phased DNA) 4. Phase Reducer -> Key: User ID, Array[Object(Window ID, Phased DNA)] 36 Underdog performance ● Went from 12 hours to process 1,000 samples to under 25 minutes with a MapReduce implementation 80,000 70,000 Underdog replaces Beagle 60,000 50,000 40,000 30,000 20,000 10,000 0 Total Run Size Total Beagle-Underdog Duration With improved accuracy! 37 Performance and next steps Incremental change 38 Pipeline steps and incremental change… ● Incremental change over time ● Supporting the business in a “just in time” Agile way 250000 AdMixture on Hadoop 200000 Jermline replaces Germline Ethnicity V2 Release Underdog Replaces Beagle Beagle-Underdog Phasing Pipeline Finalize Relationship Processing 150000 Germline-Jermline Results Processing Germline-Jermline Processing 100000 Beagle Post Phasing Admixture Plink Prep 50000 Pipeline Initialization 500 5618 9615 14446 19522 24820 31172 38307 45252 52232 61675 73407 84337 93937 104684 115758 127194 138601 149988 161616 173719 185701 197853 209575 221290 232673 243550 255111 267266 279210 291017 302658 314662 326704 338813 350854 362954 375161 387334 399512 0 39 …while the business continues to grow rapidly DNA Database Size 450,000 400,000 # of processed samples) 350,000 300,000 250,000 200,000 150,000 100,000 50,000 Jan-12 Apr-12 Jul-12 Oct-12 Jan-13 Apr-13 Jul-13 Oct-13 Jan-14 Apr-14 40 What’s next? Building different pipelines ● Azkaban ● Allows us to easily tie together steps on Hadoop ● Drop different steps in/out and create different pipelines ● Significant improvement over a hand coded pipeline ● Cloud ● New algorithm changes will force a complete re-run of the entire DNA pool ● Best example: New matching or ethnicity algorithm will force us to reprocess 400K+ samples ● Solution: Use the cloud for this processing while the current pipeline keeps chugging along 41 What’s next? Other areas for improvement ● Admixture as a MapReduce implementation ● Last major algorithm that needs to be addressed ● Expect to get performance improvements similar to Underdog ● Matching growth will cause problems ● Matches per run increasing Batch Run Size Total Matches Per Run 9,000,000 8,000,000 7,000,000 6,000,000 5,000,000 4,000,000 3,000,000 2,000,000 1,000,000 0 84337 92130 101143 109887 119669 129170 138601 148166 157710 167551 177654 187745 197853 207799 217516 226958 236516 245548 255111 265284 275335 285149 295020 312617 322655 332777 342854 352904 362954 373109 383312 393394 ● Change the handoff Jermline Processing Duration Keep measuring everything and adjust 42 Questions? Ancestry is hiring for the DNA Pipeline Team! Tech Roots Blog: http://blogs.ancestry.com/techroots byetman@ancestry.com Special thanks to the DNA Science and Pipeline Development Teams at Ancestry

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?