The Football Association Premier League Limited et al v. Youtube, Inc. et al

Filing 276

DECLARATION of Elizabeth Anne Figueira, Esq. in Opposition re: 167 MOTION for Summary Judgment.. Document filed by The Music Force LLC, Cal IV Entertainment, LLC, Cherry Lane Music Publishing Company, Inc., The Football Association Premier League Limited, Robert Tur, National Music Publishers' Association, The Rodgers & Hammerstein Organization, Edward B. Marks Music Company, Freddy Bienstock Music Company, Alley Music Corporation, X-Ray Dog Music, Inc., Federation Francaise De Tennis, The Scottish Premier League Limited, The Music Force Media Group LLC, Sin-Drome Records, Ltd., Murbo Music Publishing, Inc., Stage Three Music (US), Inc., Bourne Co.. (Attachments: # 1 Exhibit 189, # 2 Exhibit 190, # 3 Exhibit 191, # 4 Exhibit 192, # 5 Exhibit 193, # 6 Exhibit 194, # 7 Exhibit 195, # 8 Exhibit 196, # 9 Exhibit 197, # 10 Exhibit 198, # 11 Exhibit 199, # 12 Exhibit 200, # 13 Exhibit 201, # 14 Exhibit 202, # 15 Exhibit 203, # 16 Exhibit 204, # 17 Exhibit 205, # 18 Exhibit 206, # 19 Exhibit 207, # 20 Exhibit 208, # 21 Exhibit 209, # 22 Exhibit 210, # 23 Exhibit 211, # 24 Exhibit 212, # 25 Exhibit 213, # 26 Exhibit 214, # 27 Exhibit 215, # 28 Exhibit 216, # 29 Exhibit 217, # 30 Exhibit 218, # 31 Exhibit 219, # 32 Exhibit 220, # 33 Exhibit 221, # 34 Exhibit 222, # 35 Exhibit 223, # 36 Exhibit 224 Part 1, # 37 Exhibit 224 Part 2, # 38 Exhibit 225, # 39 Exhibit 226, # 40 Exhibit 227 Part 1, # 41 Exhibit 227 Part 2, # 42 Exhibit 227 Part 3, # 43 Exhibit 227 Part 4, # 44 Exhibit 228, # 45 Exhibit 229, # 46 Exhibit 230, # 47 Exhibit 231, # 48 Exhibit 232, # 49 Exhibit 233, # 50 Exhibit 234, # 51 Exhibit 235, # 52 Exhibit 236, # 53 Exhibit 237, # 54 Exhibit 238, # 55 Exhibit 239, # 56 Exhibit 240, # 57 Exhibit 241, # 58 Exhibit 242, # 59 Exhibit 243, # 60 Exhibit 244, # 61 Exhibit 245, # 62 Exhibit 246, # 63 Exhibit 247, # 64 Exhibit 248, # 65 Exhibit 249, # 66 Exhibit 250, # 67 Exhibit 251, # 68 Exhibit 252, # 69 Exhibit 253, # 70 Exhibit 254, # 71 Exhibit 255, # 72 Exhibit 256, # 73 Exhibit 257, # 74 Exhibit 258, # 75 Exhibit 259, # 76 Exhibit 260, # 77 Exhibit 261, # 78 Exhibit 262, # 79 Exhibit 263, # 80 Exhibit 264, # 81 Exhibit 265, # 82 Exhibit 266, # 83 Exhibit 267, # 84 Exhibit 268, # 85 Exhibit 269, # 86 Exhibit 270, # 87 Exhibit 271, # 88 Exhibit 272 Part 1, # 89 Exhibit 272-2, # 90 Exhibit 272 Part 3, # 91 Exhibit 272 Part 4, # 92 Exhibit 272 Part 5, # 93 Exhibit 272 Part 6, # 94 Exhibit 272 Part 7, # 95 Exhibit 272 Part 8, # 96 Exhibit 272 Part 9, # 97 Exhibit 272 Part 10, # 98 Exhibit 272 Part 11, # 99 Exhibit 272 Part 12, # 100 Exhibit 272 Part 13, # 101 Exhibit 272 Part 14, # 102 Exhibit 272 Part 15, # 103 Exhibit 272 Part 16, # 104 Exhibit 272 Part 17, # 105 Exhibit 272 Part 18, # 106 Exhibit 272 Part 19, # 107 Exhibit 273, # 108 Exhibit 274, # 109 Exhibit 275, # 110 Exhibit 276, # 111 Exhibit 277, # 112 Exhibit 278, # 113 Exhibit 279, # 114 Exhibit 280, # 115 Exhibit 281, # 116 Exhibit 282, # 117 Exhibit 283, # 118 Exhibit 284, # 119 Exhibit 285, # 120 Exhibit 286, # 121 Exhibit 287, # 122 Exhibit 288, # 123 Exhibit 289, # 124 Exhibit 290, # 125 Exhibit 291, # 126 Exhibit 292, # 127 Exhibit 293, # 128 Exhibit 294, # 129 Exhibit 295, # 130 Exhibit 296, # 131 Exhibit 297, # 132 Exhibit 298, # 133 Exhibit 299, # 134 Exhibit 300, # 135 Exhibit 301, # 136 Exhibit 302, # 137 Exhibit 303, # 138 Exhibit 304, # 139 Exhibit 305, # 140 Exhibit 306, # 141 Exhibit 307, # 142 Exhibit 308, # 143 Exhibit 309, # 144 Exhibit 310, # 145 Exhibit 311, # 146 Exhibit 312, # 147 Exhibit 313, # 148 Exhibit 314, # 149 Exhibit 315, # 150 Exhibit 316, # 151 Exhibit 317, # 152 Exhibit 318, # 153 Exhibit 319, # 154 Exhibit 320, # 155 Exhibit 321, # 156 Exhibit 322, # 157 Exhibit 323, # 158 Exhibit 324, # 159 Exhibit 325, # 160 Exhibit 326, # 161 Exhibit 327, # 162 Exhibit 328, # 163 Exhibit 329, # 164 Exhibit 330, # 165 Exhibit 331, # 166 Exhibit 332, # 167 Exhibit 333 Part 1, # 168 Exhibit 333 Part 2, # 169 Exhibit 334, # 170 Exhibit 335, # 171 Exhibit 336, # 172 Exhibit 337, # 173 Exhibit 338)(Figueira, Elizabeth)

Download PDF
Copy Detection Sergey Mechanisms James for Digital Documents Brin. Davis Hector Carcia-Molina Science Eiueiia Deci. DeparLineilt of CorilpuLer University Tab Stanford 259 Stanford c-mail CA 943O214O scrgeyc.s.stanford.e.du OcLober 31 1994 Abstract In easily digital library system docunienLs are more are available violated in digital form and therefore are Illore as it copied and their copyrights easily Ihis isavery iL serious users. problem There are discourages owners of valuable for inforniaLion this from sharing with auLhorized detection. the Lwo main iiiakes philosoplues unauthorized such addressing of problem difficult prevention or and The former it actually easier use documents impossible while latter makes to discover In either activity. this paper we copies propose system for registering describe documents for and such then detecting copies metrics conipleLe for or partial copies. detection We algoriLlnns detection and required also evaluating mechanisms called covering PS. accuracy eciencv and issues security aiid We describe working results prototype suggest CO describe for implementation copy deLection present experinental that the proper seutings parameters. Introduction igita such as Ii braries are concrete processor aspects that of possibility today beca use of ma nit tech nological scan fling advances areas user storage In and technology networkdatabae library will systems is systems and it. of interfaces. many building digital digital library that today just matter of doing However. interest there or will is real danger such isolaLed either have very relatively restricLed few documents access. IL be paLchwork for Ails systems is provide The copy and reason danger that If Lhe electronic medium provider list makes doc it much ii easier to leti Lo illegally lie distribute information. lv an inform ation large mailing gives ment customer board. to customer danger of can illegal easi dist is ri biite it on of or is can post on biil ihe copies not or new course however than it much more time consuming reproduce and distribute paper Lechurologv giving CDs does videotape sLrike copies on-line documents. protecLmg Lhe Current property on of noL good balance need between owners are lie of intellectual and access to those is who free the information .At one is extreme frequently open sou rces because as the Internet the dangers This research where everything but valuable information unavailable of unauthorized distribution. At the other extreme Research Corporation those cuiier of the are closed systems such of the was sponsored by the Advanced with the are Projects for Agency AItFA Research should ol of the Department Defense views under Grant conclusions No Lii MDA9I2-92-J-1029 contained oflicial National and Initiatives UNItI. as The and in this or document authors or not be interpreted Tie necessarily or represent irig policies eridorserricriL expressed implied. ARPA U. S. GovcrnmenL CNHT. 1As Barry just one example Royko Knight-Ridder columns Tiibune recently June 23 1991 the ceased articles publishing on large on ClariNet lists the Dave and the Mike because subscribers re-distributed mailing one that the where IEEE users currently uses for of to to distribute is papers view in CD-ROM. and of This completely buL data. to stand-alone sysLem daLa in can look ouL like specific articles Lhem. prinL his users Lhem. or her cannot move ally elecLronic form would and Lhe system an aiid cannoL add any that at gives Clearly digital one have infrastructure access gives wide variety of libraries information sources their aiid but that the In same time ways. information this is providers the central good economic issue In issue. be for future incentives digital for offering information. systems. the many we believe information one si library of Lhis paper key we is presdllt uite componenL pIe provide informaLion thtction will infrastructure LhaL addresses docu copies in Lhis ihe idea copy he sfnI.ef detect where riot original rnents can also registered and copies is can be detected. service service just exact Section but documents of that overlap significant ways. The can be to used see variety ways by information Although providers Lhe and communications idea is agents simple. aiid corn detect are violations several that of intellectual issues resolved. properLy laws. copy detecLion Lhere challenging Lo we address here involving copy detection of performance is sLorage to capaciLv. accuracy ii need its be Furthermore is relevant the Hatabahe in ty since central corn ponent large database that are registered documents. is We Lool. stress copy detection of oLher not the complete solution that will by also any means it is simply helpful There number importanL tools assist in are safeguarding iieeded he iii intellectual cases. in will It properLy is For example ii good eiicrypLioii and auLhorizaLion rgi rig mechanisms to some also portant variety to of have other mechanisms topics for cha to for access information. hese articles discuss related intellect ia property. other toos and topics not be In covered the ii this paper. following aiid will section we will briefly discuss is some very of the options for safeguarding In SecLion intellectual properLy Lhe basic argue that copy deLecLion for promising approach. in we describe define terms prototype and evaluaLion metrics report copy deLection. Then Section ph we our can working reduc.e CO IS spac.e 6. and on some initial experiments can speed are sam chec.king ng technique is that the in storage Section of registered documents security or up time in presented 3.3. and analyzed Finally some considerations disc.ussed Section Safeguarding intellectual how paid property used the by person can to we see ensure LhaL us documenL illustrate is only seen and possibilities who by isautliorized e.g. has particular it let the and problems suggesti rig two techniques. The cannot key the is first tecJnique by Lhis its is based on It the conLa only in notion is of public to secure key the printer. Such device printer is sealed Lhe and be opened to owner. and encrypLion iLself. where printers printer private unique of priiiter is known pritLer The trusted public key and name the is the owner are registered database an provided by the manufacturer. rst When owner it owner req uests docu has ment from then public it inform ation the public vendor key of the the vendor printer fl ensures the the authorized the e.g. paid the fetches from registry enc.rypts document iL using key can aid then sends decrypL only the and this so result. prinL When Lhe the owner receives the the daLa. he can send to Lhe printer which documenL can decrvpL he however iL. elecLronic daLa resent cannoL to be used to for anyLhing else as one the prinLer The daLa in can this be the printer illegally create iced nother paper pa per copies copy is docu ii ment can nsolved reprod iced way. However does not reprod previously problerii that this scheme The delivery address. main problem sysLem use parts with this scheme else. is LhaL iL is too restricLive. browse through Iij It is more of an elecLronic paper Lhan of the it an Lhing docu Users canitoL e.g. conj documenLs it before ii buying and can not ment still in otheN useful in for quotes. nction rthermore req res special purpose perha Ps hardware. However ma be with other scheriie. For exam pIe users LhaL can have be key Lhe allowed to browse through low-resolution the user copies decides via lie of documents. Lo or Lhe through documents he can also components high qualiLv ire missing. Once can wanLs secure read documenL purchase be ad copy that be delivered of is Lhe pnnLer. The scheme can apted for sec corn puter to instead illustrate printer. that of The idea LhaL iL is second that an technique we wish vendor an active document instead suggested it in The programs information documents. does not user send out one documents of Lhese sends call ft out P. lie can his ru generate local it When receives programs is can run on machine. displays it Embedded the docu it is wiLhin and iLs daLa or dii sLrucLures ring Lhe encrypted documenL to lie as ns ment. being the Hovever run and of before waits for splay This sends message can vendor eacJ inforrning that can has response. runs. way the vendor charge time This runs or limit its number times user scJeme The also drawbacks. Lhe see biil cannot of Lhe if read users the document in through his to favorite viewer. veiidor code. ally musL user know cannot is archiLecture Lhe let machine advance is generaLe oii appropriaLc network. of his Ii The documenL proof since of Lhe vendors machine coii Id unavailable the the scheme could not the user run in an software emulator maciæne that have record the characters the document that they and as they are displayed. While we document The thern only given two examplesVe Lhev are ofLen believe illustrate usually common geL in problem with users. allow proieett07 is Lechniques Lo use detection cumbersome ThaL is the are NI way of alLernative access to Lechniques. we assume that most users rules. in honesL any the docu this ment and focus to be on detecting those violate the get software of vendors users have and found approach superior protection mechanisms the way honest sales may actually is decrease. to One origin possible direction incorporate if watermark of the into as document images. that identifies its 2. inLo For example of we Lhink hiLs documenLs the we ma would encode he the \VaLermark of small number bits random Lhroughout vendor image. The users unaware rnent is where the watermark were but the information that originally sold to is provided If the doc could extrac.t them to determine person is who the document then desLroy was originally. detected. the document possessed of by different or LhaL organization. users violation Lhe The main weakness the approaches instance easily such as Lhese the may watermark filLer by processiitg documenL. algorithm For passing change image bits documenL without that basic. Lhrough altering in noise or to lossy compression the could enough and 10. the really the image for an destroy waterrnark. is second copy or detection approach server it one we advocate idea is this paper text docurnents creates that work. of The as follows also When the author for new he she registers at server. as The server in could be repository are copyright Lhey iL is rec.ordation and small hash regisLration system now suggesLed documents senLence is registered pointer Lo are sLored broken in inLo large uniLs table. for sa be sentences. Each hashed and Documents or can of compared to existing documents document hash table in the is repository be LhaL to it check is for plagiarism broken has into other types For If significant senLence. overlap. When Lhe to if checked also sentences. each the we and probe Lo see parLicular senLence Lhan been seen nu iii before. her of documenL then are previously is regisLered he documenL share more be set some on if Lhreshold sentences if violation for flagged. threshold can larger if depending to to the desired doc it checks share smaller large we looking copied paragraphs have to we only want check see if rnents truly portions .A human wiLh would then examine both documents was violation. Unlike i.e. Lo the case waLermarks copy. nu iii it is noL easy for if user Lo automatically uniLs Fh is subvert are Lhe sysLem user ust make ave to an tindeLecLable large For ber of example. sentences ui Lhe decomposition docu ment. enLences. rnore th an would adding change space user in the involves blan between change words all assu ng that the hashing is scheme make ignores it spaces. to Of cou rse documents determined could sentences but our goal to hard copy not to make copy it impossible. This makes be it hard in to rapidly of distribute copies of documents. publisher IL is The liable if detecLion server can Lhe used varieL noL an in ways. For example legally for publisliiiig materials auLhor is clues have copyright on Urns electronic of may wish Lo check soontohepublished document chec.k actuall postings checking original this for document. Similarly bulletinboard software also chec.k may the automatically messages that new fashion An mail gateway may go through if transportation Loo prove focus stolen goods. Copy Program paper. can iso committee members wailL for that may to check submission overlaps Lo much illegal with an auLIiors previous Lawyers may also be used check puter not subpoenaed programs detection documneitLs behavior. ni deLection ihe user are corn but of we ii on lv on text his pa per. applications retrieving do involve ndesi ra ble behavior. or Fbr is exam pie electronic that is documents flag duplicate from an information retrieval system who reading the mail may documents want are or to items been Limes should with given the overlap threshold. represent of it Here registered that are so Lhose that have mnaiiy seen already copies or versions messages reLransmnitted on. forwarded differenL not be ediLions deleted the is same up to work. the and to Of course if potential to du plicates possible automatically user decide he wants view In duplicates. summary we think that detecting copies of text documents are is fundamental that else e.g.. problem be of for disLribuLed information should Lhe or daLabase systems units ii And he there many issues need Lo addressed. senLences sequences his For instance decomposiLion lit paragraphs or someLhing or itistead Should of we take into Is it accou to order of the nits paragraphs of the sentences of registered by hashing nits feasible table is only hash fraction still sentences it docu ments major wou Id make If the hash hash smaller. relatively hopefully small having issues it making be very likely that we will catch violations. the table locally also can cloned. Our mail gateway copy deLection For above server could for then each are pt perform mnessage. iLs checks are instead of to conLact need Lo remoLe be There implementaLion sax LhaL addressed. extract examnple. how senLences ii extracted or from bit latex or Word docu ments Can one hem from ostscri doc ments from These maps via OCR. will and other questions defining Lhe basic be addressed in the rest of this paper. for We start in Sections and Section by Lerms evaluaLion proLoLype the meLrics and and of options report copy deLecLion. Then in we rig describe our that working can and red uce COPS space 6. on some docu iniLial experiments. or sam ph technique ti storage in registered ments can speed ii checking me is presented analzed Section General Concepts In Lhis secLion we define IL. some far of the basic concepts texL is for copy deLection has not and for evaluaLing mechanisms so LhaL start implemenL from As he as we know point as copy deLection of been formally sLudied from we bahics starting the concept and document boundaries are body can be of text which In some initial structural inform ation formatting such word and sentence extracted. an phase information canonical non-textual components consists of removed of ascii from documents with of see Section whiLespace 5. The separaL resulting ig form document separaLing string possibly characters words. of puncLuaLion senLences and sLandard meLhod marking Lhe beginning occurs of paragraphs. docu There exac.t violation when text. ment are infringes upon of another violation docu types ment which in some can way e.g. including by duplicating plagiarism of portions number of of occur steps is few sentences for is lar replication the entire violaLion document beLween Lhen Lwo and many documnents in between. by The noLion of checking If parLicular tesL type and pIe capLured documnent nolaiiou according from doc ii tesi. violaLion test. id. holds documnent true if violaLes to the Rarticu r. lor exam this /laqthrsmd to nd ide ris docu against ment set has plagiarized of ment also extend notation checking doc ii ments td by 7Z is true of if aud ouly if td we holds are for some document in R. well deflued to tesL MosL the violaLiou Lests exaruple. iriteresLed is are uot aud for. require decisiou liumau being. he if it For is plagiansm particularly difficult For not iusLauce be If the sentence plagiarism test proof as follows iu may occri man is scientific papers and won Id considered occurred that two documents if while this seuteuce esseutially siguificauL most of certaiuly would. we cousider ueed to Subset if detects document subset another one alit we again requires consider evalu aLion he Lhe smaller docuruenL ruakes any conLribuLioiis. Thisag human goal of copy detection system notation as is to in plement well defi ned algurith mic tests termed violation in opfratinq tests. tests with the same violation test tests that that approxi if mate the of desired For instance in consider test the be operating considered Llien trd. holds to 90% the test sentences described are contained If This flags Li may an approximation can check if the are Subset indeed above. Lhe sysLem violations. human they Subset violaLions. 3.1 Ordinary the rest Operational Tests pa per In of this we be will focus on specific efficiently. class of operational they tests ordinary opfratIonai approximate tests OUTs violation that tests can of implemented such need as Lo We believe can accurately many which interest Subset. define Overlap some in and Plagiarism. for specifying the level of deLail at Before We look describe aI the OOTs we primiLives SecLioji into well we docuinenLs docu As menLioned ments can be documents defined or contaii parts some structural information. ii particular structu of re divided consistent \\e are given call with each units. Lype. lie nderlyi ng such sections unit paragraphs sentences instances units of words of these characters. unit of these types divisions type and of particular types of call let called unit We different define chunk be and FIC as sequence into not consecuLive in iii documenL since document may sizes divided need chunks pletely number the ways ment. chunks exaril overlap is may me we it he of com cover docu For or pIe other assu have be or is docu ment into It where as follows the letters represent sentences or some or nit. can organized chunks or ABCDEFG of selecting that Lo ABCDEFG from uniLs chunks ABBCCDDEEFFG divided into units ABCCDEFG ehurtkug Lo Lhe ADG. IL is method chunks doc.ument have siraiegy. iinportanL chuiikiiig to is noLe unlike strudural abouL by efficient significance documenL and uses re so straLegies detect cannoL use sLrucLural chunks key and is informaLion implemented not First Lhe documenL. set of An cedu res 001 in hashing he matching to the pro plete Ilgu code intended our convey concepts an or com implementation. ing Lhe operation. Section describes actual as prototype input for set system. of there is the preprocess PREPROCESS H. LIiaL takes regisLered documeiiLs Lo is and creaLes hash docu LaMe ments Second and th at for Lliere are procedures from 1-I onLhefly adding documenLs 1hird registeritig new removing them unregistering documents. the function EVALUATE To break insert com putes in od R. hash table procedure function iii documents document the its INSERT returns uses set function of is INSCHUNKSr Each of the Lhe to tuple up into in r. chunks. is LIie The tuple. Lhe represenLs one in chunk uniL where enLry is Lext in Lhe chunk Lable and location chuiik in chunk measured roced some .An sLored Lhe hash doc ii for every for use violations. different documenL. ire ure EVALUATE function will tests to given ment proced ch uses at EVALCHUNKS evaluation aiid break ii d. in ihe reason 6. why we For Lo unking function time become apparent and Section now. refer Lhe we to can assume that both INSCHUNKS EVALCHUNKS After chunkiiig MATCH. are idenLical we use CHUNKS then in Lhem. in procedure Each hatches is EVALUATE looks up chunks the hash LaNe H. produciig at location ii set of in tuples ldr ir as sa to MAICH key as represents match at location chunk ir in of size registered is id r. docu ment set me ash chunk doc ment of The MATCH then given function DECIDEMATCH SIZE where SIZE the number PREPROCESS CREATETABLEH for each in INSERTrH INSERTrH INSCHUNKSr for each in OOT dependent HASHt assume size of reg. doc. may be obtained from id INSERTCHUNKhr1 DELETEr implementation unspecified INSCHUNKSr for each in HASHc DELETECHtJNKkhr1 EVALUATEdH EVALCHUNKSd SIZE HATCHES for each ICI implementation unspecified empty set ld HASHt LOOKUPh each NATCHES in SS returns all lr with matching for ir in SS Iti id SIZE ir OOT dependent return DECIDENATCHES Figure Pseudo-code for OOT chunks Lhere in LItat reLurns i.e.. Lhe set of inatciung regisLered documents. If the set is noneinpLy Lhen was Note violaLion that an od of an is is. holds. instance 001 the is specified in simply which its by its INSCHUNKS differ. EVALCHUNKS in and DECIDE we will functions. start That this oniy way where OUTs some Luples is In particular. Section and its by considering selects leL an 001 MATCH both LhaL CHUNKS functions extrac.t sentences. of DECIDE chunks. 1hen function regisLered docurnenLs be Lhe exceed of threshold of the fracLion inaLciting iii ThaL is COUNT will number form MATCH. if document 0.4 he selected to if COUNTr 100 call MATCH greater registered than docti SIZE. ments For with example 41 or and the docti ment be check has sentences this then more matching In sentences of will selected. We is DECIDE function the match_ratio in function. full the code is also for store for Figure we only in store the ids of registered simply Lhe documents or id not copy the documents. sysLemn ThaL Luple separaLely Lhe user name of r. The deLection LIus. may be regisLered documneiits. Our COPS and proLotype the does matching This can useful showing the matching documents highlighting chunks. 3.2 Measuring described tests earlier such Accuracy As OOTs as and operaLional and Su hset. It is tesLs It is in general are intended to for approximating violation flagiarism therefore to their in important evaluate efficiency Lhe resL of eva nate of how well an i.e. UOT how approximates hard it is some subvert other the test. also important as are well as the security UOTs efficiency to copy detection securiLy i.e.. what computational is resources Lhev require. Accuracy and discussed Lhis secLion addressed in Section 3. Assume R. docu test random the is regisLered that docnmenL is chosen from disLribnLion r1 of regisLered of docnmnenLs registered of That ments is probability parLicular document test out of population from implicitly i1 D. Similarly assu me random docu ment.Xis metrics selected distribution documents and D. We can then define the following accuracy each parametrized by Definition 3.1 For test IPC fe/inc freqt test is PtX Y. is stands for probohthty. Tntiiitively frq measures one of of how frequently test true. as likely lor to to example be suppose and is uniform in over -i x2 yi only either y.J two these docu ments are just tested is iform Y2 all three docnments hold holds equally likely be of registered. Further. are assnme that Then tx y2 tx 3/6 1/2 operating is y3 tx2 since test i.e. for only these ouL of pairs docnments choices their violations. freq1 If Lhe possible for an approximates they test test is violation test well sets. then If should be close test If but is lie converse not to true since can it accept is on disjoint the frq of the is operating small is compared large the violation approximatiig liberal. then it being too conservative. it too then the operating too Suppose for general we have an operaLiig tesL Lhese 12 and can also violaLion tesL Then Lwo we define Lhe following accuracy. Note two tests. LhaL be applied beLween operaLiig LesLs and iii between any Definition 3.2 The Alpha 7neinc corresponds i.c Lu measure of false high 7egatttes i.e. JlpIta 1. Pt2 opfratnq test t1X t2 Note mssi.nq too 4/p/ia not symmetric. oft1. 4lphatj t2 value indicotes that many is violations Definition 3.3 Yu Beta. metrie analoqous Beta. to Alpha. xcept that mEas Vs high faLsi postjves i.e. value Betati t2 indicates that Pt2 t2 is ti too Y. not not in sym1ntrc Hther. Beta.tj t2 finding many violations Definition account both It 3.4 false is Yu Error metric is the combination ctnd is of Alpha as the and Beta that it takes into positives and false negatives value defined that Eroti two tests t2 are Fti dissimilar. t2X fl. 3.3 So symmetric. high Error indicates Security far we have assumed that the and does it is author Lo of test document it. does not know how our copy detection for in svsLem works is iiot inLend sahoLage user to Lo however it. anoLher Lhis imnporLanL measure rily an 001 of how hard for malicious to be break measure docu noLion of SEct it Lermns how an as changes copy. need made registered ment so that will not be identified by the 001 Definition 3.5 liE scwrty of an 007 numbei such that also of applico h/c to any must The operatnq test on ginn or value docu pent in to S/i.C product is. is th mnmm choraeters that nsrted higher deleted. moth/ied is the new document or. is false. SECo more secure We can use this notion to evaluate as and compare chunk. OOTs. For example. consider for all an OUT oi that considers single the entire document Lhe single Then SECoi as copy. because changing characLer makes decision documnenL noL deLedable 2This assumes if 01 function always true which no doesnt matter if flag there violation are if there or are no then matches reasonable Toes condition not hold. For instance is matches not our statement As another funcLion. if example consider OOT 100 02 that where uses sentences is as chunks of aL and match_ratio in r. decision Then and SECo2 our document an has qSIZE SIZE we of the to ilumber change senLences leasL as For instance 0.6 sentences uses need 40 ch of in ks. them As third if exam the pie consider 00 sentenc.es that pars 03 overlapping cimnks each sentences lor instance Here document half as has sentences C. as before considers AB. BC CD can is we need to modify many is roughly Lo since modification affect two cimnks. half as secure TIiuSECo3 as r\ote approximately equal SECo2. r/2 it i.e. approximately that our secti rity definition certain large is weak because assu mes lie adversary knows can all about secu ou rity. 00 %Te then 1. However model by this keeping by having one and information class of bout 0. on 00 vary the For secret we by enhance can OOTs all that only some does parameters. not define and which secretly choosing choseji of OOT Urns that from to 0. We assume that of adversary Lhis know 001 as Lhe we have needs subverL inserted. Lhem. or model we to of SECO. false for all number For characLers of for using must he of deleted king as modified strategy make Section or 4.2 0. examples the seed classes 00Is see chun and Section consider Finally issues. lie the random number measures generator parameter. here do Lhe notice that the security user we have documenL the or presented not address ensure checks there authorization the for user is For example to when LhaL registers how does system user vIio claims he and the of he actually owns documenL we just When him violaLions violations violates can we show the him owner the matching doc be ii documents be notified identity do inform that were Should ment that of someone was checking person not submitting to docu the test in ment that his are Should important owner given the the document paper. These administrative questions that we do attempt address this Taxonomy he tin of ch In OOTs unking this its selected of an the strategy and the decision hi nction can and affect the acc ii racy and the security 001. section we consider some of the options the tradeoffs involved. 4.1 Units determine factor to to To key tend how doc ii ments the are to be of divided into in chunks unit. we must fi rst choose else the units. One will consider is number and characters will Larger units all being equal selective. function. Lhe to unit generate can be fewer matches henc.e have smaller freq selection and be or of more dec.ision This of course compensated facLor by changing in the of cJunk unit strategy Lhe ease AnoLher Lors. irnporLant the are choice type is deLectitg are easier separa titan For example which the Words can be iii LhaL separated in in by spaces ways. is aiid punctuation detect paragraphs Perhaps if it is distinguished many it most portant factor that sequences tin selection the violation copied should he test of interest. For instance of more sentence meaningful of sentences were rather used than as sequences words e.g. fragments then sentences and not words units. 4.2 Chunks are There of number of strategies for selecting of chuik. entries To contrast them we are c.an consider the number and an upper units involved for are the also the number hash table LhaL required of for the document four sLratcgies hound here securiLy iiian SECo. variations r. not See Lthle for summary the table we ber coiisider. of ii covered here. do not have refers to the nu iii nits in For our discussion we assume that documents significant numbers of repeating units. strut snmmary unit aroinpie on I3C/// spae ri units SEC Id ABC.D.EF over units uiiiLs k-i FICIKF H/k ri over ABCBCD.CDEDEF F1C1KF Properties of in/k hashed breakpoints H/k Strategies Table Chunking the Lhe document number of being hash chunked. entries and need is parameter while of the strategies. gives Lhe The chunk space size. column gives Lable for uiiiLs One chunk smallest quaL one unit. Here every unit e.g. every sentence to is chin of k. his yields he chunks. As with is units small chunks cost ri tend hash make the frcq are an OOT is. smaller. The major weakness however the it is the rnosL high storage table is entries required for ri document. depeiiding oii the secure scheme it SECc. to bounded up to ii by ThaL decision hi nction may be necessar alter characters one per ch ii to subvert the OOt. equals One chun1 nonoierlappmg ii units. use re In these since Lhis seq sLraLegy. iences we ou docu break chin the ks. document It up inLo lie sequences space at of of consecutive but cause it is nits and nsecu uses /kth single Strategy will very have altering ment by adding this unit the start to also no to matdes high with the errors. original. We call effect phase dependence. One chunk This effect leads Alpha Ic equals units as iii ktnits in overlapping on our units. Here. we we is take do every sequence from of consecutive dependeitce our document buL that uses as chunks. Lhe Therefore space an cosL not suffer to the phase A. SLraLegy O4 ujiforLunately equivaleiiL LhaL is SLraLegy Comparing for its an of 001 Strategy any SLraLegy see A. and 001 being same Rta.o. that excepL use one can test o. hat.A /pha his is or errors 4ipha.o true and iii Dc o4 I. Retao is 0/1 for violation is because nlies true. Thus is Strategy relatively prone to higher Alpha but in lower that Beta errors. Also kth Strategy unit of insecure is though sufficienL more secure to fool the than modifying every regisLered document system. Use Lhe Lhe nonoverlapping firsL units determining If brcak points is by hashing Lo units. We unit. start by hashing k. unit iii Lhe is document. the Lhe uniL. hash If value equal some Lhe If constanL niodulo If its Lhen value firsL chunk simply firsL noL we consider two mod units. second not hash equals and first modulo on until the the first chunk that is the first we consider identifies the third the end unit of so we find some unit hashs to to do the each and this the chunk. be We then repeat the procedure identify in following nonoverlapping will chunks. Strategy by phase has higher It can shown that the expected to in since its number of units chunk unlike be is k. Thus. is similar hash table texL requirements. have Lhe However it not affected like dependeitce sinHlar will same break Lo heca poinLs. SLrategv all else D. Alpha should be ca and be ight lower Beta errors less as compared of A. use Furthermore. significant being of dii the same text will only slightly ust as in than that portions plicated C. of the with change key a4vantage secret Strategy see test is that it is very secu re. It is really family of function the system strategies parameter unit of Section 3.3. to Without be sure it knowing will the hash one must every document get through without \VarIIiIigs. 4.3 Decision are Functions options for choosi Sri rig here be is many for decision hi nctions. violation if he match_nitjo nother of function si ii Section decision 3.1 hi can risefu approxi mali ng bset and Overla tests tests. pie nction matches with parameter is that above also simply certain the number would matches useful between for detecting the test and the such registered as document One value usiiig k. This be violations there are Plagiarism. certain if ii inighL of consider ordered_matches the which iii tests whether more be Lhan number matches are Ii occurr to be ig same order both documents. This would useful nordered matches kelv coincidental. Prototype and We have built aiid Preliminary Results prototype to test working OUT our is ideas and to understand how to select good and CHUNKS Figure DECIDE its functions. The proLotype called COPS can If COpy via either ProLection email iii System shows DVI against major modules. and Documents can he submniLLed be TEX in including system or is WfEX tested troff the ASCII formats. set of New docu ments registered the existing registered documents. that it new docu ment is tested sum mary returned listing the registered documents violates. TeX ASCII converter Document registration DVI ASCII Rnceenttion LJ nflainashing troff-ASCII converter Query processing ____________________________ Figure Modules in COIS implementation. COPS e.g. with the allows modules Lo he easily replaced. permniLLing experimnentaLion functions. and with different begin and has been of sLrategies different si INSCHUNKS plest case EJALCHUNKS and DECIDE ch ii We eva will our explanation mote/i_nitjo sentence later discuss king for both insertion iation decision to function system is and given as i.e. possible improvements. ID. document is that submitted the unique and document To This Lhe ID used to it index mnusL is table document inLo the the informaLion canonical such LiLle author. register document by which the nroff firsL be converted form format plait ASCII document text. The be process piped this occurs utility dependent while upon document with formats are roff lX to can be through with Unix Si thtx \/ and adocument ii formatting filters commands handle and can converted to ui larly other doc plain ment have to their conversion plain ASCII text After producing Using into ASCII we ready and determine quesLion hash as the documenfs individual sentences. periods exclamation key. points current marks unique to sentence is delimiters we stored in hash each sentence Lable. set of numeric for each The documents we wish ID then docu permanent the hash once sentence. When use list check new ment against existing registered documents we very of similar procedure. and any look We them generate the plain ASCII. determine Section sentences 3.1. If and generate hash keys wiLh up in the hash table see report more than SIZE sentences match given regisLered document we possible violaLion. 10 5.1 Conversion proced let ii to bed ASCII above is he re descri the of idea case. In practice nu iii ber of nteresti to rig diflic ii Ities arise is first consider is some no the challenges associated of with the conversion ASCII to text. The most important Documents Lext that exact objective method or troff reducing bec.ause in formatted there is document some so LexL will ASCII exists. are Tills formatted using TFX precisely value he added For labtis tables over plain exLra formaLLing graphs Iia\ cannoL he represenLed ASCII reLa and any losL or example associated are all embedded with Ities no the ASCII pri equivalenL. structure is We not can items the as of gra ph but on mar be translatable. eq Kq nations ta and pictu diffcn well. implementation that cannot that we discard graphs naturally nations in ble res and choose of to other pieces all information represented the produce ASCII. not change We the also discard document. and text formatting commands effect presentation. iLalic but and content are the For example command sequences Lo Lype Font removed ignored. he conversion to process is not perfect. If the docu plain to ment text. in pit format the as it is then it is someti rues impossible the distinguish equations will from Consider exactly Lhe sentence is Let XY be equal if answer. wiLh Lhe This sentence Lhe to be translated he ASCII leaving shown. However we hegm Since to TEX. then equaLion plain will discarded senLence our Let system will ci equal Lhe would iscnss answer. unable system coiiverioii that ASCII match to detect produced occn rred hiferent later in senLences this section recognize sentence allovs ns we some enhancements Another does not that matching is sentences gives despite for is imperfect placing of translations. complication with is DVI that it directions text on page structures but it spec.ify what headers the text part of the main body and what part subsidiary aLLemnpL to cafle like fooLnoLes it si page and text hibliograpiHes. in Our DVI converter it clues not rearrange it Lext ply of considers the order appears reading on the page. left However to one to1 does handle is th at two colii format. in Instead of characters right detects to bottom which gap would corrupt most sentences and reads down the left two column and noL then the in format right one. the converter the inter-column column can is An ipuL format ig COPS IL handle thificulL general is PosLscripL. Since Postscript Lo plain is acLually Lext. programnmn laiiguage. very as he Lo converL and its Nl layouL comnmands ic ASCII Some ostsc pt generators which text snch can the ilvips enseri.pt rosoft \\ord as prod nce relatively simple pt Postscript from extracted. of However page bit others maps. snch These Interleaf prod nce scanned is Iostscri code which would require generation to could This be with difficult OCR. and optical error Tn cJarac.ter recognition analyze and rec.onstruc.t the text. process prone. sn mary the to approach hnt be not we have taken with the CO PS converters ng sentences discussed is to do are reason not able job converting identically ASCII still necessarily perfect. since Most match that later translated to are will found by the system. enhancements Even if attempt negate the Lhere effects of common he translation misinterpretations. in some matching so LhaL sentences missed flag the should enough oLher we presenL maLches overlapping resulLs LhaL documents confirmn COPS can sLill violaLions. Later experimental thIs. 5.2 Sentence problems given Identification also arise in and Hashing idenLification DifficulL if Lhe sentence plain identify and hashlng always merely or clear module. Iii parLicular even we are fi correcLlv Lranslated ASCII iL is not by how all Lo exLract ii senLences. to be period As or into rst approximation we ma rlc can sentence contain periods. taking words nestion multiple for However because of sentences the that e.g. other abbreviations to will broken parts embedded An as extension our simple so model explicitly will watcJes noL he and in eliminates thIs common abbreviations unexpecLed such e.g. and i.e. sLill that sentences difficulties. broken way. Nevertheless ahbreviaLions will cause For 11 example idenLify given Lhe the actual set sentences. of senLences. is am am TJ.S. citizeu. aud The will U.S. is large. our system will following Lhe S. The ice ciLizeii. The flag error Lids U. SY as and is large. even Lhough NoLice the LhaL sentence are not identified Lwice. system this sort maLch disregard actual sentences of single at the same. To red of we can For sentences title composed author cud have word however head of other are similar also errors may to still occur. as example sinc.e and names the document discuss later difficult extract sentences to Lhe they rarely with puncLuaLion. here. We NoLe not some further if improvements IL simple involve algoriLhm similar we described that detect paragraph detecLion phs. were needed would issues. CO eacJ IS The cu rrently does Lw units in used COlS to 01 of pa ragra are words and result is sentences sequence is see of Section 3.1. with COlS first converts end-of- word the text hash key. The this is hash keys interspersed sentence markers. The cJunking sequence Lhe for done of by calling uniLs to proc.edure CONBINENUNITS inLo Lhe STEP STEP be each is UNITTYPE. Lhe where of units NUNITS Lo number Lhe ly he combined nexL chunk should ii number ii advance next chunk. and UNITTYPE indicaLes what ch considered nit. hr exam pIe repreated Calling calling COMBINE WORD creates seates for word in the input sequence. COMBINE1 every three SENTENCE words as chunk while for each sentence. Using COMBINE3 overlapping chunks. sentence WORD three takes chunk SENTENCE COMBINE3 flexihiliLv WORD overlapping produces two word we chunks. can see COMBINE LhaL Lhis it would produce for Titus scheme should he gives us great that the experimenting function described is with different it lu CHUNKS be used functions. consistently However for all noted Fh at once CHUNKS ust chosen useful ust in doc ii ments. flexibility is only an experimental setting. 5.3 Exploratory evaluaLe ninety Lhe Tests of Lhe To of accuracy sysLem IVT be corn we conducLed some exploratory i.e... experimenLs like usiiig set two Iatec not real ASCII intended and to technica documents ou to goal papers ply to this one. how These experiments matching are prehensive be expected is si nderstand man chunks documents might have and i-50 and how well in our converters length. work. The half of documents Lhese or average approximately are 73H nine words sentences labeled Approximately iii documents Lhree grouped wiLhtn descri hi inLo topical are seLs the tables. of The Lwo documents pa per for the each ng group closely he related. usually in mulLiple revisions topical conference are half Lo unrelated of or journal the same work. our docu ments group at separate groups except authors in affiliation with are research Stanford. Stanford The and remaining not related the documents in not any topical group drawn from outside any All document of these goal is our collecLion. registered in documenLs were to see if COPS. and Lhen each was queried ments. iates against Lhe complete set. Our Section CO PS can determine violation the closely related docu eva Using to true the terminology if il of we group. in are considering test Related I. that and of are in the same This will If be approximated by an the 001 that computes will the percentage to matching sentences Table parLicular his and shows the number our if high documents InsLead Lhe of be assumed Lhe be of related. violations in resulLs from exploration. reporLing of number that case. aleiLpaito us rst would yield we show percenLage of maLchtng senLeuices each gives he fi more information result colu in regarding able in the closeness the precent docu ments. of gives matches each docu ment against itself. That is. for each the document values group iL we compute for IOOXCOUNTd group. NATCH/SIZE facL LhaL all see in Section the first 3.1 average are nu and report in Lhe row is that The values column he 100% hers simply in confirms second col LhaL liii COPS are for workmg as properly. fol the com puted all lows. For each in docu ment and in group the we compute 100XCOUNTr refer to values MATCH/SIZE in other docu as ments the group since average results. We the second column affinity values they represent how 12 Match self Match Related Documents MatcJ Unrelated Documents Aflilifly Noise 0.6% 0.9% 0.9% 0.3% 0.2% 0.8% 0.4% 0.1% 1.3% Group 100% 100% 100% 100% 100% i00X i00X 100% 100% 71.9% N/A 3.6% 42.9% 38.4% 63.0% 66.0% 3/i% 93.3% TotalAvg 100% Table 52.9%25.16% Average O.6%i2.1% number of inaLcliiiig senLences. close documents p5. refer are. to For nu at the her third in column colu of we as compare each since din they group represent agaThst ii all in others grou this noise are ndesi red matches. The numbers for Lhe reported that the bottom Table the the averages over all document comparisons tests to performed illusLraLe Jdeally as to possible. distinguish column. of values. affiniL We also report standard deviation between individual spread wants one values LhaL for ii are as high as possible that is and noise values LhaL are noise as low This makes it possible threshold value related between the affinity that and levels between related and docu ments. Ta ble ones have reports related doc ii ments have is on average low is 3% matciæng the sentences of while unrelated 0.6%. used The here reason is why affinity relatively that notion version Related the to documents we have the very quiLe broad. For example The often noise is Lhe level journal of and conference or versioll of is same than Fh is work what work are differenL. 0.6% equivalent hi sentences sentences so larger we expected. lv The discrepancy by the caused are quite by seeral ng.A journal be are few such even as partial su pported NSF it. common when Also ad in articles that by of unrelated documents Hash might both be an contain another issue iii Other sentences may noLe on ii also exact large replicas coincidence.. regisLered collisions may iiot facLor especially Lhere the numbers large 20 documents in but In are our experiments related process relaLively variance reporLed sentences. also has the LaMe. parLicular some in docu ments the is order of translated matching to he by which the doc ment use ASCII some produces in effect on the noise less level. For than example does our translation we to convert is 7pX documents by differences somewhat inclusion of noise translation from ciLe DVI. the This caused the references. Many TEX red unrelaLed filLer docurnenLs itoL same in references 1k ouLpuL by ASCII possibly LIiev are generaLing in inatcliiiig senLences. Our noise is does include in references noise are separaLe bib less files so iced. ihe differences discussed the If generated translation become significant when the enhancements The graph BeLa it later level the added harder to it our system. is larger or noise the to say detect plagiarism senLences. as of small passages Lhe e.g. have if para high Lwo raLe we set Lhreshold th aL .5/SIZE flagged 001 would error Loo many we as unrelated docurnenLs actual Plagiarism Alpha violaLions Fh us while it is we set higher red uce say 10/SIZE the noise level wou Id miss as vioations high error. portant to much possible. 5.4 Enhancements we need to However it decrease the the noise target wit hout test sacrificing leading aff nity. to high If aff nity is too F3eta low it makes With hard to approximate Related again Alpha or errors. 13 this goal in mind we sumrnanz.ed an have in considered Ltble series firsL of line enhancements represenL Lite to the basic COPS algorithms. line The of the resulLs Lable are The base values case are each addiLional over all represents itdependenL to enhancement last The 2. reporLed averages documenL groups i.e. equivalent the row of table Match self Match Related Match Unrelated Aflinfly Sim pIe Noise 0.61% 2.0S 0.06% 0.33 0.47% 1.34 0.04% 0.21 0.36% 0.93 0.03% 0.23 Method Chunks Numbers 100% 100% 100% 100% 100% 100% Table 53.0% 53.4% 54.1% 51.8% 54.4% 53.6% Enhancements. No Commoit Drop r\c Short Sentences No All Short Words Enliaiicemeiits COPS In ti IJie 110 0111111011 chunks enlianceinenL function chunks occurr re ig his in our hash Lable more than mon ph ten rases riles are eliminated by the LOOKUP using is see teigu 1. rn keeps legitimate his corn and by passages the from whicJ ca docu in ment violation. Ibr exa pie the be sentence as work The su pported last three NSF wiLit present the digiL many documents. will not reported stream. arbiLIarily match. enhancements word fewer were remove nunieric sitorL indic.ated is occ.urrenc.e sitorL Lo nu II from the input are For drop numbers Lo any or dropped are defined sentences Lliree defined have Lhree words motivated in words oti have bers or fewer characters. These words like enltancemenLs were sorneti by discovery matches. that Iteca iii short sentences with and short mes in involved incorrect tile problem bbreviations VS. One described Section 5.2. last The Jie row of Table shows are note Lite effect of using aL all enltancemenLs Jie noise used at while for well once. caii see that at combi ted enhancements levels. quiLe effective reducitg values keepitg the for Jie affinfly roughly the the same of We that the parameter chunk we enhancements our collection e.g. but number occurrences to that for makes larger of of common the worked probably In have be adjusted study the collections. increasing Figure.3we any of Lite effec.t number line of overlapping the the sentences noise as per chunk wiLliouL of as Lite enhancements Ltble in Tue chunk Fh solid shows see average funcLion number nu nt iii of of overlapping overlapping detecta sentences sentences ble. teigu If .A5 is is we noise it decreases decreases that dramaLically the is the ber grows. beneficial effective noise for is since mini mu amuu noise of plagiarism three re3 we as shows an noise curve the average variable plus standard Lite deviations. noise Lo assume that lower exaniple he less normally Lliresliold distributed in we can of inLerpreL the at false effecLive curve bound if Jie order Lo eliminaLe chunks 99% posiLives due the in noise. For we use IA. will three senLence and seL our threshold cb 0.01 as then Beta error will than error However in described Section for 4.2 the Alpha be increase detec.t as we corn of bi ne sentences chunks. This mean that Also. Lo instance security iL we of as will unable is to plagiarism Section multiple. it non fewer contiguous changes Lo sentences. the the system reduc.ed 1.2 takes documenL make pass new 5.5 Effect issue is of Converters we investigate is final the in impact of different the input converters. of the for For example document say Latex document by find initially Lhie regisLered COPS. Later the DVI is versioll same produced like running that original through clearl Latex processor the registered submiLLed latex tesLing. We the would Lo has the VT copy matches original and VT copy 14 The effect of chunk size on document noise Average Effective noise noise 6- Number of sentences per chunk Ieigu re Noise as hi nction of number of overlapping sentences. similar number of matcJes Lids with other firsL documents row is as Lhe of the original would have had. the Table for are the as explores LitaL issue. all The the for basic 3. COPS The Self algoriLlun firsL Lhird second fifth row is version before of includes enliancemenLs for reference. Ltble and reports columns average ihe is and are only included The ment is Altered corn column its the precent matching sentences when the Latex so docu average pared against latex original. Altered compared Lo Related Lo all column Lhe gives percent matching Lhe sentences results as when are far Lo DVI document from iLs of relaLed documeiits LhaL the AlLhough can he perfecL Lhere remait its enough original matches DVI flagged relaLed original and Lo docu ments was related to. MaLch Simple Self Altered Self Related Group AlLered R.el. IJnrelaLed 100% 100% Ta Ne Itesu 60.9% 76.5% for 52.9% .53.6% 36.0% 16.2% docu ments. 0.50% 0.03% Enhanced 4. Its mechanical lv altered We insight believe into that the resu Its of presented in this value section for although at not definitive provide target some test. the selection of good threshold 0.05 COPS due least to for the identify Related the also will threshold of relaLed value say while of 25 out of 500 sentences violations seems Lo of vast majority that docurnenLs plagiarism either high not 10 Fleta Lriggering or less false noise. We conclude be quite deLecLiig without abouL or sentences roughly 2% documenLs hard Alpha errors. Approximating In this section OOTs efficiency we it address can test Lhe and scalability collections of OOTs. of For copy detecLion Lo scale well as to well the use we require to that jickl operate with very large registered documents achieve scalaLi aIM litv many new docu ment. One effective way to lity is sampling. 15 To percent simply 20 illustrate of Lhe say \\e say we chunks 20 of have an 001 and with niaLch. DECIDE function of that tests all whether cliujiks in more d. than 15 docurnenL chunks that InsLead inure checking Lhan of pV we could of take pies. If raiidorn Id check wheLlier Lhem maLcited 15% the the sam won expect this new 00 last bahed on sam will rig approximates original 00T. by the average of 50. test document of is contains is 1000 the chunks we have that reduced is our evaluation in time 6.1. factor The cost course. Lo in accuracy and analyzed here Section Lo only AnoLher iii sampling Lable of opLion sample of registered for documenLs. each regisLered The idea is inserL our hash random the and chunks find sample are chunks document. are Si For all example 100 chunks doc ii say of that only docu 10% hashed. with be Next suppose that we checking the new was ment matches should registered equivalent to docu 20 ment. under as of nec registered ment sampled. the be Lable ist ri these matches the Lhe the original In OOT. this Sinc.e 20/100 savings smaller exceed 15% sLorage also Piited threshold document hash to Lable would will be have IL flagged violation. Lite case the would hash in space IL only 10% regisLered chunks can makes fashion possible disLribuLe cost is to of other siLes accu rac.r. so that copy deLection he done Again option is the to loss third sampling in sample both the at registration and at testing for time. Due to space note limitations LhaL here. Lhe aiid this paper Lhe are we only consider aL first option is sampling testing. to However we will analysis the for sampling analogous. regisLration time almosL idenLlcal whaL present results We start by giving any more precise definition of the sampling at testing strategy. We are given an 001 Its oj with with chunking functions Section evaluation. INSCHUNKSI EVALCHUNKSI second is and the match_ratioDECIDEl to function threshold function for 3.1. We define 001 02 intended approximate o. c.hunking EVALCHUNKS2 simply EVALCHUNKS2 EVALCHUNKS1r return where RANDONSELECTN picks RANDOMSELECT i.e. chunks aL random. Tue chunking function for inserLions is not changed he INSCHUNKS2 hi INSCHUNKSI. selects DECIDEI is nction of oj docu lv ments chunks where the are tested nu ii her of matching so chunks COUNT nu ii MATCH of greater is than thSIZE. For 02 on selects not SIZE the threshold chunks her chunks is oN. Thus DECIDE2 than documents where the number of matching COUNTr MATCH greater N. Randomized how let 6.1 Accuracy we pit wish doc ii of OOTs is Now of in Lo determine and differenL be from of cJocu As in SecLion doc ii 3.2 let he Jet.Jc our disLribuLion be ments the he distribution registered ments. 1. random he docti ment that of follows and according to random ment that in follows Let rn in the proportion let chllnks the 01s cimnking function LhaL function whicJ i.e. match cJunks Y. Then Tta the 1r2 of Wd be prohahiliLy LhIs are density mX 02 are as Using we can in compuLe Aiphao1 the results Be Pxi mX The x2 details 02 and Erroroi. 02. computation Appendix follows J6 /7/Ui WxQxdx TtTadi f/ TV1 f7 The code Qxdx Bctaoi.o2 VVxdx is 4This is not the most efficient way to sample. just for explanation purposes. 16 Pa 0.8 0.1 335 0.2 Th 0.3 0.8 0.2 0.4 0.6 0.8 Ieigu re 4A Exaggerated tl Error or 02 WxQdx jxRi Wi Qxdx where Qx jo 6.2 Results we can evaluate it is F3efore tells on to expressions have we need of to now the fVa distri bijtiori Itecal that doc ii us how likely proportion matches given between of test and but registered ment. One he option would Lo LhaL be to measure Wx for body use documents. then our results leLs would specific varieLv of parLicular body. IjisLead. we parametrized function Litat us consider scenarios. of Section will Using probability the observations we be in arrive related at the following to i/ one. fu nction In With there very can high still Pa the test docu ment the registered this case be noise matches will which be we model as normally distributed pe Lhe with mean the and test standard deviation is 5a whicJ Lo the probably one. very In small. case With we probability LhaL document chunks be large unrelated normally as registered this assume number would nu of maLcliiiig c6 Lo is disLribuLed with related mean doc ii /L6 and sLajidard deviation have norniial a. varying expecL bers of since we have seen is ments tend of to widely matches. us our normalized to function the weighted 1. sum two truncated at and distributions make VVx shows Tile Figure apparent. sample under of Wx Lhe function in with range exaggerated parameters 0.2 represenLs of related to Lhe make its form of more noise area the Id curve range to be Lhe likelihood matches of while rest the Pa represents mai to ly matches docu will ment. between Tn practice ii con rse we won and expect to much closer most comparisons be related documents Given is Ua be much smaller. parametrized Au nportanL the that Wx. issue 1- we Lo can sLudv present is results that of show how good an approximation resulLs. of N. to .5 firsL Lhe utumber and that samples required for accuraLe values as funcLion Figure 0.4. shows Itecal AlpIta the 02 of Beiaoi 0.4 02. Erroroi 02 is for value means looking for registered docu ments whose 17 Pa 0.12 0.95 0.02 Pa 0.05 0.3 /L 0 .8 th 0.4 Alpha 0.1 ta Error 0.08 0.06 O.040.02 .0 10 15 20 30 Figu re he Effect of the Nii her of Sam pie Ioi its on Accu racy chunks bec.ause 1\ote that match are 10% of the chunks in of the test test. document. This value for may have are been picked in say we interested Lhe values values Subset Figure as .5 target are The parameters monoLonlcally 10. VVx is given the figure. the LhaL iii noL simply to decreasing. error the For cause rig example for in thIs. Alpha exam than i.e. the will and Error pie 3.6 or for ucrfa8E seiects goes ments from with Rounding the 10 riu For docu COUNT For ber of match with ch ks greater than i.e. with selected. or more matches. documents that It COUNT say that greater more of IL are Consider now is test document by matches is with likely to 10% to 50% of chunks select registered it document Lo Lo geL hence hiLs. selected oi. 02 is more likely with because for since only has IL \ViLh 10 effect less select it wiLh 10. only In one extra sample. of the has geLS hILs. This ii leads to note well to Lhe higher overa 0.01. ii Alpha the This error spite as nonmonotonicity For 10. relatively Lhe it is portant stays how below Error decreases shows that 02 very can rapidly increases. well the Error approximate with LhaL small number error of sampled as chunks. rapidly but Lhis is Note The Iiowevcr error Alpha say does noL decrease caused by iiot as serious. ratio of is Alpha for th beyond 0.4. 20 is mainly tinder Lest documenLs in whose to maLch the higher gives than The one area of the Wa in curve In the vicinity right 0.4 the probabiiity of hits getting to these docu ment. these case In are cases the the sampiing 001 0i may not of not very muster enough at trigger detection. violation test clear if However of interest this original DOT in m

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?