The Football Association Premier League Limited et al v. Youtube, Inc. et al
Filing
276
DECLARATION of Elizabeth Anne Figueira, Esq. in Opposition re: 167 MOTION for Summary Judgment.. Document filed by The Music Force LLC, Cal IV Entertainment, LLC, Cherry Lane Music Publishing Company, Inc., The Football Association Premier League Limited, Robert Tur, National Music Publishers' Association, The Rodgers & Hammerstein Organization, Edward B. Marks Music Company, Freddy Bienstock Music Company, Alley Music Corporation, X-Ray Dog Music, Inc., Federation Francaise De Tennis, The Scottish Premier League Limited, The Music Force Media Group LLC, Sin-Drome Records, Ltd., Murbo Music Publishing, Inc., Stage Three Music (US), Inc., Bourne Co.. (Attachments: # 1 Exhibit 189, # 2 Exhibit 190, # 3 Exhibit 191, # 4 Exhibit 192, # 5 Exhibit 193, # 6 Exhibit 194, # 7 Exhibit 195, # 8 Exhibit 196, # 9 Exhibit 197, # 10 Exhibit 198, # 11 Exhibit 199, # 12 Exhibit 200, # 13 Exhibit 201, # 14 Exhibit 202, # 15 Exhibit 203, # 16 Exhibit 204, # 17 Exhibit 205, # 18 Exhibit 206, # 19 Exhibit 207, # 20 Exhibit 208, # 21 Exhibit 209, # 22 Exhibit 210, # 23 Exhibit 211, # 24 Exhibit 212, # 25 Exhibit 213, # 26 Exhibit 214, # 27 Exhibit 215, # 28 Exhibit 216, # 29 Exhibit 217, # 30 Exhibit 218, # 31 Exhibit 219, # 32 Exhibit 220, # 33 Exhibit 221, # 34 Exhibit 222, # 35 Exhibit 223, # 36 Exhibit 224 Part 1, # 37 Exhibit 224 Part 2, # 38 Exhibit 225, # 39 Exhibit 226, # 40 Exhibit 227 Part 1, # 41 Exhibit 227 Part 2, # 42 Exhibit 227 Part 3, # 43 Exhibit 227 Part 4, # 44 Exhibit 228, # 45 Exhibit 229, # 46 Exhibit 230, # 47 Exhibit 231, # 48 Exhibit 232, # 49 Exhibit 233, # 50 Exhibit 234, # 51 Exhibit 235, # 52 Exhibit 236, # 53 Exhibit 237, # 54 Exhibit 238, # 55 Exhibit 239, # 56 Exhibit 240, # 57 Exhibit 241, # 58 Exhibit 242, # 59 Exhibit 243, # 60 Exhibit 244, # 61 Exhibit 245, # 62 Exhibit 246, # 63 Exhibit 247, # 64 Exhibit 248, # 65 Exhibit 249, # 66 Exhibit 250, # 67 Exhibit 251, # 68 Exhibit 252, # 69 Exhibit 253, # 70 Exhibit 254, # 71 Exhibit 255, # 72 Exhibit 256, # 73 Exhibit 257, # 74 Exhibit 258, # 75 Exhibit 259, # 76 Exhibit 260, # 77 Exhibit 261, # 78 Exhibit 262, # 79 Exhibit 263, # 80 Exhibit 264, # 81 Exhibit 265, # 82 Exhibit 266, # 83 Exhibit 267, # 84 Exhibit 268, # 85 Exhibit 269, # 86 Exhibit 270, # 87 Exhibit 271, # 88 Exhibit 272 Part 1, # 89 Exhibit 272-2, # 90 Exhibit 272 Part 3, # 91 Exhibit 272 Part 4, # 92 Exhibit 272 Part 5, # 93 Exhibit 272 Part 6, # 94 Exhibit 272 Part 7, # 95 Exhibit 272 Part 8, # 96 Exhibit 272 Part 9, # 97 Exhibit 272 Part 10, # 98 Exhibit 272 Part 11, # 99 Exhibit 272 Part 12, # 100 Exhibit 272 Part 13, # 101 Exhibit 272 Part 14, # 102 Exhibit 272 Part 15, # 103 Exhibit 272 Part 16, # 104 Exhibit 272 Part 17, # 105 Exhibit 272 Part 18, # 106 Exhibit 272 Part 19, # 107 Exhibit 273, # 108 Exhibit 274, # 109 Exhibit 275, # 110 Exhibit 276, # 111 Exhibit 277, # 112 Exhibit 278, # 113 Exhibit 279, # 114 Exhibit 280, # 115 Exhibit 281, # 116 Exhibit 282, # 117 Exhibit 283, # 118 Exhibit 284, # 119 Exhibit 285, # 120 Exhibit 286, # 121 Exhibit 287, # 122 Exhibit 288, # 123 Exhibit 289, # 124 Exhibit 290, # 125 Exhibit 291, # 126 Exhibit 292, # 127 Exhibit 293, # 128 Exhibit 294, # 129 Exhibit 295, # 130 Exhibit 296, # 131 Exhibit 297, # 132 Exhibit 298, # 133 Exhibit 299, # 134 Exhibit 300, # 135 Exhibit 301, # 136 Exhibit 302, # 137 Exhibit 303, # 138 Exhibit 304, # 139 Exhibit 305, # 140 Exhibit 306, # 141 Exhibit 307, # 142 Exhibit 308, # 143 Exhibit 309, # 144 Exhibit 310, # 145 Exhibit 311, # 146 Exhibit 312, # 147 Exhibit 313, # 148 Exhibit 314, # 149 Exhibit 315, # 150 Exhibit 316, # 151 Exhibit 317, # 152 Exhibit 318, # 153 Exhibit 319, # 154 Exhibit 320, # 155 Exhibit 321, # 156 Exhibit 322, # 157 Exhibit 323, # 158 Exhibit 324, # 159 Exhibit 325, # 160 Exhibit 326, # 161 Exhibit 327, # 162 Exhibit 328, # 163 Exhibit 329, # 164 Exhibit 330, # 165 Exhibit 331, # 166 Exhibit 332, # 167 Exhibit 333 Part 1, # 168 Exhibit 333 Part 2, # 169 Exhibit 334, # 170 Exhibit 335, # 171 Exhibit 336, # 172 Exhibit 337, # 173 Exhibit 338)(Figueira, Elizabeth)
Copy Detection
Sergey
Mechanisms
James
for
Digital
Documents
Brin.
Davis
Hector
Carcia-Molina
Science Eiueiia
Deci.
DeparLineilt
of CorilpuLer University
Tab
Stanford
259
Stanford c-mail
CA 943O214O
scrgeyc.s.stanford.e.du
OcLober
31
1994
Abstract
In
easily
digital
library system
docunienLs are more
are
available violated
in
digital
form and
therefore
are Illore
as
it
copied
and
their
copyrights
easily
Ihis isavery
iL
serious
users.
problem There
are
discourages
owners
of valuable
for
inforniaLion
this
from sharing
with
auLhorized detection. the
Lwo
main
iiiakes
philosoplues unauthorized such
addressing of
problem
difficult
prevention
or
and
The
former
it
actually
easier
use
documents
impossible
while
latter
makes
to
discover In either
activity.
this
paper
we
copies
propose
system
for
registering describe
documents
for
and
such
then
detecting
copies metrics
conipleLe
for
or partial copies. detection
We
algoriLlnns
detection
and
required
also
evaluating
mechanisms
called
covering
PS.
accuracy
eciencv
and
issues
security
aiid
We
describe
working
results
prototype
suggest
CO
describe
for
implementation
copy deLection
present
experinental
that
the proper
seutings
parameters.
Introduction
igita such
as
Ii
braries
are
concrete processor aspects that
of
possibility
today
beca use
of
ma
nit
tech
nological scan
fling
advances
areas user
storage
In
and
technology
networkdatabae
library
will
systems
is
systems and
it. of
interfaces.
many
building digital
digital library that
today
just
matter
of
doing
However.
interest
there
or
will
is
real danger
such
isolaLed
either
have
very
relatively restricLed
few documents
access.
IL
be
paLchwork
for Ails
systems
is
provide
The
copy and
reason
danger
that
If
Lhe
electronic
medium
provider
list
makes
doc
it
much
ii
easier to
leti
Lo
illegally
lie
distribute
information.
lv
an
inform ation
large mailing
gives
ment
customer
board.
to
customer danger
of
can
illegal
easi
dist
is
ri
biite
it
on
of
or
is
can
post
on
biil
ihe
copies
not
or
new
course
however
than
it
much
more time consuming
reproduce
and
distribute
paper
Lechurologv giving
CDs
does
videotape
sLrike
copies
on-line
documents. protecLmg
Lhe
Current property on
of
noL
good balance
need
between
owners
are
lie
of intellectual
and
access
to
those
is
who
free
the information
.At
one
is
extreme frequently
open
sou rces because
as
the Internet the dangers This
research
where
everything
but valuable
information
unavailable
of unauthorized
distribution.
At the other extreme
Research Corporation those
cuiier of the
are
closed
systems such
of
the
was
sponsored
by
the
Advanced
with
the are
Projects
for
Agency
AItFA
Research should
ol
of
the Department
Defense views
under
Grant
conclusions
No
Lii
MDA9I2-92-J-1029
contained
oflicial
National and
Initiatives
UNItI.
as
The
and
in
this or
document
authors
or
not
be
interpreted
Tie
necessarily
or
represent
irig
policies
eridorserricriL
expressed
implied.
ARPA
U.
S.
GovcrnmenL
CNHT.
1As Barry
just
one
example Royko
Knight-Ridder columns
Tiibune
recently
June
23
1991
the
ceased articles
publishing on
large
on
ClariNet
lists
the
Dave
and
the Mike
because
subscribers
re-distributed
mailing
one
that
the where
IEEE
users
currently
uses for of to
to
distribute
is
papers view
in
CD-ROM.
and
of
This
completely buL
data. to
stand-alone
sysLem daLa
in
can
look ouL
like
specific
articles
Lhem.
prinL his users
Lhem.
or her
cannot
move
ally
elecLronic
form would and
Lhe
system
an
aiid cannoL
add any
that
at gives
Clearly
digital
one
have
infrastructure
access gives
wide
variety
of
libraries
information
sources
their aiid
but
that
the
In
same
time ways.
information
this
is
providers
the central
good economic
issue In issue. be for future
incentives digital
for offering
information. systems.
the
many
we
believe
information one
si
library of
Lhis
paper key
we
is
presdllt uite
componenL
pIe provide
informaLion thtction
will
infrastructure
LhaL
addresses docu copies
in
Lhis
ihe
idea
copy
he
sfnI.ef detect
where
riot
original
rnents
can
also
registered
and
copies
is
can
be
detected.
service service
just
exact
Section
but
documents
of
that
overlap
significant
ways.
The
can
be
to
used
see
variety
ways
by
information Although
providers
Lhe
and
communications
idea
is
agents
simple. aiid corn
detect are
violations several that
of
intellectual issues resolved.
properLy
laws.
copy
detecLion
Lhere
challenging Lo
we
address
here involving copy
detection of
performance
is
sLorage
to
capaciLv.
accuracy
ii
need
its
be
Furthermore
is
relevant
the
Hatabahe
in
ty
since
central
corn
ponent
large
database
that are
registered
documents.
is
We
Lool.
stress
copy
detection
of oLher
not
the
complete
solution that
will
by
also
any
means
it
is
simply
helpful
There
number
importanL
tools
assist in are
safeguarding
iieeded he
iii
intellectual cases.
in will It
properLy
is
For example
ii
good
eiicrypLioii
and auLhorizaLion
rgi
rig
mechanisms
to
some
also
portant
variety
to of
have other
mechanisms
topics
for cha to
for
access
information.
hese
articles
discuss
related
intellect
ia
property.
other
toos and
topics
not
be
In
covered the
ii
this
paper.
following aiid
will
section
we
will
briefly
discuss
is
some
very
of
the
options
for
safeguarding
In SecLion
intellectual
properLy
Lhe basic
argue
that
copy
deLecLion
for
promising
approach.
in
we
describe
define
terms prototype
and
evaluaLion
metrics
report
copy
deLection.
Then
Section
ph
we
our can
working
reduc.e
CO IS
spac.e
6.
and
on
some
initial
experiments can speed
are
sam
chec.king
ng
technique
is
that
the
in
storage
Section
of
registered
documents
security
or
up
time
in
presented
3.3.
and
analyzed
Finally
some
considerations
disc.ussed
Section
Safeguarding intellectual
how
paid
property
used the by person
can
to
we
see
ensure
LhaL us
documenL
illustrate
is
only seen and
possibilities
who
by
isautliorized
e.g. has
particular
it
let
the
and
problems
suggesti
rig
two
techniques.
The
cannot key the
is
first
tecJnique by
Lhis
its
is
based
on
It
the conLa
only
in
notion
is
of public to
secure key
the
printer.
Such device
printer
is
sealed Lhe
and
be
opened
to
owner. and
encrypLion
iLself.
where printers
printer
private
unique
of
priiiter
is
known
pritLer
The
trusted
public
key
and
name
the
is
the owner
are
registered
database an
provided
by
the
manufacturer.
rst
When
owner
it
owner
req uests
docu has
ment from
then
public
it
inform ation the
public
vendor
key
of
the the
vendor
printer
fl
ensures the
the
authorized the
e.g.
paid
the
fetches
from
registry
enc.rypts
document
iL
using
key can
aid then
sends decrypL only
the and
this so
result. prinL
When
Lhe
the
owner
receives
the
the
daLa.
he
can
send
to
Lhe
printer
which
documenL
can decrvpL
he
however
iL.
elecLronic
daLa
resent
cannoL
to
be
used
to
for
anyLhing
else
as
one the
prinLer
The
daLa
in
can
this
be
the
printer
illegally
create iced
nother paper
pa per copies
copy
is
docu
ii
ment can
nsolved
reprod
iced
way.
However
does not
reprod
previously
problerii
that
this
scheme The
delivery
address.
main problem sysLem
use parts
with
this
scheme
else.
is
LhaL
iL
is
too restricLive. browse through
Iij
It
is
more
of
an
elecLronic
paper
Lhan
of the
it
an Lhing
docu
Users
canitoL e.g. conj
documenLs
it
before
ii
buying
and
can
not
ment
still
in
otheN
useful
in
for quotes. nction
rthermore
req
res
special
purpose perha Ps
hardware.
However
ma
be
with
other scheriie.
For exam
pIe
users LhaL
can have
be key
Lhe
allowed
to
browse
through
low-resolution the user
copies decides via
lie
of
documents.
Lo
or Lhe
through
documents
he can
also
components
high
qualiLv
ire
missing.
Once
can
wanLs
secure
read
documenL
purchase
be ad
copy
that
be
delivered of
is
Lhe
pnnLer.
The
scheme can
apted
for
sec
corn
puter
to
instead illustrate
printer. that of
The
idea
LhaL
iL is
second that an
technique
we wish
vendor
an
active
document
instead
suggested
it
in
The programs
information documents.
does
not
user
send
out one
documents
of Lhese
sends
call
ft
out P.
lie
can
his
ru
generate
local
it
When
receives
programs
is
can
run
on
machine.
displays
it
Embedded
the docu
it is
wiLhin
and
iLs
daLa
or
dii
sLrucLures ring
Lhe
encrypted
documenL
to
lie
as
ns
ment. being the
Hovever run
and
of
before waits for
splay This
sends
message can
vendor
eacJ
inforrning
that can has
response. runs.
way
the vendor
charge
time This
runs
or
limit
its
number
times
user
scJeme The
also
drawbacks.
Lhe see
biil
cannot
of Lhe
if
read users
the
document
in
through
his to
favorite
viewer.
veiidor code. ally
musL
user
know
cannot
is
archiLecture Lhe
let
machine
advance
is
generaLe
oii
appropriaLc network.
of his
Ii
The
documenL
proof
since of
Lhe
vendors machine
coii
Id
unavailable
the
the scheme could
not
the
user
run
in
an
software
emulator
maciæne that
have
record
the characters
the
document
that they and
as
they
are
displayed.
While we document The
thern
only given
two examplesVe
Lhev
are ofLen
believe
illustrate usually
common
geL
in
problem
with
users. allow
proieett07
is
Lechniques
Lo use detection
cumbersome ThaL
is
the are
NI
way
of
alLernative access to
Lechniques.
we
assume
that
most
users
rules. in
honesL
any
the
docu
this
ment
and
focus
to be
on
detecting
those
violate
the
get
software
of
vendors users
have and
found
approach
superior
protection
mechanisms
the
way
honest
sales
may
actually
is
decrease.
to
One
origin
possible
direction
incorporate
if
watermark
of the
into as
document
images.
that
identifies
its
2.
inLo
For
example
of
we
Lhink
hiLs
documenLs
the
we
ma
would
encode he
the
\VaLermark
of
small number
bits
random
Lhroughout vendor
image.
The
users
unaware
rnent
is
where
the watermark
were
but the information
that originally
sold to
is
provided
If
the doc
could
extrac.t
them
to
determine person
is
who
the
document
then desLroy
was
originally. detected.
the
document
possessed
of
by
different
or LhaL
organization. users
violation Lhe
The
main weakness
the
approaches instance
easily
such
as
Lhese the
may
watermark
filLer
by
processiitg
documenL. algorithm
For
passing change
image
bits
documenL without
that
basic.
Lhrough
altering
in
noise
or to
lossy
compression
the
could
enough and 10. the
really
the
image
for an
destroy
waterrnark.
is
second copy
or detection
approach
server
it
one
we
advocate
idea
is
this
paper
text
docurnents
creates
that work.
of
The
as
follows
also
When
the
author
for
new
he
she
registers
at
server. as
The
server
in
could
be
repository
are
copyright Lhey
iL is
rec.ordation
and
small hash
regisLration
system now
suggesLed
documents
senLence
is
registered
pointer Lo
are sLored
broken
in
inLo large
uniLs
table.
for
sa
be
sentences.
Each
hashed
and
Documents
or
can
of
compared
to
existing
documents document
hash
table
in
the
is
repository be
LhaL
to
it
check
is
for
plagiarism broken has
into
other
types For
If
significant senLence.
overlap.
When
Lhe
to
if
checked
also
sentences.
each
the
we
and
probe
Lo
see
parLicular
senLence Lhan
been
seen
nu
iii
before. her of
documenL
then
are
previously
is
regisLered he
documenL
share more
be set
some
on
if
Lhreshold
sentences
if
violation for
flagged.
threshold can
larger
if
depending
to to
the desired doc
it
checks share
smaller
large
we
looking
copied
paragraphs have
to
we
only want
check
see
if
rnents truly
portions
.A human
wiLh
would then
examine both
documents
was
violation.
Unlike
i.e. Lo
the
case
waLermarks
copy. nu
iii
it
is
noL
easy
for
if
user
Lo
automatically
uniLs Fh
is
subvert
are
Lhe
sysLem
user ust
make
ave
to
an
tindeLecLable large
For
ber of
example. sentences
ui
Lhe
decomposition docu ment.
enLences.
rnore th an
would adding
change space
user
in
the
involves
blan
between change
words
all
assu
ng
that
the
hashing
is
scheme make
ignores
it
spaces.
to
Of cou rse documents
determined
could
sentences
but
our goal
to
hard
copy
not
to
make
copy
it
impossible.
This
makes
be
it
hard
in
to
rapidly of
distribute
copies
of
documents.
publisher
IL is
The
liable
if
detecLion
server
can
Lhe
used
varieL noL
an in
ways.
For
example
legally
for
publisliiiig
materials
auLhor
is
clues
have
copyright
on Urns
electronic of
may
wish
Lo
check
soontohepublished
document
chec.k
actuall postings checking
original this for
document.
Similarly
bulletinboard
software
also chec.k
may
the
automatically messages that
new
fashion
An
mail gateway
may
go
through
if
transportation Loo prove focus
stolen
goods. Copy
Program
paper. can
iso
committee
members
wailL for that
may
to
check
submission
overlaps
Lo
much
illegal
with
an
auLIiors previous
Lawyers may
also be used
check puter
not
subpoenaed
programs
detection
documneitLs
behavior.
ni
deLection ihe
user are
corn
but
of
we
ii
on
lv
on
text
his
pa per.
applications retrieving
do
involve
ndesi ra ble
behavior. or
Fbr
is
exam pie
electronic
that
is
documents
flag duplicate
from
an
information
retrieval
system
who
reading the
mail may documents
want
are or
to
items been Limes should
with
given
the
overlap
threshold.
represent of
it
Here
registered
that are so
Lhose
that
have
mnaiiy
seen
already
copies
or versions
messages
reLransmnitted on.
forwarded
differenL not be
ediLions deleted
the
is
same
up
to
work. the
and
to
Of course
if
potential to
du plicates
possible
automatically
user
decide
he
wants
view
In
duplicates.
summary
we
think
that
detecting
copies
of
text
documents
are
is
fundamental
that else
e.g..
problem be
of
for
disLribuLed
information should
Lhe
or daLabase
systems
units
ii
And
he
there
many
issues
need
Lo
addressed. senLences sequences
his
For
instance
decomposiLion
lit
paragraphs
or someLhing or
itistead
Should
of
we
take into
Is it
accou
to
order of the
nits
paragraphs
of the
sentences
of registered
by hashing
nits
feasible table
is
only hash
fraction
still
sentences
it
docu
ments
major
wou
Id
make
If
the hash hash
smaller. relatively
hopefully small having
issues
it
making
be
very
likely
that
we
will
catch
violations.
the
table locally also
can
cloned.
Our
mail gateway copy deLection For
above
server
could
for
then each
are pt
perform
mnessage.
iLs
checks
are
instead
of
to
conLact need
Lo
remoLe be
There
implementaLion
sax
LhaL
addressed.
extract
examnple.
how
senLences
ii
extracted
or
from
bit
latex
or
Word
docu
ments
Can
one
hem from
ostscri
doc
ments
from These
maps
via
OCR.
will
and other questions
defining Lhe basic
be
addressed
in
the
rest
of
this
paper.
for
We
start
in
Sections
and
Section
by
Lerms
evaluaLion proLoLype the
meLrics
and and
of
options
report
copy
deLecLion.
Then
in
we
rig
describe
our that
working can and
red uce
COPS
space
6.
on
some
docu
iniLial
experiments.
or
sam
ph
technique
ti
storage
in
registered
ments
can
speed
ii
checking
me
is
presented
analzed
Section
General Concepts
In Lhis secLion
we
define
IL.
some
far
of the
basic
concepts
texL
is
for copy
deLection has not
and
for evaluaLing
mechanisms
so
LhaL start
implemenL
from
As
he
as
we
know
point as
copy
deLection
of
been
formally
sLudied from
we
bahics
starting
the
concept and
document
boundaries
are
body
can
be
of
text
which
In
some
initial
structural
inform ation formatting
such
word
and
sentence
extracted.
an
phase
information canonical
non-textual
components
consists of
removed
of
ascii
from
documents
with
of
see Section
whiLespace
5. The
separaL
resulting
ig
form document
separaLing
string possibly
characters
words.
of
puncLuaLion
senLences
and
sLandard
meLhod
marking
Lhe
beginning occurs
of
paragraphs. docu There
exac.t
violation
when
text.
ment
are
infringes
upon
of
another
violation
docu types
ment
which
in
some
can
way
e.g.
including
by
duplicating plagiarism
of
portions
number
of of
occur
steps
is
few
sentences
for
is lar
replication
the entire
violaLion
document
beLween Lhen Lwo
and many
documnents
in
between. by
The
noLion
of
checking
If
parLicular tesL
type and pIe
capLured documnent
nolaiiou according from doc
ii
tesi.
violaLion
test.
id.
holds
documnent
true
if
violaLes
to
the
Rarticu
r.
lor exam
this
/laqthrsmd
to nd ide
ris
docu
against
ment
set
has plagiarized
of
ment
also
extend
notation
checking
doc
ii
ments
td
by
7Z
is
true of
if
aud
ouly
if
td
we
holds
are
for
some
document
in
R.
well deflued to tesL
MosL
the
violaLiou
Lests exaruple.
iriteresLed
is
are
uot
aud
for.
require
decisiou
liumau
being. he
if it
For
is
plagiansm
particularly
difficult
For
not
iusLauce
be
If
the
sentence plagiarism
test
proof
as follows
iu
may
occri
man
is
scientific
papers and
won
Id
considered
occurred that
two documents
if
while this seuteuce
esseutially siguificauL
most
of
certaiuly
would.
we cousider
ueed
to
Subset
if
detects
document
subset
another
one
alit
we
again
requires
consider
evalu aLion he
Lhe
smaller
docuruenL
ruakes any
conLribuLioiis.
Thisag
human
goal
of
copy
detection
system
notation as
is
to
in plement
well
defi
ned
algurith
mic
tests
termed
violation
in
opfratinq
tests.
tests
with
the
same
violation test
tests
that
that
approxi
if
mate the
of
desired
For
instance
in
consider
test
the be
operating considered
Llien
trd.
holds
to
90%
the
test
sentences described
are
contained
If
This
flags
Li
may
an
approximation can check
if
the
are
Subset indeed
above.
Lhe
sysLem
violations.
human
they
Subset
violaLions.
3.1
Ordinary
the
rest
Operational Tests
pa per
In
of
this
we
be
will
focus
on
specific
efficiently.
class
of
operational they
tests
ordinary
opfratIonai approximate
tests
OUTs
violation
that
tests
can
of
implemented such need
as Lo
We
believe
can
accurately
many
which
interest
Subset.
define
Overlap some
in
and
Plagiarism.
for specifying the level of deLail at
Before
We
look
describe aI the
OOTs
we
primiLives SecLioji into well
we
docuinenLs docu
As menLioned
ments can
be
documents
defined or
contaii parts
some
structural
information.
ii
particular
structu of
re
divided
consistent \\e are given
call
with each
units. Lype.
lie
nderlyi
ng
such
sections
unit
paragraphs
sentences instances
units of
words
of these
characters. unit
of
these
types
divisions
type
and
of
particular
types
of
call let
called unit
We
different
define
chunk be and
FIC
as
sequence
into not
consecuLive
in
iii
documenL
since
document
may
sizes
divided
need
chunks
pletely
number
the
ways
ment.
chunks
exaril
overlap
is
may
me we
it
he
of
com
cover
docu
For
or
pIe other
assu
have
be or
is
docu
ment
into
It where
as follows
the
letters
represent
sentences
or
some
or
nit.
can
organized
chunks
or
ABCDEFG
of selecting that Lo
ABCDEFG
from uniLs chunks
ABBCCDDEEFFG
divided
into units
ABCCDEFG
ehurtkug
Lo Lhe
ADG.
IL
is
method
chunks
doc.ument have
siraiegy.
iinportanL
chuiikiiig to
is
noLe
unlike
strudural abouL
by
efficient
significance
documenL
and
uses
re
so
straLegies detect
cannoL
use sLrucLural chunks key and
is
informaLion implemented
not First
Lhe
documenL.
set of
An
cedu
res
001
in
hashing
he
matching
to
the
pro
plete
Ilgu
code
intended our
convey
concepts
an
or
com
implementation.
ing Lhe operation.
Section
describes
actual as
prototype input
for set
system.
of
there
is
the
preprocess
PREPROCESS
H.
LIiaL
takes
regisLered
documeiiLs
Lo
is
and
creaLes
hash docu
LaMe
ments
Second
and
th at for
Lliere
are
procedures from
1-I
onLhefly
adding
documenLs
1hird
registeritig
new
removing
them
unregistering
documents.
the
function
EVALUATE To
break
insert
com putes
in
od R.
hash table procedure function
iii
documents document
the
its
INSERT
returns
uses set
function
of
is
INSCHUNKSr
Each
of the Lhe
to tuple
up
into
in r.
chunks.
is LIie
The
tuple.
Lhe
represenLs
one
in
chunk
uniL
where enLry
is
Lext
in
Lhe
chunk
Lable
and
location chuiik
in
chunk
measured
roced
some
.An
sLored
Lhe
hash doc
ii
for every for use violations. different
documenL.
ire
ure
EVALUATE
function
will
tests to
given
ment
proced
ch
uses at
EVALCHUNKS
evaluation
aiid
break
ii
d. in
ihe
reason
6.
why we
For
Lo
unking
function
time
become
apparent and
Section
now.
refer Lhe
we
to
can
assume that both
INSCHUNKS
EVALCHUNKS
After chunkiiig MATCH.
are
idenLical
we use CHUNKS
then
in
Lhem.
in
procedure Each hatches
is
EVALUATE
looks
up
chunks
the
hash
LaNe
H.
produciig
at location
ii
set
of
in
tuples
ldr ir
as sa to
MAICH
key
as
represents
match
at location
chunk
ir
in
of size registered
is
id
r.
docu
ment
set
me
ash
chunk
doc
ment
of
The
MATCH
then
given
function
DECIDEMATCH
SIZE
where SIZE
the
number
PREPROCESS
CREATETABLEH
for each
in
INSERTrH
INSERTrH
INSCHUNKSr
for each
in
OOT
dependent
HASHt
assume size
of reg. doc.
may
be
obtained
from
id
INSERTCHUNKhr1
DELETEr
implementation
unspecified
INSCHUNKSr
for each
in
HASHc
DELETECHtJNKkhr1 EVALUATEdH EVALCHUNKSd
SIZE HATCHES for each
ICI
implementation
unspecified
empty
set
ld HASHt
LOOKUPh
each NATCHES
in
SS
returns
all
lr
with
matching
for
ir
in
SS
Iti id
SIZE
ir
OOT dependent
return
DECIDENATCHES
Figure
Pseudo-code
for
OOT
chunks
Lhere
in
LItat
reLurns
i.e..
Lhe
set
of
inatciung
regisLered
documents.
If
the
set
is
noneinpLy
Lhen
was
Note
violaLion that an
od
of an
is is.
holds.
instance
001
the
is
specified
in
simply which
its
by
its
INSCHUNKS
differ.
EVALCHUNKS
in
and
DECIDE we
will
functions. start
That
this
oniy way where
OUTs
some
Luples
is
In
particular.
Section and
its
by
considering
selects
leL
an
001
MATCH
both
LhaL
CHUNKS
functions
extrac.t
sentences. of
DECIDE
chunks. 1hen
function
regisLered
docurnenLs be
Lhe
exceed
of
threshold
of the
fracLion
inaLciting
iii
ThaL
is
COUNT
will
number
form
MATCH.
if
document
0.4
he
selected to
if
COUNTr
100
call
MATCH
greater
registered
than
docti
SIZE.
ments
For with
example
41 or
and
the
docti
ment
be
check
has
sentences
this
then
more
matching
In
sentences
of
will
selected.
We
is
DECIDE
function
the
match_ratio
in
function.
full
the code
is also for store for
Figure
we only
in
store
the ids of registered simply
Lhe
documents
or
id
not copy
the
documents.
sysLemn
ThaL
Luple separaLely Lhe user
name
of
r.
The
deLection LIus.
may
be
regisLered
documneiits.
Our COPS
and
proLotype the
does matching
This
can
useful
showing
the
matching
documents
highlighting
chunks.
3.2
Measuring
described
tests earlier such
Accuracy
As
OOTs
as
and
operaLional and Su hset.
It
is
tesLs
It is
in
general
are
intended
to
for
approximating
violation
flagiarism
therefore to their
in
important evaluate
efficiency Lhe resL of
eva
nate of
how
well
an
i.e.
UOT
how
approximates hard
it is
some
subvert
other the
test.
also
important
as are well as
the
security
UOTs
efficiency
to
copy
detection
securiLy
i.e..
what
computational
is
resources
Lhev
require.
Accuracy
and
discussed
Lhis
secLion
addressed
in
Section
3.
Assume
R. docu
test
random
the
is
regisLered that
docnmenL
is
chosen
from
disLribnLion
r1
of
regisLered of
docnmnenLs
registered of
That
ments
is
probability
parLicular
document
test
out
of
population from
implicitly
i1 D.
Similarly
assu
me
random
docu
ment.Xis
metrics
selected
distribution
documents and D.
We
can
then
define
the following
accuracy
each
parametrized
by
Definition
3.1
For
test
IPC
fe/inc
freqt
test
is
PtX Y.
is
stands
for
probohthty.
Tntiiitively
frq measures
one
of of
how frequently
test
true. as
likely
lor to to
example
be
suppose and
is
uniform
in
over
-i x2
yi
only
either
y.J
two
these
docu
ments
are
just
tested
is
iform
Y2
all three
docnments
hold holds
equally
likely
be
of
registered.
Further. are
assnme that Then
tx
y2 tx
3/6 1/2 operating
is
y3 tx2
since test
i.e.
for
only these ouL
of
pairs
docnments
choices their
violations.
freq1
If
Lhe
possible
for
an
approximates they
test test
is
violation
test
well
sets.
then
If
should
be
close test
If
but
is
lie
converse
not to
true
since
can
it
accept
is
on
disjoint
the frq of the
is
operating
small
is
compared
large
the
violation
approximatiig
liberal.
then
it
being
too
conservative.
it
too
then
the operating
too
Suppose
for general
we
have
an
operaLiig
tesL Lhese
12
and
can
also
violaLion
tesL
Then
Lwo
we
define
Lhe
following
accuracy.
Note
two
tests.
LhaL
be
applied
beLween
operaLiig
LesLs
and
iii
between
any
Definition
3.2
The Alpha
7neinc
corresponds
i.c
Lu
measure
of false high
7egatttes
i.e.
JlpIta
1.
Pt2
opfratnq
test
t1X
t2
Note mssi.nq
too
4/p/ia
not
symmetric.
oft1.
4lphatj
t2
value
indicotes
that
many
is
violations
Definition
3.3
Yu
Beta.
metrie
analoqous
Beta.
to
Alpha.
xcept
that
mEas Vs
high
faLsi
postjves
i.e. value
Betati t2
indicates that
Pt2
t2 is
ti
too
Y.
not not
in
sym1ntrc
Hther.
Beta.tj
t2
finding
many
violations
Definition account
both
It
3.4
false
is
Yu
Error
metric
is
the
combination
ctnd
is
of
Alpha
as the
and
Beta
that
it
takes
into
positives
and
false
negatives value
defined that
Eroti
two
tests
t2
are
Fti
dissimilar.
t2X fl.
3.3 So
symmetric.
high
Error
indicates
Security
far
we
have
assumed that the
and does
it is
author
Lo
of
test
document
it.
does
not
know
how
our copy
detection
for
in
svsLem
works
is
iiot
inLend
sahoLage
user to Lo
however
it.
anoLher
Lhis
imnporLanL
measure
rily
an
001
of
how
hard
for
malicious
to be
break
measure
docu
noLion of SEct
it
Lermns
how
an
as
changes
copy.
need
made
registered
ment
so
that
will
not
be
identified
by
the
001
Definition
3.5
liE
scwrty
of
an
007
numbei such
that
also
of
applico
h/c
to
any must The
operatnq
test
on
ginn
or value
docu
pent
in to
S/i.C product
is.
is
th
mnmm
choraeters
that
nsrted
higher
deleted.
moth/ied
is the
new document
or.
is
false.
SECo
more
secure
We
can
use
this
notion
to
evaluate
as
and compare
chunk.
OOTs.
For example.
consider
for
all
an
OUT
oi
that
considers
single
the
entire
document
Lhe
single
Then
SECoi
as copy.
because
changing
characLer
makes
decision
documnenL
noL
deLedable
2This
assumes
if 01
function always true
which no
doesnt matter
if
flag there
violation are
if
there or
are
no then
matches
reasonable Toes
condition not
hold.
For instance
is
matches
not
our statement
As another
funcLion.
if
example
consider
OOT
100
02
that where
uses
sentences
is
as
chunks
of aL
and
match_ratio
in
r.
decision
Then
and
SECo2
our document an has
qSIZE
SIZE
we
of
the to
ilumber change
senLences
leasL as
For instance
0.6
sentences
uses
need
40
ch
of
in ks.
them
As
third
if
exam
the
pie
consider
00
sentenc.es
that
pars
03
overlapping cimnks each
sentences
lor instance Here
document
half as
has sentences
C. as before
considers
AB.
BC CD
can
is
we
need
to
modify
many
is
roughly
Lo
since
modification
affect
two cimnks.
half as secure
TIiuSECo3
as r\ote
approximately
equal
SECo2.
r/2
it
i.e.
approximately
that
our
secti
rity
definition certain large
is
weak
because
assu
mes
lie
adversary
knows
can
all
about
secu
ou
rity.
00
%Te then
1.
However model
by
this
keeping by having one and
information
class of
bout 0.
on
00
vary the For
secret
we
by
enhance
can
OOTs
all
that
only
some
does
parameters. not
define
and
which
secretly
choosing
choseji of
OOT
Urns
that
from
to
0.
We
assume that
of
adversary
Lhis
know
001
as Lhe
we
have
needs
subverL
inserted.
Lhem.
or
model we
to of
SECO.
false for
all
number
For
characLers
of for using
must he
of
deleted
king as
modified strategy
make
Section
or
4.2
0.
examples the
seed
classes
00Is
see chun
and
Section
consider Finally
issues.
lie
the
random
number
measures
generator
parameter.
here do
Lhe
notice
that
the security
user
we have documenL
the or
presented
not address ensure checks
there
authorization
the for user
is
For example
to
when
LhaL
registers
how
does
system
user
vIio
claims
he
and the
of
he
actually
owns
documenL
we
just
When
him
violaLions violations
violates
can
we
show
the
him owner the
matching doc be
ii
documents
be notified identity
do
inform
that
were
Should
ment
that
of
someone was checking
person not submitting
to
docu the
test
in
ment
that
his
are
Should important
owner
given
the
the
document
paper.
These
administrative
questions
that
we
do
attempt
address
this
Taxonomy
he
tin
of
ch
In
OOTs
unking
this
its
selected
of an
the
strategy
and
the
decision
hi
nction
can and
affect
the
acc
ii
racy
and
the
security
001.
section
we
consider
some
of
the options
the tradeoffs
involved.
4.1
Units
determine
factor to to
To
key tend
how
doc
ii
ments the
are
to
be of
divided
into
in
chunks
unit.
we
must
fi
rst
choose
else
the
units.
One
will
consider
is
number
and
characters
will
Larger
units
all
being equal
selective. function. Lhe to unit
generate can be
fewer
matches
henc.e
have
smaller freq
selection
and be
or of
more
dec.ision
This
of
course
compensated
facLor
by changing
in
the
of
cJunk
unit
strategy
Lhe ease
AnoLher
Lors.
irnporLant
the are
choice
type
is
deLectitg
are easier
separa
titan
For
example
which the
Words can
be
iii
LhaL
separated
in in
by
spaces ways.
is
aiid
punctuation
detect
paragraphs Perhaps
if
it is
distinguished
many
it
most
portant factor that sequences
tin
selection
the violation copied should he
test
of interest.
For instance
of
more
sentence
meaningful
of
sentences
were
rather used
than
as
sequences
words
e.g.
fragments
then
sentences
and not
words
units.
4.2
Chunks
are
There
of
number
of strategies
for selecting of
chuik.
entries
To contrast them we
are
c.an
consider
the number and an upper
units
involved
for are the also
the
number
hash
table
LhaL
required
of
for the
document
four sLratcgies
hound
here
securiLy iiian
SECo.
variations
r.
not
See
Lthle
for
summary
the table
we
ber
coiisider. of
ii
covered
here.
do not have
refers to
the
nu
iii
nits
in
For
our
discussion
we
assume
that
documents
significant
numbers
of
repeating
units.
strut
snmmary
unit
aroinpie
on
I3C///
spae
ri
units
SEC
Id
ABC.D.EF
over
units
uiiiLs k-i
FICIKF
H/k
ri
over
ABCBCD.CDEDEF
F1C1KF
Properties
of
in/k
hashed
breakpoints
H/k
Strategies
Table
Chunking
the
Lhe
document number
of
being
hash
chunked.
entries
and need
is
parameter
while
of
the
strategies. gives Lhe
The chunk
space
size.
column
gives
Lable
for
uiiiLs
One chunk
smallest
quaL
one
unit.
Here
every
unit
e.g.
every sentence
to
is
chin
of
k.
his
yields
he
chunks.
As with
is
units
small chunks cost
ri
tend hash
make
the frcq
are
an
OOT
is.
smaller.
The
major weakness however
the
it is
the
rnosL
high
storage
table
is
entries
required for
ri
document. depeiiding
oii
the
secure scheme
it
SECc.
to
bounded
up
to ii
by
ThaL
decision
hi
nction
may
be
necessar
alter
characters
one
per
ch
ii
to
subvert
the
OOt.
equals
One
chun1
nonoierlappmg
ii
units. use
re
In these since
Lhis seq
sLraLegy. iences
we
ou docu
break chin
the
ks.
document
It
up
inLo
lie
sequences space
at of
of
consecutive but cause
it is
nits
and
nsecu
uses
/kth
single
Strategy
will
very have
altering
ment
by
adding
this
unit
the
start
to also
no
to
matdes
high
with
the
errors.
original.
We
call
effect
phase
dependence. One
chunk
This
effect
leads
Alpha
Ic
equals units as
iii
ktnits
in
overlapping
on our
units.
Here.
we we
is
take do
every
sequence from
of
consecutive dependeitce
our
document
buL
that uses
as
chunks.
Lhe
Therefore space an
cosL
not
suffer to the
phase A.
SLraLegy O4
ujiforLunately
equivaleiiL LhaL
is
SLraLegy
Comparing
for
its
an
of
001
Strategy any
SLraLegy
see
A.
and
001
being
same Rta.o.
that
excepL
use
one
can
test
o.
hat.A /pha
his
is
or
errors
4ipha.o
true
and
iii
Dc
o4 I.
Retao
is
0/1
for
violation
is
because
nlies
true.
Thus
is
Strategy
relatively
prone
to
higher
Alpha
but
in
lower that
Beta
errors.
Also
kth
Strategy
unit of
insecure
is
though
sufficienL
more secure
to fool the
than
modifying
every
regisLered
document
system.
Use
Lhe Lhe
nonoverlapping
firsL
units
determining
If
brcak
points
is
by
hashing
Lo
units.
We
unit.
start
by
hashing
k.
unit
iii
Lhe
is
document.
the
Lhe uniL.
hash
If
value
equal
some
Lhe
If
constanL
niodulo
If its
Lhen
value
firsL
chunk
simply
firsL
noL
we consider two mod
units.
second
not
hash
equals and
first
modulo
on
until
the the
first
chunk that
is
the
first
we
consider identifies
the third the end
unit
of
so
we
find
some
unit
hashs
to
to
do the each
and
this
the
chunk. be
We
then
repeat
the
procedure
identify
in
following
nonoverlapping
will
chunks. Strategy by phase has higher
It
can
shown that the expected
to in since
its
number
of units
chunk
unlike
be
is
k.
Thus.
is
similar
hash
table texL
requirements. have
Lhe
However
it
not
affected
like
dependeitce
sinHlar
will
same break
Lo heca
poinLs.
SLrategv
all else
D.
Alpha
should
be ca
and
be ight
lower
Beta
errors
less
as
compared
of
A.
use
Furthermore.
significant
being
of
dii
the
same
text
will
only slightly
ust as
in
than
that
portions
plicated
C.
of
the
with change
key
a4vantage
secret
Strategy
see test
is
that
it
is
very
secu
re.
It
is
really
family of function the system
strategies
parameter
unit of
Section
3.3.
to
Without
be
sure
it
knowing
will
the
hash
one
must
every
document
get
through
without
\VarIIiIigs.
4.3
Decision
are
Functions
options
for choosi
Sri
rig
here be
is
many
for
decision
hi
nctions. violation
if
he
match_nitjo nother
of
function
si
ii
Section
decision
3.1 hi
can
risefu
approxi
mali ng
bset
and
Overla
tests
tests.
pie
nction
matches
with
parameter
is
that above
also
simply
certain
the
number
would
matches
useful
between
for detecting
the
test
and
the such
registered as
document One
value
usiiig
k.
This
be
violations there are
Plagiarism.
certain
if
ii
inighL of
consider
ordered_matches
the
which
iii
tests
whether
more
be
Lhan
number
matches
are
Ii
occurr
to be
ig
same
order
both
documents.
This
would
useful
nordered
matches
kelv
coincidental.
Prototype and
We
have
built aiid
Preliminary Results
prototype
to test
working
OUT
our
is
ideas
and
to
understand
how
to
select
good and
CHUNKS
Figure
DECIDE
its
functions.
The
proLotype
called
COPS
can
If
COpy
via either
ProLection email
iii
System
shows DVI
against
major modules.
and
Documents
can
he
submniLLed be
TEX
in
including system
or
is
WfEX
tested
troff the
ASCII formats.
set of
New
docu
ments
registered
the
existing
registered
documents. that
it
new
docu
ment
is
tested
sum mary
returned
listing
the
registered
documents
violates.
TeX
ASCII
converter
Document
registration
DVI
ASCII
Rnceenttion
LJ
nflainashing
troff-ASCII converter Query processing
____________________________
Figure
Modules
in
COIS
implementation.
COPS
e.g.
with the
allows
modules
Lo he
easily
replaced.
permniLLing
experimnentaLion functions. and
with different begin and has been
of
sLrategies
different
si
INSCHUNKS
plest case
EJALCHUNKS and DECIDE
ch
ii
We
eva
will
our
explanation
mote/i_nitjo
sentence
later discuss
king
for
both
insertion
iation
decision to
function system
is
and given
as
i.e.
possible
improvements.
ID.
document
is
that
submitted
the
unique and
document To
This
Lhe
ID
used
to
it
index
mnusL
is
table
document
inLo the the
informaLion canonical
such
LiLle
author.
register
document
by which the nroff
firsL
be
converted
form
format
plait
ASCII document
text.
The
be
process
piped
this
occurs
utility
dependent
while
upon
document
with formats
are
roff
lX
to
can
be
through with
Unix
Si
thtx
\/ and
adocument
ii
formatting
filters
commands
handle and
can
converted
to
ui larly
other doc
plain
ment
have
to
their
conversion
plain
ASCII
text
After
producing Using
into
ASCII we
ready and
determine quesLion
hash
as
the
documenfs
individual
sentences.
periods
exclamation
key.
points current
marks
unique
to
sentence
is
delimiters we stored
in
hash
each
sentence
Lable. set of
numeric
for each
The
documents we
wish
ID
then docu
permanent
the
hash
once
sentence.
When
use
list
check
new
ment
against
existing
registered
documents
we
very
of
similar
procedure. and any look
We
them
generate
the
plain
ASCII.
determine Section
sentences 3.1.
If
and generate
hash
keys wiLh
up
in
the
hash
table
see
report
more than
SIZE
sentences
match
given
regisLered
document
we
possible
violaLion.
10
5.1
Conversion
proced
let
ii
to
bed
ASCII
above
is
he
re
descri
the
of
idea
case.
In
practice
nu
iii
ber
of
nteresti to
rig
diflic
ii
Ities
arise
is
first
consider
is
some
no
the
challenges
associated of
with
the
conversion
ASCII
to
text.
The
most important Documents
Lext
that
exact
objective
method
or troff
reducing
bec.ause
in
formatted
there
is
document some
so LexL
will
ASCII
exists.
are Tills
formatted
using
TFX
precisely
value
he
added
For
labtis tables
over
plain
exLra formaLLing graphs
Iia\
cannoL
he
represenLed
ASCII
reLa
and any
losL or
example
associated are
all
embedded
with
Ities
no the
ASCII
pri
equivalenL. structure
is
We
not
can
items
the as of
gra ph
but on
mar
be
translatable. eq
Kq nations
ta
and
pictu
diffcn
well.
implementation that cannot that
we
discard
graphs
naturally
nations
in
ble
res and choose
of to
other
pieces
all
information
represented the produce
ASCII. not change
We
the
also
discard document. and
text
formatting
commands
effect
presentation.
iLalic
but and
content
are
the
For
example
command
sequences
Lo
Lype
Font
removed
ignored. he
conversion
to
process
is
not
perfect.
If
the docu plain
to
ment
text.
in
pit format the
as
it
is
then
it
is
someti rues
impossible the
distinguish
equations
will
from
Consider exactly
Lhe
sentence
is
Let
XY
be
equal
if
answer.
wiLh
Lhe
This
sentence
Lhe to
be
translated he
ASCII
leaving
shown.
However
we
hegm
Since
to
TEX.
then
equaLion
plain
will
discarded
senLence our
Let
system
will
ci
equal Lhe would
iscnss
answer.
unable system
coiiverioii that
ASCII
match
to detect
produced occn
rred
hiferent
later
in
senLences
this section
recognize
sentence
allovs ns
we
some
enhancements Another
does not
that
matching
is
sentences
gives
despite for
is
imperfect
placing of
translations.
complication
with
is
DVI
that
it
directions
text
on
page
structures
but
it
spec.ify
what
headers the
text
part of
the main
body
and what
part
subsidiary aLLemnpL to
cafle
like
fooLnoLes
it si
page
and
text
hibliograpiHes.
in
Our DVI converter
it
clues
not
rearrange
it
Lext
ply of
considers
the
order
appears
reading
on
the
page.
left
However
to
one to1
does
handle
is
th at
two
colii
format.
in
Instead
of
characters
right
detects
to
bottom
which
gap
would corrupt most sentences and reads down
the
left
two column
and noL then the
in
format
right one.
the converter
the
inter-column
column
can
is
An
ipuL
format
ig
COPS
IL
handle
thificulL
general
is
PosLscripL.
Since
Postscript Lo plain
is
acLually Lext.
programnmn
laiiguage.
very as he
Lo
converL and
its Nl
layouL comnmands
ic
ASCII
Some
ostsc
pt
generators which
text
snch can the
ilvips
enseri.pt
rosoft
\\ord
as
prod nce
relatively
simple
pt
Postscript
from
extracted. of
However
page
bit
others maps.
snch These
Interleaf
prod nce scanned
is
Iostscri
code
which
would
require
generation
to
could This
be
with
difficult
OCR. and
optical
error Tn
cJarac.ter
recognition
analyze
and
rec.onstruc.t
the
text.
process
prone.
sn
mary the
to
approach hnt be
not
we
have
taken
with
the
CO
PS converters
ng sentences discussed
is
to
do
are
reason not
able
job
converting
identically
ASCII
still
necessarily
perfect. since
Most
match
that
later
translated to are
will
found
by
the
system.
enhancements Even
if
attempt
negate
the
Lhere
effects
of
common
he
translation
misinterpretations.
in
some
matching
so LhaL
sentences
missed
flag the
should
enough oLher we
presenL
maLches
overlapping
resulLs LhaL
documents
confirmn
COPS
can
sLill
violaLions.
Later
experimental
thIs.
5.2
Sentence
problems given
Identification
also arise
in
and
Hashing
idenLification
DifficulL
if
Lhe
sentence
plain identify
and hashlng
always merely
or clear
module.
Iii
parLicular
even
we
are
fi
correcLlv
Lranslated
ASCII
iL
is
not by
how
all
Lo
exLract
ii
senLences. to be period
As
or into
rst
approximation we ma
rlc
can
sentence
contain periods.
taking
words
nestion multiple for
However
because
of
sentences the
that
e.g.
other
abbreviations
to
will
broken
parts
embedded
An
as
extension
our
simple
so
model
explicitly
will
watcJes noL he
and
in
eliminates thIs
common
abbreviations unexpecLed
such
e.g.
and
i.e.
sLill
that
sentences
difficulties.
broken
way.
Nevertheless
ahbreviaLions
will
cause
For
11
example
idenLify
given
Lhe
the
actual set
sentences. of senLences.
is
am am
TJ.S.
citizeu.
aud
The
will
U.S.
is
large.
our system
will
following Lhe
S.
The
ice
ciLizeii.
The
flag error Lids
U. SY
as
and is large.
even Lhough
NoLice the
LhaL
sentence
are not
identified
Lwice.
system
this sort
maLch
disregard
actual
sentences
of single at
the
same.
To
red
of
we
can For
sentences
title
composed
author cud have
word however
head
of
other
are
similar also
errors
may
to
still
occur. as
example
sinc.e
and
names
the
document
discuss later
difficult
extract
sentences
to Lhe
they
rarely
with
puncLuaLion.
here.
We
NoLe
not
some
further
if
improvements
IL
simple
involve
algoriLhm
similar
we
described
that detect
paragraph
detecLion phs.
were
needed
would
issues.
CO
eacJ
IS The
cu
rrently
does Lw
units
in
used
COlS
to
01
of
pa ragra are
words and
result
is
sentences sequence
is
see
of
Section
3.1. with
COlS
first
converts
end-of-
word
the text
hash
key.
The
this
is
hash
keys
interspersed
sentence
markers.
The
cJunking
sequence
Lhe for
done
of
by calling
uniLs to
proc.edure
CONBINENUNITS
inLo Lhe
STEP STEP
be each
is
UNITTYPE.
Lhe
where
of units
NUNITS
Lo
number
Lhe
ly
he
combined
nexL
chunk
should
ii
number
ii
advance
next
chunk.
and UNITTYPE
indicaLes
what
ch
considered
nit.
hr
exam pIe
repreated Calling
calling
COMBINE
WORD
creates
seates
for
word
in
the
input
sequence.
COMBINE1
every
three
SENTENCE
words
as
chunk
while
for
each
sentence.
Using
COMBINE3
overlapping chunks. sentence
WORD
three
takes
chunk SENTENCE
COMBINE3
flexihiliLv
WORD
overlapping
produces two
word we
chunks. can
see
COMBINE
LhaL Lhis
it
would produce
for
Titus
scheme
should
he
gives
us
great that the
experimenting function described
is
with
different
it lu
CHUNKS
be used
functions. consistently
However
for
all
noted
Fh at
once
CHUNKS
ust
chosen
useful
ust
in
doc
ii
ments.
flexibility
is
only
an
experimental
setting.
5.3
Exploratory
evaluaLe
ninety Lhe
Tests
of Lhe
To
of
accuracy
sysLem IVT
be corn
we
conducLed
some
exploratory
i.e...
experimenLs
like
usiiig
set
two
Iatec
not real
ASCII
intended
and
to
technica
documents
ou
to goal
papers
ply to
this
one. how
These
experiments matching
are
prehensive be expected
is
si
nderstand
man
chunks
documents
might
have and
i-50
and
how
well
in
our converters
length.
work.
The
half of
documents
Lhese or
average
approximately
are
73H
nine
words
sentences labeled
Approximately
iii
documents
Lhree
grouped wiLhtn
descri
hi
inLo
topical are
seLs
the
tables.
of
The
Lwo
documents
pa per for the
each ng
group
closely he
related.
usually
in
mulLiple
revisions topical
conference
are half Lo unrelated of
or
journal
the
same
work. our
docu
ments group
at
separate
groups
except
authors
in
affiliation
with
are
research
Stanford. Stanford
The
and
remaining not
related
the
documents
in
not
any
topical
group
drawn
from
outside
any
All
document
of these goal
is
our collecLion.
registered
in
documenLs were
to see
if
COPS.
and
Lhen
each
was
queried ments.
iates
against
Lhe
complete
set.
Our
Section
CO
PS can
determine
violation
the closely
related
docu
eva
Using
to true
the terminology
if il
of
we
group.
in
are considering
test
Related I.
that
and
of
are
in
the
same
This
will If
be
approximated
by an the
001
that
computes
will
the
percentage
to
matching
sentences Table
parLicular
his
and shows
the
number
our
if
high
documents
InsLead
Lhe of
be
assumed
Lhe
be
of
related. violations
in
resulLs
from
exploration.
reporLing
of
number
that case.
aleiLpaito us
rst
would yield
we
show
percenLage
of
maLchtng
senLeuices
each
gives he
fi
more information
result colu
in
regarding
able in
the closeness the
precent
docu
ments.
of
gives
matches
each
docu
ment
against
itself.
That
is.
for
each
the
document
values
group
iL
we
compute
for
IOOXCOUNTd
group.
NATCH/SIZE
facL LhaL
all
see
in
Section
the
first
3.1
average
are nu
and
report
in
Lhe
row
is
that
The
values
column
he
100%
hers
simply
in
confirms second
col
LhaL
liii
COPS
are for
workmg
as
properly.
fol
the
com puted
all
lows.
For each
in
docu
ment and
in
group
the
we
compute
100XCOUNTr
refer to values
MATCH/SIZE
in
other docu
as
ments
the group
since
average
results.
We
the
second
column
affinity
values
they
represent
how
12
Match
self
Match
Related
Documents
MatcJ
Unrelated
Documents
Aflilifly
Noise
0.6% 0.9% 0.9% 0.3% 0.2% 0.8% 0.4% 0.1%
1.3%
Group
100% 100% 100% 100% 100% i00X i00X 100% 100%
71.9%
N/A
3.6%
42.9% 38.4% 63.0% 66.0%
3/i%
93.3%
TotalAvg
100%
Table
52.9%25.16%
Average
O.6%i2.1%
number
of
inaLcliiiig
senLences.
close
documents
p5. refer
are. to
For
nu at
the
her
third
in
column
colu of
we
as
compare
each
since
din
they
group
represent
agaThst
ii
all
in
others
grou
this
noise
are
ndesi red
matches.
The
numbers
for Lhe
reported
that
the
bottom
Table the
the
averages
over
all
document
comparisons
tests to
performed
illusLraLe Jdeally as to possible. distinguish
column.
of values.
affiniL
We
also
report
standard
deviation
between
individual
spread wants
one
values
LhaL for
ii
are
as
high
as
possible that
is
and
noise
values
LhaL
are noise
as
low
This
makes
it
possible
threshold value
related
between
the affinity that
and
levels
between
related
and
docu
ments.
Ta ble ones have
reports
related
doc
ii
ments
have
is
on
average low
is
3%
matciæng the
sentences
of
while unrelated
0.6%. used
The
here
reason
is
why
affinity
relatively
that
notion
version
Related
the to
documents
we
have
the
very
quiLe
broad.
For
example The
often noise
is
Lhe
level
journal
of
and
conference
or
versioll of
is
same
than
Fh
is
work what work
are
differenL.
0.6%
equivalent
hi
sentences sentences
so
larger
we
expected.
lv
The
discrepancy by the
caused
are quite
by seeral
ng.A
journal be
are
few
such even
as
partial
su pported
NSF
it.
common
when
Also
ad
in
articles
that by
of
unrelated
documents Hash
might
both be an
contain another
issue
iii
Other
sentences
may
noLe on
ii
also
exact
large
replicas
coincidence.. regisLered
collisions
may
iiot
facLor
especially
Lhere the
numbers
large 20
documents
in
but
In
are
our experiments
related process
relaLively
variance
reporLed
sentences. also has
the
LaMe.
parLicular
some
in
docu
ments
the
is
order of
translated
matching
to
he
by which
the
doc
ment
use
ASCII
some
produces
in
effect
on
the
noise
less
level.
For than
example
does our
translation
we
to
convert
is
7pX documents
by
differences
somewhat
inclusion of
noise
translation
from
ciLe
DVI.
the
This
caused
the
references.
Many TEX
red
unrelaLed
filLer
docurnenLs
itoL
same
in
references 1k ouLpuL by ASCII
possibly LIiev are
generaLing
in
inatcliiiig
senLences.
Our
noise
is
does
include
in
references noise are
separaLe
bib
less
files
so
iced.
ihe differences
discussed the
If
generated
translation
become
significant
when
the
enhancements The
graph BeLa
it
later level the
added
harder
to
it
our system.
is
larger or
noise
the
to say
detect
plagiarism
senLences. as
of
small passages
Lhe
e.g.
have
if
para
high
Lwo
raLe
we
set
Lhreshold
th
aL
.5/SIZE
flagged
001
would
error
Loo
many
we
as
unrelated
docurnenLs
actual
Plagiarism Alpha
violaLions
Fh us
while
it is
we
set
higher
red uce
say 10/SIZE the
noise
level
wou
Id
miss as
vioations
high
error.
portant
to
much
possible.
5.4
Enhancements
we
need to
However
it
decrease the
the
noise target
wit
hout
test
sacrificing leading
aff nity. to high
If
aff nity
is
too
F3eta
low
it
makes With
hard to
approximate
Related
again
Alpha
or
errors.
13
this
goal
in
mind we
sumrnanz.ed an
have
in
considered Ltble
series
firsL
of
line
enhancements
represenL
Lite
to
the
basic
COPS
algorithms.
line
The
of the
resulLs Lable
are
The
base
values
case
are
each
addiLional over
all
represents
itdependenL
to
enhancement
last
The 2.
reporLed
averages
documenL
groups
i.e.
equivalent
the
row
of
table
Match
self
Match
Related
Match
Unrelated
Aflinfly Sim
pIe
Noise
0.61% 2.0S 0.06% 0.33 0.47% 1.34 0.04% 0.21 0.36% 0.93 0.03% 0.23
Method Chunks Numbers
100% 100% 100% 100% 100% 100%
Table
53.0% 53.4% 54.1% 51.8% 54.4% 53.6% Enhancements.
No Commoit
Drop
r\c
Short
Sentences
No
All
Short
Words
Enliaiicemeiits
COPS
In
ti
IJie
110
0111111011
chunks
enlianceinenL function
chunks
occurr
re
ig his
in
our
hash
Lable
more than mon
ph
ten rases
riles
are eliminated
by the LOOKUP
using
is
see
teigu
1.
rn
keeps
legitimate
his
corn
and by
passages the
from whicJ
ca
docu
in
ment
violation.
Ibr
exa
pie the be
sentence
as
work The
su pported last three
NSF
wiLit
present the
digiL
many
documents.
will
not
reported stream.
arbiLIarily
match.
enhancements word
fewer were
remove
nunieric sitorL
indic.ated
is
occ.urrenc.e sitorL Lo nu
II
from
the
input
are
For
drop numbers
Lo
any
or
dropped
are defined
sentences
Lliree
defined
have
Lhree
words
motivated
in
words
oti
have bers
or
fewer
characters.
These words
like
enltancemenLs were
sorneti
by
discovery matches.
that
Iteca
iii
short
sentences with
and
short
mes
in
involved
incorrect
tile
problem
bbreviations
VS.
One
described
Section
5.2.
last
The
Jie
row
of
Table
shows
are note
Lite
effect
of
using aL
all
enltancemenLs
Jie noise used
at while for well
once.
caii
see
that at
combi
ted
enhancements
levels.
quiLe
effective
reducitg
values
keepitg the
for
Jie
affinfly
roughly the
the
same
of
We
that
the
parameter chunk
we
enhancements
our
collection
e.g. but
number
occurrences
to
that
for
makes
larger of of
common
the
worked
probably
In
have
be
adjusted study the
collections. increasing
Figure.3we
any
of
Lite
effec.t
number
line
of
overlapping
the the
sentences
noise as
per
chunk
wiLliouL
of as
Lite
enhancements
Ltble
in
Tue
chunk
Fh
solid
shows see
average
funcLion
number
nu nt
iii
of of
overlapping overlapping detecta
sentences sentences
ble. teigu If
.A5
is is
we
noise
it
decreases decreases
that
dramaLically the
is
the
ber
grows.
beneficial effective noise for
is
since
mini
mu
amuu
noise
of
plagiarism three
re3 we
as
shows
an
noise
curve
the
average variable
plus
standard
Lite
deviations. noise Lo
assume that
lower exaniple
he
less
normally
Lliresliold
distributed
in
we
can
of
inLerpreL the at
false
effecLive
curve
bound
if
Jie
order Lo eliminaLe chunks
99%
posiLives
due the
in
noise.
For
we
use IA.
will
three
senLence
and
seL
our
threshold
cb
0.01
as
then
Beta
error
will
than
error
However
in
described
Section for
4.2
the
Alpha be
increase detec.t
as
we
corn of
bi
ne
sentences
chunks.
This
mean
that
Also. Lo
instance
security
iL
we
of as
will
unable
is
to
plagiarism Section
multiple.
it
non
fewer
contiguous changes
Lo
sentences.
the
the
system
reduc.ed
1.2
takes
documenL
make
pass
new
5.5
Effect
issue
is
of Converters we
investigate
is
final
the
in
impact
of
different the
input
converters. of the for
For
example document
say
Latex
document
by
find
initially Lhie
regisLered
COPS. Later
the
DVI
is
versioll
same
produced
like
running that
original
through clearl
Latex
processor the
registered
submiLLed latex
tesLing.
We
the
would
Lo has
the
VT
copy
matches
original
and
VT
copy
14
The
effect
of chunk
size
on
document
noise
Average
Effective
noise noise
6-
Number
of sentences
per
chunk
Ieigu
re
Noise
as
hi
nction
of
number
of
overlapping
sentences.
similar
number
of
matcJes
Lids
with
other
firsL
documents
row
is
as Lhe of
the
original
would
have
had.
the
Table
for are the as
explores
LitaL
issue.
all
The
the
for
basic
3.
COPS
The
Self
algoriLlun
firsL Lhird
second
fifth
row
is
version before of
includes
enliancemenLs
for reference.
Ltble
and
reports
columns average ihe
is
and
are
only
included
The
ment
is
Altered
corn
column
its
the
precent
matching
sentences
when
the Latex
so
docu average
pared
against
latex
original.
Altered compared
Lo
Related
Lo
all
column
Lhe
gives
percent
matching
Lhe
sentences
results as
when
are far Lo
DVI document
from
iLs
of
relaLed
documeiits
LhaL the
AlLhough
can he
perfecL
Lhere
remait
its
enough
original
matches
DVI
flagged
relaLed
original
and
Lo
docu
ments
was
related
to.
MaLch
Simple
Self
Altered
Self
Related
Group
AlLered
R.el.
IJnrelaLed
100% 100% Ta Ne
Itesu
60.9% 76.5%
for
52.9%
.53.6%
36.0% 16.2%
docu ments.
0.50% 0.03%
Enhanced
4.
Its
mechanical
lv
altered
We
insight
believe into
that
the
resu Its of
presented
in
this value
section
for
although
at
not
definitive
provide
target
some
test.
the
selection of
good threshold
0.05
COPS
due
least to
for the identify
Related
the also
will
threshold
of relaLed
value
say
while of
25
out
of
500
sentences
violations
seems
Lo
of
vast
majority
that
docurnenLs plagiarism
either high
not
10 Fleta
Lriggering or
less
false
noise.
We
conclude be
quite
deLecLiig without
abouL
or
sentences
roughly
2%
documenLs
hard
Alpha
errors.
Approximating
In this section
OOTs
efficiency
we
it
address can
test
Lhe
and
scalability collections
of
OOTs.
of
For
copy
detecLion
Lo
scale well as to
well the
use
we
require to
that
jickl
operate
with
very
large
registered
documents
achieve
scalaLi
aIM litv
many
new
docu
ment.
One
effective
way
to
lity
is
sampling.
15
To
percent simply 20
illustrate of Lhe say \\e
say we chunks
20 of
have
an
001
and
with
niaLch.
DECIDE
function
of
that
tests
all
whether
cliujiks in
more
d.
than
15
docurnenL chunks that
InsLead
inure
checking Lhan
of pV
we
could
of
take
pies.
If
raiidorn
Id
check
wheLlier
Lhem
maLcited
15%
the
the
sam
won
expect
this
new
00
last
bahed
on
sam
will
rig
approximates
original
00T.
by
the
average
of 50.
test
document
of
is
contains
is
1000 the
chunks
we
have that
reduced
is
our evaluation
in
time
6.1.
factor
The
cost
course. Lo
in
accuracy
and
analyzed here
Section
Lo only
AnoLher
iii
sampling
Lable of
opLion
sample
of
registered for
documenLs. each
regisLered
The
idea
is
inserL
our
hash
random
the and chunks
find
sample
are
chunks
document.
are
Si
For
all
example
100 chunks doc
ii
say
of
that
only docu
10%
hashed. with be
Next
suppose
that
we
checking the
new was
ment
matches should
registered equivalent to
docu 20
ment. under
as
of
nec
registered
ment
sampled. the be
Lable
ist
ri
these
matches the
Lhe
the
original In
OOT.
this
Sinc.e
20/100 savings smaller
exceed
15%
sLorage
also Piited
threshold
document
hash
to Lable
would
will
be have
IL
flagged
violation.
Lite
case
the
would
hash
in
space
IL
only
10%
regisLered
chunks can
makes
fashion
possible
disLribuLe cost
is
to of
other siLes accu
rac.r.
so
that
copy
deLection
he
done
Again
option
is
the
to
loss
third
sampling
in
sample
both the
at
registration
and
at
testing for
time.
Due
to
space note
limitations
LhaL here. Lhe aiid
this
paper
Lhe are
we
only
consider
aL
first
option
is
sampling
testing. to
However we
will
analysis the
for
sampling analogous.
regisLration
time
almosL
idenLlcal
whaL
present
results
We
start
by giving any
more
precise
definition
of the
sampling
at
testing
strategy.
We
are
given
an
001
Its
oj
with with
chunking
functions Section
evaluation.
INSCHUNKSI
EVALCHUNKSI
second
is
and
the
match_ratioDECIDEl
to
function
threshold function
for
3.1.
We
define
001
02
intended
approximate
o.
c.hunking
EVALCHUNKS2
simply
EVALCHUNKS2
EVALCHUNKS1r
return
where
RANDONSELECTN
picks
RANDOMSELECT
i.e.
chunks
aL
random.
Tue
chunking
function
for
inserLions
is
not
changed
he
INSCHUNKS2
hi
INSCHUNKSI.
selects
DECIDEI
is
nction
of oj
docu
lv
ments chunks
where the
are tested
nu
ii
her
of
matching
so
chunks
COUNT
nu
ii
MATCH
of
greater
is
than
thSIZE.
For 02 on
selects
not
SIZE
the threshold chunks
her
chunks
is
oN.
Thus DECIDE2
than
documents
where the number
of matching
COUNTr
MATCH
greater
N. Randomized
how
let
6.1
Accuracy
we
pit wish doc
ii
of
OOTs
is
Now
of
in
Lo
determine and
differenL be
from
of cJocu
As
in
SecLion doc
ii
3.2
let
he
Jet.Jc
our disLribuLion
be
ments
the
he
distribution
registered
ments.
1.
random
he
docti
ment
that
of
follows
and according
to
random
ment
that
in
follows
Let
rn
in
the
proportion
let
chllnks the
01s cimnking
function
LhaL
function
whicJ
i.e.
match
cJunks
Y.
Then
Tta
the
1r2 of
Wd
be
prohahiliLy LhIs are
density
mX
02
are as
Using
we can
in
compuLe
Aiphao1
the
results
Be
Pxi
mX
The
x2
details
02
and
Erroroi. 02.
computation
Appendix
follows
J6
/7/Ui
WxQxdx
TtTadi
f/
TV1
f7 The
code
Qxdx
Bctaoi.o2
VVxdx
is
4This
is
not
the
most
efficient
way
to
sample.
just
for
explanation
purposes.
16
Pa
0.8
0.1
335
0.2
Th
0.3
0.8
0.2
0.4
0.6
0.8
Ieigu
re
4A
Exaggerated
tl
Error
or
02
WxQdx
jxRi
Wi
Qxdx
where
Qx
jo
6.2
Results we
can evaluate
it is
F3efore
tells
on
to
expressions have
we
need
of
to
now
the
fVa
distri
bijtiori
Itecal
that doc
ii
us
how
likely
proportion
matches
given
between
of
test
and but
registered
ment.
One
he
option
would
Lo LhaL
be
to
measure
Wx
for
body
use
documents.
then
our results
leLs
would
specific varieLv of
parLicular
body.
IjisLead.
we
parametrized
function
Litat
us
consider
scenarios. of Section
will
Using
probability
the observations
we
be
in
arrive related
at
the following
to
i/
one.
fu
nction
In
With
there
very can
high
still
Pa
the
test
docu
ment
the
registered
this case
be
noise
matches
will
which be
we
model
as
normally
distributed pe Lhe
with
mean
the
and
test
standard
deviation
is
5a
whicJ
Lo the
probably
one.
very
In
small.
case
With we
probability LhaL
document
chunks be
large
unrelated normally
as
registered
this
assume
number
would
nu
of
maLcliiiig c6 Lo
is
disLribuLed
with
related
mean
doc
ii
/L6
and
sLajidard
deviation have
norniial
a.
varying
expecL
bers of
since
we
have
seen
is
ments
tend
of
to
widely
matches.
us our
normalized
to
function
the
weighted
1.
sum
two
truncated
at
and
distributions
make
VVx
shows
Tile
Figure apparent.
sample under
of
Wx
Lhe
function
in
with range
exaggerated
parameters
0.2 represenLs of related
to Lhe
make
its
form
of
more
noise
area
the
Id
curve range
to be
Lhe
likelihood
matches
of
while
rest
the
Pa
represents
mai
to
ly
matches
docu
will
ment.
between
Tn
practice
ii
con
rse we
won
and
expect
to
much
closer
most
comparisons
be
related
documents
Given
is
Ua
be
much
smaller.
parametrized Au nportanL
the that
Wx.
issue
1-
we
Lo
can sLudv
present
is
results
that
of
show
how
good an
approximation
resulLs. of
N.
to
.5
firsL
Lhe
utumber and
that
samples
required for accuraLe
values as funcLion
Figure
0.4.
shows
Itecal
AlpIta the
02
of
Beiaoi
0.4
02.
Erroroi 02
is
for
value
means
looking
for
registered
docu
ments
whose
17
Pa
0.12
0.95
0.02
Pa
0.05
0.3
/L
0
.8
th
0.4
Alpha
0.1
ta
Error
0.08
0.06
O.040.02
.0
10
15
20
30
Figu
re
he
Effect
of
the
Nii
her
of
Sam
pie
Ioi
its
on
Accu
racy
chunks
bec.ause 1\ote
that
match
are
10%
of
the chunks
in
of
the
test
test.
document.
This
value
for
may
have
are
been
picked
in
say
we
interested Lhe values values
Subset Figure
as
.5
target are
The
parameters monoLonlcally
10.
VVx
is
given
the figure.
the
LhaL
iii
noL
simply
to
decreasing. error the
For cause
rig
example
for
in thIs.
Alpha exam
than i.e. the
will
and Error
pie
3.6 or for
ucrfa8E
seiects
goes ments
from with
Rounding the 10
riu
For
docu
COUNT
For
ber
of
match with
ch
ks
greater than
i.e.
with
selected.
or
more
matches.
documents
that
It
COUNT
say that
greater
more
of
IL
are
Consider
now
is
test
document
by
matches
is
with
likely to
10%
to
50%
of
chunks
select
registered
it
document
Lo Lo geL
hence
hiLs.
selected
oi.
02
is
more
likely
with because
for
since
only has
IL
\ViLh
10
effect
less
select
it
wiLh
10.
only
In
one
extra sample.
of the
has
geLS
hILs.
This
ii
leads to note well
to
Lhe
higher overa
0.01.
ii
Alpha
the This
error
spite as
nonmonotonicity
For
10. relatively Lhe
it
is
portant
stays
how
below
Error decreases shows that
02
very can
rapidly
increases. well
the
Error
approximate
with
LhaL
small number
error
of
sampled
as
chunks. rapidly but
Lhis
is
Note
The
Iiowevcr
error
Alpha
say
does
noL
decrease caused by
iiot
as
serious. ratio of
is
Alpha
for
th
beyond
0.4.
20
is
mainly
tinder
Lest
documenLs
in
whose
to
maLch
the
higher gives
than
The
one
area
of
the
Wa
in
curve
In
the
vicinity
right
0.4
the
probabiiity
of hits
getting to
these
docu
ment.
these
case In are
cases the
the
sampiing
001
0i
may
not
of
not very
muster
enough
at
trigger
detection. violation test clear
if
However
of interest
this
original
DOT
in
m