AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC. et al v. PUBLIC.RESOURCE.ORG, INC.
Filing
30
REPLY to opposition to motion re #27 Amended MOTION to Compel filed on December 15, 2014, filed by AMERICAN EDUCATIONAL RESEARCH ASSOCIATION, INC., AMERICAN PSYCHOLOGICAL ASSOCIATION, INC., NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION, INC.. (Attachments: #1 Declaration in Reply of Jonathan Hudis, #2 Exhibit S to Hudis Reply Decl, #3 Exhibit T to Hudis Reply Decl, #4 Exhibit U to Hudis Reply Decl, #5 Exhibit V to Hudis Reply Decl, #6 Exhibit W to Hudis Reply Decl, #7 Exhibit X to Hudis Reply Decl, #8 Exhibit Y to Hudis Reply Decl, #9 Exhibit Z to Hudis Reply Decl, #10 Exhibit AA to Hudis Reply Decl, #11 Exhibit BB to Hudis Reply Decl, #12 Exhibit CC to Hudis Reply Decl, #13 Exhibit DD to Hudis Reply Decl, #14 Exhibit EE to Hudis Reply Decl, #15 Text of Proposed Order -Revised)(Hudis, Jonathan)
EXHIBIT W
Case No. 1:14-cv-00857-TSC-DAR
TR-CS-96-05
The rsync algorithm
Andrew Tridgell and Paul Mackerras
June 1996
Joint Computer Science Technical Report Series
Department of Computer Science
Faculty of Engineering and Information Technology
Computer Sciences Laboratory
Research School of Information Sciences and Engineering
This technical report series is published jointly by the Department of
Computer Science, Faculty of Engineering and Information Technology,
and the Computer Sciences Laboratory, Research School of Information
Sciences and Engineering, The Australian National University.
Please direct correspondence regarding this series to:
Technical Reports
Department of Computer Science
Faculty of Engineering and Information Technology
The Australian National University
Canberra ACT 0200
Australia
or send email to:
Technical.Reports@cs.anu.edu.au
A list of technical reports, including some abstracts and copies of some full
reports may be found at:
http://cs.anu.edu.au/techreports/
Recent reports in this series:
TR-CS-96-04 Peter Bailey and David Hawking. A parallel architecture for
query processing over a terabyte of text. June 1996.
TR-CS-96-03 Brendan D. McKay. autoson – a distributed batch system for
unix workstation networks (version 1.3). March 1996.
TR-CS-96-02 Richard P. Brent. Factorization of the tenth and eleventh Fermat
numbers. February 1996.
TR-CS-96-01 Weifa Liang and Richard P. Brent. Constructing the spanners
of graphs in parallel. January 1996.
TR-CS-95-08 David Hawking. The design and implementation of a parallel
document retrieval engine. December 1995.
TR-CS-95-07 Raymond H. Chan and Michael K. Ng. Conjugate gradient
methods for Toeplitz systems. September 1995.
The rsync algorithm
Andrew Tridgell
Paul Mackerras
Department of Computer Science
Australian National University
Canberra, ACT 0200, Australia
June 18, 1996
Abstract
This report presents an algorithm for updating a le on one machine
to be identical to a le on another machine. We assume that the two
machines are connected by a low-bandwidth high-latency bi-directional
communications link. The algorithm identi es parts of the source le
which are identical to some part of the destination le, and only sends
those parts which cannot be matched in this way. E ectively, the algorithm computes a set of di erences without having both les on the same
machine. The algorithm works best when the les are similar, but will
also function correctly and reasonably e ciently when the les are quite
di erent.
1 The problem
Imagine you have two les, A and B , and you wish to update B to be the same
as A. The obvious method is to copy A onto B .
Now imagine that the two les are on machines connected by a slow communications link, for example a dial up IP link. If A is large, copying A onto
B will be slow. To make it faster you could compress A before sending it, but
that will usually only gain a factor of 2 to 4.
Now assume that A and B are quite similar, perhaps both derived from the
same original le. To really speed things up you would need to take advantage
of this similarity. A common method is to send just the di erences between A
and B down the link and then use this list of di erences to reconstruct the le.
The problem is that the normal methods for creating a set of di erences
between two les rely on being able to read both les. Thus they require that
both les are available beforehand at one end of the link. If they are not both
available on the same machine, these algorithms cannot be used once you had
copied the le over, you wouldn't need the di erences. This is the problem
that rsync addresses.
The rsync algorithm e ciently computes which parts of a source le match
some part of an existing destination le. These parts need not be sent across
the link; all that is needed is a reference to the part of the destination le.
Only parts of the source le which are not matched in this way need to be sent
verbatim. The receiver can then construct a copy of the source le using the
references to parts of the existing destination le and the verbatim material.
1
Trivially, the data sent to the receiver can be compressed using any of a
range of common compression algorithms, for further speed improvements.
2 The rsync algorithm
Suppose we have two general purpose computers and . Computer has
access to a le A and has access to le B , where A and B are similar".
There is a slow communications link between and .
The rsync algorithm consists of the following steps:
1.
2.
3.
4.
5.
splits the le B into a series of non-overlapping xed-sized blocks of size
S bytes1 . The last block may be shorter than S bytes.
For each of these blocks calculates two checksums: a weak rolling"
32-bit checksum described below and a strong 128-bit MD4 checksum.
sends these checksums to .
searches through A to nd all blocks of length S bytes at any o set,
not just multiples of S that have the same weak and strong checksum
as one of the blocks of B . This can be done in a single pass very quickly
using a special property of the rolling checksum described below.
sends a sequence of instructions for constructing a copy of A. Each
instruction is either a reference to a block of B , or literal data. Literal
data is sent only for those sections of A which did not match any of the
blocks of B .
The end result is that gets a copy of A, but only the pieces of A that are
not found in B plus a small amount of data for checksums and block indexes
are sent over the link. The algorithm also only requires one round trip, which
minimises the impact of the link latency.
The most important details of the algorithm are the rolling checksum and
the associated multi-alternate search mechanism which allows the all-o sets
checksum search to proceed very quickly. These will be discussed in greater
detail below.
3 Rolling checksum
The weak rolling checksum used in the rsync algorithm needs to have the property that it is very cheap to calculate the checksum of a bu er X2 ::X +1 given
the checksum of bu er X1 ::X and the values of the bytes X1 and X +1 .
The weak checksum algorithm we used in our implementation was inspired
by Mark Adler's adler-32 checksum. Our checksum is de ned by
n
n
ak; l =
n
X X mod M
l
i
=k
1 We have found that values of S between 500 and 1000 are quite good for most purposes
i
2
X
l
bk; l = l , i + 1X mod M
i
=k
i
sk; l = ak; l + 216 bk; l
where sk; l is the rolling checksum of the bytes X : : : X . For simplicity
and speed, we use M = 216 .
The important property of this checksum is that successive values can be
computed very e ciently using the recurrence relations
k
l
ak + 1; l + 1 = ak; l , X + X +1 mod M
bk + 1; l + 1 = bk; l , l , k + 1X + ak + 1; l + 1 mod M
k
l
k
Thus the checksum can be calculated for blocks of length S at all possible
o sets within a le in a rolling" fashion, with very little computation at each
point.
Despite its simplicity, this checksum was found to be quite adequate as a
rst level check for a match of two le blocks. We have found in practice that
the probability of this checksum matching when the blocks are not equal is quite
low. This is important because the much more expensive strong checksum must
be calculated for each block where the weak checksum matches.
4 Checksum searching
Once has received the list of checksums of the blocks of B , it must search A
for any blocks at any o set that match the checksum of some block of B . The
basic strategy is to compute the 32-bit rolling checksum for a block of length S
starting at each byte of A in turn, and for each checksum, search the list for a
match. To do this our implementation uses a simple 3 level searching scheme.
The rst level uses a 16-bit hash of the 32-bit rolling checksum and a 216
entry hash table. The list of checksum values i.e., the checksums from the blocks
of B is sorted according to the 16-bit hash of the 32-bit rolling checksum. Each
entry in the hash table points to the rst element of the list for that hash value,
or contains a null value if no element of the list has that hash value.
At each o set in the le the 32-bit rolling checksum and its 16-bit hash are
calculated. If the hash table entry for that hash value is not a null value, the
second level check is invoked.
The second level check involves scanning the sorted checksum list starting
with the entry pointed to by the hash table entry, looking for an entry whose
32-bit rolling checksum matches the current value. The scan terminates when
it reaches an entry whose 16-bit hash di ers. If this search nds a match, the
third level check is invoked.
The third level check involves calculating the strong checksum for the current
o set in the le and comparing it with the strong checksum value in the current
list entry. If the two strong checksums match, we assume that we have found
a block of A which matches a block of B . In fact the blocks could be di erent,
but the probability of this is microscopic, and in practice this is a reasonable
assumption.
When a match is found, sends the data in A between the current le
o set and the end of the previous match, followed by the index of the block in
3
B that matched. This data is sent immediately a match is found, which allows
us to overlap the communication with further computation.
If no match is found at a given o set in the le, the rolling checksum is
updated to the next o set and the search proceeds. If a match is found, the
search is restarted at the end of the matched block. This strategy saves a
considerable amount of computation for the common case where the two les
are nearly identical. In addition, it would be a simple matter to encode the
block indexes as runs, for the common case where a portion of A matches a
series of blocks of B in order.
5 Pipelining
The above sections describe the process for constructing a copy of one le on
a remote system. If we have a several les to copy, we can gain a considerable
latency advantage by pipelining the process.
This involves initiating two independent processes. One of the processes
generates and sends the checksums to while the other receives the di erence
information from and reconstructs the les.
If the communications link is bu ered then these two processes can proceed
independently and the link should be kept fully utilised in both directions for
most of the time.
6 Results
To test the algorithm, tar les were created of the Linux kernel sources for two
versions of the kernel. The two kernel versions were 1.99.10 and 2.0.0. These
tar les are approximately 24MB in size and are separated by 5 released patch
levels.
Out of the 2441 les in the 1.99.10 release, 291 les had changed in the 2.0.0
release, 19 les had been removed and 25 les had been added.
A di " of the two tar les using the standard GNU di utility produced
over 32 thousand lines of output totalling 2.1 MB.
The following table shows the results for rsync between the two les with a
varying block size.2
block matches tag
size
hits
300
500
700
900
1100
64247
46989
33255
25686
20848
3817434
620013
571970
525058
496844
false data
alarms
948
64
22
24
21
5312200
1091900
1307800
1469500
1654500
written read
5629158
1283906
1444346
1575438
1740838
1632284
979384
699564
544124
445204
In each case, the CPU time taken was less than the time it takes to run
di " on the two les.3
All the tests in this section were carried out using rsync version 0.5
The wall clock time was approximately 2 minutes per run on a 50 MHz SPARC 10 running
SunOS, using rsh over loopback for communication. GNU di took about 4 minutes.
2
3
4
The columns in the table are as follows:
block size The size in bytes of the checksummed blocks.
matches The number of times a block of B was found in A.
tag hits The number of times the 16 bit hash of the rolling checksum matched
a hash of one of the checksums from B .
false alarms The number of times the 32 bit rolling checksum matched but
the strong checksum didn't.
data The amount of le data transferred verbatim, in bytes.
written The total number of bytes written by including protocol overheads.
This is almost all le data.
read The total number of bytes read by including protocol overheads. This
is almost all checksum information.
The results demonstrate that for block sizes above 300 bytes, only a small
fraction around 5 of the le was transferred. The amount transferred was
also considerably less than the size of the di le that would have been transferred if the di patch method of updating a remote le was used.
The checksums themselves took up a considerable amount of space, although
much less than the size of the data transferred in each case. Each pair of
checksums consumes 20 bytes: 4 bytes for the rolling checksum plus 16 bytes
for the 128-bit MD4 checksum.
The number of false alarms was less than 1=1000 of the number of true
matches, indicating that the 32 bit rolling checksum is quite good at screening
out false matches.
The number of tag hits indicates that the second level of the checksum
search algorithm was invoked about once every 50 characters. This is quite high
because the total number of blocks in the le is a large fraction of the size of the
tag hash table. For smaller les we would expect the tag hit rate to be much
closer to the number of matches. For extremely large les, we should probably
increase the size of the hash table.
The next table shows similar results for a much smaller set of les. In this
case the les were not packed into a tar le rst. Rather, rsync was invoked
with an option to recursively descend the directory tree. The les used were
from two source releases of another software package called Samba. The total
source code size is 1.7 MB and the di between the two releases is 4155 lines
long totalling 120 kB.
block matches tag false data
size
hits alarms
300
500
700
900
1100
3727
2158
1517
1156
921
3899
2325
1649
1281
1049
0
0
0
0
0
129775
171574
195024
222847
250073
5
written read
153999
189330
210144
236471
262725
83948
50908
36828
29048
23988
7 Availability
An implementation of rsync which provides a convenient interface similar to the
common UNIX command rcp has been written and is available for download
from ftp: samba.anu.edu.au pub rsync.
6
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?