The Authors Guild, Inc. et al v. Hathitrust et al
DECLARATION of Jeremy S. Goldman (redacted) in Opposition re: 100 MOTION for Summary Judgment.. Document filed by Authors' Licensing and Collecting Society, Pat Cummings, Erik Grundstrom, Angelo Loukakis, Norsk Faglitteraer Forfatter0OG Oversetterforening, Roxana Robinson, Helge Ronning, Andre Roy, Jack R. Salamanca, James Shapiro, Daniele Simpson, T.J. Stiles, Sveriges Forfattarforbund, The Australian Society Of Authors Limited, The Authors Guild, Inc., The Authors League Fund, Inc, Union Des Ecrivaines Et Des Ecrivains Quebecois, Fay Weldon, the Writers' Union of Canada. (Rosenthal, Edward)
UNITED STATES DISTRICT COURT
SOUTHERN DISTRICT OF NEW YORK
THE AUTHORS GUILD, INC., et al.,
Index No. 11 Civ. 6351 (RB)
HATRITRUST, et al.,
DECLARATION OF JEREMY S. GOLDMAN
JEREMY S. GOLDMAN hereby declares as follows:
I am an associate at FrankThrt Kurnit Klein & Selz, P.C., attorneys for the
Plaintiffs in the above-captioned action.
I submit this declaration in in opposition to Defendants’ motion for summary
judgment. I have personal knowledge of the facts set forth in this Declaration and could testify
competently at a hearing or trial if called upon to do so.
Attached hereto as Exhibit A is a true and correct copy of an interview of John
Wilkin by Mary Minow entitled Rising into the Public Domain: The Copyright Review
Management System (CRA[S) at the University ofMichigan, dated September 9, 2010, which I
downloaded from the Web address hap ://fairuse.stanford.edu/blog/20 10/09/rising-into-thepublic-domain.html.
Atached hereto as Exhibit B is a true and correct copy of an article by John P.
Wilkin entitled Bibliographic Indeterminacy and the Scale ofProblems and Opportunities of
“Rights” in Digital Collection Building, dated February 2011, which I downloaded from the
Web address http://www.clir.org/pubs/ruminations/Olwilkin.
On January 23, 2012, I purchased a used copy of the book Good Troupers All by
Gladys Malvem by ordering it on Amazon.com. I paid $0.70 for the book plus $3.99 for
shipping and handling. Attached hereto as Exhibit C is a true and correct copy of the order
confirmation I received from Amazon.com.
With the assistance of a summer associate at the firm, I reviewed the titles of and
in some instances conducted Internet research on each of the 116 books in which Plaintiffs own
the copyright and that were digitized by Defendants without authorization. I determined that of
the 116 books that were infringed, a total of 88 (approximately 76%) may be classified as fiction
and 28 (approximately 24%) may classified as non-fiction.
Attached hereto as Exhibit D is a true and correct copy of
I declare under penalty of perjury that the foregoing is true and correct.
New York, New York
July 20, 2012
FAIR1,Y USED ELOG
CHARTS & TOOLS
LIBRARIES & EDUCATION
Rising Into the Public Domain: The Copyright Review Management System
(CRMS) at the University of Michigan
hiyuar)’Minowon Septentber9, 2010 a2:aIAM I Permalink I Conaments(o)
Conameattart’,A, taiwda, sad Interviews
Rising Into tue Public Domain: Tlae Copyright ReviewManagemertt System
at the University
Interview will, John Wilkiua, Associate University Libraflan for Library Information Technology and &ecaatioe
Director, Hathi’rrustand Principnllnoestigatorfor CRMS
podrasla and videracasis
Mary Minowa Where does CRMS fit into the scheme of other copyright tools, such as the
John Wilkin: The Deternainator isa good point of comparison for ass. It serves as a
resource for helping someone make a determination, and what we wanted to do is actually
make determinations, The focus is on aaaaterials in our Collections, across the HathiTrust
partnership. We are not so concerned about where a book comes from, because we think of
[the corpus] as a “collective collection” ... materials from across tlae board.
Music copyright cases-.tuttiih nash ialk
Rising Into the Public Dnnsain: The
copyright Review Management System
(CIt’stS) at the University
Fair Use. Free Speech end Social “slop
I tlainkwe did have, earlyon, perhaps a naive sense that we might be able to make those determinations without the
materials being in front of us, digitally or in print. We quickly concluded, tlaough, that the only way to do the work was
to have those works in hand. Mad we Chose to have them in hand, digitally. And the digital flow of materials drives the
Sept tttlo - Anthony Falzoae
The aoao DVI) Exemption to the DMCA:
An Intereiswwith Abigail Dc Keasnils, Oar,’
Iuandnaan had Mark Kaiser
Minow: When you say digitally in hand, it sounds like researchers are allowed to look at the text, the preface, etc.
An [naider’.s View nitlte wipo ttttereiew
Wilkin: That’s right. We have a strong authentication and authorization ~stem, and it’s tied into the Michigan
CoSign system, but also it uses Shibboletla. So that gives us a lot of tools there. In this ease, we use a two factor
autheatication for all reviewers. They have to authenticate [with a password], and they have to he, essentially, at their
desk. They can’t take their identities laome and start looking at materials that are still iaa copyright. So it’s vety muds
justified by the work the~~re doing.
with Janice V. Pitch, UIUC.
Urban Cop>,ight tsgends
Open Source, Open Stsanclarcls, Optia
Access - wltatis Tnade open and taywlast
Minow: Doesn’t Google make its own determinations of what’s itithe Public Donaain? Do they come up with different
determinations? Is there duplicative workgoing on?
Coantttenta due March 24 to Intellectual
WilIdu: We’re doing the 1923-1963 work.
Would an approved Google Rooks
Minow: That is, a focus on books published between 1923 and 1963. Books published in the U.S. priorto 1923 ate in
the Public Domain. ‘The Copyiight Renewal Act of 1992 automatically extended the copyright terms of works puhlished
in 5964 and later.
vettlemeTat offer competitors equal fraotiiag?
Wilkin: Right. So far as we know, Google is not doing the 23-63 work. Both Google and HathiTrust do a layer of vet~’
automatic determinations. Ours is entirely automatic, based on elements in the MARC record. They have reviewers
look at materials to do some [consultation] because occasionallythe bibliographic information is not reliable. That’s
the point at tvhiehwell look most siminr, with some exceptions.
exemptions for naoatoaalirte works— lear
l’roport’.’ Enforcement Coordinator - on
Dr. you need Ic’ send an copy ofvonr oa,li”s
works to the Copyright Office for deposit?
tnterim regulattoas (Jan. 25) allow
There are important areas where we deviate. We are opening up U.S. Federal Does, post 1922. Google is considering
that now, but they have been slow to do that. Theyre considering what classes of materials they’ll open up. HathiTrust
will saythat U.S. government does are, by and large, in the Public Donsain.
Then we diverge. For example, we’re going to look at U.S. pre-1923 materials as in the Public Domain, and we’re going
to look at users outside the U.S. differently for materials that were published outside the US does that make sense?
Minow: Help me out here.
Wilkin: Forthe user in the U.S. or really for anyhody in the world, we deem U.S. works pre-1923 as being in the
Public Domain. And forthe userin the U.S., we also deem non-U.S. works pre-1923 as in the Public Domain. For users
outside the U.S., we are fairly conservative witla non-U.S. works. I tlaink tlae date we’re using now is about 1870. It’s a
rolling wall, and essentially a best guess. It would be that date for a young authorwho lived a long time who publislaed
_~_something.Weuse statisticalprobability, aaadwe rolIthatwailforward everyyear._________~
Minow: Howdo you figure out if the workwas published first outside the countay?
Wilida, We primarily use the bib record of the publication. If the place of publication is outside the U.S., we assume
that it was [first published there]. Effectlvelywe are conservative unless we get a good look at something and make an
We ingested 700,000 volumes one month, so that gives you a sense of the scale we’re working at. Were never going to
have the resources needed to do individual sorts of this one should go here and that one should go there.
Minnie: You mentioned that you’re using the Determinator, but that’s only available for Class Aboolcs. Are most of
your materials Class A books?
Wilk’m: ‘They’re all Class A books. ‘The reviewers use the Determinator and other tools, they look at the book and they
make an assessment. Theylook to see that there are not embedded rights problems in making those determinations.
Minow: Inserts - photos, stories, poems - you’d almost have to read everypage.
Within: Well, we look at acknowledgements, not the entire book There are going to be some cases where the
acknowledgements are not that adequate. We have an advertised takedown policy, and we’ve never been contacted
about anything that is an insert.
Minow: It takes mybreath away to look at that level.
WilIdo: The insert issue is of particular concern in Congressional materials, such as materials that are inserted into
the record for bearings. We work with the assumption that these inserts are part of the public record and that they are
provided or reproduced in that context.
Minow: In Section io8(h), the copyrigbtlawgives 20 yearsbackto libraries and archives even on the web, if not
subject to normal commercial exploitation. Here s a chart I made, showmg that, for example, that libranes and
archives may make and distribute copies of works up through 1934 thisyear, instead of 5922. The catch is that the
works cannot be subject to an undefined “normal commercial exploitation.”
Wilkin: We’re not taking advantage of that atthis point.
Minow: .Anotherthought Iliad, after reading Melissa Levine’s article, is that manyauthors ofolderworks retain their
digital rights, because when they signed publisher agreements, digital rightswere not yet contemplated. Are you taking
advantage of that? [Opening Up Content in HathiTrust: Using HathiTrust Pennissions Agreements to Make Authors’
WorkAvailable, Research Library Issues, no. 269 (April 2010): Special Issue on Strategies for Opening Up Content]
Wilkin: We’re not. Werejust testing the waters, taking baby steps. We’re only dealing with works where the rights
have reverted to the author and when the author or publisher knows they own the rights. As it turns out, we’ve had
some fairly large lump permissions. For example, in at least one case where ajournal died, thejournal publisher gave
us permission to open up the full run of thejournal. As it turns out, a few’ organizations have opened up a large number
Melissa’s artide is ale earlystep for us. We haven’t gone out to seek pennissions from authors, yet. But it’s most
definitely something we want to do.
Minow: The UniversityofMichigan is a player inthe OCLC pilot project, the WorldCat Copyright Evidence Registry.
Does that mean your detenninations of copyright for the works you examine then feed into that Registry’?
Wilkin: I think that effort is in limbo right now. We did set up a mechanism that we could share our determinations
with them. The Regist,ywas set up to allow institutions to identi& records that need to be enhanced or annotated with
information about URLs and rights, etc. In our distribution mechanism, there’s one record for everyvolunae in the
repository at this point.
We think of OCLC as a central switching point for bibliographic info, so it seemed like a natural forthem to have a
regstry’ of copyright evidence. We were making data available to them, but in fact we have now 6 million volumes,
each identified with our either automatic or manual copyright determination, so that’s more than what OCLC would
have, I guess, aspired to do.
In the CRMS process, that’s only been tens of thousands of volumes, but someone could start with our 6 million
volumes and look for changes.
Minow: But it wouldn’t be open in the sense that someone could put their own data in, right?
Wilkin: Exactly, and the Copyright Evidence Registry was intended to be tlint.
Minow: Is there anything you’d like to add?
Wifido: Well, for us, the question is’whal next?’ The easiest “what next’ is expanding to other partners. Anne’s been
but’ as we laid out in the grant, she is training staff in Indiana, Minnesota and Wisconsin ‘just finished Wisconsin the three pilots along with the Michigan staff. [~nne Karlc-7_enith, Copyright Review Project Librarian]. This winter
she’ll probably incorporate staff at a California partner.
Subscribe tuttle hiog’s ft*ci
[~‘,‘hat is this?]
Minow: Doyou see members of the public as becoming able to add notes or comments in the future?
WiUcin: We have a tagging application for bib records. Probably not a day passes when someone doesn’t say,’! think
this is in the Public Donsain” or ask, “is this in the public domain?” That’s what stimulates someone to look at it. So it is
user driven now. Wewon’t take someone’s assertion as fact, but it provides a good starting point to do investigation.
Minow: Do you have plans to add other materials, besides “Class A” books?
Wilkin: in Hathi’ft’ust, we have mitch more than “Class A,” but the only ones we’re pushing into the workfiow right
now are “Class A.” So that becomes a question for you. Then. How would we go beyond ‘Class A”? i-low could we build
sustainable cost effective eyste,n? Probably going to be something piece by piece, right?
Minow: I’ve heard that the Copyright Office is working on a retrospective conversion of the copyright registration and
renewal records of rest of the material types, beyond “Class Aboolcs.” if they make the records available in bulk, as
they did with ‘Class A,” then others can set tip or build on databases like Stanford’s “Determinator.”
Wilicin: Did you knowthat we’ve found about between ~s% and, 6o% of our materials have been found in the public
Willdn: The numbers you see out there saylike, only i~% are in copyright. Some assertions are prettywild. There “as
some early work done by the copyright office, but the law was in flux at the time. Best to have something so statistically
sound, I’m guessing that between pre-CRMS and CRMS, we’ve gone through soo,00n titles and those numbers have
held. T think we have another 400,000 titles to deal with in that period. One question we have, how many titles ARE
there in the 23-63 period? There’sjust so much indeterminaq because of variation in cataloguing practice and ways of
reporting things, and so.
Minow: Are the other 40% ones that you’ve determined are in copyright oryoujust can’t figure them out?
Wilkin: I think early on it was about 3o% in copyright and 10% in WiD (undetermined or undeterminable). Anne
found that as staff got more experienced, theywere getting stuck on complicated problems, andwe often found a lower
yield of public domain determinations, So Anne encouraged staff to push things to UND rather than get some finality.
So the number of UND has gone up, but the numbers in the Public Domain have stayed constant. That’s really a
worktlow strate~ kind of thing.
It’s exciting to get those works opened up. The surprise has come in the titles. Because of the required renewal process,
it’s stunning to see what was not renewed. The first time! encountered this was with my t3 year old daughter, who was
doing a book report on code breakers. We found really modem materials byliving mathematicians.! thought, “oh,
we’re in trouble.” Then, looking further, these were ones where renewal did not take place. Interesting to learn the
But the numbers, the numbers are reallyveay interesting, the 60/40 sort of thing.
Minow: And yet, going forward, this is not going to be the case, because now there’s no renewal required. An anomaly
really, unless law changes again in the other direction, which doesn’t seen’ likely.
Wilkin: That’s something for us to ponder as a society, as a culture, that these works are overwhelmingly not on the
market, What’s happening is, without this effort, no one is able to take advantage of the information that’s there, or
only in a limited way.
Another surprise is the Committee on Institutional Cooperation, the CrC, the non-Michigan, non’Wisconsin CIC
institutions, don’t getback their in-copyright materials ... hy contract with Google. I thinkwhat we ought to say is they
don’t get back those things that are putatively in copyright. With those numbers in mind, think about what are we not
able to put online because the/re assumed to be in copyright, when we know that 6o% or some large percent are in the
Minow: You mean, those institutions are not getting access to the full text of their own books?
Wilkin: They stay at Google, they’re embargoed, That may change with an amended agreement, but for now, 000gle
doesn’t provide them back.
Minow: I thought those were called ‘library copies.”
Wilkin: It is important to call them “embargoed copies.” Jack Bernard, our Assistant General Counsel, has asked us to
use the term “rising into the public domain” instead of”falling into the public domain.”
Minow: That’s a good title for this interview. Thanks so much for talking with us today.
Commentary, Aaalysia, and laterdews
Leave a comment
Remember personal info?
Comments (You may use HTML tags for style)
Your suggestions are welcomed at any time. Please send to fathisecontent~tistia.com
flits site nssponsorod hystanford UnivnrsityLibraries nndAcadeotticlnfonnatinn Resources, Juslia, NOLO, LibratyLutw.coni &Onecle. Hug tim Ptigl
©uooe-a009the gourd ofrnrstees oftlie ielurdStani’arddooiorunivvrsity. With the exrcption ofthe Nob Cog~vrighi and Fair Uae Ovenlow. thkwork Li lict’n-cod
tinder a Creative Commons Attribution-Noncommercial 3.0 United States bAcnase,
Stanford Copyright Reminder oMCAA~er,t I The Centrrfor Internet ntdSodety FairUse Pmjtct contactwebmostrr
Bibliographic Indeterminacy and the Scale of Problems and
Opportunities of “Rights” in Digital Collection Building
by John P. Wilkin
The research library community has little strong or reliable data on the number of unique books
in our collections and their “rights”—for example, whether they are in the public domain or incopyright and, if in-copyright, whether they are orphan works. At its foundation, this problem is
created by the dearth of reliable bibliographic information, or what I’ve been calling bibliographic
indeterminacy. For example, we’d like to know how large the “collective collection” of all (or
even just all North American) research libraries is, and how many unique volumes research
libraries hold in aggregate; otherwise, there’s no way to know the cost of digitizing or caring for
these materials. We’d also like to have a better handle on the question of what’s in the public
domain and, by extension, what’s in copyright. We’d like to know how many orphan works there
are, or perhaps what proportion of the digitized content we have online is likely to be orphans.
And while these questions and more are regularly part of the conversation around digital
collection building, they’re also relevant to more conventional library problems such as print
storage and particularly shared print storage. We don’t know what’s in the collective collection.
The fact is, we have little reliable data about most of these questions. There’s been
considerable speculation in the wake of the proposed Google Books settlement and even years
before, when we first considered the probable shape of the growing digital collection or the
opportunities in front of us. Our biggest impediment to getting a good bearing on questions of
the size, nature and rights status of research library collections is the simple lack of an
Efforts to Date
To answer these questions, we often turn to WorldCat, but its records are overwhelmed by the
noise in WorldCat: the high number of unique records that represent variations in cataloging
rather than separate manifestations of a work, non-book and non-journal material masquerading
as books and journals, and items with incomplete or unreliable metadata. As a database,
WorldCat is by far the best thing we have, but its purposes long ago shifted away from
documenting the collective collection to facilitating discovery (as a data source for
WorldCat.org). Brian Lavoie and Lorcan Dempsey worked through those challenges with
admirable adroitness in their “Beyond 1923: Characteristics of Potentially In-Copyright Print
Books in Library Collections,” providing the best picture of post-1923 book publishing. Still, their
analysis is just as certainly hampered by the chaos of the WorldCat database. And while
extraordinarily helpful, much of the focus of Lavoie’s and Dempsey’s work is on the aggregate
database (i.e., everything that has been cataloged) and then to a limited extent on a few Google
digitization partners. Dempsey’s “Libraries and the Lona Tail: Some Thoughts About Libraries
in the Network Age,” which provides a picture of the shape of the collections of the first Google
partners, is also worthy of note.
One of the best pieces of analysis on the likely body of orphan works is “580,388 Orphan Works
significantly on publishing statistics as well as Lavoie’s and Dempsey’s analysis. The focus on
publishing statistics highlights the fundamental problem caused by a lack of empirical data.
Cairns relies on Bowker’s publishing data when, in fact, libraries buy many works that are never
described in these types of sources. It’s likely that a sizable body of gray literature and even
some scholarly literature (e.g., some monographs in series) skews the numbers and would
create many opportunities for opening access to content. Moreover, because of the informal
nature of the publishing process for these works, many more of them may be orphans. The
numbers are indeed hard to pin down. For example, in another study, “In From the Cold: An
Assessment of the Scope of ‘Orphan Works’ and its Impact on the Delivery of Services to the
Public,” the Joint Information Systems Committee (JISC) estimated 503 UK institutions could
hold in excess of 50 million orphan works.
New Insights Through HathiTrust
Over the past two years, HathiTrust, a partnership of major research libraries working together
to ensure that the cultural record is preserved and accessible long into the future, has built a
large and representative body of materials that gives us a much more reliable empirical window
into a number of questions around books. By the end of October 2010, the collection contained
digitized versions of slightly more than 5 million monographic volumes. Work by Constance
Malpas, Roy Tennant, and others in RLG Research has demonstrated that the composition of
the HathiTrust collection is remarkably representative of research library collections. Their data,
much of it published in “Cloud-sourcing Research Collections: Managing Print in the Massdigitized Library Environment.” by Constance Malpas, shows that the HathiTrust collection holds
a growing percentage of titles that are also held by ARL libraries: the median rate of overlap
between HathiTrust and an ARL library was 19% in June 2009, 31% in June 2010, and 33% in
December 2010. The rate of overlap is fairly consistent across all ARLs, and grows in a fairly
constant way (see Figure 1: Overlap between HathiTrust and ARL libraries). The composition of
the collection, too, shows strong signs of representativeness. The HathiTrust collection contains
more than 400 languages and, like so many ARLs, slightly fewer than 50% of the volumes are in
English: as the collection grows, many bibliographic characteristics (e.g., language, period,
subject) hold fairly constant. This large and representative collection, then, may hold the key to
understanding the general parameters of some of the problems facing us.
Fig. 1: Overlap between HathiTrust and ARL libraries
Academic print book collection already substantially
duplicated in mass digitized book corpus
Medtan duplication: 31%
Median duplication: 19%
Rank In 2008 ARL Investment Index
Assuming the HathiTrust collection is representative or indicative, we’ve started to analyze it for
characteristics to help us better understand the scope of the public domain, orphan works, and
copyright challenges. Before beginning, I’d like to offer a frank apology about the US-centric
analysis that follows. US copyright law affords us a relatively clear framework in which to
understand these problems. The challenges I’ll identify are not specific to readers in the United
States, though the US-specific analysis also helps us understand the problems for readers in
other countries as well.
Distribution by Date
The first and most basic piece of analysis identifies how the collection breaks out according to
boundaries of US copyright determination in the United States. Understanding the publishing
patterns in relation to the major markers in US legislation helps clarify some of the issues that
we should address. Specifically, we want to have a clear sense of how the corpus breaks down
in the following regard: works published before 1923, those published between 1923 and 1963,
those published between 1964 and 1977, and those published after 1977. For US law and US
users, we know that something approximating the following is true:
1. All works published before 1923 can be treated as public domain for a US audience.
2. US copyright law required a copyright notice and copyright renewal for US works
published between 1923 and 1963.
3. US copyright law required only copyright notice for US works published between 1964
and 1977. (Actually, works published until 1 March 1989 are in the public domain if
published without notice and without subsequent registration within 5 years.)
4. If the work was created after 1977 and published with notice, the work was afforded
copyright protection for the life of the author plus 70 years. Thus, nearly all works
created after 1977 will be given copyright protection for decades to come.
There is considerable nuance and some tricky exceptions to all of these rules, which I won’t try
to supply here. Peter Hirtle’s “Copyright Term and the Public Domain in the United States” and
other sources provide a fuller picture.
As shown in Figure 2, the distribution along these dates helps refine our sense of the certain
and likely public domain. Currently, 21% of the HathiTrust book corpus was published before
1923, and another 21% was published between 1923 and 1963. These numbers both mirror
and deviate from the Lavoie and Dempsey numbers based on the Google digitization partners:
their numbers for pre-1923 were a lower 15%, though the 1923-1963 numbers were a similar
20%. The higher HathiTrust pre-1923 numbers might be explained by the focus of some
partners on digitizing public domain works; nevertheless, most of the works digitized are from
Michigan and California, both of which have digitized more comprehensively. (About 60% of
Michigan’s print collections are currently online in HathiTrust.)
Fig. 2: Breakdown of 1-lathiTrust book corpus by publication date
Percent of total books in 1-tathiTrust corpus
Distribution of the Corpus by US and Non-US Publication
Whether a work was published in the United States also has a bearing on its copyright status,
specifically for US users. For the periods 1923-1963 and 1964-1977, a work published in the
United States is subject to different rules of copyright status interpretation than works published
outside the United States. Though we might expect significant variation in the distribution of
US versus non-US published work over the years, if only because of the relative growth of US
publishing over this vast span of time, it’s remarkably uniform in the HathiTrust collection.
Applied to each of the four periods, the breakdown is as follows:
Fig. 3: HathiTrust book corpus: US vs. non-US-published holdings
Percent of total books in F-lathiTrust corpus
Copyright Status Determination, Pre-1923 and 1923-1963
For the sake of this discussion, we will assume all pre-1923 books are in the public domain.
This is, of course, an oversimplification and a very US-centric perspective, but I’d like to posit
this for the sake of clarity in representing these numbers. The copyright status of books
published in the United States between 1923-1 963 cannot be assumed, and must be
determined for each individual work. The University of Michigan was awarded a grant by the
Institute for Museum and Library Services (IMLS) to undertake large-scale and systematic work
to determine the copyright status of works published in this period. Over the last two years,
Michigan, in collaboration with several other partners, has amassed a large and compelling
picture of the likelihood of a US work published in this period being in the public domain. Month
Review Management System (CRMS) staff find 55% of the 1923-1963 works in the corpus to be
in the public domain, either because those works never received copyright protection when they
were published, or because their initial copyright was not renewed. Mind you, this is with more
than 100,000 titles having been reviewed, not some insignificant and skewed sample. We
confirmed the reliability of this work by asking the Library of Congress Copyright Office to
analyze a random sample of our determinations. Most of the works in the remaining 45% are in
copyright, though in some cases staff could not make a determination without more data.
Consequently, we have a well-defined picture of the copyright status of works first published in
US during this period and found in research libraries:
Fig. 4: HathiTrust book corpus: Copyright status of books published
pre-1923 and US works published 1923-1963
Percent of total books In HathiTrust corpus
Moving From “Certainty” to “Speculation”: 1923-1963
The data presented thus far come with a high degree of confidence, as they are based on large
numbers of volumes or, in the case of the CRMS work, tens of thousands of determinations.
Now, however, we enter the realm of speculation. Some of this speculation makes assumptions
grounded in work that has been done elsewhere. For example, Carnegie Mellon University’s
to gain permission to use a work, and this was more likely to be the case for older works than
for more recent works (Covey, “Acquiring Copyright Permission to Digitize and Provide Open
Access to Books”). As Covey notes, ‘We could not find the publishers of most of the books
published between 1920 and 1930 and of almost half of the books published between 1940 and
1950. Publishers of more than a third of the books published from 1950 to 1960 and 1960 to
1970 could not be found” (p. 19). Moreover, when a rights holder could be identified and an
attempt was made to contact them, the CMU project received no response from 30-40% of the
identifiable rights holders for works published during the periods 1930—1940 and 1970—1990;
and no response from 20-30% of the rights holders from works published between 1940 and
Based on the experience of Carnegie Mellon University, let’s hypothesize the following,
recognizing that more data are needed for each characterization:
1. For non-US works published between 1923-1963, roughly 20% will be in the public
domain (e.g., because the author died before 1941, as would be the case for
determining public domain status for works published in countries like the US that has a
term of life plus 70 years). I want to be clear that I have no basis for this assertion of how
many authors died between 1923 and 1941—it’s a wild guess.
2. For all works (i.e., both US and non-US works) published between 1923-1963, we will be
able to contact only 10% of the authors, publishers, or heirs who hold rights.
3. The remaining works published between 1923 and 1963, both US and non-US, (i.e.,
35% in the United States and 70% outside of the United States) are “orphan works,” i.e.,
works in copyright where no rights holder can be identified or contacted.
Fig. 5: HathiTrust book corpus: Public domain, in-copyright, and orphan works,
pre-1923 and 1923-1963
Percent of total books In HathiTrust corpus
Speculation Amplified: 1964-1977
Through HathiTrust we’ve made a considerable effort to get a bearing on the public domain
opportunity for US works published between 1923 and 1963, but we have absolutely no data on
the copyright status of works published between 1964 and 1977. In this later period, rights
holders were required only to affix a copyright notice on published works to secure protection;
---moreover1 if-they included-a-copyright notice,-they-were-not-required to-renew: Rights holders
between 1964-1 977 were undoubtedly more likely to be aware of the requirements for copyright
protection than rights holders in previous periods (if only because of the increased public and
legislative attention to copyright), and the lack of an additional renewal requirement probably
also means that more volumes are in-copyright. This means that the percentage of volumes in
the public domain is likely to be much smaller. We can also speculate that because the
materials are closer to being contemporary, we are more likely to be able to locate the rights
holder than would be the case for older materials. The CMU data again note that “Publishers of
more than a third of the books published from 1950 to 1960 and 1960 to 1970 could not be
found,” and response rates for those who could be located were 20-30% (p. 19).
My numbers for copyright status in the 1964-1 977 period are just guesses. This is based on
very little data because we have very little data to guide our speculation. Bear with me as I posit
1. 20% of US works published between 1964 and 1977 will be in the public domain.
2. With very few exceptions, no works published outside of the United States will be in the
3. Compared with the period 1923 and 1963, we are twice as likely to be able to identify
and successfully contact authors, publishers, or heirs who hold rights for those works in
copyright (i.e., 20%).
4. The remaining works, both US and non-US, are “orphan works”—works in copyright
where no rights holder can be identified or contacted (i.e., 60% of US works and 80% of
Fig. 6: HathiTrust book corpus: Breakdown by US/non-US and rights status,
pre-1923, 1923-1963 and 1964-1977
Percent of total books in HathiTrust corpus
Guesses, Pure Guesses: 1978 to the Present
If we had guesses and informed speculation for the periods before the present, we’re clearly
working without a net for 1978 to the present. There are copyright wrinkles here (e.g,. some
governmental publications, both US and otherwise, and occasional cases where a work may be
ineligible for copyright protection or is dedicated to the public domain), but in general we can
domain 70 years after the death of the author, but there are many cases where the period of
coverage is longer (e.g., 95 years for some works of corporate authorship, a number that may
actually be lower or higher depending on circumstances). Covey notes that most of the
publishers for works in this period could be located, but only 30-40% responded to inquiries (p.
19).[2j For discussion only, let’s assume that our ability to successfully contact authors,
publishers, or heirs that hold rights will double again, reaching 40% of the works.
Fig. 7: HathiTrust book corpus: Breakdown by US/non-US and rights status for all periods
Percent of total books In Hathrrrust corpus
Our data spotlight the likely scope of the public domain and the probable large role of orphans in
our bibliographic landscape. The following are some key findings of our preliminary analysis:
1. The percentage of public domain books in the collective collection—not simply the
current 5+ million books, but the collection as it expands—is unlikely to grow to more
than 33% of the total number of books we will put online. Using the numbers assembled
here, the percentage of public domain materials, not including government documents,
will be 28%.
2. The body of orphan works—works whose rights holders we cannot locate—is likely to be
extremely large, and perhaps the largest body of materials. If the guesses made here
are right, 50% of the volumes will be orphan works. This 50% is comprised as follows:
12.6% will come from the years 1923-1 963, 13.6% from 1964-1 977, and 23.8% from
1978 and years that follow. (The percentage of orphan works relative to all works
decreases as time passes; the number of orphan works increases in more recent years
because more works are published in later years.) Indeed, if this speculation is right, our
incomplete collection today includes more 2.5 million orphan works, of which more than
800,000 are US orphans.
3. The likely size of the corpus of in-copyright publications for which we are able to identify
a known rights holder will be roughly the same size as, or slightly smaller than, the body
of public domain materials. Again, using these speculative numbers, they may comprise
as little as 22% of the total number of books.
Even before we are finished digitizing our collections, the potential numbers are significant and
surprising: more than 800,000 US orphans and nearly 2 million non-US orphans.
There are two important conclusions to draw from this preliminary analysis. The first and most
obvious is that we still need belier data to understand the extent of the problems and
opportunities. In the coming years, HathiTrust and its partners hope to gather more data on
orphan works in various periods, and on the extent of the public domain in works published
outside the United States. Making serious progress on the mailer of orphan works, however, will
probably depend on a policy framework that allows us to make use of those volumes.
Nevertheless—and this is critically important for those who wish to see reasonable uses made
of digitized book content—most of the publications we hold in our collections and put online are
likely to be those we would consider orphan works, with no clearly identifiable or contactable
rights holder. In nearly all cases, there is no economic harm to any person or organization in
opening access to these in-copyright works, and there is a great loss in not providing access to
them. Without an effective legal or policy framework that allows us to do so, a significant portion
of our cultural heritage will be underused and undervalued.
 Although we attempt to segregate US works that may have also been published abroad in
our automatic rights determination process, as in our copyright review process, the numbers
here make no attempt to take simultaneous publication into account.
[211 include these numbers about a lack of response because of their possible bearing on the
absence of copyright holders. Still, it should be noted, if a rights holder does not respond, it
does not mean the rights holder does not exist.
Special thanks to Suzanne Chapman for the handsome graphics. Many people helped me
clarify the points I’m making here. Friends with copyright knowledge, including Jack Bernard,
Peter Hirtle, Melissa Levine and Anne Karle-Zenith were kind enough to set me straight on
some points. Other friends immersed in these problems, including Constance Malpas, Jeremy
York and Lynne Raughley, were generous with their feedback and corrections.
umaji - Tour timer witn nTnazon.com
Your Order with Amazon.com
Mon, Jan 23, 2012 at 7:41 PM
Thanks for your order, Jeremy S. Goldman!
want to manage your order online?
If you need to check the status of your order or make changes, please visit our home page at
Amazon .com and click on Your Account at the top of any page.
Jeremy S. Goldman
Jeremy S. Goldman
Order Grand Total: $4.69
Get the Amazon.com Rewards Visa Card and earn 3% rewards on your Amazon.com
Shipping Details: betterworldbooks_
Shipping & Handling:
Group my items into as few shipments as possible
Total Before Tax:
Estimated Tax To Be Collected:* $0.00
Delivery estimate: Jan. 30, 2012 Feb. 14, 2012
1 “Good troupers all: The story of Joseph Jefferson”
Malvern, Gladys; Unknown Binding; $0.70
Sold by: betterworldbooks_
7/19/2012 5:25 PM