UNITED STATES DISTRICT COURT SOUTHERN DISTRICT OF NEW YORK x THE AUTHORS GUILD, INC., et al., Plaintiffs, - against : - Index No. 11 Civ. 6351 (RB) HATRITRUST, et al., Defendants. x DECLARATION OF JEREMY S. GOLDMAN JEREMY S. GOLDMAN hereby declares as follows: 1. I am an associate at FrankThrt Kurnit Klein & Selz, P.C., attorneys for the Plaintiffs in the above-captioned action. 2. I submit this declaration in in opposition to Defendants’ motion for summary judgment. I have personal knowledge of the facts set forth in this Declaration and could testify competently at a hearing or trial if called upon to do so. 3. Attached hereto as Exhibit A is a true and correct copy of an interview of John Wilkin by Mary Minow entitled Rising into the Public Domain: The Copyright Review Management System (CRA[S) at the University ofMichigan, dated September 9, 2010, which I downloaded from the Web address hap :// 10/09/rising-into-thepublic-domain.html. 4. Atached hereto as Exhibit B is a true and correct copy of an article by John P. Wilkin entitled Bibliographic Indeterminacy and the Scale ofProblems and Opportunities of “Rights” in Digital Collection Building, dated February 2011, which I downloaded from the Web address 5. On January 23, 2012, I purchased a used copy of the book Good Troupers All by Gladys Malvem by ordering it on I paid $0.70 for the book plus $3.99 for shipping and handling. Attached hereto as Exhibit C is a true and correct copy of the order confirmation I received from 6. With the assistance of a summer associate at the firm, I reviewed the titles of and in some instances conducted Internet research on each of the 116 books in which Plaintiffs own the copyright and that were digitized by Defendants without authorization. I determined that of the 116 books that were infringed, a total of 88 (approximately 76%) may be classified as fiction and 28 (approximately 24%) may classified as non-fiction. 7. Rising Into the Public Domain: The Copyright Review Management System (CRMS) at the University of Michigan

Interview with John Wilkin, Associate University Librarian for Library Information Technology and Executive Director, HathiTrust and Principal Investigator for CRMS

Mary Minow: Where does CRMS fit into the scheme of other copyright tools, such as the Determinator? John Wilkin: The Determinator is a good point of comparison for us. It serves as a resource for helping someone make a determination, and what we wanted to do is actually make determinations. The focus is on materials in our Collections, across the HathiTrust partnership. We are not so concerned about where a book comes from, because we think of [the corpus] as a "collective collection" ... materials from across the board.

I think we did have, early on, perhaps a naive sense that we might be able to make those determinations without the materials being in front of us, digitally or in print. We quickly concluded, though, that the only way to do the work was to have those works in hand. And we chose to have them in hand, digitally. And the digital flow of materials drives the prioritization process. Minow: When you say digitally in hand, it sounds like researchers are allowed to look at the text, the preface, etc.

Wilkin: That's right. We have a strong authentication and authorization system, and it's tied into the Michigan CoSign system, but also it uses Shibboleth. So that gives us a lot of tools there. In this case, we use a two factor authentication for all reviewers. They have to authenticate [with a password], and they have to be, essentially, at their desk. They can't take their identities home and start looking at materials that are still in copyright. So it's very much justified by the work they're doing. Minow: Doesn't Google make its own determinations of what's in the Public Domain? Do they come up with different determinations? Is there duplicative work going on?

Wilkin: We're doing the 1923-1963 work.

Minow: That is, a focus on books published between 1923 and 1963. Books published in the U.S. prior to 1923 are in the Public Domain. The Copyright Renewal Act of 1992 automatically extended the copyright terms of works published in 1964 and later.

Wilkin: Right. So far as we know, Google is not doing the 23-63 work. Both Google and HathiTrust do a layer of very automatic determinations. Ours is entirely automatic, based on elements in the MARC record. They have reviewers look at materials to do some [consultation] because occasionally the bibliographic information is not reliable. That's the point at which we'll look most similar, with some exceptions. There are important areas where we deviate. We are opening up U.S. Federal Docs, post 1922. Google is considering that now, but they have been slow to do that. They're considering what classes of materials they'll open up. HathiTrust will say that U.S. government docs are, by and large, in the Public Domain.

Then we diverge. For example, we're going to look at U.S. pre-1923 materials as in the Public Domain, and we're going to look at users outside the U.S. differently for materials that were published outside the US does that make sense?

Minow: Help me out here.

Wilkin: For the user in the U.S. or really for anybody in the world, we deem U.S. works pre-1923 as being in the Public Domain. And for the user in the U.S., we also deem non-U.S. works pre-1923 as in the Public Domain. For users outside the U.S., we are fairly conservative with non-U.S. works. I think the date we're using now is about 1870. It's a rolling wall, and essentially a best guess. It would be that date for a young author who lived a long time who published something. We use statistical probability, and we roll that wall forward every year.

Minow: How do you figure out if the work was published first outside the country?

Wilkin: We primarily use the bib record of the publication. If the place of publication is outside the U.S., we assume that it was [first published there]. Effectively we are conservative unless we get a good look at something and make an individual determination. We ingested 700,000 volumes one month, so that gives you a sense of the scale we're working at. We're never going to have the resources needed to do individual sorts of this one should go here and that one should go there. Minow: You mentioned that you're using the Determinator, but that's only available for Class A books. Are most of your materials Class A books?

Wilkin: They're all Class A books. The reviewers use the Determinator and other tools, they look at the book and they make an assessment. They look to see that there are not embedded rights problems in making those determinations.

Minow: Inserts - photos, stories, poems - you'd almost have to read every page.

Wilkin: Well, we look at acknowledgements, not the entire book. There are going to be some cases where the acknowledgements are not that adequate. We have an advertised takedown policy, and we've never been contacted about anything that is an insert.

Minow: It takes my breath away to look at that level. Wilkin: The insert issue is of particular concern in Congressional materials, such as materials that are inserted into the record for hearings. We work with the assumption that these inserts are part of the public record and that they are provided or reproduced in that context.

Minow: In Section 108(h), the copyright law gives 20 years back to libraries and archives even on the web, if not subject to normal commercial exploitation. Here's a chart I made, showing that, for example, that libraries and archives may make and distribute copies of works up through 1934 this year, instead of 1922. The catch is that the works cannot be subject to an undefined "normal commercial exploitation."

Wilkin: We're not taking advantage of that at this point.

Minow: Another thought I had, after reading Melissa Levine's article, is that many authors of older works retain their digital rights, because when they signed publisher agreements, digital rights were not yet contemplated. Are you taking advantage of that? [Opening Up Content in HathiTrust: Using HathiTrust Permissions Agreements to Make Authors' Work Available, Research Library Issues, no. 269 (April 2010): Special Issue on Strategies for Opening Up Content]

Wilkin: We're not. We're just testing the waters, taking baby steps. We're only dealing with works where the rights have reverted to the author and when the author or publisher knows they own the rights. As it turns out, we've had some fairly large lump permissions. For example, in at least one case where a journal died, the journal publisher gave us permission to open up the full run of the journal. As it turns out, a few organizations have opened up a large number of publications. Melissa's article is an early step for us. We haven't gone out to seek permissions from authors, yet. But it's most definitely something we want to do.

Minow: The University of Michigan is a player in the OCLC pilot project, the WorldCat Copyright Evidence Registry. Does that mean your determinations of copyright for the works you examine then feed into that Registry?

Wilkin: I think that effort is in limbo right now. We did set up a mechanism that we could share our determinations with them. The Registry was set up to allow institutions to identify records that need to be enhanced or annotated with information about URLs and rights, etc. In our distribution mechanism, there's one record for every volume in the repository at this point. We think of OCLC as a central switching point for bibliographic info, so it seemed like a natural for them to have a registry of copyright evidence. We were making data available to them, but in fact we have now 6 million volumes, each identified with our either automatic or manual copyright determination, so that's more than what OCLC would have, I guess, aspired to do. In the CRMS process, that's only been tens of thousands of volumes, but someone could start with our 6 million volumes and look for changes. Minow: But it wouldn't be open in the sense that someone could put their own data in, right?

Wilkin: Exactly, and the Copyright Evidence Registry was intended to be that.

Minow: Is there anything you'd like to add?

Wilkin: Well, for us, the question is 'what next?' The easiest "what next" is expanding to other partners. Anne's been busy as we laid out in the grant, she is training staff in Indiana, Minnesota and Wisconsin just finished Wisconsin the three pilots along with the Michigan staff. [Anne Karle-Zenith, Copyright Review Project Librarian]. This winter she'll probably incorporate staff at a California partner.

Minow: Do you see members of the public as becoming able to add notes or comments in the future?

Wilkin: We have a tagging application for bib records. Probably not a day passes when someone doesn't say, 'I think this is in the Public Domain" or ask, "is this in the public domain?" That's what stimulates someone to look at it. So it is user driven now. We won't take someone's assertion as fact, but it provides a good starting point to do investigation.

Minow: Do you have plans to add other materials, besides "Class A" books?

Wilkin: In HathiTrust, we have much more than "Class A," but the only ones we're pushing into the workflow right now are "Class A." So that becomes a question for you. Then, how would we go beyond "Class A"? How could we build sustainable cost effective system? Probably going to be something piece by piece, right? Minow: I've heard that the Copyright Office is working on a retrospective conversion of the copyright registration and renewal records of rest of the material types, beyond "Class A books." If they make the records available in bulk, as they did with "Class A," then others can set up or build on databases like Stanford's "Determinator."

Wilkin: Did you know that we've found about between 55% and 60% of our materials have been found in the public domain?

Minow: Fantastic!

Wilkin: The numbers you see out there say like, only 15% are in copyright. Some assertions are pretty wild. There was some early work done by the copyright office, but the law was in flux at the time. Best to have something so statistically sound, I'm guessing that between pre-CRMS and CRMS, we've gone through 500,000 titles and those numbers have held. I think we have another 400,000 titles to deal with in that period. One question we have, how many titles ARE there in the 23-63 period? There's just so much indeterminacy because of variation in cataloguing practice and ways of reporting things, and so. Minow: Are the other 40% ones that you've determined are in copyright or you just can't figure them out?

Wilkin: I think early on it was about 30% in copyright and 10% in UND (undetermined or undeterminable). Anne found that as staff got more experienced, they were getting stuck on complicated problems, and we often found a lower yield of public domain determinations. So Anne encouraged staff to push things to UND rather than get some finality. So the number of UND has gone up, but the numbers in the Public Domain have stayed constant. That's really a workflow strategy kind of thing. It's exciting to get those works opened up. The surprise has come in the titles. Because of the required renewal process, it's stunning to see what was not renewed. The first time I encountered this was with my 13 year old daughter, who was doing a book report on code breakers. We found really modern materials by living mathematicians. I thought, "oh, we're in trouble." Then, looking further, these were ones where renewal did not take place. Interesting to learn the behavioral piece. But the numbers, the numbers are really very interesting, the 60/40 sort of thing. Minow: And yet, going forward, this is not going to be the case, because now there's no renewal required. An anomaly really, unless law changes again in the other direction, which doesn't seem likely.

Wilkin: That's something for us to ponder as a society, as a culture, that these works are overwhelmingly not on the market. What's happening is, without this effort, no one is able to take advantage of the information that's there, or only in a limited way. Another surprise is the Committee on Institutional Cooperation, the CIC, the non-Michigan, non-Wisconsin CIC institutions, don't get back their in-copyright materials ... by contract with Google. I think what we ought to say is they don't get back those things that are putatively in copyright. With those numbers in mind, think about what are we not able to put online because they're assumed to be in copyright, when we know that 60% or some large percent are in the public domain.

Minow: You mean, those institutions are not getting access to the full text of their own books?

Wilkin: They stay at Google, they're embargoed. That may change with an amended agreement, but for now, Google doesn't provide them back.

Minow: I thought those were called "library copies."

Wilkin: It is important to call them "embargoed copies." Jack Bernard, our Assistant General Counsel, has asked us to use the term "rising into the public domain" instead of "falling into the public domain."

Minow: That's a good title for this interview. Thanks so much for talking with us today. Bibliographic Indeterminacy and the Scale of Problems and Opportunities of "Rights" in Digital Collection Building by John P. Wilkin Februaiy 2011 The research library community has little strong or reliable data on the number of unique books in our collections and their “rights”—for example, whether they are in the public domain or incopyright and, if in-copyright, whether they are orphan works. At its foundation, this problem is created by the dearth of reliable bibliographic information, or what I’ve been calling bibliographic indeterminacy. For example, we’d like to know how large the “collective collection” of all (or even just all North American) research libraries is, and how many unique volumes research libraries hold in aggregate; otherwise, there’s no way to know the cost of digitizing or caring for these materials. We’d also like to have a better handle on the question of what’s in the public domain and, by extension, what’s in copyright. We’d like to know how many orphan works there are, or perhaps what proportion of the digitized content we have online is likely to be orphans. And while these questions and more are regularly part of the conversation around digital collection building, they’re also relevant to more conventional library problems such as print storage and particularly shared print storage. We don’t know what’s in the collective collection. The fact is, we have little reliable data about most of these questions. There’s been considerable speculation in the wake of the proposed Google Books settlement and even years before, when we first considered the probable shape of the growing digital collection or the opportunities in front of us. Our biggest impediment to getting a good bearing on questions of the size, nature and rights status of research library collections is the simple lack of an authoritative bibliography. Efforts to Date To answer these questions, we often turn to WorldCat, but its records are overwhelmed by the noise in WorldCat: the high number of unique records that represent variations in cataloging rather than separate manifestations of a work, non-book and non-journal material masquerading as books and journals, and items with incomplete or unreliable metadata. As a database, WorldCat is by far the best thing we have, but its purposes long ago shifted away from documenting the collective collection to facilitating discovery (as a data source for Brian Lavoie and Lorcan Dempsey worked through those challenges with admirable adroitness in their “Beyond 1923: Characteristics of Potentially In-Copyright Print Books in Library Collections,” providing the best picture of post-1923 book publishing. Still, their analysis is just as certainly hampered by the chaos of the WorldCat database. And while extraordinarily helpful, much of the focus of Lavoie’s and Dempsey’s work is on the aggregate database (i.e., everything that has been cataloged) and then to a limited extent on a few Google digitization partners. Dempsey’s “Libraries and the Lona Tail: Some Thoughts About Libraries in the Network Age,” which provides a picture of the shape of the collections of the first Google partners, is also worthy of note. One of the best pieces of analysis on the likely body of orphan works is “580,388 Orphan Works significantly on publishing statistics as well as Lavoie’s and Dempsey’s analysis. The focus on publishing statistics highlights the fundamental problem caused by a lack of empirical data. Cairns relies on Bowker’s publishing data when, in fact, libraries buy many works that are never described in these types of sources. It’s likely that a sizable body of gray literature and even some scholarly literature (e.g., some monographs in series) skews the numbers and would create many opportunities for opening access to content. Moreover, because of the informal nature of the publishing process for these works, many more of them may be orphans. The numbers are indeed hard to pin down. For example, in another study, “In From the Cold: An Assessment of the Scope of ‘Orphan Works’ and its Impact on the Delivery of Services to the Public,” the Joint Information Systems Committee (JISC) estimated 503 UK institutions could hold in excess of 50 million orphan works. New Insights Through HathiTrust Over the past two years, HathiTrust, a partnership of major research libraries working together to ensure that the cultural record is preserved and accessible long into the future, has built a large and representative body of materials that gives us a much more reliable empirical window into a number of questions around books. By the end of October 2010, the collection contained digitized versions of slightly more than 5 million monographic volumes. Work by Constance Malpas, Roy Tennant, and others in RLG Research has demonstrated that the composition of the HathiTrust collection is remarkably representative of research library collections. Their data, much of it published in “Cloud-sourcing Research Collections: Managing Print in the Massdigitized Library Environment.” by Constance Malpas, shows that the HathiTrust collection holds a growing percentage of titles that are also held by ARL libraries: the median rate of overlap between HathiTrust and an ARL library was 19% in June 2009, 31% in June 2010, and 33% in December 2010. The rate of overlap is fairly consistent across all ARLs, and grows in a fairly constant way (see Figure 1: Overlap between HathiTrust and ARL libraries). The composition of the collection, too, shows strong signs of representativeness. The HathiTrust collection contains more than 400 languages and, like so many ARLs, slightly fewer than 50% of the volumes are in English: as the collection grows, many bibliographic characteristics (e.g., language, period, subject) hold fairly constant. This large and representative collection, then, may hold the key to understanding the general parameters of some of the problems facing us. Fig. 1: Overlap between HathiTrust and ARL libraries 50% Academic print book collection already substantially r— duplicated in mass digitized book corpus ~ 4 ~ •• • + 4 ..• Z’e:rSs • s ~ + ~ t~.: v June2010 Medtan duplication: 31% t. “ • •‘a, ~ + 10% 0 0~ a June 2009 Median duplication: 19% 0% a .10 Rank In 2008 ARL Investment Index lii Assuming the HathiTrust collection is representative or indicative, we’ve started to analyze it for characteristics to help us better understand the scope of the public domain, orphan works, and copyright challenges. Before beginning, I’d like to offer a frank apology about the US-centric analysis that follows. US copyright law affords us a relatively clear framework in which to understand these problems. The challenges I’ll identify are not specific to readers in the United States, though the US-specific analysis also helps us understand the problems for readers in other countries as well. Distribution by Date The first and most basic piece of analysis identifies how the collection breaks out according to boundaries of US copyright determination in the United States. Understanding the publishing patterns in relation to the major markers in US legislation helps clarify some of the issues that we should address. Specifically, we want to have a clear sense of how the corpus breaks down in the following regard: works published before 1923, those published between 1923 and 1963, those published between 1964 and 1977, and those published after 1977. For US law and US users, we know that something approximating the following is true: 1. All works published before 1923 can be treated as public domain for a US audience. 2. US copyright law required a copyright notice and copyright renewal for US works published between 1923 and 1963. -— 3. US copyright law required only copyright notice for US works published between 1964 and 1977. (Actually, works published until 1 March 1989 are in the public domain if published without notice and without subsequent registration within 5 years.) 4. If the work was created after 1977 and published with notice, the work was afforded copyright protection for the life of the author plus 70 years. Thus, nearly all works created after 1977 will be given copyright protection for decades to come. There is considerable nuance and some tricky exceptions to all of these rules, which I won’t try to supply here. Peter Hirtle’s “Copyright Term and the Public Domain in the United States” and other sources provide a fuller picture. As shown in Figure 2, the distribution along these dates helps refine our sense of the certain and likely public domain. Currently, 21% of the HathiTrust book corpus was published before 1923, and another 21% was published between 1923 and 1963. These numbers both mirror and deviate from the Lavoie and Dempsey numbers based on the Google digitization partners: their numbers for pre-1923 were a lower 15%, though the 1923-1963 numbers were a similar 20%. The higher HathiTrust pre-1923 numbers might be explained by the focus of some partners on digitizing public domain works; nevertheless, most of the works digitized are from Michigan and California, both of which have digitized more comprehensively. (About 60% of Michigan’s print collections are currently online in HathiTrust.) Fig. 2: Breakdown of 1-lathiTrust book corpus by publication date 197840% 1964- 1977 18% 1923-1963 21% pre 1923 - 21% I I I 0 10 20 30 40 50 Percent of total books in 1-tathiTrust corpus I 60 70 I 60 90 100 Distribution of the Corpus by US and Non-US Publication Whether a work was published in the United States also has a bearing on its copyright status, specifically for US users. For the periods 1923-1963 and 1964-1977, a work published in the United States is subject to different rules of copyright status interpretation than works published outside the United States.[1] Though we might expect significant variation in the distribution of US versus non-US published work over the years, if only because of the relative growth of US publishing over this vast span of time, it’s remarkably uniform in the HathiTrust collection. Applied to each of the four periods, the breakdown is as follows: Fig. 3: HathiTrust book corpus: US vs. non-US-published holdings 197840% 1964-1977 18% 1923-1963 21% pre 1923 - 21% I I I I 0 10 20 30 40 50 Percent of total books in F-lathiTrust corpus I I 60 70 80 I 90 I 100 Copyright Status Determination, Pre-1923 and 1923-1963 For the sake of this discussion, we will assume all pre-1923 books are in the public domain. This is, of course, an oversimplification and a very US-centric perspective, but I’d like to posit this for the sake of clarity in representing these numbers. The copyright status of books published in the United States between 1923-1 963 cannot be assumed, and must be determined for each individual work. The University of Michigan was awarded a grant by the Institute for Museum and Library Services (IMLS) to undertake large-scale and systematic work to determine the copyright status of works published in this period. Over the last two years, Michigan, in collaboration with several other partners, has amassed a large and compelling picture of the likelihood of a US work published in this period being in the public domain. Month aftermonth, ~ Review Management System (CRMS) staff find 55% of the 1923-1963 works in the corpus to be in the public domain, either because those works never received copyright protection when they were published, or because their initial copyright was not renewed. Mind you, this is with more than 100,000 titles having been reviewed, not some insignificant and skewed sample. We confirmed the reliability of this work by asking the Library of Congress Copyright Office to analyze a random sample of our determinations. Most of the works in the remaining 45% are in copyright, though in some cases staff could not make a determination without more data. Consequently, we have a well-defined picture of the copyright status of works first published in US during this period and found in research libraries: Fig. 4: HathiTrust book corpus: Copyright status of books published pre-1923 and US works published 1923-1963 197840% 1964-1977 18% 46% 1923-1963 21% 55% pre 1923 100% - 21% I 0 10 20 I 30 40 I 50 I I I I 60 70 80 90 100 Percent of total books In HathiTrust corpus In Copyright laOopyaght Orphans Public Domain Pubbo Domain Moving From “Certainty” to “Speculation”: 1923-1963 The data presented thus far come with a high degree of confidence, as they are based on large numbers of volumes or, in the case of the CRMS work, tens of thousands of determinations. Now, however, we enter the realm of speculation. Some of this speculation makes assumptions grounded in work that has been done elsewhere. For example, Carnegie Mellon University’s project-to-secure-rights-for-contemporary-publications-was-unable-to-reach-many-rights-holders— to gain permission to use a work, and this was more likely to be the case for older works than for more recent works (Covey, “Acquiring Copyright Permission to Digitize and Provide Open Access to Books”). As Covey notes, ‘We could not find the publishers of most of the books published between 1920 and 1930 and of almost half of the books published between 1940 and 1950. Publishers of more than a third of the books published from 1950 to 1960 and 1960 to 1970 could not be found” (p. 19). Moreover, when a rights holder could be identified and an attempt was made to contact them, the CMU project received no response from 30-40% of the identifiable rights holders for works published during the periods 1930—1940 and 1970—1990; and no response from 20-30% of the rights holders from works published between 1940 and 1970. Based on the experience of Carnegie Mellon University, let’s hypothesize the following, recognizing that more data are needed for each characterization: 1. For non-US works published between 1923-1963, roughly 20% will be in the public domain (e.g., because the author died before 1941, as would be the case for determining public domain status for works published in countries like the US that has a term of life plus 70 years). I want to be clear that I have no basis for this assertion of how many authors died between 1923 and 1941—it’s a wild guess. 2. For all works (i.e., both US and non-US works) published between 1923-1963, we will be able to contact only 10% of the authors, publishers, or heirs who hold rights. 3. The remaining works published between 1923 and 1963, both US and non-US, (i.e., 35% in the United States and 70% outside of the United States) are “orphan works,” i.e., works in copyright where no rights holder can be identified or contacted. Fig. 5: HathiTrust book corpus: Public domain, in-copyright, and orphan works, pre-1923 and 1923-1963 197840% 1964- 1977 18% 1923-1963 21% pre 1923 - 21% I I ~~~___I 0 10 20 30 40 50 Percent of total books In HathiTrust corpus In Copyright 60 I 70 I I 80 90 100 In copy* Orphans Public Dorn&n - - -- -- Public, Domain Speculation Amplified: 1964-1977 Through HathiTrust we’ve made a considerable effort to get a bearing on the public domain opportunity for US works published between 1923 and 1963, but we have absolutely no data on the copyright status of works published between 1964 and 1977. In this later period, rights holders were required only to affix a copyright notice on published works to secure protection; ---moreover1 if-they included-a-copyright notice,-they-were-not-required to-renew: Rights holders between 1964-1 977 were undoubtedly more likely to be aware of the requirements for copyright protection than rights holders in previous periods (if only because of the increased public and legislative attention to copyright), and the lack of an additional renewal requirement probably also means that more volumes are in-copyright. This means that the percentage of volumes in the public domain is likely to be much smaller. We can also speculate that because the materials are closer to being contemporary, we are more likely to be able to locate the rights holder than would be the case for older materials. The CMU data again note that “Publishers of more than a third of the books published from 1950 to 1960 and 1960 to 1970 could not be found,” and response rates for those who could be located were 20-30% (p. 19). My numbers for copyright status in the 1964-1 977 period are just guesses. This is based on very little data because we have very little data to guide our speculation. Bear with me as I posit the following: 1. 20% of US works published between 1964 and 1977 will be in the public domain. 2. With very few exceptions, no works published outside of the United States will be in the public domain. 3. Compared with the period 1923 and 1963, we are twice as likely to be able to identify and successfully contact authors, publishers, or heirs who hold rights for those works in copyright (i.e., 20%). 4. The remaining works, both US and non-US, are “orphan works”—works in copyright where no rights holder can be identified or contacted (i.e., 60% of US works and 80% of non-US works). Fig. 6: HathiTrust book corpus: Breakdown by US/non-US and rights status, pre-1923, 1923-1963 and 1964-1977 197840% 1964-1977 18% 1923-1963 21% pre 1923 - 21% I I I I I 0 10 20 30 40 50 Percent of total books in HathiTrust corpus I 60 I 70 60 90 I 100 tn Copyflght Orphans Public Domain Public Domain Guesses, Pure Guesses: 1978 to the Present If we had guesses and informed speculation for the periods before the present, we’re clearly working without a net for 1978 to the present. There are copyright wrinkles here (e.g,. some governmental publications, both US and otherwise, and occasional cases where a work may be ineligible for copyright protection or is dedicated to the public domain), but in general we can domain 70 years after the death of the author, but there are many cases where the period of coverage is longer (e.g., 95 years for some works of corporate authorship, a number that may actually be lower or higher depending on circumstances). Covey notes that most of the publishers for works in this period could be located, but only 30-40% responded to inquiries (p. 19).[2j For discussion only, let’s assume that our ability to successfully contact authors, publishers, or heirs that hold rights will double again, reaching 40% of the works. Fig. 7: HathiTrust book corpus: Breakdown by US/non-US and rights status for all periods 40% 197840% 60% 1964- 1977 18% • 1923-1963 21% pre 1923 - 21% I I I 0 10 20 30 40 50 Percent of total books In Hathrrrust corpus En Copyright Orphans Public Domain Public Domain I 60 70 I 60 I SI) 100 Conclusion Our data spotlight the likely scope of the public domain and the probable large role of orphans in our bibliographic landscape. The following are some key findings of our preliminary analysis: 1. The percentage of public domain books in the collective collection—not simply the current 5+ million books, but the collection as it expands—is unlikely to grow to more than 33% of the total number of books we will put online. Using the numbers assembled here, the percentage of public domain materials, not including government documents, will be 28%. 2. The body of orphan works—works whose rights holders we cannot locate—is likely to be extremely large, and perhaps the largest body of materials. If the guesses made here are right, 50% of the volumes will be orphan works. This 50% is comprised as follows: 12.6% will come from the years 1923-1 963, 13.6% from 1964-1 977, and 23.8% from 1978 and years that follow. (The percentage of orphan works relative to all works decreases as time passes; the number of orphan works increases in more recent years because more works are published in later years.) Indeed, if this speculation is right, our incomplete collection today includes more 2.5 million orphan works, of which more than 800,000 are US orphans. 3. The likely size of the corpus of in-copyright publications for which we are able to identify a known rights holder will be roughly the same size as, or slightly smaller than, the body of public domain materials. Again, using these speculative numbers, they may comprise as little as 22% of the total number of books. Even before we are finished digitizing our collections, the potential numbers are significant and surprising: more than 800,000 US orphans and nearly 2 million non-US orphans. There are two important conclusions to draw from this preliminary analysis. The first and most obvious is that we still need belier data to understand the extent of the problems and opportunities. In the coming years, HathiTrust and its partners hope to gather more data on orphan works in various periods, and on the extent of the public domain in works published outside the United States. Making serious progress on the mailer of orphan works, however, will probably depend on a policy framework that allows us to make use of those volumes. Nevertheless—and this is critically important for those who wish to see reasonable uses made of digitized book content—most of the publications we hold in our collections and put online are likely to be those we would consider orphan works, with no clearly identifiable or contactable rights holder. In nearly all cases, there is no economic harm to any person or organization in opening access to these in-copyright works, and there is a great loss in not providing access to them. Without an effective legal or policy framework that allows us to do so, a significant portion of our cultural heritage will be underused and undervalued. Notes [1] Although we attempt to segregate US works that may have also been published abroad in our automatic rights determination process, as in our copyright review process, the numbers here make no attempt to take simultaneous publication into account. [211 include these numbers about a lack of response because of their possible bearing on the absence of copyright holders. Still, it should be noted, if a rights holder does not respond, it does not mean the rights holder does not exist. Acknowledgments: Special thanks to Suzanne Chapman for the handsome graphics. Many people helped me clarify the points I’m making here. Friends with copyright knowledge, including Jack Bernard, Peter Hirtle, Melissa Levine and Anne Karle-Zenith were kind enough to set me straight on some points. 