DECLARATION OF NEIL R. SMALHEISER IN SUPPORT OF DEFENDANTS' MOTION FOR SUMMARY JUDGMENT I, Neil R. Smalheiser, pursuant to 28 U.S.C. § 1746, hereby declare as follows: 1. Since August, 1996, I have been a faculty member in the Department of Psychiatry, University of Illinois at Chicago, in which I teach courses and conduct research on neuroscience and information science. Currently I am Associate Professor with Tenure. I submit this declaration in support of the defendant libraries’ (the “Libraries”) motion for summary judgment. Unless otherwise noted, I make this declaration based upon my own personal knowledge. 2. I received a Bachelor of Arts degree in Mathematics from the University of Iowa in 1974 and received my MD-PhD in Medicine and Neuroscience from the Albert Einstein College of Medicine in 1982. 3. I have worked in the field of text mining since 1991. “Text mining” is the use of technology to identify and extract new pieces of information from the enormous amount of knowledge available in large bodies of text. While text generally is written for people to read, text mining does not involve reading the text; instead, it uses text in digital form as data to be analyzed and processed through algorithms, which are sets of instructions or rules applied— usually by a computer—to compute a result. 4. Text mining can be applied to many different types of uses, such as retrieving and classifying documents; identifying new, interesting or particularly controversial findings; or identifying new emerging trends. In different contexts, the techniques of text-mining can be put to a variety of uses, including identifying influential experts (thought leaders) in a particular subject, predicting civil unrest in third world countries, or tracking the emergence of infectious disease outbreaks or terrorist cells. 5. A simple example of these many uses of text mining is as follows: Assume a historian discovers an unpublished manuscript of a play written in absurdist style—he suspects that it may have been written by Edward Albee or Harold Pinter. A text mining approach to this question might be tackled by collecting all of the known works of Edward Albee digitally and tabulating all of the words and phrases and punctuation marks used therein. Besides counting their individual frequencies, they can also be classified in different aggregate ways—e.g., counting the frequencies of proper names, active verbs, mentions of geographical locations, or calculating the average difficulty of the text in terms of the grade level required to understand it. This creates an overall profile of Edward Albee, and the same can be done for the known works of Harold Pinter. The profile of the unpublished manuscript is compared to the profiles of Edward Albee and Harold Pinter—if it is very similar to Albee and not to Pinter, this would provide evidence that Albee is the likely author. If not very similar to either, this would suggest that some other author entirely may be responsible for writing it. 6. In fact, I understand that a professor at Vassar College, Donald Wayne Foster, used a form of text mining to identify Joe Klein as the writer of “Primary Colors,” a thinly veiled exposé of President Clinton’s 1992 run to the presidency which was originally published anonymously. 7. As I will discuss in more detail below, my personal experience in text mining has mostly been in the biomedical field. However, text mining processes and methods could be employed to conduct research over digital textual material of virtually any subject matter to discover new relationships, trends, correlations, and other information that may not be recognized through manually reading the texts, or that may only become apparent upon analysis of such a vast dataset that it would be virtually impossible to realize through reading. 8. I have published more than 90 peer-reviewed publications, of which more than 20 concern text mining. I have received five research grants for text mining from the National Institutes of Health (NIH) and private foundations. I have been a member of the program committee of many international conferences on medical informatics, am a member of eight journal editorial boards, and have been in leadership roles in prominent professional societies including the American Medical Informatics Association, Association for Computing Machinery, American Society of Information Science and Technology, and Society for Neuroscience. I have served on numerous grant review panels for NIH and the National Science Foundation (NSF). Attached as Exhibit A is a true and correct copy of my most recent curriculum vitae. 9. I have been asked by Kilpatrick Townsend & Stockton LLP to describe certain of the types of research that can be performed using a digital repository of works such as the repository of works offered by the Libraries through the HathiTrust Digital Library (“HDL”). 10. In working on this assignment, to date, I have read and/or referred to the HathiTrust website at The Emerging Field of Text Mining 11. The studies of one of my mentors, Dr. Don Swanson, during the period 1986 to 1993 were an early impetus for the development of automated text mining research processes and their application in the biomedical field. Dr. Swanson developed the technique of combining separate statements, found in separate works, together to form new statements that represent new scientific hypotheses. 12. For example, suppose the statement “A affects B” appears in one work, and the statement “B affects C” appears in another work. These two works may have been published in different years by different authors, in different medical sub-fields, and no one person may have even read both of them. However, juxtaposing and viewing both statements together, one may well infer the possibility that “A affects C,” and that statement might be novel and potentially represent an important scientific discovery. 13. Dr. Swanson used this type of procedure to propose several significant medical hypotheses that were subsequently tested and confirmed clinically. For example, he proposed that fish oil supplementation would ameliorate Raynaud’s syndrome1 and that magnesium 1 Swanson DR. Fish oil, Raynaud's syndrome, and undiscovered public knowledge. Perspect Biol Med. 1986 Autumn;30(1):7-18. Raynaud syndrome is a disorder, believed to be the result of decreases in the blood supply to parts of the body, that causes pain to and discoloration of the fingers, toes, and other areas. In some cases, the effects can be more significant, including necrosis and gangrene. supplementation would ameliorate migraine headaches.2 14. Dr. Swanson’s early studies employing this technique were carried out by hand, reading numerous articles and identifying patterns. While a researcher might be able to identify a few “A – B – C” correlations of this type manually by reading articles or other texts, Dr. Swanson and I quickly realized that through computers it is possible to search through thousands of articles to identify a large number of potentially new scientific hypotheses. Such automated search processes carry the hope of discovering correlations that individuals could not discover without computers. 15. Dr. Swanson and I created one such computer program together, called Arrowsmith,3 which was designed to consider data in the bibliographic records for biomedical articles in medical databases (e.g. the PubMed database4), and which given a topic A, would identify topics C that were likely to be related to it, on the basis that both topic A and topic C have some relationship to common topic B. Arrowsmith used article bibliographic records to identify these “A – B – C” correlations where no articles explicitly mentioned A and C together. 16. Arrowsmith operated by first running searches for a topic A (e.g., Huntington’s Disease) and retrieving the bibliographic records for all articles that discuss that topic. Next, it created a list of all of the terms included in the titles of those articles, and these terms were treated as the B items that had a relationship to topic A and might serve as a link to identifying 2 Swanson DR. Migraine and magnesium: eleven neglected connections. Perspect Biol Med. 1988 Summer;31(4):526-57. 3 Swanson DR, Smalheiser NR. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence 1997; 91: 183-203. 4 The PubMed database consists of bibliographic data concerning ~20 million biomedical articles (including author names, title, abstract, affiliation, Medical Subject Headings, etc.). (No full-text articles are contained within the PubMed database.) Public users can query the PubMed database freely at, or can apply for a relatively unrestricted license to download the entire database and manipulate the data locally on their own computers. new topics C that had not previously been identified as related to topic A. The program ran searches for these B items and retrieved the bibliographic records for all the articles that discussed each one, creating a number of B article sets. Arrowsmith then created lists of all of the terms in the titles of each B article set, and the terms in these lists became the C items. To exclude from the results any A – C connections that may have been mentioned within the articles themselves, the program deleted from the lists of C items any terms that also appeared in the titles of the articles retrieved with the searches for topic A. Then the program ranked the remaining C items by potential relevance, according to the number of different B article sets in which they appeared (the more different B items that resulted in identifying a particular C item, the higher the possibility that the C item shared a relevant connection with the initial topic A). As a result, Arrowsmith provided a ranked list of items that may have been related to a topic but that were not identified in the existing medical literature as being related to that topic. 17. Using such a procedure, I identified a particular class of molecule called “microRNAs” as particularly likely to be involved in Huntington’s Disease, and this prediction was confirmed by subsequent research in this field. 18. In the years since we first designed and implemented the “Arrowsmith” technology we have improved upon it and made modifications to it that have enabled new discoveries. 19. For example, during the time period 2008, I was engaged in writing a review article on microRNA regulation5 and became interested in assessing whether “phosphorylation,” a common modification of proteins that regulates their function, might be involved in regulating the formation of microRNAs. At the time of my analysis, many proteins had been reported to 5 Smalheiser NR. Regulation of mammalian microRNA processing and function by cellular signaling and subcellular localization. Biochim Biophys Acta. 2008 Nov; 1779(11): 678-681. interact with microRNAs, and in separate studies many proteins were known to be phosphorylated, but no one had investigated directly whether phosphorylation was responsible for regulating microRNAs. 20. I hypothesized that microRNAs (topic A) were meaningfully linked to phosphorylation (topic C), and using a modified version of the Arrowsmith program, I sought to make a list of proteins (the B items) that were candidates to mediate this connection. I used the Arrowsmith system to carry out two searches of the PubMed database (one on microRNAs and one on phosphorylation), to collect all of the titles in each set of articles, and to identify all of the words and phrases that were shared in common in both sets. The Arrowsmith system then filtered the list of words and phrases to identify the names of proteins, and then ranked the proteins according to their likely relevance (using an algorithm that we developed). The result was a shortlist of proteins that represented good candidates for further study of their possible action in regulating microRNAs by virtue of their phosphorylation. 21. The analyses described above could not reasonably be carried out manually. Not only is it necessary to use computers in order to conduct the searches of thousands of articles identified in each set (A and C), but we needed to carry out statistical modeling based on many searches in order to create a quantitative model that could predict which B items are most likely to be relevant. 22. Automated text mining continues to evolve at a remarkable pace. As more full- text becomes accessible and technology advances, increasingly these techniques focus on the full text of books and other texts, both in the general domain of digitized books (as illustrated by the example of assessing authorship of a manuscript in Paragraph 5, above) and in the biomedical domain. The HathiTrust Digital Library and HathiTrust Research Center 23. As described in the examples above, because of the scale on which it is conducted and the complexity of the algorithms applied, a great deal of valuable text mining research cannot be carried out manually, but requires large databases of digital textual material that can be processed by computers. 24. I understand that the HDL is a shared database of over ten million digitized volumes, many of which had not previously existed in digital form, from the library collections of major research universities. 25. I believe that the HDL, as a large database of widely varied digital textual material, presents an opportunity for valuable educational and scholarly text-mining research to be conducted in a broad range of subjects and disciplines. Indeed, the same text mining techniques described above could be used to identify previously unknown trends, correlations, and relationships from information contained in the different books in the HDL. 26. I understand that the HathiTrust, through the HathiTrust Research Center, is exploring ways of enabling research similar to the text mining research conducted by myself and others as described above. 27. In my opinion, the HDL corpus is amenable to many of the same types of text mining analyses set out above. For example, scientists have developed algorithms and visualization tools designed to analyze digital text and detect "bursts," which are sudden increases in data, and in the context of text mining, refer to sudden increases in appearance or usage of a word or topic. These tools have been used by researchers in the science community to identify major research topics and to trace research topic trends.6 Mane KK, Börner K. Mapping topics and topic bursts in PNAS. Proc Natl Acad Sci U S A. 2004 Apr 6;101 Suppl 1:5287-90.

NAME: Neil R. Smalheiser, MD, PhD POSITION: Associate Professor (with tenure), Department of Psychiatry, as of 8/15/08; Adjunct Associate Professor, Department of Anatomy & Cell Biology; Member, Psychiatric Institute, University of Illinois at Chicago (9/96 - present). EDUCATION University of Iowa, Iowa City, IA, (major: mathematics) B. A. with Honors 1974; Albert Einstein College of Medicine, New York, NY (PhD in Neuroscience) MD-PhD 1982 PREVIOUS EMPLOYMENT University of Chicago, Chicago, IL, Department of Pediatrics: Intern, Postdoctoral Fellow, Instructor, and Assistant Professor 1982-1996. University of Illinois at Chicago, Chicago, IL. Department of Psychiatry, Research Assistant Professor and Assistant Professor 1996-2008. LICENSURE Licensed physician, State of Illinois 1983 – present. CURRICULUM DESIGN ACTIVITIES Advisory Committee Member for The Scientific Communications Initiative, 2006-2009. This is a NSF-funded curriculum grant in bioinformatics centered at the Graduate School of Library and Information Science at University of Illinois Urbana-Champaign. Invited Presentations at other Universities since 1996: [list of universities and dates omitted for brevity] FORMAL RESEARCH COLLABORATORS SINCE 1996 (shared active grants, were co-authors on published papers, or submitted research grant applications together) [list of collaborators and institutions] UIC, Department of Psychiatry Erminio Costa, John Davis, Yogesh Dwivedi, Robert Gibbons, Dennis Grayson, Alessandro Guidotti, John Larson, Hari Manev, Rudmila Manev, George Pappas, Kiminobu Sugaya, John Sweeney, Vetle Torvik, Tolga Uz. [additional collaborators listed] 