PA Advisors, LLC v. Google Inc. et al
Filing
433
MOTION in Limine and Daubert Motion to Exclude the Testimony of Mr. Stanley Peters by PA Advisors, LLC. (Attachments: #1 Affidavit, #2 Exhibit A, #3 Exhibit B-1, #4 Exhibit B-2, #5 Exhibit B-3, #6 Exhibit B-4, #7 Exhibit B-5, #8 Exhibit B-6, #9 Exhibit B-7, #10 Exhibit B-8, #11 Exhibit B-9, #12 Exhibit B-10, #13 Exhibit B-11, #14 Exhibit B-12, #15 Exhibit B-13, #16 Exhibit C, #17 Text of Proposed Order)(Wiley, Elizabeth)
Exhibit B-2
ACC - 2
Invalidity Chart Braden in view of Herz and Additional Prior Art References
1
Invalidity Chart Braden in view of Herz and Additional Prior Art References The `067 Patent 1. A data processing method for enabling a user utilizing a local computer system having a local data storage system to locate desired data from a plurality of data items stored in a remote data storage system in a remote computer system, the remote computer system being linked to the local computer system by a telecommunication link, the method comprising the steps of: Braden Braden 5:2-6 "In accordance with our broad teachings, the present invention satisfies this need by employing natural language processing to improve the accuracy of a keyword-based document search performed by, e.g., a statistical web search engine." Herz Herz 79:11-14 "A method for cataloging a plurality of target objects that are stored on an electronic storage media, where users are connected via user terminals and bidirectional data communication connections to a target server that accesses said electronic storage media." Herz 1:19-21 "This invention relates to customized electronic identification of desirable objects, such as news articles, in an electronic media environment." Herz See also Abstract; 1:18-43; 4:35-48; 28:41 55:42; Figures 1-16. Additional Prior Art References Salton `89 p. 229 "Information retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes attached to records and information requests." Salton `68 p. 7 "Because of their special importance in the present context, it is useful to describe in more detail the operations that lead to the retrieval of stored information in answer to user search requests. In practice, searches often may be conducted by using author names or citations or titles as principal criteria. Such searches do not require a detailed content analysis of each item and are relatively easy to perform, provided that there is a unified system for generating and storing the bibliographic citations pertinent to each item." Culliss 1:28-31 "Given the large amount of information available over the Internet, it is desirable to reduce this information down to a manageable number of articles which fit the needs of a particular user." Ahn 1:31-33 "The present invention is directed to a system and method for searching through documents maintained in electronic form. The present invention is capable of 2
The `067 Patent
Braden
Herz
Additional Prior Art References searching through individual documents, or groups of documents." Brookes 1:9-14 "This invention relates to information technology and, in particular, to a method and apparatus whereby users of a database system may be alerted to important information including text, graphics and other electronically stored information within the system and by which means information may be efficiently disseminated." Dasan 1:10-15 "The present invention relates to information retrieval. More specifically, the present invention relates to a client server model for information retrieval based upon a user-defined profile, for example, for the generation of an "electronic" newspaper which contains information of interest to a particular user." Dedrick See, e.g., Abstract, Figures 1-8. Krishnan See 1:6-12. Kupiec 3:23-29 "The present invention provides a method for answer extraction. A system operating according to this method accepts a natural-language input string such as a user supplied question and a set of relevant documents that are assumed to contain the answer to the question. In response, it generates answer hypotheses and finds these hypotheses within the documents." Reese 1:55-57 "A method and a system for requesting and retrieving information from
3
The `067 Patent
Braden
Herz
Additional Prior Art References distinct web network content sites is disclosed." Menczer p. 157 "In this paper we discuss the use of algorithms based on adaptive, intelligent, autonomous, distributed populations of agents making local decisions as a way to automate the on-line information search and discovery process in the Web or similar environments." Armstrong p. 4 "We have experimented with a variety of representations that re-represent the arbitrary-length text associated with pages, links, and goals as a fixed-length feature vector. This idea is common within information retrieval systems [Salton and McGill, 1983]. It offers the advantage that the information in an arbitrary amount of text is summarized in a fixed length feature vector compatible with current machine learning methods."
(a) extracting, by one of the local computer system and the remote computer system, a user profile from user linguistic data previously provided by the user, said user data profile being representative of a first linguistic pattern of the said user linguistic data;
Braden 7:19-23 "Generally speaking and in accordance with our present invention, we have recognized that precision of a retrieval engine can be significantly enhanced by employing natural language processing to process, i.e., specifically filter and rank, the records, i.e., ultimately the documents, provided by a search engine used therein." Braden See, e.g., 11:62-14:61.
Herz 56:19-27 "Initialize Users' Search Profile Sets. The news clipping service instantiates target profile interest summaries as search profile sets, so that a set of high interest search profiles is stored for each user. The search profiles associated with a given user change over time. As in any application involving 4
Salton `89 p. 405-6 "To help furnish semantic interpretations outside specialized or restricted environments, the existence of a knowledge base is often postulated. Such a knowledge base classifies the principal entities or concepts of interest and specifies certain relationships between the entities. [43-45] . . . . The literature includes a wide variety of different knowledge representations . . . [one of the] best-known knowledge-representation techniques [is] the semantic-net. . . . In generating a semantic network, it is necessary to decide on a method of representation for
The `067 Patent
Braden
Herz search profiles, they can be initially determined for a new user (or explicitly altered by an existing user) by any of a number of procedures, including the following preferred methods: (1) asking the user to specify search profiles directly by giving keywords and/or numeric attributes, (2) using copies of the profiles of target objects or target clusters that the user indicates are representative of his or her interest, (3) using a standard set of search profiles copied or otherwise determined from the search profile sets of people who are demographically similar to the user." Herz 6:58-60 "Each user's target profile interest summary is automatically updated on a continuing basis to reflect the user's changing interests." Herz 7:26-29 "The accuracy of this filtering system improves over 5
Additional Prior Art References each entity, and to relate or characterize the entities. The following types of knowledge representations are recognized: [46-48]. . . . A linguistic level in which the elements are language specific and the links represent arbitrary relationships between concepts that exist in the area under consideration." Salton `89 p. 378 "A prescription for a complete language-analysis package might be based on the following components: A knowledge base consisting of stored entities and predicates, the latter used to characterize and relate the entities." Salton `68 p. 9, Fig. 1-3
"different content analysis procedures are available to generate identifiers for documents and requests. . . statistical and syntactic procedures to identify relations between words and concepts, and phrase generating methods." Salton `68 p. 11 (Statistical association
The `067 Patent
Braden
Herz time by noting which articles the user reads and by generating a measurement of the depth to which the user reads each article. This information is then used to update the user's target profile interest summary." Herz 27:47-49 "[T]he disclosed method for determining topical interest through similarity requires users as well as target objects to have profiles."
Additional Prior Art References methods, Syntactic analysis methods, and Statistical phrase recognition methods) Salton `68 p. 33 "The phrase dictionaries. Both the regular and the stem thesauruses are based on entries corresponding either to single words or to single word stems. In attempting to perform a subject analysis of written text, it is possible, however, to go further by trying to locate phrases consisting of sets of words that are judged to be important in a given subject area."
Salton `68 p. 35-36 "The syntactic phrase dictionary has a more complicated structure, as shown by the excerpt reproduced in Fig. 26. Here, each syntactic phrase, also known as criterion tree or criterion phrase, consists not only of a specification of the component concepts but also of syntactic indicators, as Herz 27:62-67 "In a variation, each user's user well as of syntactic relations that may obtain profile is subdivided into between the included concepts. . . . More specifically, there are four main classes of a set of long-term syntactic specifications, corresponding to noun attributes, such as phrases, subject-verb relations, verb-object demographic relations, and subject-object relations." characteristics, and a set of short-term attributes . . . such as the user's textual Culliss 3:46-48 "Inferring Personal Data Users can explicitly specify their own personal and multiple-choice data, or it can be inferred from a history of answers to questions" their search requests or article viewing habits. In this respect, certain key words or terms, Herz 56:20-28 "As in any application involving such as those relating to sports (i.e. "football" and "soccer"), can be detected within search search profiles, they can be initially determined for requests and used to classify the user as someone interested in sports." a new user (or explicitly altered by an existing 6
The `067 Patent
Braden
Herz user) by any of a number of procedures, including the following preferred methods: . . . (2) using copies of the profiles of target objects or target clusters that the user indicates are representative of his or her interest."
Additional Prior Art References Culliss 3:13-36 "The present embodiment of the invention utilizes personal data to further refine search results . . . . Personal activity data includes data about past actions of the user, such as reading habits, viewing habits, searching habits, previous articles displayed or selected, previous search requests entered, previous or current site visits, previous key terms utilized within previous search results, and time or date of any previous activity." Brookes 12:38-43 "creating and storing an interest profile for each database user Herz 59:24-27 "The user's desired attributes . . indicative of categories of information of interest to said each database user, said . would be some form of word frequencies such as interest profile comprising (i) a list of keywords taken from said finite hierarchical TF/IDF and potentially set and (ii) an associated priority level value other attributes such as the source, reading level, for each keyword." and length of the article." Brookes See also, 1:66-2:3. Herz See also Abstract; Chislenko 3:38-39 "Each user profile 1:18-43; 4:8:8; 55:44 associates items with the ratings given to those 56:14; 56:15-30; 58:57 items by the user. Each user profile may also 60:9; Figures 1-16. store information in addition to the user's ratings." Chislenko 4:15-18 "For example, the system may assume that Web sites for which the user has created "bookmarks" are liked by that user and may use those sites as initial entries in the user's profile." Chislenko 4:40-50 "Ratings can be inferred by the system from the user's usage pattern. For example, the system may monitor how long the user views a particular Web page and store 7
The `067 Patent
Braden
Herz
Additional Prior Art References in that user's profile an indication that the user likes the page, assuming that the longer the user views the page, the more the user likes the page. Alternatively, a system may monitor the user's actions to determine a rating of a particular item for the user. For example, the system may infer that a user likes an item which the user mails to many people and enter in the user's profile and indication that the user likes that item." Chislenko 21:64-22:2 "(a) storing, using the machine, a user profile in a memory for each of the plurality of users, wherein at least one of the user profiles includes a plurality of values, one of the plurality of values representing a rating given to one of a plurality of items by the user and another of the plurality of values representing additional information." Chislenko 22:29-35 "storing, using the machine, a user profile in a memory for each of the plurality of users, wherein at least one of the user profiles includes a plurality of values, one of the plurality of values representing a rating given to one of a plurality of items by the user and another of the plurality of values representing information relating to the given ratings." Dasan 3:21-24 "The present invention is a method and apparatus for automatically scanning information using a user-defined profile, and providing relevant stories from that information to a user based upon that profile."
8
The `067 Patent
Braden
Herz
Additional Prior Art References Dasan 4:1-25 "[T]he user is able to connect to the remote server and specify a user profile, setting forth his interests. The user is able to specify the context for the information to be searched (e.g. the date). The user is able to save the profile on the remote machine. Finally the user is able to retrieve the personal profile (with any access control, if desired) and edit (add or delete entries) and save it for future operations. Dasan 4:34-39 "Using this interface, and HTTP, the server may notify the client of the results of that execution upon completion. The server's application program, the personal newspaper generator maintains a record of the state of each user's profile, and thus, provides state functionality from session to session to an otherwise stateless protocol." Dasan See, e.g., 5:37-6:3; 8:53-67. Dedrick 7:28-38 "Data is collected for personal profile database 27 by direct input from the end user and also by client activity monitor 24 monitoring the end user's activity. When the end user consumes a piece of electronic information, each variable (or a portion of each variable) within the header block for that piece of electronic information is added to the database for this end user. For example, if this piece of electronic information is made available to the end user for consumption in both audio and video format, and the end user selects the audio format, then this choice of format selection is stored in
9
The `067 Patent
Braden
Herz
Additional Prior Art References personal profile database Z1 for this end user." Dedrick 3:544:4 "The GUI may also have hidden fields relating to "consumer variables." Consumer variables refer to demographic, psychographic and other profile information. Demographic information refers to the vital statistics of individuals, such as age, sex, income and marital status. Psychographic information refers to the lifestyle and behavioral characteristics of individuals, such as likes and dislikes, color preferences and personality traits that show consumer behavioral characteristics. Thus, the consumer variables refer to information such as marital status, color preferences, favorite sizes and shapes, preferred learning modes, employer, job title, mailing address, phone number, personal and business areas of interest, the willingness to participate in a survey, along with various lifestyle information. This information will be referred to as user profile data, and is stored on a consumer owned portable profile device such as a Flash memory-based PCMClA pluggable card." Dedrick See, e.g., Abstract, Figures 1-8. Eichstaedt 1:34-43 "The present invention provides a profiling technique that generates user interest profiles by monitoring and analyzing a user's access to a variety of hierarchical levels within a set of structured documents, e.g., documents available at a web site. Each information document has parts associated with it and the documents are
10
The `067 Patent
Braden
Herz
Additional Prior Art References classified into categories using a known taxonomy. In other words, each document is hierarchically structured into parts, and the set of documents is classified as well." Eichstaedt 3:28-31 "The profile generation algorithm in the present embodiment learns from positive feedback. Each view of a document signifies an interest level in the content of the document." Eichstaedt 1:43-55 "In other words, each document is hierarchically structured into parts, and the set of documents is classified as well. The user interest profiles are automatically generated based on the type of content viewed by the user. The type of content is determined by the text within the parts of the documents viewed and the classifications of the documents viewed. In addition, the profiles also are generated based on other factors including the frequency and currency of visits to documents having a given classification, and/or the hierarchical depth of the levels or parts of the documents viewed. User profiles include an interest category code and an interest score to indicate a level of interest in a particular category. Unlike static registration information, the profiles in this invention are constantly changing to more accurately reflect the current interests of an individual." Eichstaedt 2:15-41 "A preferred embodiment of the present invention automatically generates a profile that accurately captures a user's stable interest after monitoring the
11
The `067 Patent
Braden
Herz
Additional Prior Art References user's interaction with a set of structured documents. The technique of the present embodiment is based on the following three assumptions. First, each document in the corpus has different levels, parts, or views. These views are used to determine the level of interest a user has in a particular document. A hierarchical document structure is a good example for a document with different views. Structured documents such as patents have a title, an abstract and a detailed description. These parts of the document may be categorized according to a 3-level hierarchy which then can be used to determine how interested a user is in a particular topic. For example, if a user only views the title of a patent document, the user probably has little or no interest in the content of the document. If the user views the abstract as well, the user can be assumed to have more interest in the content of the document. If the user goes on to view the detailed description, then there is good evidence that the user has a strong interest in the document, and the category into which it is classified. Generally, the more views, levels, or parts a document has, the finer will be the granularity of the present system. Although not all documents are structured at present, with the advent of XML, it is likely that the proportion of hierarchical documents available on the internet and in other databases will only increase." Eichstaedt 3:15-18 "In the system of the present invention, a special access analyzer and profile generator 62 analyzes information about user access to database 60 to generate a
12
The `067 Patent
Braden
Herz
Additional Prior Art References profile for the user. The profile is then used by a webcasting system 64 to provide or "push" customized information back to the user 54." Eichstaedt 5:32-36 "The automatic profile generation algorithm is completely automated and derives the user profiles from implicit feedback. Therefore, the user community does not have to learn new rules to customize the pushed information stream." Krishnan 2:37-41 "The information access monitor computes user/group profiles to identify information needs and interests within the organization and can then automatically associate users/groups with information of relevance." Krishnan 4:1-4 "[A] profile of a user's attributes is termed a `user profile'; a summary of digital profiles of objects accessed by a user and/or noted as of interest to the user, is termed the `interest summary' of that user." Krishnan See also Fig. 6. Reese 4:35-53 "The user profile is intended to focus the retrieved results on meaningful data. One type of user profile is related to the demographics of the user. For example, the user profile might include the area code, zip code, state, sex, and age of a user. With such a profile, the matching server would retrieve data to the client related to the client's demographics. For example, if the user were interested in current events in the state of Oregon, the matching server would retrieve
13
The `067 Patent
Braden
Herz
Additional Prior Art References data and compile an aggregate database relating to current events pertinent to the user's age and area, e.g., Portland. Similarly, if the user sought information regarding retail purchases, the matching server would retrieve data relevant to the user's demographics. A demographics user profile is also very effective for advertisers that wish to advertise their goods or services on the matching server so that specific advertisements can be targeted at user's with specific user profile demographics. Other user profiles include, but are not limited to, areas of interest, business, politics, religion, education, etc." Reese 5:55-65 "The user profile form 600 includes a Search Type field 630 that allows a user to select whether the user wants an exact match of the user profile with the search data or whether the user will accept some lesser amount of exactness as acceptable for retrieved data. The user profile form 600 further allows the user to enter demographics specific to the user. In FIG. 6, the demographics include area code 640, zip code 650, state 660, sex 670, age 680, and some other identifiers 690. Once the user enters the appropriate data in the user profile form 600, the user is instructed to save the profile by a "Save Profile" 694 button." Reese 8:26-35 "Thus far, the invention is focused on a user-created user profile. The invention also contemplates that the user profile may be constructed by the client based on the user's search habits. In other words, an artificial intelligence system may be created to
14
The `067 Patent
Braden
Herz
Additional Prior Art References develop a user profile. In the same way that a system is trained to be associative with regard to matching profile elements, the entire profile may be trained based on a user's search habits. For instance, a user profile that relates to demographics can be trained by recognizing user habits relating to demographics." Sheena 4:40-49 "Ratings can be inferred by the system from the user's usage pattern. For example, the system may monitor how long the user views a particular Web page and store in that user's profile an indication that the user likes the page, assuming that the longer the user views the page, the more the user likes the page. Alternatively, a system may monitor the user's actions to determine a rating of a particular item for the user. For example, the system may infer that a user likes an item which the user mails to many people and enter in the user's profile an indication that the user likes that item." Sheena 2:9-14 "In one aspect the present invention relates to a method for recommending an item to one of a plurality of users. The method begins by storing a user profile in a memory by writing user profile data to a memory management data object. Item profile data is also written to a memory management data object." Sheena 3:34-67 "Each user profile associates items with the ratings given to those items by the user. Each user profile may also store information in addition to the user's rating. In one embodiment, the user profile stores
15
The `067 Patent
Braden
Herz
Additional Prior Art References information about the user, e.g. name, address, or age. In another embodiment, the user profile stores information about the rating, such as the time and date the user entered the rating for the item. User profiles can be any data construct that facilitates these associations, such as an array, although it is preferred to provide user profiles as sparse vectors of n-tuples. Each n-tuple contains at least an identifier representing the rated item and an identifier representing the rating that the user gave to the item, and may include any number of additional pieces of information regarding the item, the rating, or both. Some of the additional pieces of information stored in a user profile may be calculated based on other information in the profile, for example, an average rating for a particular selection of items (e.g., heavy metal albums) may be calculated and stored in the user's profile. In some embodiments, the profiles are provided as ordered n-tuples. Alternatively, a user profile may be provided as an array of pointers; each pointer is associated with an item rated by the user and points to the rating and information associated with the rating. A profile for a user can be created and stored in a memory element when that user first begins rating items, although in multi-domain applications user profiles may be created for particular domains only when the user begins to explore, and rate items within, those domains. Alternatively, a user profile may be created for a user before the user rates any items in a domain. For example, a default user profile may be created for a domain which the user has not yet begun to explore based on the
16
The `067 Patent
Braden
Herz
Additional Prior Art References ratings the user has given to items in a domain that the user has already explored." Sheena 28:16-21 "(a) storing a user profile, in the memory, for each of a plurality of users, wherein the user profile comprises a separate rating value, supplied by a particular one of the users, for each corresponding one of a plurality of items, said items including the item non-rated by the user." Siefert 2:48-59 "In addition, in other forms of the invention, a profile is maintained which specifies certain preferences of the user. Two such preferences are (1) a preferred natural language (such as English or French), (2) the type of interface which the user prefers. The invention presents the resource in a manner compatible with the profile. Also, another profile, termed a "learning profile:' is maintained, which, in a simplified sense, specifies the current status of a user. with respect to a curriculum which the user is undertaking. The invention ensures compatibility between the resource and the learning profile, if possible." Siefert 8:60-62 "As stated above, the user profile contains information identifying the preferences of the user." Siefert 11:57-63 "The user profile specifies preferences of a user. It may not be possible, in all cases, to cause a resource selected by a user to become compatible with all specified preferences. However, insofar as the resource is transformed so that more preferences are
17
The `067 Patent
Braden
Herz
Additional Prior Art References matched than previously, the invention can be said to "enhance" the compatibility between the resource and the preferences." Belkin p. 397 "The search intermediary uses his knowledge about the IR system (with its data collections) and the searcher to formulate requests directly to the IR system. The search intermediary has formulated a model of the user and taken advantage of his existing model of the IR system." Belkin p. 399 "In the general information seeking interaction, the IR system needs to have (see Table 1 for a brief listing of the ten functions and their acronyms): a model of the user himself, including goals, intentions and experience (UM)." Han p. 409 "Personalized Web Agents Another group of Web agents includes those that obtain or learn user preferences and discover Web information sources that correspond to these preferences, and possibly those of other individuals with similar interests (using collaborative filtering)" Han p. 409 "As the user browses the Web, the profile creation module builds a custom profile by recording documents of interest to the user. The number of times a user visits a document and the total amount of time a user spends viewing a document are just a few methods for determining user interest [1, 3, 4]. Once WebACE has recorded a sufficient number of interesting documents, each document is reduced to a document vector and
18
The `067 Patent
Braden
Herz
Additional Prior Art References the document vectors are passed to the clustering modules." Menczer p. 158-9 "Words are the principal asset in text collections, and virtually all information retrieval systems take advantage of words to describe and characterize documents, query, and concepts such as "relevance" or "aboutness" . . . This metric can be called word topology and is the reason why documents are usually represented as word vectors in information retrieval . . . [l]inks, constructed manually to point from one page to another, reflect an author's attempts to relate her writings to others.' Word topology is a epiphenomenal consequence of word vocabulary choices made by many authors, across many pages. The entire field of free text information retrieval is based on the statistical patterns reliably present in such vocabulary usage. By making our agents perceptually sensitive to word topology features." Menczer p. 160 "For the reasons outlined in Section 2, each agent's genotype also contains a list of keywords, initialized with the query terms." [Agent's genotype is its version of a user profile.] Menczer p. 163 "The user initially provides a list of keywords and a list of starting points, in the form of a bookmark file." [The bookmarks and starting points are evidence of the profile the agent uses in creating its genotype.] Armstrong p. 1 "In interactive mode,
19
The `067 Patent
Braden
Herz
Additional Prior Art References WebWatcher acts as a learning apprentice [Mitchell et al., 1985; Mitchell et. al., 1994], providing interactive advice to the Mosaic user regarding which hyperlinks to follow next, then learning by observing the user's reaction to this advice as well as the eventual success or failure of the user's actions." Armstrong p. 4 "1. Underlined words in the hyperlink. 200 boolean features are allocated to encode selected words that occur within the scope of the hypertext link (i.e., the underlined words seen by the user). These 200 features correspond to only the 200 words found to be most informative over all links in the training data (see below.)" Armstrong p. 4: "The task of the learner is to learn the general function UserChoice?, given a sample of training data logged from users."
(b) constructing, by the remote computer system, a plurality of data item profiles, each plural data item profile corresponding to a different one of each plural data item stored in the remote data storage system, each of said plural data item profiles being representative of a second linguistic pattern of a
Braden 7:19-23 "Generally speaking and in accordance with our present invention, we have recognized that precision of a retrieval engine can be significantly enhanced by employing natural language processing to process, i.e., specifically filter and rank, the records, i.e., ultimately the documents, provided by a search engine used therein."
Herz 79:11-22 "A method for cataloging a plurality of target objects that are stored on an electronic storage media, where users are connected via user terminals and bidirectional data communication Braden 11:62-14:61 "In general, to generate connections to a target logical form triples for an illustrative input server that accesses said string, e.g. for input string 510, that string is electronic storage media, first parsed into its constituent words. said method comprising Thereafter, using a predefined record (not to be the steps of: storing on confused with document records employed by a said electronic storage 20
Salton `89 p. 275. "[I]n these circumstances, it is advisable first to characterize record and query content by assigning special content descriptions, or profiles, identifying the items and representing text content. The text profiles can be used as short-form descriptions; they also serve as document, or query, surrogates during the text-search and [text]retrieval operations." Salton `89 p. 294-6 (see also fn. 28-30)( Linguistic methodologies including syntactic class indicators (adjective, noun, adverb, etc.) are assigned to the terms).
The `067 Patent corresponding plural data item, each said plural second linguistic pattern being substantially unique to each corresponding plural data item;
Braden search engine), in a stored lexicon, for each such word, the corresponding records for these constituent words, through predefined grammatical rules, are themselves combined into larger structures or analyses which are then, in turn, combined, again through predefined grammatical rules, to form even larger structures, such as a syntactic parse tree. A logical form graph is then built from the parse tree. Whether a particular rule will be applicable to a particular set of constituents is governed, in part, by presence or absence of certain corresponding attributes and their values in the word records. The logical form graph is then converted into a series of logical form triples. Illustratively, our invention uses such a lexicon having approximately 165,000 head word entries. This lexicon includes various classes of words, such as, e.g., prepositions, conjunctions, verbs, nouns, operators and quantifiers that define syntactic and semantic properties inherent in the words in an input string so that a parse tree can be constructed therefor. Clearly, a logical form (or, for that matter, any other representation, such as logical form triples or logical form graph within a logical form, capable of portraying a semantic relationship) can be precomputed, while a corresponding document is being indexed, and stored, within, e.g., a record for that document, for subsequent access and use rather than being computed later once that document has been retrieved. Using such precomputation and storage, as occurs in another embodiment of our invention discussed in detail below in conjunction with FIGS. 10-13B, drastically and advantageously reduces the amount of natural
Herz media each target object; automatically generating in said target server, target profiles for each of said target objects that are stored on said electronic storage media, each of said target profiles being generated from the contents of an associated one of said target objects and their associated target object characteristics" Herz 6:43-46 "The specific embodiment of this system disclosed herein illustrates the use of a first module which automatically constructs a "target profile" for each target object in the electronic media based on various descriptive attributes of the target object." Herz 12:54-13:53 "In particular, a textual attribute, such as the full text of a movie review, can be replaced by a collection of numeric attributes that represent scores to denote the presence and significance of the words "aardvark," 21
Additional Prior Art References Salton `89 p. 389 (see also fn. 23-25) (Syntactic class markers, such as [noun], adjective, and pronoun, are first attached to the text words. Syntactic class patterns are then specified, such as "noun-noun", or "adjectiveadjective-noun," and groups of text words corresponding to permissible syntactic class patterns are assigned to the texts for content identification. Word frequency and word distance constraints may also be used to refine phrase construction." Salton `89 p. 391, Fig. 11.3 Salton `68 p. 11 (Statistical association methods, Syntactic analysis methods, and Statistical phrase recognition methods). Salton `68 p. 30 "The word stem thesaurus and suffix list. One of the earliest ideas in automatic information retrieval was the suggested use of words contained in documents and search requests for purposes of content identification. No elaborate content analysis is then required, and the similarity between different items can be measured simply by the amount of overlap between the respective vocabularies." Salton `68 p. 33 "The phrase dictionaries. Both the regular and the stem thesauruses are based on entries corresponding either to single words or to single word stems. In attempting to perform a subject analysis of written text, it is possible, however, to go further by trying to locate phrases consisting of sets of words that are judged to be important in a given subject
The `067 Patent
Braden language processing, and hence execution time associated therewith, required to handle any retrieved document in accordance with our invention. In particular, an input string, such as sentence 510 shown in FIG. 5A, is first morphologically analyzed, using the predefined record in the lexicon for each of its constituent words, to generate a so-called "stem" (or "base") form therefor. Stem forms are used in order to normalize differing word forms, e.g., verb tense and singular-plural noun variations, to a common morphological form for use by a parser. Once the stem forms are produced, the input string is syntactically analyzed by the parser, using the grammatical rules and attributes in the records of the constituent words, to yield the syntactic parse tree therefor. This tree depicts the structure of the input string, specifically each word or phrase, e.g. noun phrase "The octopus", in the input string, a category of its corresponding grammatical function, e.g., NP for noun phrase, and link(s) to each syntactically related 45 word or phrase therein. For illustrative sentence 510, its associated syntactic parse tree would be:
Herz "aback," "abacus," and so on through "zymurgy" in that text. The score of a word in a text may be defined in numerous ways. The simplest definition is that the score is the rate of the word in the text, which is computed by computing the number of times the word occurs in the text, and dividing this number by the total number of words in the text. This sort of score is often called the "term frequency" (TF) of the word. The definition of term frequency may optionally be modified to weight different portions of the text unequally: for example, any occurrence of a word in the text's title might be counted as a 3-fold or more generally k-fold occurrence (as if the title had been repeated k times within the text), in order to reflect a heuristic assumption that the words in the title are particularly important indicators of the text's content or topic. However, for lengthy 22
Additional Prior Art References area." Salton `68 p. 35-36 "The syntactic phrase dictionary has a more complicated structure, as shown by the excerpt reproduced in Fig. 26. Here, each syntactic phrase, also known as criterion tree or criterion phrase, consists not only of a specification of the component concepts but also of syntactic indicators, as well as of syntactic relations that may obtain between the included concepts. . . . More specifically, there are four main classes of syntactic specifications, corresponding to noun phrases, subject-verb relations, verb-object relations, and subject-object relations." Culliss 2:33-37 "The articles can each be associated with one or more of these key terms by any conceivable method of association now known or later developed. A key term score is associated with each article for each of the key terms. Optionally, a key term total score can also be associated with the article." Ahn 2:32-34 "Also, a document tree and a document index table is maintained for each document (such as Document Dl)." Brookes 12:27-37 "storing in association with each information item in the database system a plurality of parameters including (i) at least one keyword indicative of the subject matter of said information item, and (ii) a priority level value for each information item, wherein said priority level value is selected from a predetermined set ·of priority level values, and
The `067 Patent
Herz textual attributes, such as the text of an entire document, the score of a word is typically defined to be not merely its term frequency, but its term frequency multiplied by the negated logarithm of the word's "global frequency," as measured with respect to the textual attribute in question. The global frequency of a word, which effectively A start node located in the upper-left hand measures the word's corner of the tree defines the type of input uninformativeness, is a string being parsed. Sentence types include fraction between 0 and 1, "DECL" (as here) for a declarative sentence, defined to be the fraction "IMPR" for an imperative sentence and of all target objects for "QUES" for a question. Displayed vertically to which the textual attribute the right and below the start node is a first level in question contains this analysis. This analysis has a head node word. This adjusted score indicated by an asterisk, typically a main verb is often known in the art (here the word "has"), a premodifier (here the as TF/IDF ("term noun phrase "The octopus"), followed by a frequency times inverse postmodifier (the noun phrase "three hearts"). document frequency"). Each leaf of the tree contains a lexical term or a When global frequency of punctuation mark. Here, as labels, "NP" a word is taken into designates a noun phrase, and "CHAR" denotes account in this way, the a punctuation mark. The syntactic parse tree is common, uninformative then further processed using a different set of words have scores rules to yield a logical form graph, such as comparatively close to graph 515 for input string 510. The process of zero, no matter how often producing a logical form graph involves or rarely they appear in extracting underlying structure from syntactic the text. Thus, their rate analysis of the input string; the logical form has little influence on the graph includes those words that are defined as object's target profile. 23
Braden
Additional Prior Art References wherein said at least one keyword is selected from a finite hierarchical set of keywords having a tree structure relating broad keywords to progressively narrower keywords." Brookes See also, 1:57-65. Dedrick 15:41-44 "The metering server 14 is capable of storing units of information relating to the content databases of the publisher/advertiser, including the entire content database." Dedrick See, e.g., Abstract, Figures 1-8. Eichstaedt 2:42-50 "The second assumption is that the documents must already be assigned to at least one category of a known taxonomy tree for the database. Notice, however, that this system works with any existing taxonomy tree and does not require any changes to a legacy system. FIG. 1 illustrates a taxonomy tree with six leaf categories 50. Each leaf category has an interest value associated with it. Taxonomies are available for almost all domain-specific document repositories because they add significant value for the human user." Eichstaedt 1:34-43 "The present invention provides a profiling technique that generates user interest profiles by monitoring and analyzing a user's access to a variety of hierarchical levels within a set of structured documents, e.g., documents available at a web site. Each information document has parts
The `067 Patent
Braden having a semantic relationship there between and the functional nature of the relationship. The "deep" cases or functional roles used to categorize different semantic relationships include:
Herz Alternative methods of calculating word scores include latent semantic indexing or probabilistic models. Instead of breaking the text into its component words, one could alternatively break the text into overlapping word bigrams (sequences of 2 adjacent words), or more generally, word ngrams. These word ngrams may be scored in the same way as individual words. Another possibility is to use character n-grams. For example, this sentence contains a sequence of overlapping character 5-grams which starts "for e", "or ex", "r exa'', "exam", "examp", etc. The sentence may be characterized, imprecisely but usefully, by the score of each possible character To identify all the semantic relationships in an 5-gram ("aaaaa", "aaaab", input string, each node in the syntactic parse ... "zzzzz") in the tree for that string is examined. In addition to sentence. Conceptually the above relationships, other semantic roles are speaking, in the character used. 5-gram case, the textual attribute would be In any event, the results of such analysis for decomposed into at least input string 510 is logical form graph 515. 265=11,881,376 numeric Those words in the input string that exhibit a attributes. Of course, for a 24
Additional Prior Art References associated with it and the documents are classified into categories using a known taxonomy. In other words, each document is hierarchically structured into parts, and the set of documents is classified as well." Krishnan 3:64-4:1 "[I]nformation, which is typically electronic in nature and available for access by a user via the Internet, is termed an `object'; a digitally represented profile indicating an object's attributes is termed an `object profile.'" Krishnan 7:13-42 "The basic [document] indexing operation comprises three steps, noted above as: filtering, word breaking, and normalization . . . . Once the content filter has operated on the source file, the word breaker step is activated to divide the received text stream from the content filter into words and phrases. Thus, the word breaker accepts a stream of characters as an input and outputs words . . . . The final step of indexing is the normalization process, which removes `noise' words and eliminates capitalization, punctuation, and the like." Krishnan See also Fig. 6. Kupiec 13:13-20 "In step 250 the match sentences retained for further processing in step 245 are analyzed to detect phrases they contain. The match sentences are analyzed in substantially the same manner as the input string is analyzed in step 220 above. The detected phrases typically comprise noun phrases and can further comprise title phrases
The `067 Patent
Braden semantic relationship therebetween (such as, e.g. "Octopus" and "Have") are shown linked to each other with the relationship therebetween being specified as a linking attribute (e.g. Dsub). This graph, typified by graph 515 for input string 510, captures the structure of arguments and adjuncts for each input string. Among other things, logical form analysis maps function words, such as prepositions and articles, into features or structural relationships depicted in the graph. Logical form analysis also resolves anaphora, i.e., defining a correct antecedent relationship between, e.g., a pronoun and a co-referential noun phrase; and detects and depicts proper functional relationships for ellipsis. Additional processing may well occur during logical form analysis in an attempt to cope with ambiguity and/or other linguistic idiosyncrasies. Corresponding logical form triples are then simply read in a conventional manner from the logical form graph and stored as a set. Each triple contains two node words as depicted in the graph linked by a semantic relationship therebetween. For illustrative input string 510, logical form triples 525 result from processing graph 515. Here, logical form triples 525 contain three individual triples that collectively convey the semantic information inherent in input string 510. Similarly, as shown in FIGS. 5B-5D, for input strings 530, 550 and 570, specifically exemplary sentences "The octopus has three hearts and two lungs.", "The octopus has three hearts and it can swim.", and "I like shark fin soup bowls.", logical form graphs 535, 555 and 575, as well as logical form triples 540, 560 and 580, respectively result. There are three logical
Herz given target object, most of these numeric attributes have values of 0, since most 5-grams do not appear in the target object attributes. These zero values need not be stored anywhere. For purposes of digital storage, the value of a textual attribute could be characterized by storing the set of character 5grams that actually do appear in the text, together with the nonzero score of each one. Any 5gram that is not included in the set can be assumed to have a score of zero. The decomposition of textual attributes is not limited to attributes whose values are expected to be long texts. A simple, one-term textual attribute can be replaced by a collection of numeric attributes in exactly the same way. Consider again the case where the target objects are movies. The "name of director" attribute, which is textual, can be replaced by numeric attributes giving the 25
Additional Prior Art References or other kinds of phrases. The phrases detected in the match sentences are called preliminary hypotheses." Reese 7:1-24 "In collecting the information that matches the query request, the server may collect different forms of information. First, the server may collect entire content site data, for example, entire files or documents on a particular content server. Instead, the server may collect key words from particular sites (e.g., files) on individual content servers, monitor how often such key words are used in a document, and construct a database based on these key words (step 822). Another way of collecting data is through the collection of content summaries (step 824). In this manner, rather than entire files or documents being transmitted to the server and ultimately to the client, only summaries of the documents or files are collected and presented. The summaries offer a better description of the content of the particular files or documents than the key words, because the user can form a better opinion of what is contained in the abbreviated document or file based on summaries rather than a few key words. The summaries may be as simple as collective abstracts or may involve the matching server identifying often used key words and extracting phrases or sentences using these key words from the document. Finally, the invention contemplates that titles may also be retrieved by the matching server and submitted to the client rather than entire documents or files."
The `067 Patent
Braden form constructions for which additional natural language processing is required to correctly yield all the logical form triples, apart from the conventional manner, including a conventional "graph walk", in which logical form triples are created from the logical form graph. In the case of coordination, as in exemplary sentence "The octopus has three hearts and two lungs", i.e. input string 530, a logical form triple is created for a word, its semantic relation, and each of the values of the coordinated constituent. According to a "special" graph walk, we find in FIG. 540 two logical form triples "haveDobj- heart" and "have-Dobjlung". Using only a conventional graph walk, we would have obtained only one logical form triple "have-Dobj-and". Similarly, in the case of a constituent which has referents (Refs), as in exemplary sentence "The octopus has three hearts and it can swim", i.e. input string 550, we create a logical form triple for a word, its semantic relation, and each of the values of the Refs attribute, in additional to the triples generated by the conventional graph walk. According to this special graph walk, we find in triples 560 the logical form triple "swimDsuboctopus" in addition to the conventional logical form triple "swim-Dsub-it". Finally, in the case of a constituent with noun modifiers, as in the exemplary sentence "I like shark fin soup bowls", i.e. input string 570, additional logical form triples are created to represent possible internal structure of the noun compounds. The conventional graph walk created the logical form triples "bowl-Modsshark", "bowl-Modsfin" and "bowl-Modssoup", reflecting the possible internal structure
Herz scores for "FedericoFellini," "Woody-Allen," "Terence-Davies," and so forth, in that attribute." Herz 79:11-23 "A method for cataloging a plurality of target objects that are stored on an electronic storage media, . . . said method comprising the steps of: . . . automatically generating in said target server, target profiles for each of said target objects that are stored on said electronic storage media, each of said target profiles being generated from the contents of an associated one of said target objects and their associated target object characteristics." Herz 5:7-11 "The system for electronic identification of desirable objects of the present invention automatically constructs both a target profile for each target object in the electronic media based, for example, on the frequency with which each word appears in an 26
Additional Prior Art References Sheena 2:14-15 "Similarity factors are calculated for each of the users and the similarity factors are used to select a neighboring user set for each user of the system." Sheena 4:56-5:17 "Profiles for each item that has been rated by at least one user may also be stored in memory. Each item profile records how particular users have rated this particular item. Any data construct that associates ratings given to the item with the user assigning the rating can be used. It is preferred is to provide item profiles as a sparse vector of n-tuples. Each n-tuple contains at least an identifier representing a particular user and an identifier representing the rating that user gave to the item, and it may contain other information, as described above in connection with user profiles. As with user profiles, item profiles may also be stored as an array of pointers. Item profiles may be created when the first rating" Siefert 8:22-33 "In a very simple sense, the expert identifies the language of a sample of words, by reading the sample. Then, the invention analyzes samples of each language, in order to find unique character- and word patterns (or other patterns). Now the invention can associate unique patterns with each language. The invention stores the unique patterns, together with the corresponding language identities, in a reference table. Later, to identify a language, the invention looks for the unique patterns within a sample of the language, such as in a
The `067 Patent
Braden [[shark] [fin] [soup] bowl]. In the special graph walk, we create additional logical form triples to reflect the following possible internal structures [[shark fin] [soup] bowl] and [[shark] [fin soup] bowl] and [[shark [fin] soup] bowl], respectively: "fin-Mods-shark", "soup-Modsfin", and "soup-Mods-shark". Inasmuch as the specific details of the morphological, syntactic, and logical form processing are not relevant to the present invention, we will omit any further details thereof. However, for further details in this regard, the reader is referred to co-pending United States patent applications entitled "Method and System for Computing Semantic Logical Forms from Syntax Trees", filed Jun. 28, 1996 and assigned Ser. No. 08/674,610 and particularly "Information Retrieval Utilizing Semantic Representation of Text", filed Mar. 7, 1997 and assigned Ser. No. 08/886,814; both of which have been assigned to the present assignee hereof and are incorporated by reference herein." Braden 7:47-53 "each of the documents in the set is subjected to natural language processing, specifically morphological, syntactic and logical form, to produce logical forms for each sentence in that document. Each such logical form for a sentence encodes semantic relationships, particularly argument and adjunct structure, between words in a linguistic phrase in that sentence."
Herz article relative to its overall frequency of use in all articles." Herz 10:63-67; 11:1-7 "However, a more sophisticated system would consider a longer target profile, including numeric and associative attributes: (a.) full text of document . . . (d.) language in which document is written . . . (g.) length in words . . . (h.) reading level." Herz See also Abstract; 1:18-43; 4:498:8; 9:1 16:62; 26:4327:43; 55:4456:14; 56:52 57:10.
Additional Prior Art References file whose language is to be identified. When a pattern is found, the invention identifies the language containing it, based on the table." Armstrong p. 4 "1. Underlined words in the hyperlink. 200 boolean features are allocated to encode selected words that occur within the scope of the hypertext link (i.e., the underlined words seen by the user). These 200 features correspond to only the 200 words found to be most informative over all links in the training data (see below.)"
27
The `067 Patent (c) providing, by the user to the local computer system, search request data representative of the user's expressed desire to locate data substantially pertaining to said search request data;
Braden Braden 7:35-38 "Specifically, in operation, a user supplies a search query to system 5. The query should be in full-text (commonly referred to as "literal") form in order to take full advantage of its semantic content through natural language processing."
Herz Herz 66:52-61 "However, in a variation, the user optionally provides a query consisting of textual and/or other attributes, from which query the system constructs a profile in the manner described herein, optionally altering textual attributes as described herein before decomposing them into numeric attributes. Query profiles are similar to the search profiles in a user's search profile set, except that their attributes are explicitly specified by a user, most often for onetime usage, and unlike search profiles, they are not automatically updated to reflect changing interests." Herz See also Abstract; 1:18-43; 4:49-8:8; 55:44 5:14; 56:15-30; 58:57 60:9; Figures 1-16.
Additional Prior Art References Salton `89 p. 160 "Several types of query specifications can be distinguished. A simple query is one containing the value of a single search key. A range query contains a range of values for a single key for example, a request for all the records of employee ages 22 to 25. A functional query is specified by using a function for the values for certain search keys, for example the age of employees exceeding a given stated threshold." Salton `68 p. 7 "When the search criteria are based in one way or another on the contents of a document, it becomes necessary to use some system of content identification, such as an existing subject classification or a set of content identifiers attached to each item, which may help in restricting the search to items within a certain subject area and in distinguishing items likely to be pertinent from others to be rejected." Salton `68 p. 413 "The user participates in the system by furnishing information about his needs and interests, by directing the search and retrieval operations accordance with his special requirements, by introducing comments out systems operations, by specifying output format requirements, and nearly by influencing file establishment and file maintenance procedures." Culliss 2:39-41 "[T]he invention can accept a search query from a user and a search engine will identify matched articles." Culliss 12:41-51 "A method of organizing a
28
The `067 Patent
Braden
Herz
Additional Prior Art References plurality of articles comprising . . . (b) accepting a first search query from a first user having first personal data." Ahn 3:37-42 "In step 408, the invention receives a user search request containing a keyword and determines whether the search request is directed to searching an individual document or a group of documents. If the search request is directed to searching an individual document, then step 414 is performed." Brookes 8:48-54 "In this manner the information in the system may be augmented by input from the users, questions may be asked of specific users and responses directed accordingly. A collection of information items related in this manner is termed a `discussion'. The context of a discussion is defined by the parameters (especially keywords) of its constituent information items." Brookes See, e.g., 12:27-37 "storing in association with each information item in the database system a plurality of parameters including (i) at least one keyword indicative of the subject matter of said information item, and (ii) a priority level value for each information item, wherein said priority level value is selected from a predetermined set of priority level values, and wherein said at least one keyword is selected from a finite hierarchical set of keywords having a tree structure relating broad keywords to progressively narrower keywords."
29
The `067 Patent
Braden
Herz
Additional Prior Art References Dasan 7:28-38 "the user specifies search terms used in the full-text search. These are illustrated in field 804. Any number of search terms may be used and the "l" character is treated as a disjunction ("or"). Then. by selecting either of user interface objects 806 or 808, the user specifies whether the search terms are case sensitive or not. This is detected at step 706. At step 708, using either a scrollable list containing selectable item(s), as illustrated in field 810, or other means, the user specifies the search context(s) (the publications, newsfeeds, etc... ) in which to search. By the selection of icon 812 or other commit means." Dedrick See, e.g., Figures 1-8, 8:209:24, 14:5564. Krishnan 7:61-63 "The query screen allows a user to express a query by simply filling out fields in a form." Krishnan 12:36-47 "[A] method for enhancing efficiencies with which objects retrieved from the Internet are maintained for access by the multiple members, the method comprising: . . . receiving a member-generated query for one or more objects that can be obtained from the Internet." Krishnan See also Fig. 6. Kupiec 4:7-8 "The method begins by accepting as input the user's question and a set of documents that are assumed to contain the
30
The `067 Patent
Braden
Herz
Additional Prior Art References answer to the question." Reese 7:1-23 "In collecting the information that matches the query request, the server may collect different forms of information." Menczer p. 162 "Consider for example the following query: "Political institutions: The structure, branches and offices of government." Menczer p. 163 "The user initially provides a list of keywords and a list of starting points, in the form of a bookmark file.2 In step (0), the population is initialized by pre-fetching the starting documents. Each agent is "positioned" at one of these document and given a random behavior (depending on the representation) and an initial reservoir of "energy". In step (2), each agent "senses" its local neighborhood by analyzing the text of the document where it is currently situated. This way, the relevance of all neighboring documents -those pointed to by the hyperlinks in the current document- is estimated. Based on these link relevance estimates, an agent "moves" by choosing and following one of the links from the current document." Armstrong p. 4 "4. Words used to define the user goal. These features indicate words entered by the user while defining the information search goal. In our experiments, the only goals considered were searches for technical papers, for which the user could optionally enter the title, author, organization, etc. (see Figure 3). All words entered in this
31
The `067 Patent
Braden
Herz
Additional Prior Art References way throughout the training set were included (approximately 30 words, though the exact number varied with the training set used in the particular experiment). The encoding of the boolean feature in this case is assigned a 1 if and only if the word occurs in the userspecified goal and occurs in the hyperlink, sentence, or headings associated with this example." Salton `89 p.275 "In these circumstances, it is advisable first to characterize record and query content by assigning special content descriptions, or profiles, identifying the items and representing text content. The text profiles can be used as short-form descriptions; they also serve as document, or query, surrogates during the text-search and [text]retrieval operations." Salton `89 p. 294-6 (see also fn. 28-30)( Linguistic methodologies including syntactic class indicators (adjective, noun
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?