Personalized User Model LLP v. Google Inc.

Filing 222

REDACTED VERSION of 215 Response to Motion Personalized User Model L.L.P.'s Opposition to Google, Inc.'s Motion for Leave to File Motion for Summary Judgment by Personalized User Model LLP. (Attachments: # 1 Exhibit 1-6)(Tigan, Jeremy)

Download PDF
Personalized User Model LLP v. Google Inc. Doc. 222 Att. 1 EXHIBIT 1 Dockets.Justia.com SNR Denton US LLP 1530 Page Mill Road Suite 200 Palo Alto, CA 94304-1125 USA Jennifer D. Bennett Managing Associate jennifer.bennett@snrdenton.com D +1 650 798 0325 T +1 650 798 0300 F +1 650 798 0310 snrdenton.com March 10, 2011 BY COURIER SRI International 333 Ravenswood Avenue Menlo Park, CA 94025 Re: Personalized User Model LLP v. Google Inc., C.A. No. 09-00525-LPS To Whom it May Concern: On July 16, 2010, my client Personalized User Model, LLP brought a civil action against Google, Inc. for patent infringement in the United States District Court for the District of Delaware. You are being contacted because SRI International is likely to have documents and other information relevant to the case arising from its association and dealings with Google, Inc. Please see the attached subpoena and exhibits for instructions on how to respond. Kind regards, /s/ Jennifer D. Bennett Jennifer D. Bennett Enclosure 14942824\V-1 $2 $ 5HY 6XESRHQD WR 7HVWLI\ DW D 'HSRVLWLRQ RU WR 3URGXFH 'RFXPHQWV LQ D &LYLO $FWLRQ 81,7(' 67$7(6 ',675,&7 &2857 IRU WKH 1RUWKHUQ 'LVWULFW RI &DOLIRUQLD Personalized User Model, LLP Plaintiff Y Google, Inc. Defendant &LYLO $FWLRQ 1R 1:09-cv-525 (LPS) ,I WKH DFWLRQ LV SHQGLQJ LQ DQRWKHU GLVWULFW VWDWH ZKHUH BBBBBBBBBB 'LVWULFW RI BBBBBBBBBB District of Delaware SUBPOENA TO TESTIFY AT A DEPOSITION OR TO PRODUCE DOCUMENTS IN A CIVIL ACTION 7R SRI International 333 Ravenswood Avenue, Menlo Park, CA 94025 Testimony: YOU ARE COMMANDED WR DSSHDU DW WKH WLPH GDWH DQG SODFH VHW IRUWK EHORZ WR WHVWLI\ DW D u GHSRVLWLRQ WR EH WDNHQ LQ WKLV FLYLO DFWLRQ ,I \RX DUH DQ RUJDQL]DWLRQ WKDW LV not D SDUW\ LQ WKLV FDVH \RX PXVW GHVLJQDWH RQH RU PRUH RIILFHUV GLUHFWRUV RU PDQDJLQJ DJHQWV RU GHVLJQDWH RWKHU SHUVRQV ZKR FRQVHQW WR WHVWLI\ RQ \RXU EHKDOI DERXW WKH IROORZLQJ PDWWHUV RU WKRVH VHW IRUWK LQ DQ DWWDFKPHQW 3ODFH SNR Denton US LLP 1530 Page Mill Road, Suite 200 Palo Alto, CA 94304 'DWH DQG 7LPH 03/21/2011 09:00 7KH GHSRVLWLRQ ZLOO EH UHFRUGHG E\ WKLV PHWKRG Stenographic and video u Production: <RX RU \RXU UHSUHVHQWDWLYHV PXVW DOVR EULQJ ZLWK \RX WR WKH GHSRVLWLRQ WKH IROORZLQJ GRFXPHQWV HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ RU REMHFWV DQG SHUPLW WKHLU LQVSHFWLRQ FRS\LQJ WHVWLQJ RU VDPSOLQJ RI WKH PDWHULDO 7KH SURYLVLRQV RI )HG 5 &LY 3 F UHODWLQJ WR \RXU SURWHFWLRQ DV D SHUVRQ VXEMHFW WR D VXESRHQD DQG 5XOH G DQG H UHODWLQJ WR \RXU GXW\ WR UHVSRQG WR WKLV VXESRHQD DQG WKH SRWHQWLDO FRQVHTXHQFHV RI QRW GRLQJ VR DUH DWWDFKHG 'DWH 03/10/2011 CLERK OF COURT 25 /s/ Jennifer Bennett Attorney's signature Signature of Clerk or Deputy Clerk 7KH QDPH DGGUHVV HPDLO DQG WHOHSKRQH QXPEHU RI WKH DWWRUQH\ UHSUHVHQWLQJ (name of party) Personalized User Model, LLP ZKR LVVXHV RU UHTXHVWV WKLV VXESRHQD DUH Jennifer Bennett SNR Denton US LLP 1530 Page Mill Road, Suite 200, Palo Alto, CA 94304; T: 650.798.0300; Email: jennifer.bennett@snrdenton.com $2 $ 5HY 6XESRHQD WR 7HVWLI\ DW D 'HSRVLWLRQ RU WR 3URGXFH 'RFXPHQWV LQ D &LYLO $FWLRQ 3DJH &LYLO $FWLRQ 1R 1:09-cv-525 (LPS) PROOF OF SERVICE (This section should not be filed with the court unless required by Fed. R. Civ. P. 45.) 7KLV VXESRHQD IRU (name of individual and title, if any) ZDV UHFHLYHG E\ PH RQ (date) u , SHUVRQDOO\ VHUYHG WKH VXESRHQD RQ WKH LQGLYLGXDO DW (place) RQ (date) u , OHIW WKH VXESRHQD DW WKH LQGLYLGXDO¶V UHVLGHQFH RU XVXDO SODFH RI DERGH ZLWK (name) D SHUVRQ RI VXLWDEOH DJH DQG GLVFUHWLRQ ZKR UHVLGHV WKHUH RQ (date) DQG PDLOHG D FRS\ WR WKH LQGLYLGXDO¶V ODVW NQRZQ DGGUHVV RU ZKR LV RQ (date) u , UHWXUQHG WKH VXESRHQD XQH[HFXWHG EHFDXVH u 2WKHU (specify): 8QOHVV WKH VXESRHQD ZDV LVVXHG RQ EHKDOI RI WKH 8QLWHG 6WDWHV RU RQH RI LWV RIILFHUV RU DJHQWV , KDYH DOVR WHQGHUHG WR WKH ZLWQHVV IHHV IRU RQH GD\¶V DWWHQGDQFH DQG WKH PLOHDJH DOORZHG E\ ODZ LQ WKH DPRXQW RI 0\ IHHV DUH IRU WUDYHO DQG IRU VHUYLFHV IRU D WRWDO RI 0.00 RU u , VHUYHG WKH VXESRHQD RQ (name of individual) GHVLJQDWHG E\ ODZ WR DFFHSW VHUYLFH RI SURFHVV RQ EHKDOI RI (name of organization) RU RU , GHFODUH XQGHU SHQDOW\ RI SHUMXU\ WKDW WKLV LQIRUPDWLRQ LV WUXH 'DWH Server's signature Printed name and title Server's address $GGLWLRQDO LQIRUPDWLRQ UHJDUGLQJ DWWHPSWHG VHUYLFH HWF $2 $ 5HY 6XESRHQD WR 7HVWLI\ DW D 'HSRVLWLRQ RU WR 3URGXFH 'RFXPHQWV LQ D &LYLO $FWLRQ 3DJH Federal Rule of Civil Procedure 45 (c), (d), and (e) (Effective 12/1/07) (c) Protecting a Person Subject to a Subpoena. (1) Avoiding Undue Burden or Expense; Sanctions. $ SDUW\ RU DWWRUQH\ UHVSRQVLEOH IRU LVVXLQJ DQG VHUYLQJ D VXESRHQD PXVW WDNH UHDVRQDEOH VWHSV WR DYRLG LPSRVLQJ XQGXH EXUGHQ RU H[SHQVH RQ D SHUVRQ VXEMHFW WR WKH VXESRHQD 7KH LVVXLQJ FRXUW PXVW HQIRUFH WKLV GXW\ DQG LPSRVH DQ DSSURSULDWH VDQFWLRQ ² ZKLFK PD\ LQFOXGH ORVW HDUQLQJV DQG UHDVRQDEOH DWWRUQH\¶V IHHV ² RQ D SDUW\ RU DWWRUQH\ ZKR IDLOV WR FRPSO\ (2) Command to Produce Materials or Permit Inspection. (A) Appearance Not Required. $ SHUVRQ FRPPDQGHG WR SURGXFH GRFXPHQWV HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ RU WDQJLEOH WKLQJV RU WR SHUPLW WKH LQVSHFWLRQ RI SUHPLVHV QHHG QRW DSSHDU LQ SHUVRQ DW WKH SODFH RI SURGXFWLRQ RU LQVSHFWLRQ XQOHVV DOVR FRPPDQGHG WR DSSHDU IRU D GHSRVLWLRQ KHDULQJ RU WULDO (B) Objections. $ SHUVRQ FRPPDQGHG WR SURGXFH GRFXPHQWV RU WDQJLEOH WKLQJV RU WR SHUPLW LQVSHFWLRQ PD\ VHUYH RQ WKH SDUW\ RU DWWRUQH\ GHVLJQDWHG LQ WKH VXESRHQD D ZULWWHQ REMHFWLRQ WR LQVSHFWLQJ FRS\LQJ WHVWLQJ RU VDPSOLQJ DQ\ RU DOO RI WKH PDWHULDOV RU WR LQVSHFWLQJ WKH SUHPLVHV ² RU WR SURGXFLQJ HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ LQ WKH IRUP RU IRUPV UHTXHVWHG 7KH REMHFWLRQ PXVW EH VHUYHG EHIRUH WKH HDUOLHU RI WKH WLPH VSHFLILHG IRU FRPSOLDQFH RU GD\V DIWHU WKH VXESRHQD LV VHUYHG ,I DQ REMHFWLRQ LV PDGH WKH IROORZLQJ UXOHV DSSO\ (i) $W DQ\ WLPH RQ QRWLFH WR WKH FRPPDQGHG SHUVRQ WKH VHUYLQJ SDUW\ PD\ PRYH WKH LVVXLQJ FRXUW IRU DQ RUGHU FRPSHOOLQJ SURGXFWLRQ RU LQVSHFWLRQ (ii) 7KHVH DFWV PD\ EH UHTXLUHG RQO\ DV GLUHFWHG LQ WKH RUGHU DQG WKH RUGHU PXVW SURWHFW D SHUVRQ ZKR LV QHLWKHU D SDUW\ QRU D SDUW\¶V RIILFHU IURP VLJQLILFDQW H[SHQVH UHVXOWLQJ IURP FRPSOLDQFH (3) Quashing or Modifying a Subpoena. (A) When Required. 2Q WLPHO\ PRWLRQ WKH LVVXLQJ FRXUW PXVW TXDVK RU PRGLI\ D VXESRHQD WKDW (i) IDLOV WR DOORZ D UHDVRQDEOH WLPH WR FRPSO\ (ii) UHTXLUHV D SHUVRQ ZKR LV QHLWKHU D SDUW\ QRU D SDUW\¶V RIILFHU WR WUDYHO PRUH WKDQ PLOHV IURP ZKHUH WKDW SHUVRQ UHVLGHV LV HPSOR\HG RU UHJXODUO\ WUDQVDFWV EXVLQHVV LQ SHUVRQ ² H[FHSW WKDW VXEMHFW WR 5XOH F % LLL WKH SHUVRQ PD\ EH FRPPDQGHG WR DWWHQG D WULDO E\ WUDYHOLQJ IURP DQ\ VXFK SODFH ZLWKLQ WKH VWDWH ZKHUH WKH WULDO LV KHOG (iii) UHTXLUHV GLVFORVXUH RI SULYLOHJHG RU RWKHU SURWHFWHG PDWWHU LI QR H[FHSWLRQ RU ZDLYHU DSSOLHV RU (iv) VXEMHFWV D SHUVRQ WR XQGXH EXUGHQ (B) When Permitted. 7R SURWHFW D SHUVRQ VXEMHFW WR RU DIIHFWHG E\ D VXESRHQD WKH LVVXLQJ FRXUW PD\ RQ PRWLRQ TXDVK RU PRGLI\ WKH VXESRHQD LI LW UHTXLUHV (i) GLVFORVLQJ D WUDGH VHFUHW RU RWKHU FRQILGHQWLDO UHVHDUFK GHYHORSPHQW RU FRPPHUFLDO LQIRUPDWLRQ (ii) GLVFORVLQJ DQ XQUHWDLQHG H[SHUW¶V RSLQLRQ RU LQIRUPDWLRQ WKDW GRHV QRW GHVFULEH VSHFLILF RFFXUUHQFHV LQ GLVSXWH DQG UHVXOWV IURP WKH H[SHUW¶V VWXG\ WKDW ZDV QRW UHTXHVWHG E\ D SDUW\ RU (iii) D SHUVRQ ZKR LV QHLWKHU D SDUW\ QRU D SDUW\¶V RIILFHU WR LQFXU VXEVWDQWLDO H[SHQVH WR WUDYHO PRUH WKDQ PLOHV WR DWWHQG WULDO (C) Specifying Conditions as an Alternative. ,Q WKH FLUFXPVWDQFHV GHVFULEHG LQ 5XOH F % WKH FRXUW PD\ LQVWHDG RI TXDVKLQJ RU PRGLI\LQJ D VXESRHQD RUGHU DSSHDUDQFH RU SURGXFWLRQ XQGHU VSHFLILHG FRQGLWLRQV LI WKH VHUYLQJ SDUW\ (i) VKRZV D VXEVWDQWLDO QHHG IRU WKH WHVWLPRQ\ RU PDWHULDO WKDW FDQQRW EH RWKHUZLVH PHW ZLWKRXW XQGXH KDUGVKLS DQG (ii) HQVXUHV WKDW WKH VXESRHQDHG SHUVRQ ZLOO EH UHDVRQDEO\ FRPSHQVDWHG (d) Duties in Responding to a Subpoena. (1) Producing Documents or Electronically Stored Information. 7KHVH SURFHGXUHV DSSO\ WR SURGXFLQJ GRFXPHQWV RU HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ (A) Documents. $ SHUVRQ UHVSRQGLQJ WR D VXESRHQD WR SURGXFH GRFXPHQWV PXVW SURGXFH WKHP DV WKH\ DUH NHSW LQ WKH RUGLQDU\ FRXUVH RI EXVLQHVV RU PXVW RUJDQL]H DQG ODEHO WKHP WR FRUUHVSRQG WR WKH FDWHJRULHV LQ WKH GHPDQG (B) Form for Producing Electronically Stored Information Not Specified. ,I D VXESRHQD GRHV QRW VSHFLI\ D IRUP IRU SURGXFLQJ HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ WKH SHUVRQ UHVSRQGLQJ PXVW SURGXFH LW LQ D IRUP RU IRUPV LQ ZKLFK LW LV RUGLQDULO\ PDLQWDLQHG RU LQ D UHDVRQDEO\ XVDEOH IRUP RU IRUPV (C) Electronically Stored Information Produced in Only One Form. 7KH SHUVRQ UHVSRQGLQJ QHHG QRW SURGXFH WKH VDPH HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ LQ PRUH WKDQ RQH IRUP (D) Inaccessible Electronically Stored Information. 7KH SHUVRQ UHVSRQGLQJ QHHG QRW SURYLGH GLVFRYHU\ RI HOHFWURQLFDOO\ VWRUHG LQIRUPDWLRQ IURP VRXUFHV WKDW WKH SHUVRQ LGHQWLILHV DV QRW UHDVRQDEO\ DFFHVVLEOH EHFDXVH RI XQGXH EXUGHQ RU FRVW 2Q PRWLRQ WR FRPSHO GLVFRYHU\ RU IRU D SURWHFWLYH RUGHU WKH SHUVRQ UHVSRQGLQJ PXVW VKRZ WKDW WKH LQIRUPDWLRQ LV QRW UHDVRQDEO\ DFFHVVLEOH EHFDXVH RI XQGXH EXUGHQ RU FRVW ,I WKDW VKRZLQJ LV PDGH WKH FRXUW PD\ QRQHWKHOHVV RUGHU GLVFRYHU\ IURP VXFK VRXUFHV LI WKH UHTXHVWLQJ SDUW\ VKRZV JRRG FDXVH FRQVLGHULQJ WKH OLPLWDWLRQV RI 5XOH E & 7KH FRXUW PD\ VSHFLI\ FRQGLWLRQV IRU WKH GLVFRYHU\ (2) Claiming Privilege or Protection. (A) Information Withheld. $ SHUVRQ ZLWKKROGLQJ VXESRHQDHG LQIRUPDWLRQ XQGHU D FODLP WKDW LW LV SULYLOHJHG RU VXEMHFW WR SURWHFWLRQ DV WULDOSUHSDUDWLRQ PDWHULDO PXVW (i) H[SUHVVO\ PDNH WKH FODLP DQG (ii) GHVFULEH WKH QDWXUH RI WKH ZLWKKHOG GRFXPHQWV FRPPXQLFDWLRQV RU WDQJLEOH WKLQJV LQ D PDQQHU WKDW ZLWKRXW UHYHDOLQJ LQIRUPDWLRQ LWVHOI SULYLOHJHG RU SURWHFWHG ZLOO HQDEOH WKH SDUWLHV WR DVVHVV WKH FODLP (B) Information Produced. ,I LQIRUPDWLRQ SURGXFHG LQ UHVSRQVH WR D VXESRHQD LV VXEMHFW WR D FODLP RI SULYLOHJH RU RI SURWHFWLRQ DV WULDO SUHSDUDWLRQ PDWHULDO WKH SHUVRQ PDNLQJ WKH FODLP PD\ QRWLI\ DQ\ SDUW\ WKDW UHFHLYHG WKH LQIRUPDWLRQ RI WKH FODLP DQG WKH EDVLV IRU LW $IWHU EHLQJ QRWLILHG D SDUW\ PXVW SURPSWO\ UHWXUQ VHTXHVWHU RU GHVWUR\ WKH VSHFLILHG LQIRUPDWLRQ DQG DQ\ FRSLHV LW KDV PXVW QRW XVH RU GLVFORVH WKH LQIRUPDWLRQ XQWLO WKH FODLP LV UHVROYHG PXVW WDNH UHDVRQDEOH VWHSV WR UHWULHYH WKH LQIRUPDWLRQ LI WKH SDUW\ GLVFORVHG LW EHIRUH EHLQJ QRWLILHG DQG PD\ SURPSWO\ SUHVHQW WKH LQIRUPDWLRQ WR WKH FRXUW XQGHU VHDO IRU D GHWHUPLQDWLRQ RI WKH FODLP 7KH SHUVRQ ZKR SURGXFHG WKH LQIRUPDWLRQ PXVW SUHVHUYH WKH LQIRUPDWLRQ XQWLO WKH FODLP LV UHVROYHG (e) Contempt. 7KH LVVXLQJ FRXUW PD\ KROG LQ FRQWHPSW D SHUVRQ ZKR KDYLQJ EHHQ VHUYHG IDLOV ZLWKRXW DGHTXDWH H[FXVH WR REH\ WKH VXESRHQD $ QRQSDUW\¶V IDLOXUH WR REH\ PXVW EH H[FXVHG LI WKH VXESRHQD SXUSRUWV WR UHTXLUH WKH QRQSDUW\ WR DWWHQG RU SURGXFH DW D SODFH RXWVLGH WKH OLPLWV RI 5XOH F $ LL IN THE UNITED STATES DISTRICT COURT FOR THE DISTRICT OF DELAWARE PERSONALIZED USER MODEL, L.L.P., Plaintiff, v. GOOGLE, INC., Defendant. ) ) ) ) ) ) ) ) ) ) ) C.A. No. 09-525 (LPS) NOTICE OF RULE 30(b)(6) DEPOSITION OF SRI INTERNATIONAL PLEASE TAKE NOTICE that, pursuant to Rules 26 and 30 of the Federal Rules of Civil Procedure, Plaintiff Personalized User Model, L.L.P. ("P.U.M.") will take the deposition of Third Party SRI International ("SRI") concerning the topics identified in Exhibit A, beginning at 9:00 a.m. on March 21, 2011, or at an otherwise mutually agreeable date, and will be held at the offices of SNR Denton US LLP, 1530 Page Mill Road, CA 94304, or at an otherwise mutually agreeable location. If the deposition is not completed on the date set out above, the taking of the deposition will continue day to day thereafter or pursuant to the parties' agreement. The deposition will be recorded by stenographic, videographic, and/or audiographic means. Pursuant to Rule 30(b)(6) of the Federal Rules of Civil Procedure, SRI is directed to designate one or more officers, directors, or managing agents, or other persons who will testify on its behalf, who are most knowledgeable regarding the matters identified in the attached Exhibit A. SRI is requested to provide a written designation of the names and positions of the officers, directors, managing agents, or other persons designated to testify concerning the matters identified in the attached Exhibit and, for each person, identify the matters on which he or she will testify. P.U.M. reserves the right to serve additional 30(b)(6) notices. Dated: March 10, 2011 By: /s/ Jennifer D. Bennett Jennifer D. Bennett (California State Bar No. 235196) SNR Denton US LLP 1530 Page Mill Road, Suite 200 Palo Alto, CA 94304 Telephone: (650) 798-0300 Facsimile: (650) 798-0310 E-Mail: jennifer.bennett@snrdenton.com Marc S. Friedman SNR Denton US LLP 1221 Avenue of the Americas New York, NY 10020-1089 Telephone: (212) 768-6700 Facsimile: (212) 768.6800 E-Mail: marc.friedman@snrdenton.com Attorneys for Plaintiff PERSONALIZED USER MODEL, L.L.P. 2 CERTIFICATE OF SERVICE I hereby certify that on March 10, 2011, copies of the foregoing were caused to be served by e-mail upon the following: Richard L. Horwitz David E. Moore POTTER ANDERSON & CORROON LLP 1313 N. Market St., 6th Floor Wilmington, DE 19801 rhorwitz@potternanderson.com dmoore@potteranderson.com Brian C. Cannon QUINN EMANUEL URQUHART OLIVER & HEDGES, LLP briancannon@quinnemanuel.com Charles K. Verhoeven QUINN EMANUEL URQUHART OLIVER & HEDGES, LLP charlesverhoeven@quinnemanuel.com David A. Perlson QUINN EMANUEL URQUHART OLIVER & HEDGES, LLP davidperlson@quinnemanuel.com Antonio R. Sistos QUINN EMANUEL URQUHART OLIVER & HEDGES, LLP antoniosistos@quinnemanuel.com Eugene Novikov QUINN EMANUEL URQUHART OLIVER & HEDGES, LLP eugenenovikov@quinnemanuel.com /s/ Jennifer D. Bennett Jennifer D. Bennett (Cal. Bar. No. 235196) SNR Denton US LLP 1530 Page Mill Road, Suite 200 Palo Alto, CA 94304-1125 (650) 798-0300 3 EXHIBIT A I . DEFINITIONS 1. "SRI," "YOU," and "YOUR," means SRI International, and its officers, directors, current and former employees, counsel, agents, consultants, representatives, and any other persons acting on behalf of any of the foregoing, and SRI International's affiliates, parents, divisions, joint ventures, licensees, franchisees, assigns, predecessors and successors in interest, and any other legal entities, whether foreign or domestic, that are owned or controlled by SRI International, and all predecessors and successors in interest to such entities. 2. "Google" means Google, Inc. and its officers, directors, current and former employees, counsel, agents, consultants, representatives, attorneys, and any other persons acting on behalf of any of the foregoing, and Google's affiliates, parents, divisions, joint ventures, licensees, franchisees, assigns, predecessors and successors in interest, and any other legal entities, whether foreign or domestic, that are owned or controlled by Google, and all predecessors and successors in interest to such entities. 3. "Lawsuit" means the case styled Personalized User Model LLP v. Google, Inc., 1:09-cv-525, in the United States District Court for the District of Delaware. 4. "`040 PATENT" means U.S. Patent No. 6,981,040, entitled "Automatic, Personalized Online Information and Product Services," all underlying patent applications, all continuations, continuations-in-part, divisionals, reissues, and any other patent applications in the `040 patent family 5. "`031 PATENT" means U.S. Patent No. 7,320,031, entitled "Automatic, Personalized Online Information and Product Services," all underlying patent applications, all continuations, continuations-in-part, divisionals, reissues, and any other patent applications in the `031 patent family. 6. "`276 PATENT" means U.S. Patent No. 7,685,276, entitled "Automatic, Personalized Online Information and Product Services," all underlying patent applications, all continuations, continuations-in-part, divisionals, reissues, and any other patent applications in the `031 patent family. 7. "PATENTS-IN-SUIT" shall refer to the `040 PATENT, the `031 PATENT, and the `276 PATENT individually and collectively. 8. "DOCUMENT" shall mean all materials and information that are discoverable pursuant to Rule 34 of the Federal Rules of Civil Procedure. A draft or non-identical copy is a separate document within the meaning of this term. 9. "PUM" and "PLAINTIFF" shall mean Personalized User Model LLP., Plaintiff in the civil case captioned Personalized User Model, LLP v. Google Inc., Case No. 09-525 (JJF). 10. The term "PERSON" shall refer to any individual, corporation, proprietorship, association, joint venture, company, partnership or other business or legal entity, including governmental bodies and agencies. 11. "REFLECT," "REFLECTING," "RELATE TO," "REFER TO," "RELATING TO," and "REFERRING TO" shall mean relating to, referring to, concerning, mentioning, reflecting, pertaining to, evidencing, involving, describing, discussing, commenting on, embodying, responding to, supporting, contradicting, or constituting (in whole or in part), as the context makes appropriate. 12. 13. 14. "Include" and "including" shall mean including without limitation. Use of the singular also includes the plural and vice-versa. The words "or" and "and" shall be read in the conjunctive and in the disjunctive 2 wherever they appear, and neither of these words shall be interpreted to limit the scope of these Interrogatories. 15. tenses. DEPOSITION TOPICS 1. All facts and circumstances, including but not limited to all communications whether The use of a verb in any tense shall be construed as the use of the verb in all other written, oral or otherwise, between Google and SRI, concerning all transactions, contracts, agreements and understandings, and payments between Google and SRI concerning the patentsin-suit or any invention(s) claimed therein, and/or Yochai Konig. 2. 3. The work performed by Yochai Konig while at SRI. Any and all documents or other evidence that Dr. Konig developed the inventions claimed in the patents-in-suit using SRI's equipment, supplies, facility, or trade secret information, or during the time of day when he was supposed to be working for SRI. 4. All documents provided by SRI to Google regarding Yochai Konig or work performed by him for SRI. 5. All invoices submitted by SRI to Google for work responding to discovery in connection with this lawsuit. 6. SRI's knowledge of Yochai Konig and/or Utopy's work after Dr. Konig left the employment of SRI. 7. Activities of the SRI Speech Technology and Research (STAR) Laboratory from 1996 through 1999. 8. All business relationships or contracts between SRI and Google, or subsidiary or affiliate of Google, including, but not limited to (a) all work performed by SRI for Google, or subsidiary or affiliate of Google, in the last 10 years; (b) all work performed by Google, or subsidiary or affiliate of Google, for SRI in the last 10 years, and (c) all sums of money received by SRI from 3 Google, or any subsidiary or affiliate of Google, or any officers or directors of these entities in the last 10 years. 9. All documents produced by SRI to PUM under the previously served subpoena, including, but not limited, to the authenticity of such documents and the manner in which they were created and kept. 10. All information received from third parties relating to any of the above subjects. 4 EXHIBIT 2 REDACTED REDACTED REDACTED REDACTED EXHIBIT A FULLY REDACTED EXHIBIT B FULLY REDACTED EXHIBIT 3 REDACTED EXHIBIT A FULLY REDACTED EXHIBIT B FULLY REDACTED EXHIBIT C Sentence You look good Production Model (e.g., HMM) P(acoustic vector sequence) P({(0.2, 0.3),...,(0.4,0.1)}) = 0.2 P({(0.1,0.5),...,(0.3,0.9)}) = 0.3 . . Acoustic Vector Sequence {(0.2,0.3),...,(0.3,-1.9)} Recognition Model P(sentence) P("You look good") = 0.1 P("Who looks good?") = 0.4 . . P (correct sentence) = 0.2 Parameters P ( other sentences) = 0.8 Acoustics Modified Parameters P (correct sentence) = 0.4 P (other sentences) = 0.6 p(q1 | q1) p(q2 | q2) p(q2 | q1) q1 q2 p(x | q1) p(x | q2) v b m r z Output Layer 54-61 Phones ... Hidden Layer: 500-4000 Fully Connected Units ... Input Layer: 9 Frames of 20 RASTA or PLP features, total of 180 units 1- 21- 161- Current Frame Left Context -40ms -30ms -20ms -10ms 10ms 20ms 30ms 40ms Right Context State Sequence : A A B B B A A Time Index: 1 2 1 2 3 1 2 1 2 .. D D+1 .. N E P(phone | time-index, acoustic vectors) Time Index Acoustic Vectors P(/k/ | /k/, x) P(/ae/ | /ae/, x) P(/t/ | /t/, x) /k/ /ae/ /t/ P(/ae/ | /k/, x) P(/t/ | /ae/, x) P(Current_state | Acoustics, Previous_state) .. .. .. .. .. .. .. .. .. ... ...... .. .. .. .. 0.1..0 Previous State Acoustics Local Probabilities Targets P(q n |q n-1 , X) E - STEP P(q | q Trained MLP n n-1 n+c , X n-c ) Train MLP M - STEP one Y 1.00 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 -0.05 0.00 10.00 20.00 30.00 X hard-transition-labels remap1 remap2 remap3 EXHIBIT D NONLINEAR DISCRIMINANT FEATURE EXTRACTION FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION Yochai Konig, Larry Heck, Mitch Weintraub, and Kemal Sonmez Speech Technology and Research Laboratory SRI International Menlo Park, CA 94025 ´ ´ RESUME Cet article propose une m´ thode bas´ e sur l'analyse dise e criminative non-lin´ aire pour extraire et s´ lectionner un e e ensemble de vecteurs acoustiques utilis´ s pour l'identie fication de locuteurs. L'approche consiste a mesurer et ` grouper un grand nombre de mesures acoustiques (correspondant a plusieurs trames de donn´ es cons´ cutives), et ` e e a r´ duire la dimensionalit´ du vecteur r´ sultant au moyen ` e e e d'un reseau de neurones artificielles. Le crit` re utilis´ e e pour optimiser les poids du r´ seau consiste a maximiser e ` une mesure de la s´ paration entre les locuteurs d'une base e de donn´ es d'apprentissage. L'architecture du reseau est e telle que l'une de ses couches interm´ diaires repr´ sente la e e projection des vecteurs acoustiques d'entr´ e sur un espace e de dimensionalit´ inferieure. Apr` s la phase d'apprentise e sage, cette partie du r´ seau peut etre isol´ e et utilis´ e pour e e e projeter les vecteurs acoustiques d'une base de donn´ es e de test. Les vecteurs acoustiques projet´ s peuvent alors e etre classifi´ s. Combin´ a un classificateur cepstral, le ^ e e ` classificateur utilisant ces nouveaux vecteurs acoustiques r´ duit de 15% le taux d'erreur de classification de la base e de donn´ es d´ finie par NIST en 1997 pour l'´ valuation des e e e syst` mes de reconnaissance du locuteur. e ABSTRACT We study a nonlinear discriminant analysis (NLDA) technique that extracts a speaker-discriminant feature set. Our approach is to train a multilayer perceptron (MLP) to maximize the separation between speakers by nonlinearly projecting a large set of acoustic features (e.g., several frames) to a lower-dimensional feature set. The extracted features are optimized to discriminate between speakers and to be robust to mismatched training and testing conditions. We train the MLP on a development set and apply it to the training and testing utterances. Our results show that by combining the NLDA-based system with a state of the art cepstrum-based system we improve the speaker verification performance on the 1997 NIST Speaker Recognition Evaluation set by 15% in average compared with our cepstrum-only system. 1. INTRODUCTION Our goal is to extract and select features that are more invariant to non-speaker-related conditions such as handset type, sentence content, and channel effects. Such features will be robust to mismatched training and testing conditions of speaker verification systems. With current feature sets (e.g., cepstrum) there is a big performance gap between matched and mismatched tests [8] even after applying standard channel compensation techniques [4]. In order to find these features, the feature extraction step should be directly optimized to increase discrimination between speakers, and to filter out the non-relevant information. Our proposed solution is to train a multilayer perceptron (MLP) to nonlinearly project a large set of acoustic features to a lower-dimensional feature set, such that it maximizes speaker separation. We train the MLP on a development set that includes several realizations of the same speakers under different conditions. We then apply the learned transformation (MLP in feed-forward mode) to the training and testing utterances. Finally, we use the resulting features for training the speaker recognition system, e.g., Bayesian adapted Gaussian mixture system [9]. We begin by reviewing related studies in Section 2. We describe the proposed feature extraction technique in Section 3. The Development database is described in Section 4. In Section 5, we report the experimental results on the 1997 NIST evaluation set. We continue with analysis of the results in Section 6. Finally, we conclude and describe directions for future work in Section 7. 2. RELATED STUDIES The related studies to the NLDA technique can be divided into two main categories: robust speaker verification systems, and data-driven feature extraction techniques. Previously proposed approaches to increase robustness to mismatched training and testing conditions, especially to handset variations, include handset-dependent background models [3], and a handset-dependent score normalization procedure known as Hnorm [9]. Data-driven feature extraction techniques were mainly suggested for speech recognition tasks. Rahim, Bengio and LeCun suggested optimizing a set of parallel class specific (e.g., phones) networks performing feature transformation based on minimum classification (MCE) criterion [7]. Fontaine, Ris and Boite used 2-hidden layer MLP to perform NLDA for isolated word, large vocabulary speech recognition task [2]. The training criterion for the MLPs was phonetic classification. Bengio and his colleagues suggested a global optimization of a neural network-hidden Markov (HMM) hybrid, where the outputs of the neural network constitute the observation sequence for the HMM [1]. 3. NONLINEAR DISCRIMINANT ANALYSIS (NLDA) We explore a nonlinear discriminant analysis (NLDA) technique that finds a nonlinear projection of the original feature space into a lower dimensional space that maximizes speaker recognition performance. This maximization problem can be expressed as P(Speaker | Inputs) Output Layer Non-Linear Projected Features Non-Linear .. Inputs 9 frames x (cepstrum, pitch) Figure 1: MLP5 for Speaker Recognition Gaussian Mixture Model Projected Features Non-Linear .. A = argmax Where AX is a nonlinear projection of the original feature space X onto a lower dimensional space, and J fg is a closed-set speaker identification performance measure. To find the best A we train a 5 layer multilayer perceptron (MLP) to discriminate between speakers in a carefully selected development set (as described below). The MLP is constructed from a large input layer, a first large nonlinear hidden unit, a small ("bottleneck") second linear hidden layer, a large third nonlinear hidden layer, and a softmax output layer (see Figure 1). The idea is that A is the projection of the input features speaker onto the "bottleneck" layer. After training the 5-layer MLP (denoted `MLP5') we can remove the last hidden layer and the output layer, and use the remaining 3-layer MLP to project the target speaker data. Then, we use the transformed features to train the speaker verification system, for example, a Bayesian adapted GMM system (see Figure 2). The underlying assumption is that the transformation as found in the development set will be invariant across different speaker populations. 4. DEVELOPMENT DATABASE To train the 5-layer MLP, we chose 855 Switchboard sentences (about 2 hours) from 31 speakers with a balanced mix of carbon and electret handsets, and balanced across gender. The input consists of 17 cepstral coefficients A J fAX g (1) Inputs 9 frames x (cepstrum, pitch) Figure 2: MLP3 for Feature Transformation and an estimate of the pitch for the current frame, four past frames and four future frames, resulting in a 162dimension vector. The first hidden layer has 500 sigmoidal units, the bottleneck layer has 34 linear units, the third hidden layer has 500 sigmoidal units, and a softmax output layer has 31 outputs (one for each speaker in the development set). After training the MLP5, we chopped the upper two layers. The resulting MLP (`MLP3') has one hidden layer and was used to transform the data of the target and impostor speakers in a test set as described above. 5. EXPERIMENTAL RESULTS We used the 1997 NIST Speaker Recognition Evaluation corpus for testing. We report results for three different systems: (1) our best cepstrum system, which is our implementation of the state of the art in text independent speaker verification systems [6]) with 33 input features comprised of 10 cepstral coefficients, energy term, and first and second time derivatives (2) the NLDA based system described in this paper, (3) a combination of the cepstrum and the DET Curve, male, 1h, 10 Seconds Miss probability (in %) Test female 3 female 10 female 30 male 3 male 10 male 30 Cepstrum 18.4% 12.1% 10.5% 14.9% 13.2% 7.9% NLDA 23.0% 14.6% 12.4% 19.4% 12.9% 11.0% Combined 16.7% 10.8% 9.0% 14.4% 11.1% 7.1% 80 40 20 10 5 Table 1: Equal Error Rate (EER) Results of the 1997 NIST Eval., 1h condition Test female 10 male 10 Cepstrum 13.5% 11.3% NLDA 17.0% 14.4% Combined 12.5% 10.5% 2 1 0.5 0.2 0.1 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) 80 combined cepstrum Table 2: Equal Error Rate (EER) Results of the 1997 NIST Eval., 1s condition NLDA systems. The third system is a linear combination of the normalized scores with weights of 0.7 for the cepstrum system scores and 0.3 for the NLDA system scores (expect for the 3 second cases, where we used 0.6 for the cepstrum system and 0.4 for the NLDA system). We use the equal error rate (EER) between misses and false alarms as a performance measure for reporting results. In Table 1, we summarize the results for the 1h condition in the NIST evaluation. In this condition the training consists of 2 phone calls from the same handset, each 1 minute in duration. There are three different test lengths: 3, 10, and 30 seconds. We report the results for each gender separately, by pooling all the test data together (matched and mismatched telephone number). The results show a consistent win for the combined system over our state of the art cepstrum system. We observe the same consistent win for another condition, 1s, in the 1997 NIST Speaker Recognition Evaluation as demonstrated for the 10 second case in Table 2, and across all regions of the DET (false alarm probability versus miss probability) curves as illustrated in Figure 3 for the male, 10 seconds (1h condition) for the cepstrum only system and the combined system. These results are consistent with our initial results for the 1998 Evaluation corpus. 6. RESULT ANALYSIS In this section, we examine our "black box" approach, provide insight to its success and give directions for potential improvements. In order to examine the importance of the pitch input, the 9 frame temporal window, and the degradation loss as a result of the dimension reduction Figure 3: DET Curve for male, 1h, 10 seconds Inputs 9 frames + pitch 9 frames + pitch 9 frames + pitch 9 frames, no pitch 1 frame + pitch Name MLP3 MLP5-34 MLP5-50 MLP5-NO MLP5-1fr Frame Correct 37.2% 28.9% 29.0% 25.9% 18.6% Table 3: Frame-level results on the cross-validation set from 162 inputs to 34 hidden units in the bottleneck layer, we trained several MLPs and tested their cross-validation, frame-level performance on a close set speaker recognition (our development set as described above). In the development phase we found a strong correlation between these frame-level results and the "full cycle" results of the speaker verification system. The results are summarized in Table 3. We trained two types of MLPs: a 5-layer MLP, and a "vanilla" MLP with three layers including one hidden layer (denoted `MLP3'). As mentioned above there were 31 speakers in our development set, 687156 frames for training and 77904 for cross-validation. Our baseline MLP is the MLP5 described above with 162 inputs and 3 hidden layers with 500, 34, and 500 units (named `MLP5-34'). The output layer of all our nets has 31 outputs, one output for each speaker in our development set. The MLP5 named `MLP5-NO' is the same as the baseline but without pitch information (only 153 inputs). The MLP5 named `MLP51' is the same as the baseline but with only one input frame (as compared to the 9 frames used in the other systems) Training a 5-layer MLP is difficult given the complex nonlinear error surface and requires a lot of training data preferably a ratio of at least 10 between frames than free parameters. In these experiments the ratio was around 4.7 (700k frames to 150k parameters). This might explain the disparity in performance between the MLP3 to the MLP5. This is not due to the bottleneck size as shown by the result of the MLP5 named `MLP5-50' (the same as `MLP5-34' but with 50 hidden units in the bottleneck layer). In our speech recognition experiments [5] with NLDA, with the right ratio between frames to free parameters, we did not observe any performance loss because of the dimension reduction at the bottleneck layer. Thus, we plan to increase the size of the development set and hopefully improve the performance of the MLP5 and the overall technique. Additionally comparing the second row to the fourth and fifth rows in Table 3, we observe from these results that that we get a 3% absolute gain from the pitch information, and 10.3% absolute gain from the temporal window. Another set of interesting results is the correlation between the cepstrum and the NLDA scores on 1997 Eval. set, 1h condition, as summarized in Table 4. From these results, we observe that the NLDA technique contribute a significant amount of new information, especially for the shorter test lengths. This is consistent with the results previously shown in Table 1. Test Length 3 10 30 Male 0.61 0.68 0.76 Female 0.47 0.71 0.77 different analysis windows. Finally we want to note that although the training of the MLP with 5 layers is computationally expensive (25 x real time), the application of the MLP3 in a feed forward mode is very fast (less than 0.4 real-time), thus the NLDA approach is feasible in realistic settings. 8. REFERENCES [1] Y. Bengio, R. De Mori, G. Flammia, and R. Kompe. Global optimization of a neural network-hidden Markov model hybrid. IEEE trans. on Neural Networks, 3(2):252­258, March 1992. [2] V. Fontaine, C. Ris, and J. M. Boite. Nonlinear discriminant analysis for improved speech recognition. In Proceedings European Conf. on Speech Communication and Technology. (EUROSPEECH), Rhodes, Greece, 1997. [3] L. P. Heck and M. Weintraub. Handset-dependent background models for robust text-independent speaker recognition. In Proceedings International Conference on Acoustics Speech and Signal Processing (ICASSP), Munich, Germany, 1997. [4] H. Hermansky, N. Morgan, A. Bayya, and P. Kohn. Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTAPLP). Proceedings European Conf. on Speech Communication and Technology. (EUROSPEECH), pages 1367­1370, 1991. [5] Y. Konig and M. Weintraub. Acoustic modeling session - SRI site presentation. In NIST LVCSR Workshop, Linthicum Heights, MD, October 1996. [6] NIST. Result summary. In Speaker Recognition Workshop Notes, Linthicum Heights, Maryland, 1997. Table 4: Correlation Coefficients between NLDA and Cepstrum systems on 1997 Eval. set, 1h condition 7. CONCLUSIONS AND FUTURE WORK We presented a nonlinear discriminant analysis (NLDA) technique that extracts a speaker-discriminant feature set. Our results on the 1997 NIST evaluation show a consistent (across 12 different tests) and significant (around 15% in relative error) improvement when combining the system trained with the NLDA features with cepstrum based system. Our initial results on 1998 NIST evaluation are consistent with 1997 results. Furthermore, our analysis suggests that there is a potential for performance improvement given more development data. We also plan to experiment with other types of input data such as speech over cellular phones and speaker-phone speech. In addition, we plan to extend this study by using a wider range of input representations and resolutions such as first and second derivatives of cepstrum, filter-bank energy levels, and [7] M. Rahim, Y. Bengio, and Y. LeCun. Discriminative feature and model design for automatic speech recognition. In Proceedings European Conf. on Speech Communication and Technology. (EUROSPEECH), Rhodes, Greece, 1997. [8] D. A. Reynolds. The effects of handset variability on speaker recognition performance experiments on the switchboard corpus. In Proceedings International Conference on Acoustics Speech and Signal Processing (ICASSP), Atlanta, GA, 1996. [9] D. A. Reynolds. Comparison of background normalization methods for text-independent speaker verification. In Proceedings European Conf. on Speech Communication and Technology. (EUROSPEECH), Rhodes, Greece, 1997. EXHIBIT E EXPLICIT WORD ERROR MINIMIZATION IN N-BEST LIST RESCORING Andreas Stolcke ¨ Yochai Konig Mitchel Weintraub Speech Technology and Research Laboratory SRI International, Menlo Park, CA, U.S.A. http://www.speech.sri.com/ stolcke,konig,mw @speech.sri.com ABSTRACT We show that the standard hypothesis scoring paradigm used in maximum-likelihood-based speech recognition systems is not optimal with regard to minimizing the word error rate, the commonly used performance metric in speech recognition. This can lead to sub-optimal performance, especially in high-error-rate environments where word error and sentence error are not necessarily monotonically related. To address this discrepancy, we developed a new algorithm that explicitly minimizes expected word error for recognition hypotheses. First, we approximate the posterior hypothesis probabilities using N-best lists. We then compute the expected word error for each hypothesis with respect to the posterior distribution, and choose the hypothesis with the lowest error. Experiments show improved recognition rates on two spontaneous speech corpora. 1. INTRODUCTION The standard selection criterion for speech recognition hypotheses aims at maximizing the posterior probability of a given the acoustic evidence [1]: hypothesis argmax argmax argmax (1) (A) Are there cases where optimizing expected word error and expected sentence error produce different results? (B) Is there an effective algorithm to optimize expected word error explicitly? Note that question (A) is not about the difference between word and sentence error in a particular instance of and its correct transcription, since obviously the two error criteria would likely pick different best hypotheses in any given instance. Instead, we are concerned with the expected errors, as they would be obtained by averaging over many instances of the same acoustic evidence with varying true word sequences, i.e., if we sampled from the true posterior . distribution We will answer question (A) first by way of a constructed example, showing that indeed the two error metrics can diverge in their choice of the best hypothesis. Regarding question (B), we develop a new N-best rescoring algorithm that explicitly estimates and minimizes word error. We then verify that the algorithm produces lower word error on two benchmark test sets, thus demonstrating that question (A) can be answered in the affirmative even for practical purposes. 2. AN EXAMPLE The following is a hypothetical list of recognition outputs with attached (true) posterior probabilities. 1 2 1 2 1 2 (2) Here is the prior probability of a word sequence is given by according to a language model, and the acoustic model. Equation (1) is Bayes' Rule, while does not depend on (2) is due to the fact that and can therefore be ignored during maximization. Bayes decision theory (see, e.g., [2]) tells us that this criterion (assuming accurate language and acoustic models) maximizes the probability of picking the correct ; i.e., it minimizes sentence error rate. However, speech recognizers are usually evaluated primarily for their word error rates. Empirically, sentence and word error rates are highly correlated, so that minimizing one tends to minimize the other. Still, if only for theoretical interest, two questions arise: a a a b b b c c c d e f d e f d e f .0 .24 .2 .2 .05 .01 .2 .05 .05 .44 .44 .44 .26 .26 .26 .3 .3 .3 .4 .34 .26 .4 .34 .26 .4 .34 .26 correct .84 .78 .7 .66 .6 .52 .7 .64 .56 For simplicity we assume that all hypotheses consist of exactly two words, 1 and 2 , shown in the first two columns. The third column shows the assumed joint posterior probabilities for these hypotheses. Columns 4 and 1 2 and 5 give the posterior probabilities 1 2 for individual words. These posterior word probabilities follow from the joint posteriors but summing over all hypotheses that share a word in a given position. For example, the posterior is obtained by summing 1 of all hypotheses such that 1 a. Column 6 1 2 shows the expected number of correct words correct in each hypothesis, under the assumed posterior distribution. and , since This is simply the sum of 1 2 words correct correct 1 1 2 1 2 be the th hypothesis in the Let estimate is thus LM 1 LM -best list; the posterior 1 AC 1 AC correct 2 This N-best approximation to the posterior has previously been used, e.g., in the computation of posterior word probabilities for keyword spotting [7]. 3.2. Computing expected WER Given a list of N-best hypotheses and their posterior probability estimates, we approximate the expected WER as the weighted average word error relative to all the hypotheses in the N-best list. That is, we consider each of the hypotheses in turn as the "truth" and weight the word error counts from them with the corresponding posterior probability: WE 1 As can be seen, although the first hypothesis ("a d") has posterior 0, it has the highest expected number of words correct, i.e., the minimum expected word error. Thus, we have shown by construction that optimizing overall posterior probability (sentence error) does not always minimize expected word error. Of course the example was constructed such that two words that each have high posterior probability happen to have low (i.e., zero) probability when combined. Note that this is not unrealistic: for example, the language model could all but "prohibit" certain word combinations. Furthermore, we can expect the discrepancy between word and sentence error to occur more at high error rates. When error rates are low, i.e., when there are at most one of two word errors per sentence, each word error corresponds to a sentence error and vice-versa. Thus, if we had an algorithm to optimize the expected word error directly, we would expect to see its benefits mostly at high error rates. 3. THE ALGORITHM We now give an algorithm that minimizes the expected word error rate (WER) in the N-best rescoring paradigm [5]. The algorithm has two components: (1) approximating the posterior distribution over hypotheses and (2) computing the expected WER for N-best hypotheses (and picking the one with lowest expected WER). 3.1. Approximating posterior probabilities An estimate of the posterior probability of a can be derived from Equation (1), with hypothesis modifications to account for practical limitations: The true distributions and are replaced by their imperfect counterparts, the language and the acoustic model model probability LM . likelihood AC The dynamic range of the acoustic model, due to unwarranted independence assumptions, needs to be attenuated by an exponent 1 ( is the language model weight commonly used in speech recognizers, and optimized empirically). The normalization term WE (3) where WE denotes the word error of using as the reference string (computed in the standard way using dynamic programming string alignment). 3.3. Computational Complexity hypotheses requires 2 word error comRescoring putations, which can become quite expensive for N-best lists of 1000 or more hypotheses. We found empirically that the algorithm very rarely picks a hypothesis that is not within the top 10 according to posterior probability. This suggests a shortcut version of the algorithm that only computes expected word error for the top hypotheses, where . Note that we still need to consider all hypotheses to compute the expected word error as in Equation (3), otherwise these estimates become very poor and affect the final result noticeably. The practical version of our algorithm thus has complexity . 3.4. Other knowledge sources and weight optimization Often other knowledge sources are added to the standard language model and acoustic scores to improve recognition, such as word transition penalties or scores expressing syntactic or semantic well-formedness (e.g., [4]). Even though these additional scores cannot always be interpreted as probabilities, they can still be combined with exponential weights; the weights are then optimized on a held-out set to minimize WER [5]. This weight optimization should not be confused with the word error minimization discussed here; instead, the two methods complement each other. The additional knowledge sources can be used to yield improved posterior probability estimates, based on which the algorithm described here can be applied. In this scheme, one should first optimize the language model and other knowledge source weights to achieve the best posterior probability estimates (e.g., by minimizing empirical sentence error). is replaced by a finite sum over all the hypotheses in the N-best list. This is not strictly necessary for the algorithm since it is invariant to constant factors on the posterior estimates, but it conveniently makes these estimates sum to 1. WER Switchboard Standard rescoring WER minimization CallHome Spanish Standard rescoring WER minimization 52.7 52.2 68.4 67.8 SER 84.0 84.4 80.9 81.2 rescoring approach is manifest in practice. An important aspect of the WER minimization algorithm is that it can use other, more sophisticated posterior probability estimators, with the potential for larger improvements. Our experiments so far have been based on the commonly used acoustic and language model scores, but we are already experimenting with more complex posterior estimator methods based on neural network models [6]. REFERENCES [1] L. R. Bahl, F. Jelinek, and R. L. Mercer. A maximum likelihood approach to continuous speech recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 5(2):179­190, 1983. [2] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. John Wiley, New York, 1973. [3] J. J. Godfrey, E. C. Holliman, and J. McDaniel. SWITCHBOARD: Telephone speech corpus for research and development. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing,vol. I, pp. 517­520, San Francisco, 1992. [4] R. Moore, D. Appelt, J. Dowding, J. M. Gawron, and D. Moran. Combining linguistic and statistical knowledge sources in natural language processing for ATIS. In Proceedings ARPA Spoken Language Systems Technology Workshop, pp. 261­264, Austin, Texas, 1995. [5] M. Ostendorf, A. Kannan, S. Austin, O. Kimball, R. Schwartz, and J. R. Rohlicek. Integration of diverse recognition methodologies through reevaluation of N-best sentence hypotheses. In Proceedings DARPA Speech and Natural Language Processing Workshop, pp. 83­87, Pacific Grove, CA, 1991. Defense Advanced Research Projects Agency, Information Science and Technology Office. [6] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke. Neural-network based measures of confidence for word recognition. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, vol. II, pp. 887­ 890, Munich, 1987. [7] M. Weintraub. LVCSR log-likelihood ratio rescoring for keyword spotting. In Proceedings of the IEEE Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 297­ 300, Detroit, 1995. Table 1. Word (WER) and Sentence error rates (SER) of standard and word-error-minimizing rescoring methods So far, we have not implemented combined weight and word error optimization. The experiments reported below used standard language model weights and word transition penalties that had previously been determined as nearoptimal in the standard recognition paradigm. 4. EXPERIMENTS We tested the new rescoring algorithm on 2000-best lists for two test sets taken from spontaneous speech corpora. Test set 1 consisted of 25 conversations from the Switchboard corpus [3]. Test set 2 were 25 conversations from the Spanish CallHome corpus collected by the Linguistic Data Consortium. Due to the properties of spontaneous speech, error rates are relative high on these data, making word error minimization more promising, as discussed earlier. The results for both standard rescoring and WER minimization are shown in Table 1. On both test sets the WER was reduced by about 0.5% (absolute) using the word error minimization method. A per-sentence analysis of the differences in word error show that the improvement is highly significant in both cases (Sign test 0 0005). Note that, as expected, the sentence error rate (SER) increased slightly, since we no longer were trying to optimize that criterion. For comparison, we also applied our algorithm to the 1995 ARPA Hub3 development test set. This data yields much lower word error rates, between 10% and 30%. In this case the algorithm invariably picked the hypothesis with the highest posterior probability estimate, confirming our earlier reasoning that word error minimization was less likely to make a difference at lower error rates. 5. DISCUSSION AND CONCLUSION We have shown a discrepancy between the classical hypothesis selection method for speech recognizers and the goal of minimizing word error. A new N-best rescoring algorithm has been proposed that corrects this discrepancy by explicitly minimizing expected word error (as opposed to sentence error) according to the posterior distribution of hypotheses. Experiments show that the new algorithm results in small, but consistent (and statistically significant) reductions in word error under high error rate conditions. In our experiments so far, the improvement in WER is small. However, the experiments confirm that the theoretical possibility of suboptimal WER using the standard EXHIBIT F DYNAMO: An Algorithm for Dynamic Acoustic Modeling Francoise Beaufays, Mitch Weintraub, Yochai Konig ¸ Speech Technology and Research Laboratory SRI International Menlo Park, CA 94025 ABSTRACT This paper summarizes part of SRI's effort to improve acoustic modeling in the context of the Large Vocabulary Continuous Speech Recognition (LVCSR) project. It concentrates on two problems that are believed to contribute to the large error rates observed with LVCSR databases: (1) the lack of discriminative power of the speech models in the acoustic space, and (2) the discrepancy between the criterion used to train the models (typically frame-level maximum likelihood) and the task expected from the models (word-level recognition). We address the first issue by searching for features that help in narrowing the model distributions, and by proposing a neural-networkbased architecture to combine these features. The neural networks (NNET) are used in association with a set of large Gaussian mixture models (GMM) whose mixture weights are dynamically estimated by the neural networks, for each frame of incoming data. We call the resulting algorithm DYNAMO, for dynamic acoustic modeling. To address the second problem, we propose two discriminative training criteria, both defined at the sentence level. We report preliminary results with the Spanish Callhome database. in nature, as opposed to binary. For these reasons, we chose instead to base our models on neural networks. More recently, Ostendorf et al. [OBB 97] showed that a combination of acoustic and prosodic features could greatly help identifying speech segments that were erroneously recognized (32% predictability improvement for a 10-hour training subset of Switchboard). Similar results were reported by various researchers working on confidence measures for word recognition (e.g. [WBR 97]). Presumably, some of these features, which include various measures of speaking rate, SNR, energy, fundamental frequency, stress pattern, and syllable position, could be directly used to disambiguate large acoustic distributions. In the field of speaker recognition, the use of handset detectors has dramatically decreased recognition error rates by sorting out carbon button from electret handsets [Rey96, HW97]. The handset type could also be used as an input to the acoustic modeling algorithms. Another important issue in acoustic modeling is how to capture the dynamics of the speech signal. Much research has recently been devoted to relaxing the independence assumption imposed by most hidden Markov modeling approaches (HMM) and to modeling the correlation between successive frames of data, leading to the family of so-called segment models [ODK96]. Without embarking in this level of complexity, and following a feature-based approach, we propose to include in the acoustic models time features similar to the time index proposed in [GN93, DASW94] and [KM94]. These features don't model correlation but they do alleviate the independence assumption. Our goal here is to explore the usefulness of such knowledge sources as acoustic discriminants, and to propose an efficient and robust architecture to incorporate them in the acoustic models. Clearly, the richness of the acoustic space representation will have a strong influence on how far this approach can be pushed, but the success of the experiments cited above (handset classification, feature-based error prediction, etc.) indicate that the cepstrum-based representation that most systems use offers enough flexibility for the acoustic models to be significantly improved. As mentioned before, the architecture we propose relies on neural networks. An important issue related to this choice is the selection of a training criterion to optimize the weights of the networks. The desirable properties for this criterion are (1) to be discriminative, (2) to be closely related to the metric used to evaluate the performance of the recognizer (typically the word error rate (WER)), and (3) to be differentiable with respect to the weights of the neural networks. Not all the above issues will be discussed in the paper since this 1. Introduction Many factors contribute to the relatively low performance of stateof-the-art speech recognizers operating on spontaneous, telephone speech. A few of these factors are: the diversity of speakers and speaking styles, the typically relaxed articulation, the multitude of pronunciation variants, the presence of extraneous noises, the superposition of more than one voice in some segments, and the distortion due to the communication channel. Whereas some of these factors can be efficiently dealt with by explicit modeling (e.g. vocal tract normalization (e.g. [AKC94]), pronunciation modeling (e.g. [Slo95, FW97])), many others are left for the acoustic models's multi-modal distributions to model implicitly. This, however, has the well-known result of broad overlapping distributions which often lead to recognition errors. In this context, identifying features that act as discriminants in the acoustic space would be useful to narrow the acoustic distributions. If such features can be found, the problem becomes how to use them, and how to ensure that sufficient data sharing is allowed for the model parameters to be reliably estimated. These are the main issues that motivated this work. In the past decade, contextual linguistic features have been widely used in conjunction with decision tree models, and have significantly improved recognition performance (e.g. [BdSG 91, YOW94]). Decision trees, however, make data sharing among different states difficult, and are not well suited to the use of features that are continuous System baseline DT + CI (size: 1/16 DTs) + CI (size: 1/8 DTs) + CI (size: 1/4 DTs) + CI (size: 1/2 DTs) + CI (size: 1/1 DTs) N-best error rate Eval '95 71 00 67 77 68 77 68 27 68 34 67 98 67 69 52 54 Eval '96 65 22 64 37 65 22 65 22 65 10 64 49 64 31 a set of mixture weights, and the likelihood of the observed data is computed from the corresponding phone GMM. knowledge sources at time k /ah/ /ae/ /aa/ mixture weights at time k P1 (sk) PN(sk) g /ah/ /ae/ /aa/ p(xk | GMM) = g=1 Neural Networks Gaussian Mixture Models PgNg(xk) Ng Table 1: N-best list rescoring with decision tree models and contextindependent phone models of different sizes: WER in %. xk (observation at time k) Figure 1: A hybrid NNET-GMM model for dynamic acoustic modeling. Specifically, the likelihood of an observation, phone is given by , with respect to work is still in an early stage. Our first goals were to validate the architecture we propose and to investigate different discriminative training criteria. These two points will be addressed. Feature selection, however, will be the object of future work: for our preliminary experiments, we used a set generic knowledge sources including linguistic features and time indices. (1) 1 2. Baseline System and Databases The baseline system for this work is a speaker-independent continuous speech recognition system trained with 75 conversations of Callhome Spanish data and 80 conversations from Callfriend Spanish. It is based on continuous-density, genonic HMMs [DMM96], and uses a multipass recognition strategy [MBDW93] with a vocabulary of 8K words, non-cross-word acoustic models, and a bigram language model. N-best lists are generated, and rescored with the original acoustic models, a trigram language model, and additional acoustic models such as decision-tree-based cross-word models (DT) or large context-independent phone GMMs (CI). and denote, respectively, the NNET and the GMM where associated to phone , is the number of Gaussians in , and are, respectively, the mixture component and the mixture weight in , and represents the vector of knowledge sources for phone , at time . Because the mixture weights for each phone must sum to one, the training of the neural networks is a constrained optimization problem. To simplify the training procedure, we chose to hard-wire this constraint in the architecture of the neural networks by using a "softmax" output layer [Bri90]: (2) is the where softmax layer. output of the neural network, before the 3. Recognition with Large Context-Independent Models Using the Spanish Callhome database, we conducted a series of Nbest list rescoring experiments with decision tree models and with large context-independent GMMs. The numbers of Gaussians in the GMMs were chosen to be fractions of the numbers of Gaussians used in the corresponding decision tree models. The smallest models had 16 times fewer Gaussians than the decision tree models, and the largest models had exactly the same size. Recognition experiments were performed with two sets of 200 sentences selected at random from the male evaluation test sets of 1995 and 1996. The results, reported in Table 1, show that, for this database, context-independent models perform as well as or slightly better than decision tree models, provided that the numbers of parameters are equal. The Gaussians in each phone model can be interpreted as a set of basis functions. A multimodal probability density function is then estimated for each observation by taking a linear combination of the basis functions, the weights of which are computed dynamically by the neural network. The discriminative emphasis of certain portions of the acoustic space at each instant has the effect of narrowing the distributions around the acoustic areas where the data are expected to lie. This architecture thus outputs the likelihoods of the observations. This is in contrast with NNET-HMM hybrids trained for state classification [BM90], where the outputs are state posterior probabilities that need to be converted into likelihoods, and with approaches such as REMAP [BKM95, KBM96] that estimate global posterior probabilities of word sequences. 4. The DYNAMO Algorithm The architecture we propose is based on a hybrid system combining feedforward neural networks and context-independent phone models. Each phone is modeled with a large GMM whose mixture weights are dynamically estimated by a neural network (see Fig. 1), hence the name of the algorithm, DYNAMO. The means and variances of the GMMs are held constant. The inputs to the neural network are the knowledge sources discussed in the introduction. For each data frame, the knowledge sources for each phone are evaluated and input into the corresponding NNET. Each NNET outputs 4.1. Training of the DYNAMO Models The DYNAMO models are trained in two phases. First, the context-independent phone GMMs are trained with the expectationmaximization (EM) algorithm to maximize the log-likelihood of the training data. The means and variances of these models are retained; the mixture weights are discarded. Then, the adaptive parameters of the neural networks are trained with the stochastic steepest descent algorithm to optimize some criterion . The neural network weights are thus updated according to 1 ^ (3) (4) GMM size 1 16 1 16 1 16 1 8 1 8 Experiment no NNETs ­ baseline NNETs w/ ling. feat. & time feat. NNETs w/ ling. feat. only no NNETs ­ baseline NNETs w/ ling. feat. & time feat. WER 68 77 69 20 68 92 68 27 69 35 where denotes the set of neural network weights for phone at is the instantaneous gradient of the optimization iteration , ^ criterion for phone , and is a constant that controls the learning rate. does not need to be identical Note that the optimization criterion to the criterion used to train the GMMs (ML). Indeed, we argue in the next sections that discriminative training is better suited to this is the task. For now, however, we will assume for simplicity that average log-likelihood of the data, log where the sum is taken over all the observations . (5) aligned to phone Table 2: Rescoring experiments with ML-trained DYNAMO models: WER in %. numbers show that the introduction of the ML-trained networks increased the overall WER. Further analysis of the results revealed that the likelihood of the test data had increased as a result of training but that the posterior probabilities of the correct models had decreased. This indicated that competing models scored higher than the correct model, which confirmed that discriminative training should be used instead. 6. Discriminative Training Criteria Discriminative training of speech models was first introduced by Bahl et al. under the form of Maximum Mutual Information (MMI) estimation [BBdSM86]. In this framework, the speech models are trained to maximize the mutual information between the observation and the correct word sequence sequence 1 : with arg max Applying the chain rule to the derivatives of Eq. 5, and taking Eq. 2 into account, we find ^ where (7) can be backpropagated through the neural network, as in the traditional backpropagation algorithm [RMT86]. Intuitively, the backpropagation term, , for Gaussian is large in absolute value if the posterior probability of the Gaussian is very different from its prior probability , with both probabilities being functions of the knowledge sources for the current data frame. To hasten the convergence of the neural networks and steer them away from uninteresting local minima, we initially set their weights so that the network outputs are equal to the mixture weights estimated with the EM algorithm. (6) (8) (9) where the sum in the denominator is taken over all possible word sequences, . Practical implementations of Eq. 9 for continuous speech recognition include the estimation of the denominator with a phone loop model [Mer88], and its approximation by a sum over the hypotheses in an N-best list [Cho90]. The first optimization criterion we propose is similar to the N-best list implementation of MMI, but differs in that we augment the N. We then maximize the best list with the correct word sequence, posterior probability of the correct word sequence, (10) 5. Recognition Experiments with ML-trained Dynamo Models We performed a set of rescoring experiments with ML-trained DYNAMO models, using linguistic questions and, in some experiments, time features. We chose the linguistic features to be identical to those selected by the decision trees in previous DT-rescoring experiments (Table 1). The time features for a hypothesized phone aligned to frames of data were the phone duration, , and the relative time , where 0 1. index Results are given in Table 2, where the baseline obtained by rescoring the N-best lists with the GMMs is given for comparison. These 1 is the N-best list depth. The inclusion of the joint probwhere ability of the observation and the correct word sequence in the denominator makes the criterion depart from the original MMI but has a useful property in terms neural network training, as we will show. Another family of discriminative criteria stems from the motivation of directly optimizing the metric used to evaluate the recognizer, i.e. the word error rate. Bahl et al. proposed the heuristic "corrective training" procedure in [BBdSM88]. Katagiri et al. developed the Generalized Probabilistic Descent method that extends the idea of Bayes optimum classification by introducing smooth classification error functions, and generalizes this framework to the classification of patterns of variable lengths [KLJ91]. The second criterion we propose consists in minimizing the average number of errors over the N-best list, ANER 1 1 where we made use of the property 1 1 (17) NER (11) where NER denotes the number of errors in the hypothesis, is the posterior probability of the hypothesis in and the (non-augmented) N-best list. Both criteria are optimized in a stochastic optimization framework, as we will discuss shortly. In both cases, the training procedure requires N-best lists for all the training data. This is typically quite costly but not infeasible, especially if the N-best list depth is limited to a small number of hypotheses (5 or 10). Since the acoustic log-likelihoods can be expanded into sums over the observations, , in the sentence, the above weight update formula modifies the neural network weights only for those frames where the reference and the hypothesis strings do not coincide. In that case, positive training is given to the correct model (c) and negative training is given to the erroneously hypothesized model (h). The are calculated according to Eqs. log-likelihood gradients log 6 and 7. This property results from the fact that the N-best list was augmented with the correct transcription (Eq. 10). Another desirable feature of this training criterion is that more training is given to hypotheses with high posterior probabilities (the mul). tiplicative term, A potential disadvantage is that the correct hypothesis is often not in the N-best list for databases with high error rates. Improving the posterior of the correct sentence may thus result in decreasing the probability of the best (although erroneous) hypothesis in the N-best list. 6.1. Maximizing the posterior probability of the correct sentence Let denote the joint probability of a word sequence (reference or hypothesis) and of the corresponding acoustic sequence, 1 (12) where and are shorthands for the language model and , respectively, and acoustic model probabilities, and where is the language model weight. With this notation, we can rewrite the posterior probability of the correct word sequence in Eq. 10 as (13) Likewise, (14) hypothesis in the augdenotes the posterior probability of the mented N-best list. (All posteriors and likelihoods are conditioned for 1 .) upon the set of acoustic models The first training criterion can be expressed as 1 log (15) 6.2. Minimizing the average number of errors in the N-best list The second training criterion we propose is given by 1 ANER (18) where the average number of errors ANER in a sentence was defined in Eq. 11. Note that here the posterior probability of a hypothesis is computed only with respect to the other hypotheses in the N-best list (i.e. without taking the reference into account): (19) Intuitively, minimizing ANER "redistributes" the posterior probability mass to favor hypotheses with few errors and penalize hypotheses with more errors. Again, the weight update formula can be derived by taking the instantaneous gradient of with respect to the weights of the neural networks. The weight update for each sentence is therefore proportional to is the number of sentences in the training set, and where represents the posterior probability of the correct transcription of sentence . Adapting the neural network weights according to this criterion amounts to adjusting them after the presentation of each training sentence by an amount proportional to (stochastic gradient update) log log log (16) ANER log ANER NER (20) The characteristics of this weight update formula are quite different from those of the previous criterion. Negative training is given to hypotheses that have a number of errors above average, and positive training is given to hypotheses with a number of errors below average. Of course, this average, ANER , evolves with the training of the models. If the learning process progresses correctly, ANER decreases with time, thereby progressively decreasing the number of hypotheses that receive positive training. In the limit, all the posteriors converge to zero except the one that corresponds to the hypothesis with the lowest number of errors, , and ANER converges to NER , thereby bringing the training process to an end. The main disadvantage of this criterion is that positive training is given to all the frames in the best hypothesis, including those associated with incorrectly recognized words. This criterion, however, is closer to the WER metric that we ultimately wish to optimize. to output mixture weights for the same small phone models (GMMs 1/16). The training data consisted of all 15K male sentences in the training set, of which 10 % was held as a cross-validation set. The models were tested on the same subset of Eval'96 as in the previous experiments. The N-best list depth was limited to 5 hypotheses. The error rate is given in Table 4. The WER improvement is modest but since the phone GMMs in this experiments were small and hence not very detailed, little margin for improvement was left to the NNETs. models GMMs 1 16 baseline max log-post NNETs WER 65 22 64 79 7. Recognition Experiments with Discriminatively Trained Dynamo Models These experiments were limited to the training of small models (NNETs associated to GMMs 1/16), with linguistic and time features only. Fig.2 shows the results of a self-test experiment (i.e. the test data is identical to the training data) with the 627 male sentences of the Eval'96 test set of the Spanish Callhome database. The N-best list depth was limited to 10 hypotheses. Table 4: N-best rescoring with log-posterior NNETs, fair experiment: WER in %. 8. Conclusions We described an algorithm to incorporate new knowledge sources in a set of acoustic models, with the objective of dynamically increasing or decreasing the likelihoods of the different modes of the models, thereby narrowing their distributions. The algorithm makes use of feedforward neural networks to dynamically estimate the mixture weights of the speech models, given the knowledge sources for the current data frame. We argued that the neural networks need to be discriminatively trained, and we proposed two training criteria: maximizing the logposterior probability of the correct transcription and minimizing the average number of errors in the N-best list. Preliminary experiments showed a modest but encouraging improvement in WER. We are currently experimenting with larger phone models and increased N-best list depths. 49.5 min NER training max Pc training 49 48.5 Average number of errors in best hyp 48 47.5 47 46.5 46 45.5 45 References 1 2 3 4 5 Training epoch 6 7 8 9 44.5 [AKC94] Figure 2: Average number of errors as a function of the training epoch, for both training criteria. The N-best error rate for this set of sentences was 41.49%. The learning curves show that for the self-test experiment the ANER criterion shows more promise. This, however, is not a fair experiment, and the generalization properties of the max-posterior criterion may be superior. N-best rescoring of 200 randomly selected male sentences of the Eval'96 test set with the neural networks trained to minimize the ANER gave a significant WER improvement (see Table 3). models GMMs 1 16 baseline min ANER NNETs WER 65 22 63 89 Table 3: N-best rescoring with ANER NNETs, self-test experiment: WER in %. A fair experiment was conducted with the max-posterior criterion. A set of neural networks was trained from linguistic and time features A. Andreou, T. Kamm, and J. Cohen. Experiments in vocal tract normalization. In Proc. the CAIP Workshop: Frontiers in Speech Recognition II, 1994. [BBdSM86] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. Maximum mutual information estimation of hidden markov model parameters for speech recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1986. [BBdSM88] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer. A new algorithm for the estimation of hidden markov model parameters. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, New York, NY, April 1988. [BdSG 91] L. R. Bahl, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, and M.A. Picheny. Context dependent modeling of phones in continuous speech using decision trees. In DARPA Proc. Speech and Natural Language Workshop, Pacific Grove, CA, February 1991. [BKM95] H. Bourlard, Y. Konig, and N. Morgan. REMAP: Recursive Estimation and Maximization of A Posteriori Probabilities, applications to transition-based connectionist speech recognition. Technical Report TR-94064, ICSI, Berkeley, CA, March 1995. [BM90] H. Bourlard and N. Morgan. A continuous speech recognition system embedding MLP into HMM. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan Kaufmann, 1990. [Bri90] J. S. Bridle. Training stochastic model recognition algorithms as networks can lead to maximum mutual information estimation of parameters. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems, volume 2. Morgan Kaufmann, 1990. [Cho90] Y. L. Chow. Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Albuquerque, NM, 1990. [DASW94] L. Deng, M. Aksmanovic, D. Sun, and J. Wu. Speech recognition using hidden markov models with polynomial regression functions as nonstationary states. IEEE Trans. Speech, Audio Processing, 2(4), 1994. [DMM96] V. V. Digalakis, P. Monaco, and H. Murveit. Genones: Generalized mixture tying in continuous hidden markov model-based speech recognizers. IEEE Trans. Speech, Audio Processing, 4(4), July 1996. [FW97] M. Finke and A. Waibel. Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition. In Proc. Eurospeech, Rhodes, Greece, September 1997. [GN93] H. Gish and K. Ng. A segmental speech model with applications to word spotting. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, volume II, 1993. [HW97] L. P. Heck and M. Weintraub. Handset-dependent background models for robust text-independent speaker recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Munich, Germany, April 1997. [KBM96] Y. Konig, H. Bourlard, and N. Morgan. REMAP: Experiments with speech recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Atlanta, GA, May 1996. [KLJ91] S. Katagiri, C.-H. Lee, and B.-H. Juang. New discriminative training algorithms based on the generalized probabilistic descent method. In Proc. Workshop on Neural Networks for Signal Processing, 1991. [KM94] Y. Konig and N. Morgan. Modeling dynamics in connectionist speech recognition - the time index model. In Proc. Intl. Conf. on Speech and Language Processing, 1994. [MBDW93] H. Murveit, J. Butzberger, V. V. Digalakis, and M. Weintraub. Large-vocabulary dictation using SRI's DECIPHER(TM) speech recognition system: Progressive-search techniques. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, pages II­319:II­322, April 1993. [Mer88] B. Merialdo. Phonetic recgnition using hidden markov models and maximum mutual information training. In Proc. IEEE Intl. Conf. on Acoustics,Speech, and Signal Processing, New York, NY, April 1988. [OBB 97] M. Ostendorf, W. Byrne, M. Bacchiani, M. Finke, A. Gunawardana, K. Ross, S. Roweis, E. Shriberg, D. Talkin, A. Waibel, B. Wheatley, and T. Zeppenfeld. Modeling systematic variations in pronunciation via a language-dependent hidden speaking mode. Technical Report LVCSR Summer Research Workshop, Johns Hopkins U., 1997. [ODK96] M. Ostendorf, V. V. Digalakis, and O. A. Kimball. From HMM's to segment models: A unified view of stochastic modeling for speech recognition. IEEE Trans. Speech, Audio Processing, 4(5), 1996. D.A. Reynolds. Mit lincoln laboratory site presentation. In NIST Speaker Recognition Workshop, Linthicum Heights, MD, March 1996. D.E. Rumelhart, J.L. McClelland, and The PDP Group, editors. Parallel Distributed Processing, volume 1. The MIT Press, Cambridge, MA, 1986. T. Sloboba. Dictionary learning: Performance through consistency. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 1995. [Rey96] [RMT86] [Slo95] [WBR 97] M. Weintraub, F. Beaufays, Z. Rivlin, Y. Konig, and A. Stolcke. Neural - network based measures of confidence for word recognition. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, Munich, Germany, April 1997. [YOW94] S. J. Young, J. J. Odell, and P. C. Woodland. Tree-based state tying for high accuracy acoustic modelling. In Proc. Human Language Technology Workshop, pages 307­312, Plainsboro, NJ, March 1994. EXHIBIT 4 FULLY REDACTED EXHIBIT 5 FULLY REDACTED EXHIBIT 6 FULLY REDACTED

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?