Apple Computer Inc. v. Burst.com, Inc.

Filing 108

Declaration of Allen gersho in Support of 107 Response of Burst.com, Inc.'s Opposition to Plaintiff Apple Computer, Inc.'s Motion for Summary Judgment on Invalidity Based on Kramer and Kepley Patents filed byBurst.com, Inc.. (Attachments: # 1 Exhibit A to A. Gersho Declaration# 2 Exhibit B to A. Gersho Declaration# 3 Exhibit C to A. Gersho Declaration# 4 Exhibit D to A. Gersho Declaration# 5 Exhibit E to A. Gersho Declaration# 6 Exhibit F to A. Gersho Declaration (Part 1)# 7 Exhibit F to A. Gersho Declaration (Part 2)# 8 Exhibit G to A. Gersho Declaration# 9 Exhibit H to A. Gersho Declaration# 10 Exhibit I to A. Gersho Declaration (Part 1)# 11 Exhibit I to A. Gersho Declaration (Part 2)# 12 Exhibit J to A. Gersho Declaration# 13 Exhibit K to A. Gersho Declaration# 14 Exhibit L to A. Gersho Declaration# 15 Exhibit M to A. Gersho Declaration# 16 Exhibit N to A. Gersho Declaration# 17 Exhibit O to A. Gersho Declaration)(Related document(s) 107 ) (Crosby, Ian) (Filed on 6/7/2007)

Download PDF
Apple Computer Inc. v. Burst.com, Inc. Doc. 108 Att. 14 Case 3:06-cv-00019-MHP Document 108-15 Filed 06/07/2007 Page 1 of 5 Dockets.Justia.com Case 3:06-cv-00019-MHP Document 108-15 Filed 06/07/2007 Page 2 of 5 IMPROVING PERFORMANCE OF MULTI-PUlSE LPC CODERS AT LOW BIT RATES Sharad Singhal and Bishnu S. Atal Acoustics Research Department AT&T Bell Laboratories, Murray Hill, NJ 07974 ABSTRACT needs approximately 8 pulses per pitch period to synthesize high quality speech. Thus, for proper synthesis of speech with an average ducing natural-sounding speech at medium to low bit rates, Multi-pulse analysis obtains the all-pole filter excitation by minimizing a spectrally-weighted mean-squared error between the original and s,vntheiic speech signals. Although the method provides high quality speech around 10 kbits/sec, speech quality suffers if the bit rate is lowered. In this paper, we focus on problems encountered in attempting to maintain speech quality while synthesizing speech using multi-pulse excitation at lower bit rates. The multi-pulse excitation model provides a method for pro- pitch of 10 msec, an average pulse rate of 800 per second for the excitation is necessary. However, this also means that a much higher pulse rate is needed for speech where the pitch period is shorter. The number of pulses in a pitch period can be reduced significantly by exploiting the periodic nature of voiced speech. We describe in this paper a method to achieve this reduction by using pitch prediction on the multi-pulse excitation signal. IMPROVED LPC ANALYSIS PROCEDURE Linear predictor coefficjents are determined by minimizing the mean-squared prediction error between the current value of the INTRODUCTION speech signal and its predicted value based on the past samples. The prediction error 5(n) is defind as The multi-pulse model of LPC excitation [1] provides a method for producing natural sounding speech. This model replaces the traditional pitch pulse and white noise excitation used in vocoders by a sequence of pulses. The optimum pulse sequence is chosen to minimize a spectrally-weighted mean-squared error between the original and synthetic speech signals. The optimization is carried out sequentially - one pulse at a time - over short time segments typically 5 to 5(n) s(n)--s'(n) s(n) --aks(n--k), (I) where s (n) is the speech sample at the nth sampling instant, a (n) is the predicted value of the nth speech sample, ak is the kth predictor coefficient, and p is the number of predictor coefficients. For sta- tionary signals, whose underlying spectral characteristics do not change with time, the mean-squared prediction error is given by 10 msec long during which the speech signal can be considered quasi-stationary. Although the multi-pulse excitation is capable of producing high E= urn l/Ne2(n). (2) quality synthetic speech at rates around 10 kbits/sec, the speech quality deteriorates rapidly at much lower bit rates. There are several reasons for this loss of speech quality at lower bit rates. First, present LPC analysis procedures are not perfect. Even when the all-pole model provides a reasonable approximation to the speech spectrum, these procedures introduce errors in estimating the filter parameters. At higher bit rates, these errors can be compensated by using additional pulses in the multi-pulse excitation. This however is not possible at lower bit rates, where the number of pulses have to be kept at a minimum. We thus need a more accurate LPC analysis procedure to achieve a significant reduction in the bit rate For quasi-stationary signals, the sum over n in Eq. (2) is usually carried Out over a time interval during which the spectral characteristics of the signal are approximately constant. The prediction error as defined in Eq. (2) shows considerable variation even when the posi- tion of the analysis frame is changed slightly. The predictor coefficients are determined by minimizing E and consequently show similar variations. Figure 1 shows a portion of the utterance "Joe brought a young girl" spoken by a male speaker, and a plot of the first 4 partial correlation coefficients as a function of the position of the analysis frame. LPC analysis was performed using the stabilized covariance method [3] with a 20 msec frame. Each data point in the plot represents a new set of LPC parameters obtained by shifting the analysis frame by one sampling interval (125 ssec at the sampling frequency of 8 kHz). for the multi-pulse excitation. An important source of error in present LPC analysis procedures is introduced by the rectangular window used for computing the mean-squared prediction error. The rectangular window causes the resultant LPC parameters to become sensitive to the position of the analysis frame in the utterance [21. In this paper, we describe an improved LPC analysis procedure to avoid this problem. Second, in the multi-pulse analysis method [1], the amplitudes and locations of the pulses are determined in successive stages - one pulse at a time. For closely spaced pulses, successive optimization of individual pulses is inaccurate and often requires additional pulses to compensate for inaccuracies introduced earlier. Again, these additional pulses result in a higher bit rate than is necessary. This problem is avoided by using a procedure in which the amplitudes of the pulses are kept optimum at each stage of multi-pulse analysis. Third, it has been our experience that the multi-pulse excitation These sample-to-sample fluctuations are not usually obvious when the LPC parameters are computed every 10 to 20 msec; in this case, the high-frequency fluctuations are folded back in the low frequency range. Let us consider this point in more detail. The prediction error defined in Eq. (1) has the same bandwidth as the speech signal. The squared prediction error has a bandwidth which is twice the bandwidth of the speech signal. The rectangular window used in averaging the squared prediction error as shown in Eq. (2) has sharp edges and thus a poor side-lobe response at high frequencies [41. The sharp edges of the rectangular window cause the mean-squared pred- iction error E to vary rapidly as a function of the position of the analysis frame. Thus the rectangular window is not appropriate for computing the average squared prediction error. 1.3.1 CH1945-518410000-0003 $1.00 © 1984 IEEE Case 3:06-cv-00019-MHP Document 108-15 Filed 06/07/2007 Page 3 of 5 TIME tSEC) TIME SECI Fig. 1. The speech waveform and the first 4 partial correlation coefficients obtained with traditional LPC analysis. Fig. 2. The partial correlation coefficients obtained with the improved LPC analysis procedure for the same speech segment as in Fig. I. This problem can be avoided by using a non-rectangular window, such as a Hamming window [41, which has lower side lobes at high frequencies. We now define a more general form for the meansquared prediction error as conversion of the speech signals [31. Figure 2 shows a plot of the first 4 partial correlation coefficients based on the above procedure using a 20 msec. Hamming window for the same speech segment as in Fig. I. Even though the effective duration of the Hamming window is only half of the rectangular window used in the traditional E(n) = e2(n--flw(i), (3 LPC analysis, the sample-to-sample fluctuations are reduced significantly with the new procedure. where E (n) is the mean squared-prediction error at the nth sampling instant and w (I) is the ith sample of the Hamming window (or any other appropriately tapered window) with a duration of 2m+l samples. Equation (3) can be rewritten as OPTIMAL AMPLITUDE EXCITATION E(n) = [s(n--i) -- aas(nik)]w(i). Multi-pulse excitation is obtained by minimizing a weighted mean-squared error between the original and synthetic speech. The LPC synthesizer produces synthetic speech, which is compared to the (4) original speech to produce the error signal e. The error signal is weighted to dc-emphasize the error in the formant regions by using a linear filter of the form The predictor coefficients at the nth sampling instant are determined by minimizing the prediction error E(n). We then obtain a set of linear equations k-I where W(z) = [I where -- (9) lrp, I k p, I (5) P(z') = I -- (10) = w(i)s(n--i--k)s(n--i--r), and = w(i)s(n--i)s(n--i--r). (6) is the LPC inverse filter and s is a fraction between zero and one. The parameter i determines the degree to which the error is deemphasized in the formant regions. Since the weighting filter is linear, the weighting can be done before the error computation. Both the speech signal s(n) and the synthetic speech signal .(n) can be passed through W(z) to produce the weighted signals y(n) and 9(n), respectively, and the weighted error signal computed as These equations are solved to provide a new set of predictor coefficients. The extension of the above approach to determine partial correlations (or reflection coefficients) using the stablized covari- ance method is straightforward. The covartanec matrix [] is first expressed as the product of a lower triangular matrix L [L5] and its transpose Lt. Next, a set of linear equations p k--I () -- ç Os). Let h (n) be the overall impulse response of the allpole filter in cascade with the weighting filter at the nth sampling 5,,, occur at instant. Let input pulses of amplitude /3, f3 instances Oi, n2 nw, respectively. The synthetic speech signal 9(n) can then be expressed as 9(n) (11) k--i Lkrqkt,, lrp, (7) is solved. The partial correlation at delay m is given by The amplitudes and locations of the pulses are obtained by minimizing the total squared error E between y (n) and 9(n), where = qm/(o--q,2)° (8) E = [y(n)--9Os)I2. Minimizing E with respect to /3j, 1 k--i (12) where i,l' is the weighted energy of the speech signal as defined in Eq. (6). It is also desirable to apply a correction for the high fre- leads to (13) quency roll-off of the low pass filter used in analog-to-digital I3ka(nk,nJ) =x(n), ljm, 1.3.2 Case 3:06-cv-00019-MHP Document 108-15 Filed 06/07/2007 Page 4 of 5 where cs(i,f) = and h ORIGINAL SPEECH (n--i)h (n--f) (14) x(m) =Ny(n)h(n_m). (15) EXCITATION UNOPTIMIZEX AMPLITUDES IC) The minimum error Emjn is given by N-I Emin II I I y2(n) -- /kX(flk). k--I I (16) SPEECH Minimization with respect to the pulse locations leads to a Set of nonlinear equations which do not have a closed form solution. Equa- OPTIMIZED AMPLITUDES EXCITATION · tionS (13) cannot be solved directly, because they involve the unknown pulse locations. Thus suboptimal solutions have to be used. In [11, Atal and Remde obtained the pulse amplitudes and locations in successive stages, one pulse at a time. At stage f, all pulse amplitudes and locations up to stage f--I were assumed to be known and only the pulse location n1 and the pulse amplitude f3 were computed. While this procedure is fast, it has several shortcomings. Successive optimization of individual pulses is inaccurate for closely spaced pulses. Additional pulses are often required to compensate for the inaccuracy introduced in earlier stages. As a result, the signal-tonoise ratio (SNR) saturates after a few pulses have been placed. These problems can be avoided to a large extent if the amplitudes of all pulses found in earlier stages are kept optimal along with the amplitude of the current pulse during the search. Thus at stage f we assume that only the pulse locations n1, n2 n1 remain amplitudes of all pulses can now be modified, the pulse amplitudes remain accurate even when they are closely spaced and the SNR does not saturate. II I · 1 I SPEECH __..._.'\JJf\,/\J''\f\J'sIv1\J' Fig. 4. Illustration of the differences in the excitation obtained with and without amplitude optimization. (a) A superfluous pulse caused by amplitude inaccuracy, (b) a location error, and (c) an irregular pitch pulse. becomes more periodic for voiced speech. BEIGE W000A OR KNEW A CL A SH E constant. Equations (13) and (16) are then used to compute the and nj. Since the pulse amplitudes fiI 13J_I, as well as --. 40 C, w z 20 30 OPTIMIZED mMiZED) 0.5 1.0 1.5 AMPLITUDES'\7. 20 / O 2.0 2.5 TIME ISECI 0 V z 10 UNO PT IM IZED Fig. 5. The signal energy and the SNR obtained by with and without amplitude optimization for the utterance "Beige woodwork never clashes." MPLITUDES 0 Fig. 1000 aooo 3000 4000 Figure 5 shows the signal energy and the SNR obtained with the two methods for a sentence at a pulse rate of 1600 pulses/sec. The improvement is consistent throughout the utterance and not limited to any one section of speech. The degree of improvement varies from 2 to 10 dB with an average of slightly over 3 dB. NUMBER OF PULSES/SECOND 3. Average SNR obtained with and without amplitude optimiza- tion BIT-RATE REDUCTION BY PITCH PREDICTION Figure 3 shows the SNR obtained with and without amplitude optimization as a function of the pulse rate for the utterance "Both teams started from zero," spoken by a female speaker. The saturation effect is clearly visible when the amplitudes are not optimized. It can It has been our experience that about 8 pulses are needed per pitch period in the multi-pulse excitation for high quality synthesis. Thus the bit rate increases as the pitch period is reduced. This be seen that with optimized amplitudes, the SNR increases with increasing number of pulses. For the same pulse rate, the new method provides a higher SNR. Conversely for the same SNR, the new method requires a smaller pulse rate. Figure 4 illustrates the differences in the excitation obtained with implies that the bit rate required for multi-pulse excitation is higher for female speech than for male speech of the same speech quality. Alternately, for the same excitation bit rate speech quality is poorer for female speakers than for male speakers. Figure 6 shows the SNR obtained as a function of the pulse rate for a male and a female speaker. A difference of about 6 to 10 dB can be seen in the SNR for the same pulse rate. For voiced speech, multi-pulse excitation shows and without amplitude optimization. A small section of speech is shown in the figure along with the excitation obtained with the two methods. The synthetic speech signals are also shown. It can be seen that amplitude optimization removes some superfluous pulses and corrects some pulse locations. It can also be seen that the excitation significant correlation from one period to the next. We use this correlation to reduce the number of pulses in the excitation substantially by pitch prediction on the multi-pulse excitation. The pitch predictor is shown in Fig. 7. The LPC synthesizer input v(n) is given by 1.3.3 Case 3:06-cv-00019-MHP Document 108-15 Filed 06/07/2007 Page 5 of 5 V 30 MALE VOICE AVERAGE WITH PITCH 0 4 0 PITCH 125 Hz 0 4 (U 20 WITHOUT LU 20 VOICE AVERAGE 4 z (p c0 10 MALE 0 1000 2000 0 z 0 I4 z (a 10 ICTIOTCH AC TI N I I I "0 4000 1000 2000 3000 4000 NUMBER OF PULSES/SECOND 3000 NUMBER OF PULSES/SECOND Fig. 6. Average SNR as a function of pulse rate for a female and a Fig. 8. Average SNR as a function of pulse rate with and without pitch prediction of the excitation for a 1 second segment of speech by a female speaker. male speaker. female speaker. The SNR improvement varies from about 2 dB at high bit rates to 5 dB at bit rates around 1000 pulses/sec. This MULTI-PULSE SYNThESIZER corresponds to a typical savings of 500-600 pulses/sec. CONCLUSION In this paper, we described three problems encountered by us in our attempt to reduce the bit rates for multi-pulse excited LPC Fig. 7. The pitch predictor filter usedfor multi-pulse excitation. v(n) = u(n)+-yv(n---d), 0<n<JV, (17) where u (n) is the nth sample of the multi-pulse excitation in the frame, 'y is the predictor gain and d is the predictor delay. The predictor delay can be several pitch periods and in general is longer than the framelength N. The synthetic speech signal 9(n) can now be expressed as speech synthesizers and the methods used by us to bypass these problems. Our results show that these synthesizers are capable of producing good quality speech even at low bit rates. REFERENCES 9(n) = h(k)u(n--k) + '1h(k)v(n--k--d) k--s k--S N--I El] B. S. Atal and J. R. Remde, "A new model of LPC excitation for producing natural-sounding speech at low bit rates" Proc. InS. Conf on Acoustics, Speech and Signal Processing, Paris, (18) + 9(--i)h (n+1). France, 1982, pp. 614-617 As before, we min!mize the mean-squared error E between the original and synthetic speech signals as in Eq. (12). The pitch predictor and the multi-pulse excitation are now obtained in two steps. First, we assume that the multi-pulse excitation is zero and minimize E to compute the predictor gain y and the delay d. Let [2] L. R. Rabiner, B. S. Atal and M. R. Sambur, " LPC prediction error- analysis of its variation with the position of the analysis frame," IEEE Trans. Acoust.,Speech, Signal Processing, vol. ASSP-25, no. 5, Oct. 1977, pp. 434-442 [3] B. S. Atal, "Predictive coding of speech at low bit rates," iEEE Trans. Commn., vol. COM-30, no. 4, April 1982, pp. 600-614 [4] L. R. Rabiner and R. W. Schafer, "Digilal Processing of Speech Signals," Prentice Hall, 1978 9(n) =y(n) --9(--l)h(n+l) and z(fl) k--s h(k)v(n--k--d). (19) The minimization of E with respect to 'p results in N-i N-i E y(d) = 9(n)z(n)]/E z(n)], n--S (20) and E(d) = 92() -- [ j7(n)z(n)]/E z1(n)]. ((--5 N-I N-I N-I (21) Equation (21) is computed for each value of d in a given range and the value which minimizes is chosen as the predictor delay. The predictor gain -y is then found from Eq. (20). Next, the pitch predictor gain and delay are held constant, and the multi-pulse exci- tation is found as described in the last section. This procedure implies that the multi-pulse excitation only has to represent the uncorrelated part of the LPC filter excitation. For voiced speech, the excitation is highly correlated, and only a few pulses are needed in the multi-pulse excitation. Figure 8 illustrates this difference for a segment of speech for a 1.3.4

Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.


Why Is My Information Online?