Apple Computer Inc. v. Burst.com, Inc.
Filing
108
Declaration of Allen gersho in Support of 107 Response of Burst.com, Inc.'s Opposition to Plaintiff Apple Computer, Inc.'s Motion for Summary Judgment on Invalidity Based on Kramer and Kepley Patents filed byBurst.com, Inc.. (Attachments: # 1 Exhibit A to A. Gersho Declaration# 2 Exhibit B to A. Gersho Declaration# 3 Exhibit C to A. Gersho Declaration# 4 Exhibit D to A. Gersho Declaration# 5 Exhibit E to A. Gersho Declaration# 6 Exhibit F to A. Gersho Declaration (Part 1)# 7 Exhibit F to A. Gersho Declaration (Part 2)# 8 Exhibit G to A. Gersho Declaration# 9 Exhibit H to A. Gersho Declaration# 10 Exhibit I to A. Gersho Declaration (Part 1)# 11 Exhibit I to A. Gersho Declaration (Part 2)# 12 Exhibit J to A. Gersho Declaration# 13 Exhibit K to A. Gersho Declaration# 14 Exhibit L to A. Gersho Declaration# 15 Exhibit M to A. Gersho Declaration# 16 Exhibit N to A. Gersho Declaration# 17 Exhibit O to A. Gersho Declaration)(Related document(s) 107 ) (Crosby, Ian) (Filed on 6/7/2007)
Apple Computer Inc. v. Burst.com, Inc.
Doc. 108 Att. 15
Case 3:06cv00019MHP
Document 10816
Filed 06/07/2007
Page 1 of 11
Dockets.Justia.com
Case 3:06cv00019MHP
1054
IEEE TRANSACTIONS
Document 10816
Filed 06/07/2007
Page 2 of 11
VOL. ASSP34, NO. 5, OCTOBER 1986
ON ACOUSTICS,SIGNAL SPEECH, AND PROCESSING,
RegularPulse ExcitationA Novel Approach to Effective and Efficient Multipulse Coding of Speech
AbstructThis paper describes an effective and efficient time dovarying linear predictive (LP) filter to model the shortmain speech encoding technique that has an appealing complexity, time spectral envelope of the quasistationary speech siglow and produces toll quality speech at rates below 16 kbits/s. The pronal. The problem that remains is how to describe the reposed coder uses linear predictive techniques to remove the shorttime correlation in the speech signal. The remaining (residual) information sulting prediction residual that contains the necessary information to describe the fine structure of the underlying is then modeled by a low bit rate reduced excitation sequence that, when applied to the timevarying model filter, produces a signal that spectrum. In other words, what is the "best" lowcapacis "close" to the reference speech signal. The procedure for finding ity model for the speech prediction residual subjected to the optimal constrained excitation signal incorporates the solutiona of one or more judgment criteria. These may include objecfew strongly coupled sets of linear equations and is of moderate comtive and subjective quality measures (such as rate distorplexity compared to competing coding systems such as adaptive transtions and listening scores, respectively), but coder comform coding and mnltipnlse excitation coding. The paper describes the for findingtheexcitation se plexity can also be taken into account. Although certain novelcodingidea andtheprocedure quence. We then show that the coding procedure can considered as models have been shown to behave very satisfactorily 121be an"optimized"basebandcoderwithspectralfoldingashighfre[4], the question of optimality remains difficult to answer. quency regeneration technique. The effect of various analysis paramIn this paper we address the problem of finding an exeters on the quality of the reconstructed speech is investigated using of both objective and subjective tests. Further, modifications the basic citation signal for an LP speech coder that not only enalgorithm, and their impact on both the quality of the reconstructed sures a comparable quality with existing approaches, but speech signal and the complexity of the encoding algorithm, are dis is also structurally powerful. By the latter we mean that a cussed. Using the generalized baseband coder formulation, we demfast realization algorithm and a corresponding high onstrate that under reasonable assumptions concerning the weighting lowcomplexity/highqualitycoder can be obtained. throughput (VLSI) implementation can be obtained. We filter, an attractive
I. INTRODUCTION N interesting application area for digital speech coding can be found in mobile telephony systems and computer networks. Fortheseapplications, toll quality speech at bit rates below 16 kbits/s is a prerequisite. Many of the conventional speech coding techniques [ l ] fail to obey this condition. However, a class of coders, the socalled delayed decision coders (DDC) [l , ch. 91, seems to be promising for these applications.Coders that belong to this class utilize an encoding delay to find the "best" quantized version of the input speech signal or a transformed version of it. Quite effective algorithms can be designed by combining predictive and DDC techniques to yield low bit rate waveform matching encoding schemes. A powerful and common approach is to use a slowly time
A
Manuscript received August 23, 1985; revised March 5, 1986. This work was supported in part by Philips Research Laboratories, Eindhoven, The Netherlands, and by the Dutch National Applied Science Foundation under Grant STW DEL 44.0643. P. Kroon was with the Department of Electrical Engineering, Delft University of Technology, Delft, The Netherlands. He is now with the Acoustics Research Department, AT&T Bell Laboratories, Murray Hill, NJ 07974. E. F . Deprettere is with the Department of Electrical Engineering, Delft University of Technology, Mekelweg 4, 2628 CDDelft, The Netherlands. R. J. Sluyter is with the PhilipsResearch Laboratories, 5600 MD Eindhoven, The Netherlands. IEEE Log Number 8609633.
propose a method in which the prediction residual is modeled by a signal that resembles an upsampled sequence and has, therefore, a regular (in time) structure. Because of this regularity, we refer to this coder as the regularpulse excitation (RPE) coder [5]. The values of the nonzero samples in this signal are optimally determined by a leastsquares analysisbysynthesis fitting procedure that can be expressed in terms of matrix arithmetic. In Section I1 we describe in more detail the regularpulse excitation coding procedure and the algorithm for finding the excitation sequence. In Section I11 we show that the proposed encoding procedure can be interpreted in terms of optimized baseband coding. In Section IV, the influence of the various analysis parameters on the quality of the reconstructed speech is investigated. Further, to exploit the longterm correlation in the speech signal, the use of a pitch predictor is discussed. Modifications to the basic procedure, to attain a further reduction in complexity without noticeable quality loss, are described in Section V. Finally, in Section VI, we describe the effect of quantization on the quality of the reconstructed speech signal. 11. BASICCODER STRUCTURE The basic coder structure can be viewed as a residual modeling process, as depicted in Fig. 1. In this figure, the residual r ( n ) is obtained by filtering the speech signal s(n)
00963518/86/10001054$01.00 0 1986 IEEE
Case 3:06cv00019MHP
KROON et al. : REGULARPULSE EXCITATION
Document 10816
Filed 06/07/2007
Page 3 of 11
1055
In this figure, the locations of the pulses are marked by a vertical dash and the zero samples by dots. If k ( k = 1, 2 , * * , N ) denotes the phase of the upsampled version of the vector b(k),i.e., the position of the first nonzero
sample in a particular segment, then we have to compute for every value of k the amplitudes b'k)( that minimize the accumulated squared error. The vector that yields the minimum error is selected and transmitted. The decoding procedure is then straightforward,asis shown in Fig. 1(b).
a)
(b)
Fig. 1. Block diagram of the regularpulse excitation coder: (a) encoder, (b) decoder.
k.1:
I . _ I .I...I...I ..I ..I . . . I . . . I . . . I
. I . . I . . . ~ . . . I . .I
A. Encoding Algorithm Denoting by Mk the Q by L position matrix with entries m,=1fj=i*N+k1 i m, = 0 otherwise
OSiSQ1
k  2 . . I . . . I. I . . . I . . . I .
k.3.
..I
. . . ~ . . . I. . . . . I , . I . . I . . . I . I . , I I
OsjsL1,
(3)
k  4 . . . I. , I . . . I . . . ~ . . . I . I . . I . . . I . . . I . . . I
Fig. 2. Possible excitation patterns with
k = 40 and N = 4.
through a pthorder timevaryingfilter A(z),
the segmental excitation row vector d k ) , corresponding to the kth excitation pattern, can be written as = b" ' k k. (4) Let H be an uppertriangular L by L matrix whose jth row ( j = 0, * , L  1) containsthe (truncated) response h(n) of the error weighting filter l/A(z/y) caused by a unit impulse 6(n  j ) . That is,

which can be determined with the use of linear prediction (LP) techniques as described in, e.g., [ 6 ] .The difference between the LPresidual r (n) and a certain model residual u (n)(to be defined below) is fed through the shaping filter 1/A(zh),
1
H=
1
This filter, which serves as an error weighting function, plays the same role as the feedbackfilter in adaptive predictive coding with noise shaping (APCNS) [7] and the weighting filter in multipulse excitation (MPE) coders [ 2 ] . The resulting weighted difference e(n) is squared and accumulated, and is used as a measure for determining the effectiveness of the presumed model u(n) of the residual rtn). The excitation sequence u (n) is determined for adjacent frames consisting of L samples each, and is constrained as follows. Within a frame, it isrequired to correspond to an upsampled version of a certain "optimal" vector b = (b(l), , b@)) of length Q (Q < L). Thus, each segment of the excitation signal contains Q equidistant samples of nonzero amplitude, while the remaining samples are equal to zero. The spacing between nonzero samples is N = L/Q. For a particular coder, the parameters L and N areoptimally chosen but are otherwise fixed quantities. The duration of A frame of size L is typically 5 ms, Each excitation frame can support N sets of Q equidistant nonzero samples, resulting in N candidate excitation sequences. Fig. 2 shows the possible excitationpatterns for a frame containing 40 samples and a spacing of N = 4.
If eo denotes the output of the weighting filter due to the memory hangover (i.e., the output as result of the initial a filter state) of previous intervals, then the signal e(n) procan duced by the input vector b(k) be described as e(k)= e(')  b(k)Hk, k = 1, . , N , (6) where = eo rH, (7)
+
+
Hk
=
MkH,
(8)

and thevector r represents the residual r (n) for thecurrent frame. The objective is to minimize the squared error
(9) where t denotes transpose. For a given phase the optimal amplitudes b'k'(* ) can be computed from (6) and (9), by requiring e@)H;to be equal to zero. Hence,
resulting By substituting (10) in (6) andthereafterthe expression in (9), we obtain the following expression for the error: E(k) = ,@)[I  H b[HkH:]'Hk] e(')'. (1 1)
= e(k)e(k)t,
Case 3:06cv00019MHP
1056
Document 10816
Filed 06/07/2007
Page 4 of 11
5 , OCTOBER 1986
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL
PROCESSING, VOL. ASSP34, NO.
The vector b@) that yields the minimum value of E @ ) over all k is then selected.The resulting optimal excitation vector d k ) is entirely characterized by its phase k and the The corresponding amplitude vector b(k). whole procedure comprises the solution of N sets of linear equations as given by (10). A fast algorithm to compute the N vectors b(k) simultaneously has been presented in [8] and [9]. We shall show in Section V that a further reduction in complexity can be obtained by exploiting the nature of the matrix product HkH : in (10). 111. GENERALIZED BASEBANDCODING It may be observed that the regularpulse excitation sequence bears some resemblance to the excitation signal of excited baseband coder (BBC) using spectral folding as , highfrequency regeneration technique [4] [ 101. In this section we show that the RPE coder can be interpreted as a generalized version of this baseband coder. For this purpose we use the block diagram of Fig. 3. The blocks drawn with solid lines represent the conceptual structure of a residual excited BBC coder with spectral folding. For this coder, the index k has no significance and is set to signal r(n), obzero. In this scheme,theLPresidual tained by filtering the speech signal through the filter A ( z ) , is bandlimited by an (almost) ideal lowpass filter Fo(z), downsampled to b"'(n) and transmitted. At the receiver, this signal is upsampled to d0'(n)to recover the original bandwidth, and is fed through the synthesis filter to retrieve the speech signal s^(n).When the dashed blocks are included in Fig. 3, one provides a possibility to optimize the filter Fk(z), i.e., to replace the ideal lowpass filter Fo(z) by another filter, which is more tailored to "optimal' ' waveform matching, where the optimality criterion is to minimize the (weighted) meansquared error between the original and the reconstructed signal. We shall now show that for this "optimized" BBC version, the output of the filter Fk(z),after downand upsampling, is exactly the excitation signal dk)(n)as computed by the RPE algorithm. Thus, let there exist for each k, ( k
e ( n ) , I I7r 7y" ' I
L .._^_.___ _I L
ERROR MINIMIZATION
I
I
A(z/y) ________
I
I LJ
J
I
Fig. 3 . Block diagram of a BBC coder (solid lines), and an RPE coder (solid and dashed lines).
= 1,
*  ., N ) , an FIR filter Fk(z)such that the weighted
L1
leastsquares error C, e2(n)over the interval L is minimal. Define Fk(z) as Fk(z) = and
1=0
,x fjk)zi,
*
(12)
f`k'(L l)]. (13) Let r+(n) and r(n) (n = 0, . * , L  1) denotethe residual samples of the current frame and those of the previous frame, respectively. Then we can write for the output d k ) ( n )of the filter Fk(z) . * r+(L  1) r+(O) r+(U
f (k) = [f`@(O) f@'(l>*
r(L

1) r+(O) 2) r(L  1)
r+(L  2)
r(L
r+(L  3)
1
=f(k)R,
(14)
The vector b@),which isthe downsampled version of u@)(with downsampling factor N ) , can be written as
b@)= f (k)R&f:,
=f
(k)Rk
(15)
with
r((Q  1)N
+ k)
where Mk is the position matrix as defined in (3), and where the definition r(L + k ) = r + ( k ) . The excitation vector d k ) can be expressed as the product 0 . . . , 0 . . . r+((Q  l ) N  1 k) 0 * . . 0
1
I
+
...
k1 Nk
...
Nk
Case 3:06cv00019MHP
KROON et al. : REGULARPULSEEXCITATION
Document 10816
Filed 06/07/2007
Page 5 of 11
1057
321
,
,
k.3
I
321
,
,
k4
1
 4 0 ~ . ~ 1.0 2.0 30 FREQUENCY (kHz)
4.0 400.~
1.0 2.0 4.03.0 FREQUENCY (kHz)
1 6
32
TIME
48
64
Fig.'4. Power spectra
1 F,(e j s ) (* for different values of k , obtained from a
5 ms speech segment.
(ms)
Fig. 5. (a)Speechsignal s(n), (b)reconstructedspeechsignal S(n), (c) excitation signal u ( n ) , and (d) difference signal s(n)  S(n) in the RPE coding procedure.
Hence, with the matrix H and the initial error d o ) as defined in the previous section,
=
f
(k) RkMkH
TABLE I DEFAULT PARAMETERS RPE ANALYSIS
Parameter sampling frequency LP analysis procedure order ( p ) update rate coefficients analysis frame size pulse spacing N frane size L weight factory
Minimizing we
 f (k)RkHk.Value
obtain as solution
(18)
f
(k) =
e'o'(RkHk)'[R,H,(RkHk)t]'. (19)
Substituting this result in (15), we obtain the vector b@', which is equal to the pulse amplitude vector b(&) obtained via the procedure described in Section I1 (see the proof in the Appendix). Fig. 4 gives an example of the spectra 1 Fk(eje)I2obtained from real speech data. From this figure we see that the filters F k ( z ) are rather different from the one (F,(z)) used in the classical baseband coder, and havea more allpass character. Although the RPE algorithm and the optimal BBC algorithm are conceptually equivalent, the optimized BBC variant will in general not offer any computational advantage over the RPE approach. However,in Section V, it is demonstrated that under certain reasonable assumptions concerningtheweighting filter, the BBC approachcan provide an attractive alternative in practice.
OF IV. EVALUATION THE RPE ALGORITHM Fig. 5 shows a typical example of the waveforms as produced by the RPE coder,using the analysis parameters listed in Table I. The corresponding shorttime power spectra of the speech signal s(n) (solid line) and the reconstructed signal 9(n) (dashed line) are shown in Fig. 6 . To give an impression of the signaltonoise ratio over a complete utterance, we show in Fig. 7 the segmental SNR (SNRSEG) computed every 10 ms for the utterance "a lathe is a big tool" spoken by both a female and a male speaker.
8 kHz autocorrelation 12 10 ms 25 ms Hamming window 4 5 ms 0.80
b
'. 00
10 .
2.0
3.0
L
4.0
FREQUENCY (kHz) Fig. 6. Power spectra of the original speech segment (soIid line) and the reconstructed speech segment (dashed line). The spectra were obtained with a Hamming window using the last 32 ms segment of the data displayed in Fig. 5.
2) pulse spacing N , 3) frame size L , and 4) error weighting filter. To evaluate the coder behavior, we useda set of default parameter values (see TableI), while the parameter under investigation was vaned. The effects of the predictorparameters in APClike schemes have been extensively studied in the literature (e.g., [l]), and will not be discussed in detail in this paper. We found that good results were obtained with the A. RPE Analysis Parameters autocorrelation method using a Hamming window on 25 The RPE analysis parameters that could affect the final ms frames. The predictor coefficients were updated every speech quality are listed below: 20 ms and thepredictor orderp was chosen tobe equal to 1) predictor parameters, 12.
Case 3:06cv00019MHP
1058
Document 10816
Filed 06/07/2007
Page 6 of 11
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH,AND
SIGNAL PROCESSING, VOL. ASSP34, NO. 5, OCTOBER 1986
1 2 I. I
.I..I
1.1.
1.
I.I.I..I.
1.1
I. I
I.
3
1.J
1 .II..l
1L.I..
IL.1.. I.
4
5
)..I
I .I..I
.I..I..I
.I
I..l..I , I . I . .I.
6 7
I../ I.I..I..I
I . . I . l . I. . . , I . I . . I . . I
A L A T H IES A
BIG T O O L
8 9
J.,l .I..I...I../..I../
..I , I .I..I..I
.I..I..I
Fig. 9. RPE excitation patterns with D
=
12, L = 24, and N = 3.
TABLE I1 SNR VALUES DIFFERENT FOR VALUES L OF
AND D
0.00 0.25 0.75 0.50
1.00 1.251.75 1.50
2.00
TIME ( S I (b) Fig. 7 . Segmental SNR for successive time frames for a female speaker (a) and a male speaker (b). The upper curve represents the speech power +15 dB..
L:D
20 :20 40 : 20 40 : 40 14.28 14.58 80 : 40 80 : 80
SNRSEG 14.58 dB 14.15 dB dB dB 13.80 dB
SNR 11.SO dB 11.90 dB 11.17 dB 11.29 dB 10.44 dB
ever, there is no real tradeoff between the values of L and N. Informal listening tests confirmed the ranking as introduced by the SNR measurements. For values of N greater than 5 , some of the utterances (especially those by female speakers) sounded distorted. From our experiments, we found that N = 4 and L = 5 ms will give the best results considering the bit rate constraints. 7 ' Ih I 6 20 2 1 4 42 So 84 0 I The pulse amplitudes b`k)( and phase k are computed FRAMESIZE L every L samples, which means that the phase adaptation Fig. 8. Segmental SNR values for different frame sizes L and pulse spacings N . The results for L = X were obtained with a fixed value fork and rate is equal to 1/L. To investigate the effect of this "disturbance," without changing the size of L , we considered L = 40. phase adaptation every D samples, where the value of D is less than or equal to L , and LID must be an integer We mentioned earlier that for the casein which there is ratio. Within a frame of size L , the possible number of no phase adaptation (on a frame basis), that is, k is fixed excitation sequences is then given by and equal to 1, the structure of the excitation signal reB = N d , d = LlD. (20) sembles the upsampled residual signals used in BBC coders with spectral folding. This observation can give us a Hence, a value of D smaller than L results in a more comrough estimate of the maximum spacing ( N ) between the plex procedure for the computation of the optimum excipulses, to ensure agood synthetic speech quality. Assumtation. Fig. 9 shows the possible excitation patterns for L ing a maximum fundamental frequency of 500 Hz, ne = 24, D = 12, and N = 3 . Table I1 lists the resulting have to use a samplingrate of minimally 1000 Hz. Hence, averaged SNR values for different frame sizes L and ratios for an 8 kHz sampling rate, the pulse spacing should be L/D = 1 and 2. From this table we see a small improveless than or equal to 8. ment in SNR for values of D less than L , at the expense To investigate the effect of different frame sizes L and of a much higher complexity. pulse spacings N , we computed the segmental SNR values of the reconstructed speech signals for various values of B. Application of a Pitch Predictor these parameters. Fig. 8 shows the averaged segmental An examination of the regularpulse excitation (see, for SNR values for two female and two male speakers' for example, Fig. 5 ) reveals the periodic structure of the exdifferent values of N and L. As far as possible, we have citation for voiced sounds. Obviously, the RPE algorithm chosen the same frame size for different values of N . From aligns the excitation "grid" to the major pitch pulses, this figure, we see that the SNR increases with the number thereby introducing the possibility that the remaining of pulses and decreases with increasing frame size. Howpulses within the grid are not optimally located. If we `The utterances are: "A lathe is a big tool" and "An icy wind raked model the major pitch pulses with a pitch predictor/synthesizer, the remaining excitation sequence can be modthe beach."
1"
I
I
:
a)
Case 3:06cv00019MHP
KROON et al. : REGULARPULSE EXCITATION
Document 10816
Filed 06/07/2007
Page 7 of 11
1059
eled by the regularpulse excitation sequence. A simple but effective pitch predictor is the socalled onetap predictor, 1  P(z) = pzM, (21) where M represents the distance between adjacent pitch pulses and p is a gain factor. The pitch predictor parameters can be determined either in an openloop configuration [ 111, or in a closedloop configuration [121. In the latter case, the parameters can be optimally computed by including a pitch generator l/P(z) in the closedloop diagram of Fig. l . The parameters p and M are determined such that the output of the pitch generator due toits initial state is optimally close (in the weightedsense) to the initial error signal e(')(n). Once /3 and M have been determined,theremaining regularpulse excitation signal is computed as described in Section 11, except that this signal is now to be fed. through both the pitch generator and theweighting filter. The advantage of determiningthe pitch parameters within the analysis loop is that the pitch generator is then optimally contributing to the minimization of the weighted error. To be more specific, let yM(n) be the response of the pitch generator to an input u (n), which is zero for n 2 0,
YM(4
0
1 6
32 64
48
TIME (mS)
Fig. 10. (a) Speech signal s(n), (b) reconstructed speech signal S(n), (c) excitation signal (i.e., output of the pitch generator), (d) difference signal s(n)  S(n) in the RPE coding procedure with pitch prediction.
= u (4 + PY.&
 M).
(22)
;
Let zM(n) represent the response of the weighting filter to repthe input signal y M ( n ) ,defined in (22), and let e@)(n) resent the initial error as defined in (7). The error to be minimized will then be
L Z
4
'. 00
1 . 0
2.0
3.0
40 .
FREQUENCY (kHz)
Fig. 11. Power spectra of the speech signal (solid line) and the difference (dashed line) for y = 0.80. The spectra were obtained signal s(n)  %(n) from the last 32 ms segment of Fig. 5.
E(M, 0 = )
(e")(n) 
PzM(~))~.
(23)
The approach is to compute P for allpossible values of M within a specified range, and then select the pair (M, 0) for which E(M, P ) is minimal. The range of M should be chosen to accommodate to the variation in pitch frequency in the speech signal. However, in simulations with a onetap predictor using different ranges of M, we foundthat a range of M between 16 and 80 (i.e., a fundamental frequency between100 and 470 Hz) is satisfactory. The effect of pitch prediction is demonstrated in Fig. 10, by using the same speech segment as used in Fig. 5. The shorttime power spectra of the speech signal s(n) (solid line) and of the error signal s(n)  9(n) (dashed line) for y = 0.80, without and with pitch filter, are shown in Figs. 11 and 12, respectively. The effect of pitch prediction on the averaged segmental SNR values is shown in Fig. 13. These figures show that the effect of pitch prediction is to decrease the absolute level of noise powerandto flatten its spectrum, and thereby improving the performance in terms ofSNR. This effect was most noticeable for highpitched (average pitch 1 2 5 0 Hz) speakers.
[L
d
OJ 0.0
10 .
2.0
3.0
4.0
FREQUENCY ( k H z ) Fig. 12. Power spectra of the speech signal (solid line) and the difference signal s(n) S(n) (dashed line) for y = 0.80, and pitch prediction. The spectra were obtained from the last 32 ms segment of Fig. 10.

5
5
1. 716.15.
o f +PP
a A
0
a
YfPP
a
t
14
B
P
Y
1. 312
Y
+
O
,Y
II
Y
C. Error Weighting Filter Although the effect of noise shaping can be heard, the real mechanism behind this effect is not clear. We will not pursue the question whether the proposed noiseshap
N=4
N=5
N=4
N=5
UPDATE RATE IOrns 20 rns Fig. 13. Segmental SNR values obtained from RPE encoded speech with ( + p p ) and without (  p p ) pitch prediction for different update rates of the predictors and different pulse spacings N (f = female, m =.male).
Case 3:06cv00019MHP
1060
Document 10816
Filed 06/07/2007
SIGNAL PROCESSING.
Page 8 of 11
VOL. ASSP34, NO. 5, OCTOBER 1986
IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND
Notice that the matrix HH' is also a Toeplitz matrix. Moreover, when substituting H from (26) into (8), we shall have that the matrices HkHL are independent of the phase index k and are equal to a single Toeplitz matrix. It should also be remarked that the matrix of (26) is an L by 2 L matrix instead of an L by L. Thus, when substituting H of (26) in ( 7 ) and (8), the vectors eo, e('), and e(k) in (6) and ( 7 ) will now be of length 2L, while the vectors u ( ~ ) r in (4) and ( 7 ) remain of dimension L. The RPE and encoding procedure that is based on the mapping H in k= I (26), and for which g(n) in (25) is the impulse response where ( c k )are the coefficients of the$xed loworder pre of the transfer function l/A(z), will be referred to as dictors as used in DPCM systems, which are based on the RPM1. Fig. 14(a) shows the segmental SNR values per averaged spectral characteristics of speech. We carried out 10 ms for this method (dashed line) and the original comparative listening tests on the results obtained with method (solid line) for the utterance "a lathe is a big tool" fixed weighting filters of different orders ( q = 1 to 3). spoken by both a male and a female speaker. The value of y was set to 0.80 and we used for ( c k } the f coefficients tabulated in [ 131. It was surprising to find that B. Modification o Hk& to a Band Matrix In the previous subsection, a computationally attractive the effects of the weighting filters l/A(z/y) and l/C(z/y) were judged to be almost equivalent. This remarkable re scheme was obtained by forcing the matrix operator H to sult can be exploited to dramatically reduce the complex be of the form of (26). Recall, however, that this structure ity of the proposed coder, as we will show in the next is almost naturally emerging when the mapping originally defined via ( 5 ) is taken to be of dimension L by 2 L instead section. of L by L. This is the more so when h(n) in (26) is the V. COMPLEXITY REDUCTION THE RPECODER OF impulse response of the fixed filter l/C(z/y) of (24). But The analysis procedure of the RPE coder necessitates an even more interesting observation is that the resulting the solution of N sets of linear equations, where N rep single Toeplitz matrix, whether data dependent or not, is Hence, when minimizing resents the spacing between successive pulses within a strongly diagonal dominant. frame in the excitation model. However, the matrices E(k' in (1 l), where now H is built on l/C(zly), or equivH k H i , which have to be inverted, can be solved very ef alently, when maximizing ficiently as was described in [8] and [9]. We shall not T'k) e(')H:[HkHi] lHk (27) pursue the details of this procedure here, but we shall instead look for modifications of the algorithm to reduce the we can conveniently replace the (Toeplitz) matrix HkHi with a diagonal matrix roZ, where ro = h2(i), yieldcomplexity without affecting the coder performance.
+
ing filter of ( 2 ) is an effective choice or not, and concentrate instead on the effect of the suggested filter and its parameter determines the control parameter y. This amount of noise power in the formant regions of the speech spectrum. Noise shaping reduces the SNR, but improves the perceived speech quality. An optimal value for y was found to be between 0.80 and 0.90 at an 8 kHz sampling rate, and resulted in an average 2 dB decrease in SNR. Aside from the value of y,the order of the noiseshaping filter could also be of importance. By default, thecoefficients (ak} and the order p of l/A(z/y) areequal to those of the predictor A(z), but instead, we can compute a qthorder predictor ( q < p ) and use the resulting q coefficients to define the weighting filter. While reducing the order, we nevertheless must takecare that the noise remains properly weighted. We examined the effect of decreasing the order of the weighting filter l / A ( z l y ) , and observed that for low orders (24), the results were close to those obtained with a 16thorder filter. However, the computational savings obtained by reducing the order of the weighting filter are marginal. The timevarying nature of the weighting filter provides a significant contribution to the complexity of the analysis procedure,since the system of linear equations to be solved is entirely built on the impulse response of this filter. It is obvious that the computational complexity would be considerably lower in casea weighting filter could be chosen such that the matrix to be inverted no longer depends on shorttime data. It turns out that this is possible by choosing the weighting filter equal to l/C(zly), 1 1  9 (24) c(z'y) 1 C c k ykz k '
A. Modijication of Hk H i to a Toeplitz Matrix To begin with, we can reconfigure the algorithm to force the matrix product H k H : in (10) to become a single Toeplitz matrix which is independent of the phase k . Thus, let h(n) = y"g(n), n
=
0, 1, 2, *

,
(25)
be the impulse response of the weighting filter l/A(z/y), where g(n) is the impulse response of the allpole filter l/A(z). For values of /y less than one, h(n) converges I faster to zero than g(n) and, as a result, L by 2 L matrix the built on h(n) can be very well approximated by the uppertriangular Toeplitz matrix H in (26).
.=I"
h(0) h(1)
*
0
h(0) *
 h(L  1) .  h(L  2)
*
0
h(L  1)
...
*
0 0
.

.
h(L  3) h(L  2)
:
0

**
k(0)
...
h(L  1) 0
j.
(26)
Case 3:06cv00019MHP
EXCITATION : REGULARPULSE KROON et al.
Document 10816
Filed 06/07/2007
Page 9 of 11
1061
ALATHE
ISAB I G
T0 0 L
TIME (SI ALATHE ISAB I G T 0 0 L
I
h.
h
7
examples. Method RPM2 refers to the procedure described in this subsection, where the optimal phase index k is determined from (28), after which the excitation string b(k) computed according to (10). From this table, we is see that the modifications introduced resulted in a slight decrease in SNR. But from informal listening tests, the modified methods were judged to bealmost equivalent to the original RPE method. Fig. 14(b) shows the segmental SNR values per 10 ms for RPM2 (dashed lines) and the original method (solid line) for the utterance "a lathe is a big tool" spoken by both a male and a female speaker.
t
4
t
2.00
A L A T HI S A E
BIG T O O L
TIME ( S ) A L A T H ES A I G IB TOO L
C. Avoiding Matrix Inversion The discussions in the previous two subsections have led to the conclusion that the complexity of the RPE coder, although moderate by itself, can be substantially reduced without any significant degradation of the speech quality. We shall show in this subsection that it is even possible to obtain an extremely simple encoding algorithm that turns out to yield an applicable practical version of the (conceptual) optimal baseband coder which was described in Section I11 and was shown there to be equivalent to the RPE coder. Thus, let h(n) in (26) be the impulse response of the timeinvariant filter l/C(zly) as defined in (24). Next, use in (8) the matrix H as defined in (26) and discard the zerothorder approximation eo in (7). Then (6) and (10) become e(k) = rH  b'')Hk, (29)
and
b(k'[HkHi] = rHH'Mi,
respectively. Now denoting
TIME ( S )
(b) Fig. 14. Segmental SNR ratios for RPE(solid line) and modified methods, (a) RPMl and (b) RPM2 (dashed line) for a female and male speaker.
(30) 1) (3 (32)
S = HH',
and recalling that
HkHi
with
L I
Pol,
TABLE 111 SNR VALUES OBTAINED THE ORIGINAL D THE MODIFIED RPE WITH AN AND IN VA ALGORITHMS RPMl RPM2 DESCRIBED SECTION A N D VB. THE AND IN PROCEDURES,RPFI RPF2 ARE DESCRIBEDSECTION v  c . Method RPE RPMl RPM2 RPFl RPF2 SNRSEG 14.28 dB 12.98 dB 13.00 dB 10.04 dB 10.40 dB SNR 11.17 dB 10.93 dB 11.03 dB 9.38 dB 9.21 dB
r0 =
i=O
C
/z2(i),
as a coder constant, it is easy to show that (33) Interpreting M i as a downsampling operator, (33) says that b'k' resembles2 a downsampled output of a smoother S whose input is a scaled version of the residual I [see Fig. 15(a)]. The excitation selection in the diagram of Fig. 15(a) is based on the minimization of the approximation error given by (1 1). Underthe abovementioned constraints, this equation becomes = r ~ p r b(k)b(k)' f 0 (34)
'This statement must be carefully interpreted. In fact, (33) is a block smoother, and hence, the boundary conditions of the smoother's intetnal state must be properly taken into account.
ing
,(o)t, (28) r0 which means that no matrix inversion is needed to find the optimum phase k. Table I11 lists theSNR and SNRSEG values for the different methods, obtained by averaging the results of the same four utterances used in previous
T(k)
=
1e ( ~ ) H ; H k
Case 3:06cv00019MHP
1062
Document 10816
Filed 06/07/2007
Page 10 of 11
NO. 5 , OCTOBER 1986
IEEE TRANSACTIONS ACOUSTICS, ON
SMOOTHER
SPEECH,ANDSIGNAL
PROCESSING, VOL. ASSP34,
VI. QUANTIZATION
L
(a)
r(n~
SMOOTHER EXCITATION SELECTION
, b, ( n l
c
*T
+*

b2 ( n )
1
+ g
,
b, I ~ I
SELECT MAXIMUM
(b) Fig. 15. Simplified R Y @ procedure (a) and excitation selection (b). The smoother i s represented by a triangle shape.
A L A THE
ISAB I G
TO 0 L
TIME (SI
(a)
A L A T H ES A I
BIG T O O L
TIME [S)
(b)
Fig. 16. Segmental SNR for RPFl procedure (solid line) and RPF2 procedure (dashed line) for a female (a) and a male (b) speaker.
To quantize the pulses (i.e., entries of b@'), used an we 8level adaptive quantizerwhose input range was adjusted to the largest pulse amplitude within the current frame of size L. The quantization bins can be determined by a LloydMax procedure (nonuniform), but we found that a uniform quantizer also performs quite well. The quantizer normalization factor is logarithmically encoded with 6 bits and is transmitted every L samples (typically 5 ms). The normalized pulses are encoded using 3 bits per pulse. To minimize quantization errors, the quantizer has to be incorporated in the minimization procedure. This can be done in two ways. In the first case (RPQl), only the optimal excitation vector is quantized; and in the second case is (RPQ2), every candidate b@) quantized and the quantized vector that produces a minimum error is selected. From segmental SNR measurements,we found that RPQ2 yields a higher SNR, and in listening tests the quality of the reconstructed speech of RPQ2 was judged to be somewhat better than that of RPQ 1. The 12 reflection coefficients were transformed to inverse sine coefficients and encoded with 44 bits/set. The bitallocation and quantizer characteristics were determined by the minimum deviation method [14]. Using 3 bitdpulse and a pulse spacing of N = 4, the excitation signal can be encoded with 7 kbits/s. The predictor coefficients can be encoded with 2.2 kbits/s resulting in a total bit rate of 9.2 kbits/s. The quality of the reconstructed speech was judged to be good but definitely not transparent. In informal listening tests, it was determined that the RPE approach has fewer anifactsthan the baseband coder as proposed in [4],and that the performance is comparable to that of the MPE schemes. A pitch predictor will enhance the coder performance butgoes at the cost of an B additional 1000 bits/s (4bits for , and 6 bits for M ) .
VII. CONCLUSION
Hence, min { E @ ) )= max {b(k)b(k)'). (35) The whole procedure is now extremely simple. Theresidual signal r is "smoothed" with the smoother S = HH'. The resulting output vector is downsampled by applying M : , and the b(k) which b'k)b(k)' maximum is selected for is [see Fig. 15(b)]. Notice that since H i s built on l/C(z/y), the smoother S will be of low order (typically 3rd order), since h(n) is a rapidly decaying sequence. For comparison, the averaged SNRSEG values obtained with this procedure have been included in Table 111. In this table, the RPE coder using a fixed weighting filter is referred to as RPFl, while the procedure outlined above is referred to as RPF2. In Fig. 16, the same comparison is made of the segmental SNR as a function of time for the utterance "a lathe is a big tool" spoken by both a male and a female fixed speaker. From this figure, it is clearthatfora weighting filter procedure RPF2 provides a quality comparable to that of procedure RPF 1 . The advantage of the former is its ease of implementation.
In this paper, a novel coding concept has been proposed that uses linear prediction to remove the shorttime correlation in the speech signal. The remaining residual signal is then modeled L;J a regular (in time) excitation sequence, that resembles an upsampled sequence. This model excitation Yignal is det:::mined in such a way that the perceptual e? ur between ' e original and the reconstructed signal is minimized. he computational effort is only moderate and can be fun ::&.x reduced by using a fixed error weighting filter and an ai,propriate vector size (minimization segment length). The coder can produce highquality speech at bit rates around 9600 bits/s by using a pulse spacing equal to 4 and quantizing each pulse with 3 bits.The use of pitch prediction improves the speech quality but, in general,theRPEcoder perfoms adequately without a pitch predictor. Other applications for the proposed coder can be found in the area of wideband speech coding (7 kHz bandwidth) as encountered in teleand videoconferencing applications [ 151.
Case 3:06cv00019MHP
KROON et al.: REGULARPULSEEXCITATION
Document 10816
Filed 06/07/2007
Page 11 of 11
1063
APPENDIX The excitation vector obtained with optimized the BBC (Section 111), coincides with the vectcsb(k) produced by the RPE algorithm (Section 11). Proof: Equation (19) can be written as
f`k'Rk[HkHi]Rfk = e(')H;R:.
Multiplying both sides to the right by Rk gives
[13] J . L. Flanagan, M. R. Schroeder, B. S. Atal, R. E. Crochiere, N. S . Jayant,and J. M. Tribolet,"Speechcoding," IEEETrans.Commun., vol. COM27, pp. 710736, Apr. 1979. of [14] A. H. Gray and J. D. Markel, "Implementation and comparison two transformed reflection coefficient scalar quantization methods," IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP28, pp. 575583, Oct. 1980. [15] C.Home,P.Kroon, andE. F. Deprettere,"VLSIimplementable algorithm for transparent coding of wide band speech below 32 kbl s," in Proc.IASTEDSymp.Appl.SignalProcessingDig.Filter., June 1985, pp. A3.1A3.4.
f`k'Rk[HkH:]R:Rk = e`''HZR:Rk.
Now assuming that RfGRk is nonsingular (which will almost always be the case for speech signals), we can as well write
Peter Kroon (S'82M'86) was born in Vlaardingen, The Netherlands, on September 7, 1957. He degrees in receivedthe B.S., M.S.,andPh.D. of electricalengineeringfromDelftUniversity Technology, Delft, The Netherlands, in 197:. ' 1981, 1985, and respectively. His Ph.D. work focused on timedomain techniques for toll quality speech coding at rates below 16 kbits/s. From 1982 to 1983 he was a Research Assistant at the Network Theory Group, Delft University of Technology. During the years 1984 and 1985 he was sponsored by Philips Research Labs to work on coderssuitable for mobileradioapplications.He is currentlywiththe Acoustics Research Department, AT&T Bell Laboratories, Murray Hill, NJ. His research interests include speech coding, signal processing, and the development of software for signal processing.
I
f(k'Rk[HkH:] e(')H:. =:
Substituting b(k) f`k'Rk,see (15), in this equation, we for obtain (10).
REFERENCES
[l] N. S. Jayant and P. Noll, Digital Coding of Waveforms. Englewood Cliffs, NJ: PrenticeHall, 1984. [2] B. S. Atal and J . R. Remde, "A new model of LPC excitation for producing naturalsounding speech at low bit rates," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Apr. 1982, pp. 614617. [3]M. R. Schroederand B. S. Atal,"Codeexcitedlinearprediction (CELP): Highquality speech at very low bit rates," in Proc. IEEE 1985, pp. 937940. Int. Conf. Acoust., Speech, Signal Processing, 141 R. J. Sluyter, G. J. Bosscha, and H. M. P. T. Schmitz, "A 9.6 kbitl s speech coder foc mobile radio applications," in Proc. IEEE Int. Conf. Commun., May 1984, pp. 11591162. [5] E. F. Deprettere and P. Kroon, "Regular excitation reduction for effective and efficient LPcoding of speech," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Mar. 1985, pp. 25.8.125.8.4. f 161 J. D. Markel and A. H. Gray, L..ear Prediction o Speech. Berlin, Germany: SpringerVerlag, 1976. IEEE [7] B. S . Atal, "Predictive coding of speech at low bit rates," Trans. Commun., vol. COM30, pp. 600614, Apr. 1982. [8]E. F. Deprettereand K. Jainandunsing, "Design and VLSI implementation of a concurrent solver for Ncoupled leastsquares fitting Signal Proproblems," in Proc. IEEE Int. Conf. Acoust., Speech, cessing, Mar. 1985, pp. 6.3.16.3.4. [9]K.Jainandunsingand E. F. Deprettere,"DesignandVLSIimplementation of a concurrent solver for Ncoupled leastsquares fitting problems," IEEE J . Select. Areas Commun., pp. 3948, Jan. 1986. [lo] V. R. Viswanathan, A. L. Higgins, and W . H. Russel, "Design of a robust baseband LPC coder for speech transmission over noisy chanvol.COM30,pp.663673,Apr. nels," IEEETrans.Commun., 1982. [ l l ] P. Kroon and E. F. Deprettere, "Experimental evaluation of different approachestothemultipulsecoder," in Proc.IEEE In?. Con$ Acoust., Speech, Signal Processing, Mar. 1984, pp. 10.4.110.4.4. [12] S. Singhaland B. S. Atal,"Improvingperformanceofmultipulse Con$ Acoust., LPCcoders at lowbit rates," in Proc.IEEEInt. Speech, Signal Processing, Mar. 1984, pp. 1.3.11.3.4.
Lecturer at DUT, where he is now Associate Professor in the Department of Electrical Engineering. His current research interests are VLSI and in signal processing, particularly speech and image processing, filter design and modeling, systolic signal processors, and matrix equation solvers.
Rob J. Sluyter was born in Nijmegen, The Netherlands, on July 12, 1946. 1968 graduated In he in electronic engineering from the Eindhoven Instituut voor Hoger Beroepsonderwijs. He joined Research Philips Laboratories, Eindhoven, The Netherlands, in 1962. Until 1978 he wasResearch a Assistant involved in data transmission and low bit rate speech coding. In 1978 he becamea Staff Researcherengaged in speech analysis, synthesis, coding digital of speech, and digital signal processing. Since 1982 he has been engaged in research on medium bit rate coding of speech for mobile radio applications as a member of the Digital Signal Processing Group. His current interests are in digital signal processing for television signals.
:1 , .'
;
.
,
Disclaimer: Justia Dockets & Filings provides public litigation records from the federal appellate and district courts. These filings and docket sheets should not be considered findings of fact or liability, nor do they necessarily reflect the view of Justia.
Why Is My Information Online?