The primary objective of this experiment is the study of the characteristics of speech using linear prediction analysis. This includes observing the LP spectrum and LP residual for voiced and unvoiced segments and studying the effect of order of LP analysis (normalized error), autocorrelation of signal and LP residual for voiced and unvoiced segments, and study of glottal pulse shapes.
Source-system modeling of speech signals using LP analysis
The vocal tract system can be modeled as a time-varying all-pole filter using segmental analysis. The segmental analysis corresponds to the processing of speech as short (10-30 ms) overlapped (5-15 ms) windows. The vocal tract system is assumed to be stationary within the window and is modeled as an all-pole filter of order \( p \) using linear prediction (LP) analysis. The LP analysis works on the principle that a sample value in a correlated, stationary sequence can be predicted as a linear weighted sum of the past few (\( p \)) samples. If \( s(n) \) denotes a sequence of speech samples, then the predicted value at the time instant \( n \) is given by, $$ \hat{s}(n) = \sum_{k=1}^{p}{a_k~s(n-k)} $$ where \( \{a_k\},~k=1,2,...,p \) is the set of linear predictor coefficients (LPC) and $p$ is the order of the LP filter. The error at time $n$ and the sum of squared errors \( E \) are given by, $$ r(n)~=~s(n)~-~\hat{s}(n) $$ $$ E=~\sum_{n}{r^2(n)} $$ The cost function \( E \) is minimized with respect to \( \{a_i\},~i=1,2,...,p \) over the interval \( {-\infty}~{\leq}~n~{\leq}~{\infty} \) (autocorrelation formulation) as, $$ {\partial{E}}/{\partial{a_i}}~=~0~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~1~{\leq}~i~{\leq}~p $$ This minimization leads to a set of normal equations, $$ \sum_{k=1}^{p}{a_k~R(i-k)} = -R(i)~~~~~~~~~~~~~~1~{\leq}~i~{\leq}~p $$ where $$ R(i) = \sum_{n=-\infty}^{\infty}s(n)~s(n+i)~~~~~~~~~~-{\infty}~{\leq}~i~{\leq}~{\infty} $$ is the autocorrelation signal. The solution of these normal equations gives the values of the predictor coefficients \( \{a_k\},~k=1,2,...,p \). The error signal \( r(n) \) obtained by inverse filtering the speech signal is referred to as the LP residual. The smooth variations (highly correlated) in the speech signal are captured by the LPCs and are attributed to the vocal tract characteristics. The complex poles of the LP filter occur as conjugate pairs, and each pair represents a resonator cavity, with a maximum response at a frequency (called as resonant frequency) where the poles are located on the z-plane. The vocal tract can be considered as a cascade of resonator cavities with different shapes and sizes. The resonant frequencies of these cavities are referred to as formants. The LP residual signal has large error values at regular intervals and can be attributed to the periodic impulses of excitation. Hence the LP residual is a good approximation to the excitation source signal and can be used further to extract the excitation source characteristics. A segment of voiced speech (windowed), frequency response of the inverse filter and the corresponding LP residual are shown in Fig.1.
Figure 1: Inverse filtering the speech signal for estimating the excitation source (LP residual) signal. |
Collection of voiced (/a/)and unvoiced (/s/) speech segments
- Record a vowel /a/ for one second at 10 KHz sampling frequency with 16 bit quantization. From this recorded speech file collect 200 ms in steady portion of the waveform.
- Record an unvoiced segment /s/ for one second at 10 KHz sampling sampling frequency with 16 bit quantization. From this recorded speech file collect 200 ms in steady portion of the waveform.
- The short voiced and unvoiced speech segments are shown in Figure 1.
Figure 1: (a) Segment of voiced
speech /a/ and (b) segment of unvoiced speech /s/
Short time spectrum, LP spectrum and Inverse spectrum for voiced segment /a/
Short time spectrum
The short time spectrum consists of range of frequencies (magnitude and phase components) that are present in a small segment (10-30 ms) of a signal. Short time spectrum is computed with the following procedure:
- Take 20 ms of voiced speech segment /a/ after preemphasis
a=wavread('vowel.wav');
a=diff(a);
a200=a(501:700); - Apply a hamming window over a voiced segment (a200), then compute
the fast Fourier transform for the voiced segment (a200) and plot the
magnetite of the spectrum.
ham=hamming(200);
a200ham=a200.*ham;
a200hamspec=fft(a200ham,1024);
y=abs(a200hamspec.*a200hamspec);
logy=10*log10(y);
figure;plot([1:512]*5000/512,logy(1:512));grid;
- LP spectrum provides smoothed envelope of the short time
spectrum, where only the formant frequencies (resonances) are observed.
For realizing this linear prediction coefficients (LPCs) or filter
parameters need to be computed from speech signal.
ak=lpc(a200,14);
lpspec=freqz(1,ak);
y=abs(lpspec.*lpspec);
logy=10*log10(y);
figure;plot([1:512]*5000/512,logy);grid;
Inverse spectrum
- Inverse filter is realized by the inverse of LP filter. Therefore
the spectrum of the inverse filter is computed as follows:
invspec=freqz(ak,1);
y=abs(invspec.*invspec);
logy=10*log10(y);
figure;plot([1:512]*5000/512,logy);grid;
The short time spectrum, LP spectrum and inverse spectrum for a segment
of voiced speech are shown in Figure 2.
Figure 2: (a)
Segment of voiced speech /a/ and its (b) short
time spectrum, (c) LP spectrum and (d) inverse spectrum
Short time spectrum, LP spectrum and Inverse spectrum for unvoiced segment /s/
- For computing the short time spectrum, LP spectrum and inverse spectrum for an unvoiced segment /s/ the same procedure is followed as that of for voiced speech segment /a/.
- The short time spectrum, LP spectrum and inverse spectrum for a segment of unvoiced speech (/s/) are shown in Figure 3.
Figure 3: (a) Segment of unvoiced speech /s/ and its (b) short time spectrum, (c) LP spectrum and (d) inverse spectrum.
LP residual for voiced and unvoiced segments
- LP residual signal is obtained by passing the speech signal
through inverse filter designed with LP coefficients (LPCs). The block
diagram of the inverse filter is shown in Figure 4.
- LP residual is computed for voiced and unvoiced speech segments using the following matlab commands:
- res=filter(ak,1,a200);
- figure;plot(real(res));grid;
- Voiced and unvoiced speech segments and their LP residual signals are shown in Figure 5.
Figure 5: (a) Segment of voiced speech /a/ and its (b) LP residual signal, (c) segment of unvoiced speech /s/ and its (d) LP residual signal.
Autocorrelation function for voiced/unvoiced speech segments and their LP residuals
- Autocorrelation function of the signal x(t) is computed using the
following formulation:
- The above formulation is implemented in matlab using the command
xcorr(x(t)). Autocorrelation function for voiced and unvoiced segments
and their LP residuals is computed.
a200corr=xcorr(a200); - The autocorrelation function for the voiced speech segment and its LP residual signal is shown in Figure 6.
- The autocorrelation function for the unvoiced speech segment and
its LP residual signal is shown in Figure 7.
Figure 6:(a) Segment of voiced speech /a/ and its (b) autocorrelation function, (c) LP residual for the voiced speech segment and its (d) autocorrelation function.
Figure 7:(a) Segment of unvoiced speech /s/ and its (b) autocorrelation function, (c) LP residual for the unvoiced speech segment and its (d) autocorrelation function.
Glottal pulse shape in voiced portion of a speech signal
- By integrating the LP residual we can obtain the glottal pulse shape, it is also known as glottal volume velocity.
- The integration function is implemented in matlab with a function cumsum.
- gp=cumsum(res);
- A segment of voiced speech its LP residual and glottal pulse (glottal volume velocity) waveforms are shown in Figure 8.
Figure 8: (a) Segment of voiced speech /a/, its (b) LP residual and (c) glottal pulse waveform.
LP spectrum for different LP orders
- Compute LPCs for different LP orders (14, 10, 6, 3 and 1), and compute LP spectrum for each set of LPCs.
- A segment of voiced speech and its LP spectrum for different LP orders (14, 10, 6, 3 and 1) are shown in Figure 9.
Figure 9: (a) Segment of voiced speech /a/, its LP spectrum for the LP order (b) 14, (c) 10, (d) 6, (e) 3 and (f) 1
- Normalized error is obtained by normalizing the LP residual energy with respect to speech signal energy.
- Normalized error is computed for both voiced and unvoiced
segments of speech with different LP orders.
for i=1:15
ak=lpc(a200,i);
res=filter(ak,1,a200);
;
end
figure;plot( );grid; - Normalized error for voiced and unvoiced segments of speech for different LP orders is shown in Figure 10
Normalized error for different LP orders for voiced/unvoiced speech segments
Figure 10: Normalized error for voiced and unvoiced speech segments for different LP orders.
- Take a segment (200 msec) of voiced speech /a/ and a segment (200 msec) of unvoiced speech /s/.
- Compute short-time (20 msec) spectrum, inverse spectrum and 14^{th} order LP spectrum for a voiced segment (/a/).
- Compute short-time (20 msec) spectrum, inverse spectrum and 14^{th} order LP spectrum for unvoiced segment (/s/).
- Examine the LP residual for voiced and unvoiced segments.
- Compute autocorrelation function of signal and its LP residual for voiced and unvoiced segments.
- Obtain LP residual for the entire 200 msec of vowel and integrate to examine the glottal pulse shape
- Obtain LP spectrum for a voiced segment for different orders of LP p=14,10,6,3,1
- Obtain normalized error for different orders for a voiced and unvoiced segment
- Write a brief note on the observations
- Short time spectrum gives both source and system information. The envelope of the spectrum gives system information (i.e., resonances in terms of formant frequencies) and spectral ripples (fine variations) give source information (i.e., pitch harmonics). It is a real and even function of ω.
- The linear prediction (LP) analysis models the vocal tract system. LP spectrum observed to be an envelope of short time spectrum (smoothed version of short time spectrum) and the peaks in LP spectrum indicate the formant frequencies (resonances) of the vocal tract system. With observation it is evident that the LP spectrum is derived from an all-pole filter.
- The inverse spectrum is observed to be reciprocal of the LP spectrum. Therefore we can observe the valleys correspond to the peaks in LP spectrum. It is represented by an all-zero filter.
- For voiced speech segment its LP residual is observed to be
periodic. In LP residual signal, peak amplitudes refers to closure of
vocal folds (glottal closure), where the prediction is poor therefore
its results as maximum error.
- The periodicity in LP residual also indicate the pitch information.
- LP residual is a result of passing the speech signal through inverse filter (i.e., removing the vocal tract information). This is also considered to be the excitation signal (source information).
- LP residual for unvoiced speech segment looks like random noise. As the unvoiced speech signal has no periodicity and looks like random noise (no relations among the samples), obviously its input also looks like noise.
- The basic property of the autocorrelation function (even
symmetry) is evident in all (voiced/unvoiced speech and LP residual
signals) the plots.
- The samples in a voiced speech segment are highly correlated,
therefore we will observe the peaks other than center are also
prominent. As voiced speech is periodic, it is inherited in its
autocorrelation function also.
- In LP residual, the correlation among the samples is less,
therefore its autocorrelation function contains the peaks at pitch
rate. Hence autocorrelation function of a LP residual is useful for
pitch computation.
- The autocorrelation function of an unvoiced speech segment shows a major peak at the center and other peaks are not significant, since unvoiced speech signal appears like random signal.
- The autocorrelation function for the LP residual of an unvoiced
speech segment shows a dominant peak at the center and no other peaks
in rest of the portion. This is because, unvoiced speech itself looks
like random (no correlation among the samples) and its residual
reflects still random.This gives the information about the air pressure
build up near vocal folds from lungs, which cause the vibration of
vocal folds resulting in its open/closure.
- Glottal pulse shape shows the change in volume of air. It is also
referred to as glottal volume velocity.
- From the glottal pulse waveform, it is observed that volume of air and its pressure will be maximum at the instant of closure. Following this is the opening of the vocal folds, as a result of which the air pressure decreases.
- LP order determines to some extent the accuracy with which speech
production mechanism is modeled.
- It uses an all-pole model to characterize the vocal tract system
by
capturing the resonances with spectrum and source information with LP
residual (inverse filter i.e., all zero filter).
- LP order determines the number of resonances that can be captured
by
the model. The maximum number of resonances captured by the model with
LP order P is P/2.
- The length of the vocal tract from glottis to lips is approximately 17 cm. This can generate four to five prominent resonances in 0-4 KHz range. These resonances can be captured with the LP model of order 10. We also should take care of radiation and windowing effects. Therefore with LP order 10-14 we can model the system by capturing required resonances.
- System with LP order more than 14 will introduce the spurious resonances, which leads improper representation of the vocal tract system.
- Normalized error for voiced speech signal reduces as the LP order is varied from 0 to 15, since the vocal tract system (speech production mechanism) is modeled more accurately as LP order varying from 0 to 15.
- P=0, no approximation, therefore maximum error
- P=1, only one coefficient used for prediction, therefore error is slightly less compared to that of P=0
- P=10, model correctly approximates the resonances of the vocal tract system, which leads to minimum error
- P ≥ 10 also results the correct modeling of the vocal tract system, which leads to similar error as that of model with P=10
- For unvoiced speech signal, as the signal and residual energies remains reasonable same, change in error as function of LP order is relatively insensitive. Unvoiced speech signal itself appears like random noise, therefore the prediction will remains poor even though if we increase the LP order. Therefore both unvoiced speech signal and its LP residual appears like noise, hence the normalized error for unvoiced speech signal won't depend on the LP order.
- What is the minimum no. of speech samples required to compute the LP coefficients of order p?
- Explain the differences between autocorrelation and autcovariance formulations of LP analysis? Which is better and why?
- Suggest an algorithm for voiced/non-voiced region separation using:
- a) LP residual energy
- b) peridicty information in the LP residual
- Write an Octave/Scilab program that implements 3a and 3b
- Digital Processing of Speech Signals, L.R. Rabiner and L.R. Schafer, Chapter 8
- Discrete-Time Speech Signal Processing, Thomas F. Quatieri , Chapter 5