The objective of this experiment is to analyse speech signals by computing the cepstrum for various voiced and unvoiced sounds, study their characteristics using the computed cepstrum and reconstructing the spectrum by retaining various components of the cepstrum.
The Short-time Cepstrum
In the source-filter model of speech processing (shown in Figure 1), speech is produced by exciting a time-varying filter. The excitation generated by the source may be either periodic (in the case of voiced sounds) or noise-like (in the case of unvoiced sounds). Thus the speech signal is produced by convolving the excitation signal by the response of the filter.
We know that convolution of two signals in the time domain is equivalent to multiplication in the spectral domain. Since in the speech signal, the source and filter are convolutionally combined, they cannot be separated by normal linear filtering techniques. To do so, one would in addition to the speech signal, need to know either the source or the filter. Short-time spectral analysis of speech signals will generally show the magnitude response of the filter overridden by pitch harmonics when the duration of the analysis window is more than a pitch period.
Homomorphic signal processing provides us with techniques that can separate the source and filter to some extent. It does so by processing the signal in the cepstral domain.The word cepstral is derived from cepstrum which in turn is derived by reversing the order of the first four letters of the term spectrum. In the cepstral domain, the signal components are additive, easily allowing for the use of linear filtering techniques
To illustrate the process of computing the cepstrum, consider a speech signal \(s(n)\) produced by exciting a filter \(h(n)\) be \(e(n)\), i.e., $$ s(n)=h(n)*e(n). $$ Let \(S(\omega)\), \(H(\omega)\) and \(E(\omega)\) be the Fourier transforms corresponding to \(s(n)\), \(h(n)\) and \(e(n)\) respectively. Then $$ S(\omega)=H(\omega)E(\omega). $$ By applying the logarithmic operator, the multiplication operation can represented in terms of addition, $$ \log S(\omega) = \log H(\omega) + \log E(\omega) $$ Applying the Fourier transform (or the inverse Fourier transform), we obtain $$ c_S(n)=c_H(n)*c_E(n) $$ where \(c_S(n)\), \(c_H(n)\) and \(c_E(n)\) are the Fourier transforms (or the inverse Fourier transforms) of \(\log S(\omega)\), \(\log H(\omega)\) and \(\log E(\omega)\) respectively and are the corresponding cepstral sequences. The dimension in the cepstrum domain is of time and is referred to as 'quefrency'. Linear filtering techniques can now be applied in the cepstrum domain and the inverse Fourier transform (or the Fourier transform) can be applied to obtain the corresponding spectrum.
Since \(\log S(\omega)\) is complex, the computed cepstrum is referred to as the complex cepstrum. Since \(S(\omega)\) is complex, it can be represented as $$ S(\omega)=|S(\omega)|e^{j \theta (\omega)}. $$ Application of the logarithmic operator yields $$ \log S(\omega)=\log |S(\omega)|+j\theta (\omega). $$ The cepstrum can also be computed from the real part of \(\log S(\omega)\) in which case the cepstrum is referred to as the real cepstrum. Considering only the real part, \begin{eqnarray} \mbox{real} \{\log S(\omega)\} & = & \log |S(\omega)|,\\ & = & \log |H(\omega) E(\omega)| \\ & = & \log |H(\omega)| + \log|E(\omega)|. \end{eqnarray}
Application of the Cepstrum to Speech Analysis
Figure 2 shows a segment of voiced speech of duration of 25 ms. Its magnitude log spectrum computed using a 1024-point FFT and the corresponding real cepstrum are shown in Figure 3. The low quefrency regions in the cepstrum is dominated by the filter and the higher quefrency regions are dominated by the excitation source. This can be easily verified by liftering the spectrum (filtering operation in the cepstrum domain). Figure 4 shows the low-quefrency liftered cepstrum and the corresponding spectrum. It can be seen that the spectral characteristics corresponding to the filter (vocal-tract system) can be clearly observed. Similarily Figure 5 shows the high quefrency liftered cepstrum with the corresponding spectrum. The spectrum shows the periodic structure that normally overrides the speech spectrum.
(a) |
(b) |
(a) |
(b) |
(a) |
(b) |
- Study the cepstrum for different regions (voiced/unvoiced/silence) of the speech signal using the examples provided.
- Select one of the example waveforms provided.
- Select a short segment (30 ms) of voiced speech and observe the spectrum as well as cepstrum.
- Repeat the experiment for unvoiced (fricative and stop) and silence regions.
- Note that the cepstrum shows a peak at around the time lag corresponding to the pitch period for segments of voiced speech.
- Measure the pitch period of a voiced segment using the cepstrum.
- Cepstral smoothing of short-term spectrum
- Select only the first few coefficients of the spectrum for a voiced segment of speech, and observe that the liftered spectrum appears smooth.
- Select the cepstral coefficients around the peak corresponding to pitch period, and observe that the liftered spectrum has strong harmonics corresponding to the pitch frequencies.
- Record different sounds for your own voice and study the cepstral features.
- Write a brief note on the observations.
- Cepstrum provides a compact representation of the vocal tract system information.
- The vocal tract system information is characterized by the spectral envelope. This slow varying component in the spectrum is characterized by the first few coefficients in the cepstral domain. Note that cepstrum computation can be interpreted as the spectral analysis of a signal which is the log-magnitude spectrum in this case.
- Note that the pitch period of a voiced segment of speech can be measured from the cepstrum.
- Explain how cepstral analysis of speech signals helps in pattern representation for pattern recognition task.
- What are the advantages and limitations of cepstral domain processing over spectral-domain processing?
- What is the dimensions of an unit in the cepstral domain?
- Explain how cepstral analysis technique can be used to represent the convolutionally combined signals as additively combined signals
- Explain the process of separating source and system characteristics using the cepstrum.
- Explain how pitch information can be extracted from the cepstrum. Write an algorithm to extract pitch using cepstral analysis.
- Digital Processing of Speech Signals, L.R. Rabiner and L.R. Schafer, Chapter 6
- Discrete-Time Speech Signal Processing, Thomas F. Quatieri , Chapter 7
- Digital Speech Processing, Synthesis, and Recognition., Sadaoki Furui., Marcel Deccer, Inc.: New York.