To understand the issues in short-time spectrum analysis of speech signals.
- To study the effect of size of the analysis window (less than one pitch period, one pitch period, two to four pitch periods).
- To study the effect of shape of the analysis window (rectangular, Hamming and Hann window functions).
Due to time varying nature of the vocal tract system during production of speech, it is necessary to analyse the speech signal to extract the characteristics of the vocal tract system using segments of 10-30 msecs. Important characteristics of the VT system can be seen clearly from the spectrum of the signal and hence short time spectrum analysis is carried out on each segment of speech. The spectrum of a segment of speech is obtained by computing the square of the magnitude of the DFT of the segment, and it is often displayed as log spectrum in dB. Note that only the magnitude of the DFT of the signal segment is considered in this analysis, as features of the VT system are better displayed in the magnitude spectrum. The phase spectrum is not analysed in this experiment.
The objective of this experiment is to study the effect of various analysis parameters on the resulting short time spectrum. For this study the pre-stored speech signal for an utterance may be loaded, or a person's speech can be recorded and loaded for analysis. The following studies Will be made to observe the effects of size of the window, shape of the window, pitch period in the case of voiced segments on different types of segments and silence, unvoiced and voiced. All these effects need to be studied more carefully for voiced segments. For analysis, speech signal data can be loaded from a pre-stored file or speech can be recorded and loaded. The file say ex1.wav can be loaded by selecting it and pressing the load button. The loaded signal comes out in the 1st display panel. The signal can be played using the play button. A segment of speech can be selected in the fourth button in the selection panel. Once a segment is selected by making a region, its spectrum can be computed using spectrum button. The selected segment will be displayed in the windowed wave for parts of the lower two panels, and the log spectrum of the segments using minimum required FFT points are displayed on the RHS of the lower two panels. The two panels are given to compare the effects of analysis parameters on the short-time spectrum.
First let us examine the short time spectral characteristics of 30 ms segment of voiced speech. The log spectrum is plotted using the points for DFT(i.e.,the choice of NFFT) which is 512 in this case.(Note that the log spectrum is displayed for NFFT/2 points covering the desired frequency range of 0-fs/2).
The effect of points using for DFT can be seen by increasing NFFT to 1024 points or higher. Note that in this case the DFT points nearly bring out the values missing when lower NFFT value is used. Those missing values are seen prominently during the regions of rapid changes in the spectrum(near spectral values). Compare the log-spectrum for this using (NFFT=512) points and for NFFT=4096. Note that the peaks in the spectrum especially with low frequency correspond to harmonics, in the waveform has two or more pitch periods. Note that peaks are not seen in the higher frequency range due to side-lobe leakage of the rectangular window.
The effect of shape of the window can be seen by using a rectangular window for the same segment in the lowest(third) panel. Now the log-spectrum appears a little smoother compared to rectangular window case. The harmonics are seen better due to smoothing. But the narrow peak is broadened. The effect of smoothing of the envelope can be seen better if NFFT is made large, say NFFT=4096 for both rectangular and Hamming window cases. Note also that the dynamic range of spectrum envelope is large for Hamming window compared to rectangular window.
The regular peaks in the envelope correspond to harmonics of the fundamental frequency or pitch frequency. The harmonic peaks are smoother in Hamming window case compared to rectangular window case.
The smoothed spectral envelope over the harmonics correspond to the frequency response of the VT system. The peaks of this envelope corresponding to resources or formants of the vocal-tract system. The peaks of this envelope correspond to resonances or formants of the vocal tract system. Due to effects of windowing (finite duration) and pitch, the spectral envelope due to the VT system response cannot be seen clearly.
Now selecting a short(say 3-4 msec) segment of voiced speech, which is less than a pitch period, one can observe the effect of windowing(rectangular or Hamming) for different lengths of DFT, i,e, the value of NFFT. Note that here the pitch harmonics are not visible in the spectrum. Also the spectral envelope shows the gross characterstics of the VT responses, but not the individual resonance formats, due to poor resolution of the spectrum in the frequency domain caused due to selection of short segment. This shows that the width of the segment decides the resolution of the spectral envelope in the short-time spectral analysis. The layer NFFT values merely helps to interpolate the values in between those, obtained for smaller value of NFFT.
It is also interesting to compare the effect of size of the window in terms of a voiced segment. Observe the spectra of rectangle and Hamming windowed segments for large volume of NFFT. The Hamming windowed signal spectrum shows the pitch harmonics better than the rectangle windowed spectrum. Note that the rectangle window produces sharper ripple and the side-lobe effects mask the details of the harmonics in the spectrum. If only two pitch periods are selected, than the Hamming windowed speech does not show any harmonic information, as the two cycles are not equal in shape due to windowing. In all the cases, observe the y-axis scale also for the dynamic range of the log spectrum.
The effect of window size, shape and the NFFT on the short-time spectrum can also be observed on segments of unvoiced sounds like fricatives, by selecting a segment of frication in the signal in the first panel. Since there are no periodic segments in the waveform, the spectral envelope corresponds to the response of the VT system. Usually, the response has larger values is the middle and higher frequencies compared to the values in the lower frequencies. It is difficult to notice any significant differences between spectral envelopes for rectangle and Hanning windowed speech segments, although the overall dynamic range of the spectral envelope is lower for rectangle window due to side-lobe leakage effect. Therefore, it is generally preferable to use a window that tapers off the signal at the ends. The size of the window does not have any significant effect, as the spectral envelope for frication have usually broad peaks, and do not have any sharp resonance peaks.
Note that if one wants to observe the spectra of two or more different analysis segments(i.e, unvoiced and voiced, or different sizes of the voiced segments), then the spectra can be displayed by opening another browser for each type of segment and comparing the spectra displayed in different browsers.
- Record or select from existing files about 2 sec of vowel /a/ and fricative /s/.
- Short-time spectrum - Effects of size and shape of window.
- Consider x(n), 160 samples of /a/.
- Use 512 point DFT to get X(k).
- Plot log spectrum log |X(k)|2
- Study the effect of size of window - 5 msec, 20 msec, 50 msec.
- Study the effect of shape of window on 20 msec data namely, rectangular, Hamming and Hann windows.
- 20 msec of x(n), Hamming window and 512 pt DFT
- Plot log spectrum for voiced and unvoiced segments
The size of the window (rectangular in this case) determines the temporal resolution. Increase in the size of window increases the spectral resolution but reduces the temporal resolution, and vice versa. If the window length is too large, then the spectrum looks noisy and it is difficult to distinguish spectral components of the signal from the components introduced by the windowing effect.
Side-lobe effect is dominant in the rectangular window.
Pitch harmonics can be clearly seen when Hamming or Hamming window is used.
Sidelobe attenuation is more with Hamming and Hamming windows and also the main lobe width increases.
Voiced speech log spectrum shows clearly the harmonic ( source feature) and the formant structure (system feature) whereas there is no defined structure in the log spectrum of Unvoiced speech segment.
- What is the need for windowing of the signal?
- Why rectangular window is not suitable for short-time DFT spectral analysis?
- Explain the influence of the number of DFT points on the DFT spectrum.
- What is the effect of the window size on the DFT spectrum?
- What is the significance of short-time spectral envelope representation of speech?
- What features of the speech production system can be observed at different window size?
- What are the source features that can be observed in the short-time DFT spectrum?
- What vocal tract system characteristics are observed in the short-time DFT spectrum?
- Why is it difficult to analyze the vocal tract system characteristics from a short-time DFT spectrum?
- Digital Processing of Speech Signals, L.R. Rabiner and L.R. Schafer, Chapter 6
- Discrete-Time Speech Signal Processing, Thomas F. Quatieri , Chapter 7