Expt-1 : Manual speech signal-to-symbol transformation
Objective

The objective of this experiment is to make the student appreciate the difficulties in automating the task of speech signal-to-symbol transformation, by trying to manually segment a given speech signal into its constituent symbols.

Tutorial

The main objective of this experiment is to make the student understand and appreciate the difficulties in automating the task of speech signal-to-symbol transformation. The students are expected to understand terms such as phones, phonemes, syllables, transliteration, pitch or fundamental frequency, quasi-periodicity. The student should gain a good understanding of the speech signal by visual inspection. Given a speech waveform one should be able to discriminate between silence and speech, voiced speech and unvoiced speech. The student should be able to draw the speech waveform for a given word with appropriate time durations and relative amplitudes of the constituent phones.

Record speech signal of a sentence in any of the Indian languages, identify and mark by listening, manually, the boundaries of words and subword units (such as syllables and phonemes). Draw on a piece of paper the speech waveform for the first few words of the sentence with appropriate time durations and relative amplitudes of various sounds, clearly marking the various subword boundaries.

Step 1: Write down a sentence

Write down a sentence in any Indian language, preferably your native language, on a piece of paper or your note book. Also, write down the sentence in English the way you would probably write it if you were writing a mail to your friend in your language

Eg: Let us consider a sentence in Hindi:

Utt #1: भारत हमार देश है

Transliteration in English: Bharat hamara desh hai

or in Telugu:

Utt #2: శ�రీ ఎమ� వెన�కయ�య నాయ�ద�  చెప�పెర�

Transliteration in English: ShrI M. Venkaiahnaidu chepperu

Step 2: Transliteration

Transliteration is the process of writing one language using the script of another language. For example, the Hindi utterance considered in Step 1 is also written within brackets in English alphabets. The primary advantage of writing Hindi or any other Indian languages in English is that one can read it on a computer even if fonts for Devanagari (Hindi or Sanskrit script) or other Indian languages are not installed or rendered properly on our machine. Also, it is easier to write and debug programs using these English alphabets, till the time Indian language support for computers becomes widespread. In such a scenario, all softwares on our computer wiil be in our native script and one can write programs using our native script instead of English.

The transliteration of the Hindi utterance given in Step 1 (written using English alphabets or Latin script) uses a convention to write down Hindi alphabets in English. Eg. Bhaa is used for भा. Somebody else may write it as bhaa or BhA or bhA . The point here is that, there is a need for a common rule or convention, if followed by every one make life easier and avoids confusion during discussions or when programs are shared. ITRANS code is one such convention used to write Indian language alphabets in English.

Transliteration in English: Bharat hamara desh hai

ITRANS code: sri em ve.nkayyanayudu chepperu

Step 3: Record the speech signal

Record a speech utterance using any sound recording utility at a sampling rate of 8 kHz or 16 kHz, at 16 bits per sample. Some of the recording utilities that can be used are the one provided as part ofthis virtual lab, or other third party utilities such as wavesurfer, audacity or praat, which can be freely downloaded from the net.

It is assumed that the sound card is configured properly on your machine and you are able to record and playback audio signals.

Step 4: Study the speech waveform

Display the recorded speech signal using any waveform analysis utility.

Zoom in, select and listen to shorter segments of the speech waveform.

Try to identify regions of speech and nonspeech (silence) by visual inspection.

Step 5: Transcription using native script of the language

Play the speech waveform.

By listening to the speech, write down on a piece of paper the utterance in the native script of the language.

Step 6: Transcription into English using ITRANS code tables

Transcribe the utterance into English using ITRANS code tables.

ITRANS code: sri em ve.nkayyanayudu chepperu

ITRANS code is the common transliteration code for Indian languages. Using these ITRANS code tables, any Indian language script can be transcribed into English. For each of the Indian language a separate ITRANS code table is used to transcribe the script from the given language into English.

Step 7: Deriving the syllable-like units from English transcription

Split the utterance (text in ITRANS code) into syllable-like units. That is the ITRANS code text representation of the utterance is divided into subword units such as syllable-like units (text to symbol transformation, i.e., sentence to subword units). Here syllable-like units are the symbols corresponds to the segments of the speech signal.
Reasons for choosing the syllable-like units as symbols against to phonemes (consonants or vowels (C or V)):

(1) It is very difficult to segment the speech data into phonemes.
(2) Due to coarticulation the effect of consonants is observed in the adjacent vowels.
(3) A character in an Indian language scripts is close to a syllable, and is typically one of the forms: V, CV, CCV, CVC, CCVC and CVCC, where C is a consonant and V is a vowel.

Utterance in ITRANS code: sri ve.nkayyanayudu chepperu

Syllable-like units of the utterance : sri em ve.n ka yya na yu du che ppe ru

Step 8: Identifying the boundaries of the syllable-like units in speech waveform

Manually mark the boundaries of the syllable-like units by listening to the speech file segment by segment. The marked syllable boundaries and the associated waveform are shown in Figure 2. In the derived syllable boundaries some of the syllables present in the transcription are missing and some new syllables are marked. The expected syllable boundaries for a given speech file as per the transcription are shown in Figure 3.

Repeat the entire exercise with phoneme as the subword unit.

Procedure

Part-1: Learn by pre-segmented examples

  • Select an utterance of your choice from the list of examples provided.

  • Note down the utterance in the native script of the language.

  • Note down the word-level transliteration of the Indian language utterance using English alphabets.

  • Study the subword-level (syllable-lvel and phoneme-level) transliterations by changing the subword unit.

  • Now choose the subword unit as word, and note the list of words and their boundaries given in the table by the side of the speech waveform.

  • Use the zoom, select and play buttons to select a portion of the waveform, listen, and verify the boundaries of the subword units given in the table.

  • Change the subword unit to syllable and repeat the experiment.

  • Change the subword unit to phoneme and repeat the experiment.

Part-2: Segmentation with feedback

  • Select an utterance of your choice from the list of examples provided.

  • Note down the utterance in the native script of the language.

  • Note down the word-level transliteration of the Indian language utterance using English alphabets.

  • Study the subword-level (syllable-lvel and phoneme-level) transliterations by changing the subword unit.

  • Now choose the subword unit as word, and note the list of words and their boundaries given in the table by the side of the speech waveform.

  • Use the zoom, select and play buttons to select a portion of the waveform, listen, and verify the boundaries of the subword units given in the table.

  • Change the subword unit to syllable and repeat the experiment.

  • Change the subword unit to phoneme and repeat the experiment.

Part-3: Segmentation without feedback

  • Select an utterance of your choice from the list of examples provided

  • Note down the utterance in the native script of the language

  • Note down the word-level transliteration of the Indian language utterance using English alphabets

  • Study the subword-level (syllable-lvel and phoneme-level) transliterations by changing the subword unit

  • Now choose the subword unit as word, and note the list of words and their boundaries given in the table by the side of the speech waveform.

  • Use the zoom, select and play buttons to select a portion of the waveform, listen, and verify the boundaries of the subword units given in the table.

  • Change the subword unit to syllable and repeat the experiment.

  • Change the subword unit to phoneme and repeat the experiment.

Experiment

Observations
  • Marking boundaries at the phoneme-level is the most diffcult compared to using larger units such as syllables or words.


  • It is difficult to mark, listen and perceive unambiguously shorter units of sound such phonemes, as compared to syllables or words.


  • The boundaries between subword units are fuzzy due to the coarticulation effect.


  • The spoken form of a speech utterance (especially spontaneous speech) may not always contain all the phonemes used in the written form. But while listening we use the higher-level language and context information to fill in these sounds and perceive the intended word or phrase.


  • The above observations highlight the complexity of the task involved in speech signal-to-symbol transformation, even for human beings who are endowed with the ability to produce and comprehend natural speech.


  • Choice of subword unit is one of the important issues in designing an automatic speech signal-to-symbol transformation system.

    Word seem to be a good choice, but the number of words in a language run up to several tens of thousands. Devising a pattern recognition/classification strategy for such a large number of classes is a tough task. Also, collecting a large number of examples for each of the words for a solution within the statistical framework is difficult.

    Syllables are a better option as the number of classes will be in a few hundreds. The context of a sound is inherently captured in the syllable. But the number of classes are still large, and getting enough examples for all possible syllables in a language is difficult.

    Choice of phoneme as the subword unit makes the design of a system using statistical framework easier, due to the smaller number of classes (a few tens of classes) and the ease of collecting large number of examples for each class. But handling the variability of phones occuring in different contexts caused by the coarticulation effect is difficult.


Assessment

  1. Draw on a sheet of paper the speech waveform for the English word /mask/ as uttered by an adult male speaker. Use a scale of approximately 1cm=10ms. Mark clearly the beginning and ending of various phonemes in the word (/m/,/a/,/s/ and /k/). Also mark regions of silence using symbol [sil]. Clearly show the difference in amplitude of various sounds. What changes would you make if the speaker were to be a female or a child.


  2. Record the speech signal for the word /mask/ and compare or verify with the waveform you sketched in assignment #1.


  3. Sketch the waveforms for the following words and write down the main differences:

    1. /mask/
    2. /mass/
    3. /boss/
    4. /mark/


  4. Write a small program in C or any scripting language (bash, csh, awk, perl, python, etc) to convert a given text stream of word-level transcription into a stream of syllables and phonemes. Assume that the input stream of word-level transcription uses ITRANS code. In the output stream use space for word boundaries, - for syllable boundaries, and _ for phoneme boundaries.

Eg: Input: namaskAr aapka swaagat hai
      Output: n-a_m-a-s_k-A-r aa-p_k-a s-w-aa_g-a-t h-ai

Quiz

  1. What is transcription or speech transcription?

  2. How would you transcribe speech of an unknown language?

  3. How would you handle new sounds in an unknown language which are not present in any of your known languages?

  4. What choices does one have for symbols other than words, syllables or phonemes?

  5. What symbols can one choose based on auditory perception?

  6. What symbols can one choose based on production mechanism or apparatus?

  7. What is a phoneme?

  8. What is an allophone?

  9. What is the difference between terms phonetic and phonemic?

  10. What is a syllable?

  11. What is transliteration?

  12. What is ITRANS code?

References

Tools and utilities