Overview

Generic scripts to build speech recognition systems usint the Bavieca Toolkit

A series of scripts to build speech recognition systems can be found in the Bavieca Git repository at /home/bavieca-git/bavieca-code/scripts. These scripts provide a higher level interface to the Bavieca toolkit and facilitate the execution of common tasks, such feature extraction, training and decoding, in parallel (using multiple cores/processors).

The best way to see how these scripts work is looking at the WSJ recipe for building and evaluating a speech recognition system

Training acoustic models for the Wall Street Journal task

This page describes the process of training acoustic models for the Wall Street Jorunal task using the Bavieca toolkit. It also points to the scripts and resources needed to accomplish this task.

The listing below summarizes a series of details about the training process

Training scripts for the WSJ task can be obtained from Bavieca's Git repository at Sourceforge

git clone git://git.code.sf.net/p/bavieca/code bavieca-code

1. Maximum Likelihood Estimation

The training script for MLE can be found in the repository at: /bavieca-code/tasks/wsj/scripts/train/trainMLE.pl Use Bavieca's message boards to report any issues running the training script.

Training output

The listing below shows the training output resulting from the MLE training (speaker independent). Acoustic models are initialized to the global distribution of the data and 5 reestimation iterations are carried out to produce a set of single-Gaussian context independent acoustic models. Then triphone-clustering is carried out to produce a set of 3453 physical triphones, which are settled in the feature space by performing three reestimation iterations. Finally, 26 reestimation iterations are performed to incrementally build the acoustic models by incrementing the Gaussian mixture of each state by adding 2 Gaussian components at a time (at most).

loading MLF: 80h:07' of speech loaded iteration: 1 likelihood: -686325712.2866 (-23.79) Gauss: 120 [RTF=0.0073][80:07'14''][80:07'14''] iteration: 2 likelihood: -675123907.2163 (-23.41) Gauss: 120 [RTF=0.0059][80:07'14''][80:07'14''] iteration: 3 likelihood: -514242209.1551 (-17.83) Gauss: 120 [RTF=0.0035][80:07'14''][80:07'14''] iteration: 4 likelihood: -453244819.6625 (-15.71) Gauss: 120 [RTF=0.0026][80:07'14''][80:07'14''] iteration: 5 likelihood: -444156891.8289 (-15.40) Gauss: 120 [RTF=0.0037][80:07'14''][80:07'14''] likelihood before clustering: -445078545.1412 100.00% ( 79258 observed triphones, 120 monophones) likelihood top-down clustering: -385211113.3582 86.55% ( 4530 physical triphones) likelihood bottom-up merging: -386616877.6686 86.86% ( 3453 physical triphones) loading time: 10.15 seconds clustering time: 9.36 seconds iteration: 1 likelihood: -371633487.7811 (-12.88) Gauss: 3453 [RTF=0.0030][80:07'14''][80:07'14''] iteration: 2 likelihood: -366453269.9656 (-12.70) Gauss: 3453 [RTF=0.0029][80:07'14''][80:07'14''] iteration: 3 likelihood: -364365265.6462 (-12.63) Gauss: 3453 [RTF=0.0029][80:07'14''][80:07'14''] iteration: 1 likelihood: -363931720.9124 (-12.62) Gauss: 6906 [RTF=0.0048][80:07'14''][80:07'14''] iteration: 2 likelihood: -362334348.4161 (-12.56) Gauss: 13810 [RTF=0.0079][80:07'14''][80:07'14''] iteration: 3 likelihood: -354510609.7717 (-12.29) Gauss: 20633 [RTF=0.0108][80:07'14''][80:07'14''] iteration: 4 likelihood: -337409032.9950 (-11.70) Gauss: 27443 [RTF=0.0137][80:07'14''][80:07'14''] iteration: 5 likelihood: -321015747.5687 (-11.13) Gauss: 33950 [RTF=0.0163][80:07'14''][80:07'14''] iteration: 6 likelihood: -303313875.4805 (-10.52) Gauss: 39974 [RTF=0.0191][80:07'14''][80:07'14''] iteration: 7 likelihood: -288350360.4445 (-10.00) Gauss: 45261 [RTF=0.0219][80:07'14''][80:07'14''] iteration: 8 likelihood: -277676776.6275 (-9.63) Gauss: 50436 [RTF=0.0246][80:07'14''][80:07'14''] iteration: 9 likelihood: -269235857.6136 (-9.33) Gauss: 55508 [RTF=0.0276][80:07'14''][80:07'14''] iteration: 10 likelihood: -262588597.8260 (-9.10) Gauss: 60427 [RTF=0.0306][80:07'14''][80:07'14''] iteration: 11 likelihood: -256675854.9373 (-8.90) Gauss: 65201 [RTF=0.0338][80:07'14''][80:07'14''] iteration: 12 likelihood: -251425579.7726 (-8.72) Gauss: 69767 [RTF=0.0371][80:07'14''][80:07'14''] iteration: 13 likelihood: -246863248.6860 (-8.56) Gauss: 74201 [RTF=0.0402][80:07'14''][80:07'14''] iteration: 14 likelihood: -242844183.1723 (-8.42) Gauss: 78443 [RTF=0.0434][80:07'14''][80:07'14''] iteration: 15 likelihood: -239285388.9468 (-8.30) Gauss: 82410 [RTF=0.0464][80:07'14''][80:07'14''] iteration: 16 likelihood: -236104468.6583 (-8.19) Gauss: 86099 [RTF=0.0492][80:07'14''][80:07'14''] iteration: 17 likelihood: -233258900.4046 (-8.09) Gauss: 89669 [RTF=0.0532][80:07'14''][80:07'14''] iteration: 18 likelihood: -230685248.0689 (-8.00) Gauss: 92988 [RTF=0.0551][80:07'14''][80:07'14''] iteration: 19 likelihood: -228327423.8761 (-7.92) Gauss: 96133 [RTF=0.0580][80:07'14''][80:07'14''] iteration: 20 likelihood: -226161864.0132 (-7.84) Gauss: 99006 [RTF=0.0607][80:07'14''][80:07'14''] iteration: 21 likelihood: -224162528.0651 (-7.77) Gauss: 101738 [RTF=0.0633][80:07'14''][80:07'14''] iteration: 22 likelihood: -222334485.5593 (-7.71) Gauss: 104222 [RTF=0.0656][80:07'14''][80:07'14''] iteration: 23 likelihood: -220666124.3353 (-7.65) Gauss: 106534 [RTF=0.0684][80:07'14''][80:07'14''] iteration: 24 likelihood: -219122608.6269 (-7.60) Gauss: 108726 [RTF=0.0706][80:07'14''][80:07'14''] iteration: 25 likelihood: -217708198.2841 (-7.55) Gauss: 110688 [RTF=0.0729][80:07'14''][80:07'14''] iteration: 26 likelihood: -216365384.8259 (-7.50) Gauss: 112647 [RTF=0.0752][80:07'14''][80:07'14'']

2. Discriminative Training (bMMI)

The training script for discriminative estimation (bMMI) can be found in the repository at: /bavieca-code/tasks/wsj/scripts/train/trainDT.pl. Additionally, the scripts to recognize the training dataset, which is the first step of discriminative training can be found in the repository at: bavieca-code/tasks/wsj/scripts/test/trainingSet. Use Bavieca's message boards to report any issues running the training script.

Discriminative training consists of three steps:

  1. Decoding the training dataset with lattice generation enabled
  2. Processing the lattices produced in the previous step so they can be used for discriminative training (see the "Using Bavieca" section)
  3. Actual discriminative reestimation using the lattices from the previous step

Training output for bMMI training after the 17th MLE iteration. Each line corresponds to a reestimation iteration. Third and fourth columns show the likelihood for numerator and denominator statistics respectively. The fifth column shows the difference in likelihood between numerator and denominator stats.

64 -230907679.243 -11565658.8865 -11189935.2277 -> -375723.6588 64 -231503511.6597 -11589492.1825 -11286079.2574 -> -303412.9254 64 -233658608.2063 -11675696.0421 -11415622.9452 -> -260073.0969 64 -236809154.1603 -11801717.8776 -11579005.9105 -> -222711.9672

Training output for bMMI training after the 20th MLE iteration. Each line corresponds to a reestimation iteration. Third and fourth columns show the likelihood for numerator and denominator statistics respectively. The fifth column shows the difference in likelihood between numerator and denominator stats.

64 -227045341.8148 -11906252.6929 -11256710.7753 -> -649541.9177 64 -226683787.0273 -11891790.5014 -11364880.3029 -> -526910.1987 64 -228490852.378 -11964073.114 -11499322.1646 -> -464750.9497 64 -231344845.7854 -12078232.8481 -11664069.7174 -> -414163.1309

Wall Street Journal evaluation

Two evaluation conditions were examined, the Nov'92 5k closed vocabulary and the Nov'92 20k open vocabulary conditions. In both cases non-verbalized pronunciations (NVP) and the standard bigram and trigram LMs were used. The whole SI-284 dataset was used for training as described in the "training section".

All the scripts and acoustic models to reproduce the evaluation results reported in this page can be found in the Bavieca project page at Sourceforge. Acoustic models can be found in the "Files" section, while decoding scripts and configuration files can be found in the Git repository.

5k closed vocabulary task

Decoding scripts for this task can be found in the repository at: /bavieca-code/tasks/wsj/scripts/test/5k

The following listings show Word Error Rate for the 5k task under different testing conditions. Word Error Rate (WER) is shown for several reestimation iterations.

20k open vocabulary task

Decoding scripts for this task can be found in the repository at: /bavieca-code/tasks/wsj/scripts/test/20k