Bavieca.org

Overview

Generic scripts to build speech recognition systems usint the Bavieca Toolkit

A series of scripts to build speech recognition systems can be found in the Bavieca Git repository at /home/bavieca-git/bavieca-code/scripts. These scripts provide a higher level interface to the Bavieca toolkit and facilitate the execution of common tasks, such feature extraction, training and decoding, in parallel (using multiple cores/processors).

Feature extraction: extractFeatures.pl
Forced alignment: align.pl
Training acoustic models: accumulation and estimation using MLE and bMMI
- BaviecaTrain.pm: generic training functions
- accumulate.pl: parallel accumulation of sufficient statistics for MLE
- accumulateDT.pl: parallel accumulation of sufficient statistics for discriminative training
Speech recognition (including speaker adaptation): BaviecaDecode.pm
Lattice processing
- amMarkingLattices.pl: align and mark the lattices using a set of acoustic models
- compactLattices.pl: make the lattices deterministic by merging redundant nodes and edges
- insertPathLattices.pl: insert a given path into the lattice
- lmMarkingLattices.pl: mark the lattice edges using the given language model
- nbestLattices.pl: compute n-best list from lattices
- ppLattices.pl: compute posterior probabilities and confidence measures from the lattices
- rescoreLattices.pl: rescore lattices and generate the corresponding hypotheses
- werLattices.pl: compute the lattice Word Error Rate (also known as Oracle), and generate the corresponding hypotheses

The best way to see how these scripts work is looking at the WSJ recipe for building and evaluating a speech recognition system

Training acoustic models for the Wall Street Journal task

This page describes the process of training acoustic models for the Wall Street Jorunal task using the Bavieca toolkit. It also points to the scripts and resources needed to accomplish this task.

The listing below summarizes a series of details about the training process

Training of acoustic models was performed on 80 hours of data from the SI-284 dataset
39-dimensional feature vectors: 12 MFCCs and energy plus first and second order derivatives
CMN at utterance level
Pronunciations were extracted from the CMU dictionary (cmudict.0.7a) using the standard CMU phonetic symbol set without stress markers (39 phonetic classes plus silence).

Training scripts for the WSJ task can be obtained from Bavieca's Git repository at Sourceforge

git clone git://git.code.sf.net/p/bavieca/code bavieca-code

1. Maximum Likelihood Estimation

The training script for MLE can be found in the repository at: /bavieca-code/tasks/wsj/scripts/train/trainMLE.pl Use Bavieca's message boards to report any issues running the training script.

Training output

The listing below shows the training output resulting from the MLE training (speaker independent). Acoustic models are initialized to the global distribution of the data and 5 reestimation iterations are carried out to produce a set of single-Gaussian context independent acoustic models. Then triphone-clustering is carried out to produce a set of 3453 physical triphones, which are settled in the feature space by performing three reestimation iterations. Finally, 26 reestimation iterations are performed to incrementally build the acoustic models by incrementing the Gaussian mixture of each state by adding 2 Gaussian components at a time (at most).

loading MLF: 80h:07' of speech loaded iteration: 1 likelihood: -686325712.2866 (-23.79) Gauss: 120 [RTF=0.0073][80:07'14''][80:07'14''] iteration: 2 likelihood: -675123907.2163 (-23.41) Gauss: 120 [RTF=0.0059][80:07'14''][80:07'14''] iteration: 3 likelihood: -514242209.1551 (-17.83) Gauss: 120 [RTF=0.0035][80:07'14''][80:07'14''] iteration: 4 likelihood: -453244819.6625 (-15.71) Gauss: 120 [RTF=0.0026][80:07'14''][80:07'14''] iteration: 5 likelihood: -444156891.8289 (-15.40) Gauss: 120 [RTF=0.0037][80:07'14''][80:07'14''] likelihood before clustering: -445078545.1412 100.00% ( 79258 observed triphones, 120 monophones) likelihood top-down clustering: -385211113.3582 86.55% ( 4530 physical triphones) likelihood bottom-up merging: -386616877.6686 86.86% ( 3453 physical triphones) loading time: 10.15 seconds clustering time: 9.36 seconds iteration: 1 likelihood: -371633487.7811 (-12.88) Gauss: 3453 [RTF=0.0030][80:07'14''][80:07'14''] iteration: 2 likelihood: -366453269.9656 (-12.70) Gauss: 3453 [RTF=0.0029][80:07'14''][80:07'14''] iteration: 3 likelihood: -364365265.6462 (-12.63) Gauss: 3453 [RTF=0.0029][80:07'14''][80:07'14''] iteration: 1 likelihood: -363931720.9124 (-12.62) Gauss: 6906 [RTF=0.0048][80:07'14''][80:07'14''] iteration: 2 likelihood: -362334348.4161 (-12.56) Gauss: 13810 [RTF=0.0079][80:07'14''][80:07'14''] iteration: 3 likelihood: -354510609.7717 (-12.29) Gauss: 20633 [RTF=0.0108][80:07'14''][80:07'14''] iteration: 4 likelihood: -337409032.9950 (-11.70) Gauss: 27443 [RTF=0.0137][80:07'14''][80:07'14''] iteration: 5 likelihood: -321015747.5687 (-11.13) Gauss: 33950 [RTF=0.0163][80:07'14''][80:07'14''] iteration: 6 likelihood: -303313875.4805 (-10.52) Gauss: 39974 [RTF=0.0191][80:07'14''][80:07'14''] iteration: 7 likelihood: -288350360.4445 (-10.00) Gauss: 45261 [RTF=0.0219][80:07'14''][80:07'14''] iteration: 8 likelihood: -277676776.6275 (-9.63) Gauss: 50436 [RTF=0.0246][80:07'14''][80:07'14''] iteration: 9 likelihood: -269235857.6136 (-9.33) Gauss: 55508 [RTF=0.0276][80:07'14''][80:07'14''] iteration: 10 likelihood: -262588597.8260 (-9.10) Gauss: 60427 [RTF=0.0306][80:07'14''][80:07'14''] iteration: 11 likelihood: -256675854.9373 (-8.90) Gauss: 65201 [RTF=0.0338][80:07'14''][80:07'14''] iteration: 12 likelihood: -251425579.7726 (-8.72) Gauss: 69767 [RTF=0.0371][80:07'14''][80:07'14''] iteration: 13 likelihood: -246863248.6860 (-8.56) Gauss: 74201 [RTF=0.0402][80:07'14''][80:07'14''] iteration: 14 likelihood: -242844183.1723 (-8.42) Gauss: 78443 [RTF=0.0434][80:07'14''][80:07'14''] iteration: 15 likelihood: -239285388.9468 (-8.30) Gauss: 82410 [RTF=0.0464][80:07'14''][80:07'14''] iteration: 16 likelihood: -236104468.6583 (-8.19) Gauss: 86099 [RTF=0.0492][80:07'14''][80:07'14''] iteration: 17 likelihood: -233258900.4046 (-8.09) Gauss: 89669 [RTF=0.0532][80:07'14''][80:07'14''] iteration: 18 likelihood: -230685248.0689 (-8.00) Gauss: 92988 [RTF=0.0551][80:07'14''][80:07'14''] iteration: 19 likelihood: -228327423.8761 (-7.92) Gauss: 96133 [RTF=0.0580][80:07'14''][80:07'14''] iteration: 20 likelihood: -226161864.0132 (-7.84) Gauss: 99006 [RTF=0.0607][80:07'14''][80:07'14''] iteration: 21 likelihood: -224162528.0651 (-7.77) Gauss: 101738 [RTF=0.0633][80:07'14''][80:07'14''] iteration: 22 likelihood: -222334485.5593 (-7.71) Gauss: 104222 [RTF=0.0656][80:07'14''][80:07'14''] iteration: 23 likelihood: -220666124.3353 (-7.65) Gauss: 106534 [RTF=0.0684][80:07'14''][80:07'14''] iteration: 24 likelihood: -219122608.6269 (-7.60) Gauss: 108726 [RTF=0.0706][80:07'14''][80:07'14''] iteration: 25 likelihood: -217708198.2841 (-7.55) Gauss: 110688 [RTF=0.0729][80:07'14''][80:07'14''] iteration: 26 likelihood: -216365384.8259 (-7.50) Gauss: 112647 [RTF=0.0752][80:07'14''][80:07'14'']

2. Discriminative Training (bMMI)

The training script for discriminative estimation (bMMI) can be found in the repository at: /bavieca-code/tasks/wsj/scripts/train/trainDT.pl. Additionally, the scripts to recognize the training dataset, which is the first step of discriminative training can be found in the repository at: bavieca-code/tasks/wsj/scripts/test/trainingSet. Use Bavieca's message boards to report any issues running the training script.

Discriminative training consists of three steps:

Decoding the training dataset with lattice generation enabled
Processing the lattices produced in the previous step so they can be used for discriminative training (see the "Using Bavieca" section)
Actual discriminative reestimation using the lattices from the previous step

Training output for bMMI training after the 17th MLE iteration. Each line corresponds to a reestimation iteration. Third and fourth columns show the likelihood for numerator and denominator statistics respectively. The fifth column shows the difference in likelihood between numerator and denominator stats.

64 -230907679.243 -11565658.8865 -11189935.2277 -> -375723.6588 64 -231503511.6597 -11589492.1825 -11286079.2574 -> -303412.9254 64 -233658608.2063 -11675696.0421 -11415622.9452 -> -260073.0969 64 -236809154.1603 -11801717.8776 -11579005.9105 -> -222711.9672

Training output for bMMI training after the 20th MLE iteration. Each line corresponds to a reestimation iteration. Third and fourth columns show the likelihood for numerator and denominator statistics respectively. The fifth column shows the difference in likelihood between numerator and denominator stats.

64 -227045341.8148 -11906252.6929 -11256710.7753 -> -649541.9177 64 -226683787.0273 -11891790.5014 -11364880.3029 -> -526910.1987 64 -228490852.378 -11964073.114 -11499322.1646 -> -464750.9497 64 -231344845.7854 -12078232.8481 -11664069.7174 -> -414163.1309

Wall Street Journal evaluation

Two evaluation conditions were examined, the Nov'92 5k closed vocabulary and the Nov'92 20k open vocabulary conditions. In both cases non-verbalized pronunciations (NVP) and the standard bigram and trigram LMs were used. The whole SI-284 dataset was used for training as described in the "training section".

All the scripts and acoustic models to reproduce the evaluation results reported in this page can be found in the Bavieca project page at Sourceforge. Acoustic models can be found in the "Files" section, while decoding scripts and configuration files can be found in the Git repository.

5k closed vocabulary task

Decoding scripts for this task can be found in the repository at: /bavieca-code/tasks/wsj/scripts/test/5k

The following listings show Word Error Rate for the 5k task under different testing conditions. Word Error Rate (WER) is shown for several reestimation iterations.

MLE acoustic models. Speaker and gender independent. Trigram language model.
utt words acc sub del ins WER SER iteration: 18 | Sum/Avg| 330 5353 | 97.2 2.5 0.3 0.6 3.4 33.0 | iteration: 19 | Sum/Avg| 330 5353 | 97.3 2.4 0.3 0.6 3.3 32.1 | iteration: 20 | Sum/Avg| 330 5353 | 97.4 2.4 0.3 0.5 3.1 31.2 | iteration: 21 | Sum/Avg| 330 5353 | 97.5 2.3 0.3 0.5 3.0 30.6 | iteration: 22 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.5 3.1 30.9 | iteration: 23 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.6 3.2 31.8 | iteration: 24 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.6 3.1 31.8 | iteration: 25 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.6 3.1 31.5 | iteration: 26 | Sum/Avg| 330 5353 | 97.5 2.2 0.3 0.6 3.1 30.6 |
MLE acoustic models. Speaker and gender independent. Bigram language model.
utt words acc sub del ins WER SER iteration: 18 | Sum/Avg| 330 5353 | 95.7 3.7 0.6 0.7 5.0 46.4 | iteration: 19 | Sum/Avg| 330 5353 | 95.7 3.8 0.6 0.7 5.0 45.8 | iteration: 20 | Sum/Avg| 330 5353 | 95.8 3.7 0.5 0.6 4.8 45.2 | iteration: 21 | Sum/Avg| 330 5353 | 95.6 3.8 0.6 0.6 5.0 46.4 | iteration: 22 | Sum/Avg| 330 5353 | 95.7 3.7 0.6 0.7 5.0 46.4 | iteration: 23 | Sum/Avg| 330 5353 | 95.8 3.7 0.6 0.7 4.9 46.4 | iteration: 24 | Sum/Avg| 330 5353 | 95.8 3.7 0.6 0.7 4.9 45.5 | iteration: 25 | Sum/Avg| 330 5353 | 95.8 3.6 0.6 0.6 4.8 45.5 | iteration: 26 | Sum/Avg| 330 5353 | 95.8 3.6 0.6 0.7 4.8 45.2 |
Discriminatively trained models (bMMI), Speaker and gender independent. Trigram language model. Boosting Factor: 0.5, constant to compute the Gaussian-specific learning rate: 3.0, tau: 200. bMMI training was started after the 20th training iteration of MLE (99006 Gaussian components).
utt words acc sub del ins WER SER iteration: 1 | Sum/Avg| 330 5353 | 97.4 2.3 0.2 0.5 3.0 31.2 | iteration: 2 | Sum/Avg| 330 5353 | 97.6 2.2 0.2 0.5 2.9 30.0 | iteration: 3 | Sum/Avg| 330 5353 | 97.6 2.1 0.2 0.5 2.8 28.8 | iteration: 4 | Sum/Avg| 330 5353 | 97.6 2.2 0.2 0.4 2.9 28.8 |

20k open vocabulary task

Decoding scripts for this task can be found in the repository at: /bavieca-code/tasks/wsj/scripts/test/20k

MLE acoustic models. Speaker and gender independent. Bigram language model.
utt words acc sub del ins WER SER iteration: 14 | Sum/Avg| 333 5643 | 90.6 8.0 1.3 1.6 11.0 66.1 | iteration: 15 | Sum/Avg| 333 5643 | 90.8 7.9 1.3 1.6 10.8 65.5 | iteration: 16 | Sum/Avg| 333 5643 | 91.0 7.7 1.3 1.7 10.7 65.2 | iteration: 17 | Sum/Avg| 333 5643 | 91.0 7.7 1.3 1.6 10.7 64.3 | iteration: 18 | Sum/Avg| 333 5643 | 90.9 7.7 1.4 1.6 10.8 64.9 | iteration: 19 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.6 10.6 65.2 | iteration: 20 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.7 10.6 64.0 | iteration: 21 | Sum/Avg| 333 5643 | 91.0 7.7 1.3 1.7 10.7 64.6 | iteration: 22 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.7 10.6 64.3 | iteration: 23 | Sum/Avg| 333 5643 | 91.1 7.7 1.3 1.7 10.6 65.2 | iteration: 24 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.7 10.6 64.9 | iteration: 25 | Sum/Avg| 333 5643 | 91.1 7.7 1.2 1.7 10.6 65.2 | iteration: 26 | Sum/Avg| 333 5643 | 91.1 7.7 1.2 1.6 10.5 64.6 |
MLE acoustic models. Speaker and gender independent. Trigram language model.
utt words acc sub del ins WER SER iteration: 14 | Sum/Avg| 333 5643 | 92.5 6.5 1.0 1.4 8.9 58.6 | iteration: 15 | Sum/Avg| 333 5643 | 92.8 6.3 0.9 1.3 8.6 56.8 | iteration: 16 | Sum/Avg| 333 5643 | 92.8 6.3 0.9 1.3 8.5 57.1 | iteration: 17 | Sum/Avg| 333 5643 | 92.9 6.2 0.9 1.4 8.5 57.1 | iteration: 18 | Sum/Avg| 333 5643 | 92.7 6.3 1.0 1.4 8.7 59.5 | iteration: 19 | Sum/Avg| 333 5643 | 92.5 6.4 1.0 1.4 8.9 59.5 | iteration: 20 | Sum/Avg| 333 5643 | 92.3 6.7 1.0 1.5 9.3 60.7 | iteration: 21 | Sum/Avg| 333 5643 | 92.5 6.6 0.9 1.5 9.0 59.8 | iteration: 22 | Sum/Avg| 333 5643 | 92.6 6.5 1.0 1.5 8.9 58.9 | iteration: 23 | Sum/Avg| 333 5643 | 92.6 6.5 0.9 1.4 8.8 58.6 | iteration: 24 | Sum/Avg| 333 5643 | 92.7 6.4 0.9 1.5 8.8 58.3 | iteration: 25 | Sum/Avg| 333 5643 | 92.7 6.5 0.9 1.5 8.8 59.2 | iteration: 26 | Sum/Avg| 333 5643 | 92.7 6.4 0.9 1.5 8.7 59.2 |
Discriminatively trained models (bMMI), Speaker and gender independent. Trigram language model. Boosting Factor: 0.5, constant to compute the Gaussian-specific learning rate: 3.0, tau: 200. bMMI training was started after the 17th training iteration of MLE (89669 Gaussian components).
utt words acc sub del ins WER SER iteration: 1 | Sum/Avg| 333 5643 | 93.1 6.0 0.9 1.3 8.3 57.1 | iteration: 2 | Sum/Avg| 333 5643 | 93.2 5.9 0.9 1.3 8.0 57.1 | iteration: 3 | Sum/Avg| 333 5643 | 93.0 6.2 0.9 1.3 8.3 58.9 | iteration: 4 | Sum/Avg| 333 5643 | 93.0 6.2 0.9 1.3 8.4 58.6 |

Example scripts

WSJ Task