Generic scripts to build speech recognition systems usint the Bavieca Toolkit
A series of scripts to build speech recognition systems can be found in the Bavieca Git repository at
/home/bavieca-git/bavieca-code/scripts
.
These scripts provide a higher level interface to the Bavieca toolkit and facilitate the execution of
common tasks, such feature extraction, training and decoding, in parallel (using multiple cores/processors).
- Feature extraction:
extractFeatures.pl
- Forced alignment:
align.pl
- Training acoustic models: accumulation and estimation using MLE and bMMI
BaviecaTrain.pm
: generic training functionsaccumulate.pl
: parallel accumulation of sufficient statistics for MLEaccumulateDT.pl
: parallel accumulation of sufficient statistics for discriminative training
- Speech recognition (including speaker adaptation):
BaviecaDecode.pm
- Lattice processing
amMarkingLattices.pl
: align and mark the lattices using a set of acoustic modelscompactLattices.pl
: make the lattices deterministic by merging redundant nodes and edgesinsertPathLattices.pl
: insert a given path into the latticelmMarkingLattices.pl
: mark the lattice edges using the given language modelnbestLattices.pl
: compute n-best list from latticesppLattices.pl
: compute posterior probabilities and confidence measures from the latticesrescoreLattices.pl
: rescore lattices and generate the corresponding hypotheseswerLattices.pl
: compute the lattice Word Error Rate (also known as Oracle), and generate the corresponding hypotheses
The best way to see how these scripts work is looking at the WSJ recipe for building and evaluating a speech recognition system
Training acoustic models for the Wall Street Journal task
This page describes the process of training acoustic models for the Wall Street Jorunal task using the Bavieca toolkit. It also points to the scripts and resources needed to accomplish this task.
The listing below summarizes a series of details about the training process
- Training of acoustic models was performed on 80 hours of data from the SI-284 dataset
- 39-dimensional feature vectors: 12 MFCCs and energy plus first and second order derivatives
- CMN at utterance level
- Pronunciations were extracted from the CMU dictionary (cmudict.0.7a) using the standard CMU phonetic symbol set without stress markers (39 phonetic classes plus silence).
Training scripts for the WSJ task can be obtained from Bavieca's Git repository at Sourceforge
git clone git://git.code.sf.net/p/bavieca/code bavieca-code
1. Maximum Likelihood Estimation
The training script for MLE can be found in the repository at: /bavieca-code/tasks/wsj/scripts/train/trainMLE.pl
Use Bavieca's message boards to report any issues running the training script.
Training output
The listing below shows the training output resulting from the MLE training (speaker independent). Acoustic models are initialized to the global distribution of the data and 5 reestimation iterations are carried out to produce a set of single-Gaussian context independent acoustic models. Then triphone-clustering is carried out to produce a set of 3453 physical triphones, which are settled in the feature space by performing three reestimation iterations. Finally, 26 reestimation iterations are performed to incrementally build the acoustic models by incrementing the Gaussian mixture of each state by adding 2 Gaussian components at a time (at most).
2. Discriminative Training (bMMI)
The training script for discriminative estimation (bMMI) can be found in the repository at:
/bavieca-code/tasks/wsj/scripts/train/trainDT.pl
. Additionally, the scripts to recognize
the training dataset, which is the first step of discriminative training can be found in the
repository at: bavieca-code/tasks/wsj/scripts/test/trainingSet
.
Use Bavieca's message boards to report any issues running the training script.
Discriminative training consists of three steps:
- Decoding the training dataset with lattice generation enabled
- Processing the lattices produced in the previous step so they can be used for discriminative training (see the "Using Bavieca" section)
- Actual discriminative reestimation using the lattices from the previous step
Training output for bMMI training after the 17th MLE iteration. Each line corresponds to a reestimation iteration. Third and fourth columns show the likelihood for numerator and denominator statistics respectively. The fifth column shows the difference in likelihood between numerator and denominator stats.
Training output for bMMI training after the 20th MLE iteration. Each line corresponds to a reestimation iteration. Third and fourth columns show the likelihood for numerator and denominator statistics respectively. The fifth column shows the difference in likelihood between numerator and denominator stats.
Wall Street Journal evaluation
Two evaluation conditions were examined, the Nov'92 5k closed vocabulary and the Nov'92 20k open vocabulary conditions. In both cases non-verbalized pronunciations (NVP) and the standard bigram and trigram LMs were used. The whole SI-284 dataset was used for training as described in the "training section".
All the scripts and acoustic models to reproduce the evaluation results reported in this page can be found in the Bavieca project page at Sourceforge. Acoustic models can be found in the "Files" section, while decoding scripts and configuration files can be found in the Git repository.
5k closed vocabulary task
Decoding scripts for this task can be found in the repository at: /bavieca-code/tasks/wsj/scripts/test/5k
The following listings show Word Error Rate for the 5k task under different testing conditions. Word Error Rate (WER) is shown for several reestimation iterations.
- MLE acoustic models. Speaker and gender independent. Trigram language model.
utt words acc sub del ins WER SER iteration: 18 | Sum/Avg| 330 5353 | 97.2 2.5 0.3 0.6 3.4 33.0 | iteration: 19 | Sum/Avg| 330 5353 | 97.3 2.4 0.3 0.6 3.3 32.1 | iteration: 20 | Sum/Avg| 330 5353 | 97.4 2.4 0.3 0.5 3.1 31.2 | iteration: 21 | Sum/Avg| 330 5353 | 97.5 2.3 0.3 0.5 3.0 30.6 | iteration: 22 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.5 3.1 30.9 | iteration: 23 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.6 3.2 31.8 | iteration: 24 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.6 3.1 31.8 | iteration: 25 | Sum/Avg| 330 5353 | 97.4 2.3 0.3 0.6 3.1 31.5 | iteration: 26 | Sum/Avg| 330 5353 | 97.5 2.2 0.3 0.6 3.1 30.6 |
- MLE acoustic models. Speaker and gender independent. Bigram language model.
utt words acc sub del ins WER SER iteration: 18 | Sum/Avg| 330 5353 | 95.7 3.7 0.6 0.7 5.0 46.4 | iteration: 19 | Sum/Avg| 330 5353 | 95.7 3.8 0.6 0.7 5.0 45.8 | iteration: 20 | Sum/Avg| 330 5353 | 95.8 3.7 0.5 0.6 4.8 45.2 | iteration: 21 | Sum/Avg| 330 5353 | 95.6 3.8 0.6 0.6 5.0 46.4 | iteration: 22 | Sum/Avg| 330 5353 | 95.7 3.7 0.6 0.7 5.0 46.4 | iteration: 23 | Sum/Avg| 330 5353 | 95.8 3.7 0.6 0.7 4.9 46.4 | iteration: 24 | Sum/Avg| 330 5353 | 95.8 3.7 0.6 0.7 4.9 45.5 | iteration: 25 | Sum/Avg| 330 5353 | 95.8 3.6 0.6 0.6 4.8 45.5 | iteration: 26 | Sum/Avg| 330 5353 | 95.8 3.6 0.6 0.7 4.8 45.2 |
- Discriminatively trained models (bMMI), Speaker and gender independent. Trigram language model. Boosting Factor: 0.5,
constant to compute the Gaussian-specific learning rate: 3.0, tau: 200. bMMI training was started after the
20th training iteration of MLE (99006 Gaussian components).
utt words acc sub del ins WER SER iteration: 1 | Sum/Avg| 330 5353 | 97.4 2.3 0.2 0.5 3.0 31.2 | iteration: 2 | Sum/Avg| 330 5353 | 97.6 2.2 0.2 0.5 2.9 30.0 | iteration: 3 | Sum/Avg| 330 5353 | 97.6 2.1 0.2 0.5 2.8 28.8 | iteration: 4 | Sum/Avg| 330 5353 | 97.6 2.2 0.2 0.4 2.9 28.8 |
20k open vocabulary task
Decoding scripts for this task can be found in the repository at: /bavieca-code/tasks/wsj/scripts/test/20k
- MLE acoustic models. Speaker and gender independent. Bigram language model.
utt words acc sub del ins WER SER iteration: 14 | Sum/Avg| 333 5643 | 90.6 8.0 1.3 1.6 11.0 66.1 | iteration: 15 | Sum/Avg| 333 5643 | 90.8 7.9 1.3 1.6 10.8 65.5 | iteration: 16 | Sum/Avg| 333 5643 | 91.0 7.7 1.3 1.7 10.7 65.2 | iteration: 17 | Sum/Avg| 333 5643 | 91.0 7.7 1.3 1.6 10.7 64.3 | iteration: 18 | Sum/Avg| 333 5643 | 90.9 7.7 1.4 1.6 10.8 64.9 | iteration: 19 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.6 10.6 65.2 | iteration: 20 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.7 10.6 64.0 | iteration: 21 | Sum/Avg| 333 5643 | 91.0 7.7 1.3 1.7 10.7 64.6 | iteration: 22 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.7 10.6 64.3 | iteration: 23 | Sum/Avg| 333 5643 | 91.1 7.7 1.3 1.7 10.6 65.2 | iteration: 24 | Sum/Avg| 333 5643 | 91.1 7.6 1.3 1.7 10.6 64.9 | iteration: 25 | Sum/Avg| 333 5643 | 91.1 7.7 1.2 1.7 10.6 65.2 | iteration: 26 | Sum/Avg| 333 5643 | 91.1 7.7 1.2 1.6 10.5 64.6 |
- MLE acoustic models. Speaker and gender independent. Trigram language model.
utt words acc sub del ins WER SER iteration: 14 | Sum/Avg| 333 5643 | 92.5 6.5 1.0 1.4 8.9 58.6 | iteration: 15 | Sum/Avg| 333 5643 | 92.8 6.3 0.9 1.3 8.6 56.8 | iteration: 16 | Sum/Avg| 333 5643 | 92.8 6.3 0.9 1.3 8.5 57.1 | iteration: 17 | Sum/Avg| 333 5643 | 92.9 6.2 0.9 1.4 8.5 57.1 | iteration: 18 | Sum/Avg| 333 5643 | 92.7 6.3 1.0 1.4 8.7 59.5 | iteration: 19 | Sum/Avg| 333 5643 | 92.5 6.4 1.0 1.4 8.9 59.5 | iteration: 20 | Sum/Avg| 333 5643 | 92.3 6.7 1.0 1.5 9.3 60.7 | iteration: 21 | Sum/Avg| 333 5643 | 92.5 6.6 0.9 1.5 9.0 59.8 | iteration: 22 | Sum/Avg| 333 5643 | 92.6 6.5 1.0 1.5 8.9 58.9 | iteration: 23 | Sum/Avg| 333 5643 | 92.6 6.5 0.9 1.4 8.8 58.6 | iteration: 24 | Sum/Avg| 333 5643 | 92.7 6.4 0.9 1.5 8.8 58.3 | iteration: 25 | Sum/Avg| 333 5643 | 92.7 6.5 0.9 1.5 8.8 59.2 | iteration: 26 | Sum/Avg| 333 5643 | 92.7 6.4 0.9 1.5 8.7 59.2 |
- Discriminatively trained models (bMMI), Speaker and gender independent. Trigram language model. Boosting Factor: 0.5,
constant to compute the Gaussian-specific learning rate: 3.0, tau: 200. bMMI training was started after the
17th training iteration of MLE (89669 Gaussian components).
utt words acc sub del ins WER SER iteration: 1 | Sum/Avg| 333 5643 | 93.1 6.0 0.9 1.3 8.3 57.1 | iteration: 2 | Sum/Avg| 333 5643 | 93.2 5.9 0.9 1.3 8.0 57.1 | iteration: 3 | Sum/Avg| 333 5643 | 93.0 6.2 0.9 1.3 8.3 58.9 | iteration: 4 | Sum/Avg| 333 5643 | 93.0 6.2 0.9 1.3 8.4 58.6 |