Overview

The Bavieca toolkit comprises a set of about 25 command line tools that can be used to build very sophisticated large vocabulary speech recognition systems from scratch. Additionally it offers an Application Programming Interface (API) that exposes speech processing features such as speech recognition, speech activity detection, forced alignment, etc. This API is provided as a C++ library that can be used to create stand-alone applications that exploit Bavieca's speech recognition features.

Phonetic symbol set

The phonetic symbol can contain any set of symbols.

Below there is an example of a file containing a phonetic symbol set. Each phonenetic symbol must be on a separate line. Symbols between parentheses indicate that are context independent (see contextclustering tool).
# # phonetic symbol set # AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH (SIL) (_BREATH) (_COUGH) (_FP)

Phonetic rules

Phonetic rules serve to express groupings of phones that share similar properties attending to some criterion (e.g. fricatives, frontvowels, etc). They are used to guide the procedure of clustering context dependent units (see the contextclustering tool) under the assumption that phones that present similar properties are likely to affect the realization of neighboring phones in a similar fashion. During the process of building the decision tree, a cluster of allophones is split into subclusters by testing the applicable phonetic rules and picking the rule that results in a higher likelihood increase.

A very good way to see how to define phonetic rules is by looking at the example below. The character '#' can be used to write comments. Note that each phonetic symbol in the phonetic symbol set must appear in the right-side of at least one rule. However, phonetic symbols typically appear on the right side of many rules.

# # a few phonetic rules for a few phonetic symbols # $CH CH $AE JH $F F $V V $TH TH $DH DH $S S $SH SH $Z Z $ZH ZH $affricate CH JH $frontfricative F V TH DH $centralfricative S SH Z ZH $fricative $frontfricative $centralfricative $affricate

Pronunciation lexicon

The pronunciation lexicon contains pronunciations for all the words in the vocabulary. A pronunciation for a word is defined by the actual word followed by the sequence of phonemes that express the pronunciation. Alternative pronunciations of a word can be defined by appending '(n)' to the word. This suffix applies once the first pronunciation of the word has been defined (n starts at 2).

Depending on whether it is used for training or recognition, the lexicon file has different uses. For training purposes the lexicon file must contain pronunciations for all words in the master label file, while for recognition, the lexicon file determines the active vocabulary (i.e. those words that the recognizer can potentially recognize).

Below there is an example of a lexicon file containing alternative pronunciations for some words. When creating a pronunciation lexicon the following considerations must be taken into account:

## ## this is a very short pronunciation lexicon ## <SIL> SIL <BREATH> _BREATH <COUGH> _COUGH <HMM> _HMM <MMM> _MMM <UM> _UM A AH A(2) EY AB AE B AB(2) EY B IY ABBREVIATION AH B R IY V IY EY SH AH N ABBREVIATIONS AH B R IY V IY EY SH AH N Z ABBY AE B IY ABDUCTS AE B D AH K T S ABILITY AH B IH L AH T IY

Language model

Currently the Bavieca speech recognition toolkit supports language models in two formats, the ARPA format and binary format. Internally, for efficiency reasons, language models are represented as Finite State Machines (FSMs). Any n-gram orders are supported, however only orders up to fourgram have been tested.

ARPA format

The ARPA format is a simple and human-readable language model format. There are several freely available language modeling toolkits that can be used to build language models in the ARPA format. Two excellent resources are The CMU Statistical Language Modeling (SLM) Toolkit and the MIT Language Modeling Toolkit. Language models built with these toolkits have been successfully used with Bavieca.

Fragments of a trigram language model in the ARPA format are listed below.

\data\ ngram 1=7001 ngram 2=129859 ngram 3=448367 \1-grams: -1.601124 </s> -99 <s> -0.629519 -1.990822 A -0.847622 -4.108753 A'S -0.167362 -4.115170 AB -0.181871 -5.289563 ABBREVIATION -0.048269 -5.590593 ABBREVIATIONS -0.048251 -5.289563 ABBY -0.048249 -4.647300 ABILITY -0.159118 -3.341374 ABLE -0.856884 ... \2-grams: -2.588038 <s> A -0.594312 -4.355186 <s> A'S -0.035111 -4.360168 <s> AB -0.034363 -3.064455 <s> ABOUT -0.318412 -3.929392 <s> ABOVE -0.037674 -4.298169 <s> ABSORBED -0.034854 -4.581251 <s> ACCELERATING -0.027021 ... \3-grams: -1.872919 <s> </s> A -5.586053 <s> </s> A'S -5.081562 <s> </s> AB -2.974584 <s> </s> ABOUT ... -1.750264 CIRCUIT WE CREATED -1.343804 CIRCUIT WE HAVE -1.546748 CIRCUIT WE MADE -1.690600 CIRCUIT WE NEEDED -1.152217 CIRCUIT WE PUT -1.640675 CIRCUIT WE SAW -1.350947 CIRCUIT WE USED

Binary format

The binary format allows a much faster loading from disk, usually about an order of magnitude compared to the time it takes to load a language model in the ARPA format. The binary representation is lossless and encodes the language model as a Finite State Machine. In addition, the binary representation uses about half the space on disk. A language model in ARPA format can be converted to binary format using the lmfsm command line tool.

Master Label File

Master Label Files (MLFs) are used to describe a dataset composed of a series of utterances with corresponding transcriptions. MLFs serve as input to tools that initialize acoustic model parameters or accumulate sufficient statistics to reestimate acoustic model parameters.

Each utterance in the MLF is expressed by a line with a relative path to a feature file (containing features extracted from the utterance) followed by a line for each word in the transcription of the utterance. Each transcription must contain at least one word (or symbol). Typically, Bavieca tools produce absolute paths to feature files by appending relative paths in the MLFs to base folders specified as parameters. This is a flexible mechanism to, for example, use different sets of features with a single MLF.

Below there is an example of MLF. Note the use of the character '"' at the beginning and end of the relative paths.

"/MS/4/EI291/3/MS_4_3_EI291_KB__01-21-2009_trans/38.fea" BECAUSE UM UH IT'S AT TWENTY AND UH SO YEAH <LAUGH> "/MS/4/EI291/3/MS_4_3_EI291_KB__01-21-2009_trans/128.fea" BECAUSE UM A THERMOMETER <BREATH> IT IS ACTUALLY UM YEAH SO LIKE A PERSON IS IN A THERMOMETER SO THEY DON'T KNOW THE EXACT TEMPERATURE

Features

Feature files are binary files containing a series of feature vectors. These files are produced by the param tool according to a series of parameters defined in a feature configuration file. The characteristics of the features, along with their dimensionality, are not specified in the feature file. They are only available through the feature configuration file used to create them.

Alignment file

Alignment files can be produced by a Viterbi aligner or a Forward-Backward aligner. They can be either in binary or text format.

Text format

Alignment files in text format can be produced by a Viterbi aligner and keep time-alignment information for each word/symbol aligned. This format is intended to be easily readable for humans, nontheless it can be used as the input of some tools. Below there is an example of alignment file in text format. Each line represents a phone alignment and the six numbers on the left are the initial and final feature-frames aligned to each of the three HMM-states associated to the phone. In the next column there is the phonetic symbol ('TH' in the first line). The next column is the acoustic score (log-likelihood) for the phone given the features and acoustic models used. Finally, the last column shows the word in the event that the phone is the initial phone of a word.
0 0 1 2 3 5 TH -158.5670 THANK 6 6 7 9 10 10 AE -125.2262 11 11 12 13 14 15 NG -132.1665 16 17 18 18 19 23 KD -164.0810 24 26 27 28 29 29 Y -142.7290 YOU 30 33 34 37 38 77 UW -559.0276

Binary format

Alignments in binary format keep occupation counts of each HMM-state for each time-slice. In addition to that, in the case of Viterbi alignments (in which there is a hard assignment between time-slices and HMM-states), alignment information is kept for each word/symbol aligned.

Accumulator

Accumulator files keep sufficient statistics to estimate acoustic model parameters (Gaussian distributions). There exist two types of accumulator files:

Hypothesis

Hypothesis files contain word-hypothesis generated by recognition or rescoring tools. Bavieca supports two types of hypothesis file: trn and ctm, as defined in the SCLITE scoring toolkit.

Below there is an example of hypothesis file in trn format. There are hypotheses for four utterances (one per line). The trailing code between parentheses is the utterance identifier, which scoring tools use to match pairs hypothesis-reference.

THE NAIL IS NOT CONNECTED TO THE PAPER CLIP BECAUSE THE MAGNET JUST MADE THE PAPER THE NAIL A MAGNET (ME|1|CV683-2|NON|28_0) THE MAGNET TO MAKE THE NAIL INTO A MAGNET (ME|1|CV683-2|NON|29_1) THAT MAGNETS ONLY STICK TO HAVE TWO WIRES AND THEN METAL (ME|1|CV683-2|NON|3_1) CAUSE THE MAGNET TURNS THE NAIL INTO A MAGNET (ME|1|CV683-2|NON|30_0)

Transcription

Transcription files keep orthographic transcriptions of utterances and are typically used to score recognition hypothesis and compute the word error rate (WER). WER is defined as the number of edit errors resulting from aligning a hypothesis against a reference file (transcription) divided by the number of words in the reference file and it is the most common metric to measure speech recognition accuracy. Transcription files in Bavieca are in the trn format, as defined in the SCLITE scoring toolkit.

Below there is an example of transcription file in the trn format. There are transcriptions for six utterances (one per line). The trailing code between parentheses is the utterance identifier, which scoring tools use to match pairs hypothesis-reference.

GOOD (ME|1|CV232-3|SKH|1) IT DEPENDS ON IF YOU HAVE ANY OF THOSE YELLOW THINGS (ME|1|CV232-3|SKH|25) YES (ME|1|CV232-3|SKH|23) BECAUSE THE WASHER IS SO HEAVY IT MAKES THE OTHER SIDE GO UP (ME|1|CV232-3|SKH|19) THEY STICK (ME|1|CV232-3|SKH|11) YES (ME|1|CV232-3|SKH|17)

Lattice file

Lattice files can be either in binary or text format.

Text format

Decoders and lattice edition tools can output lattices in text format, however this is exclusively an output format and it is only intended for easy visualization of the lattice.

Example 1: Simple lattice in text format. The first four lines contain mandatory lattice properties: the number of edges in the lattice, the number of nodes, the number of time-slices (frames), and the lattice version.

[edges] 23 [frames] 130 [nodes] 17 [version] 0.1 (5 12 ) 0 48 BYE (5 10 ) 0 48 BI (5 3 ) 0 48 BUY (5 0 ) 0 48 HI (5 2 ) 0 48 HIGH (5 16 ) 0 48 BY (16 4 ) 49 68 <SIL> (4 7 ) 69 129 <SIL> (4 8 ) 69 111 <SIL> (4 14 ) 69 121 <SIL> (14 7 ) 122 129 IT (8 15 ) 112 117 THE (15 7 ) 118 129 <SIL> (2 13 ) 49 68 <SIL> (13 7 ) 69 129 <SIL> (0 1 ) 49 68 <SIL> (1 7 ) 69 129 <SIL> (3 6 ) 49 68 <SIL> (6 7 ) 69 129 <SIL> (10 9 ) 49 68 <SIL> (9 7 ) 69 129 <SIL> (12 11 ) 49 68 <SIL> (11 7 ) 69 129 <SIL> (0) 48 (1) 68 (2) 48 (3) 48 (4) 68 (5) -1 (6) 68 (7) 129 (8) 111 (9) 68 (10) 48 (11) 68 (12) 48 (13) 68 (14) 121 (15) 117 (16) 48

Example 2: Graphical representation of a lattice created from a lattice in text format. Word identities and word boundaries are depicted for each edge in the lattice

Example 3: Segment of a lattice in text format with a number of properties. Each edge is followed by the phone alignment (including HMM-state identifiers) of the word attached to the edge.

[am-prob] yes [bwd-prob] yes [edges] 477 [frames] 433 [fwd-prob] yes [hmms] yes [ip] yes [lm-prob] yes [n-gram] trigram [nodes] 283 [ph-align] yes [pp] yes [version] 0.1 (99 123 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-177.123 pp=0.000191005 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 124 ) 0 43 <SIL> am=316.765 lm=0 ip=-10 fw=12.2706 bw=-177.147 pp=0.000225194 SIL 0 0 [3448 ] 1 1 [3449 ] 2 43 [3450 ] (99 125 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-177.334 pp=0.000154648 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 43 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-174.922 pp=0.00172563 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 282 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-178.44 pp=5.11558e-05 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 234 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-169.834 pp=0.279544 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 21 ) 0 17 <SIL> am=116.77 lm=0 ip=-10 fw=4.27082 bw=-169.123 pp=0.000230687 SIL 0 0 [3448 ] 1 1 [3449 ] 2 17 [3450 ] (99 22 ) 0 20 OH am=-166.985 lm=-4.0015 ip=-10 fw=-11.0809 bw=-163.599 pp=1.2449e-08 OW 0 1 [2378 ] 2 2 [2414 ] 3 20 [2424 ] (99 26 ) 0 42 <SIL> am=317.296 lm=0 ip=-10 fw=12.2918 bw=-177.079 pp=0.000246154 SIL 0 0 [3448 ] 1 1 [3449 ] 2 42 [3450 ] (99 27 ) 0 49 IT am=-216.481 lm=-1.6863 ip=-10 fw=-10.7455 bw=-168.938 pp=8.35046e-11 IH 0 0 [1509 ] 1 1 [1602 ] 2 2 [1658 ] T 3 3 [2926 ] 4 4 [2973 ] 5 49 [3040 ] (99 29 ) 0 49 <SIL> am=199.937 lm=0 ip=-10 fw=7.5975 bw=-172.466 pp=0.000226988 SIL 0 42 [3448 ] 43 48 [3449 ] 49 49 [3450 ] (99 30 ) 0 58 BUT am=-362.411 lm=-1.3901 ip=-10 fw=-16.2866 bw=-160.109 pp=2.23863e-09 B 0 45 [714 ] 46 46 [719 ] 47 47 [747 ] AH 48 48 [283 ] 49 49 [390 ] 50 50 [523 ] T 51 51 [2936 ] 52 53 [2977 ] 54 58 [3047 ] (99 31 ) 0 49 IF am=-433.365 lm=-2.0184 ip=-10 fw=-19.753 bw=-167.319 pp=5.1672e-14 IH 0 0 [1509 ] 1 1 [1602 ] 2 2 [1647 ] F 3 3 [1333 ] 4 4 [1351 ] 5 49 [1360 ] (99 33 ) 0 2 OR(2) am=-72.4437 lm=-3.2337 ip=-10 fw=-6.53145 bw=-159.416 pp=7.71802e-05 ER 0 0 [1130 ] 1 1 [1159 ] 2 2 [1190 ]

Binary format

For performance reasons, the binary format is the only input format supported in the latticeeditor tool. This format is intended to be compact, easily extendable and readable by machines.
param

The param tool is used to extract features from raw audio. Feature vectors can then be used to train acoustic models, compute transforms, etc. The table below summarizes its optional and required command line parameters.

utterance
parametertypedescriptionoptionaldefault value
-cfg[file]feature configuration fileno
-raw[file]file containing samples of raw audio (either '-raw' or '-bat' must be specified)yes
-fea[file]file where features will be written to (either '-fea' or '-bat' must be specified)yes
-bat[file]batch file containing pairs [rawFile featureFile]yes
-wrp[float]warp factor (see Vocal Tract Length Normalization)yes1.0
-nrm[string]cepstral normalization mode
  • 'none': no cepstral normalization
  • 'utterance': utterance-based cepstral normalization
  • 'session': session-based cepstral normalization
yes
-met[string]cepstral normalization method
  • 'none': no cepstral normalization
  • 'CMN': cepstral mean normalization
  • 'CMVN': cepstral mean variance normalization
yesCMN
-hlt[bool]whether to halt the batch processing if an error is found yesno

Feature configuration file

This file specifies parameters needed for feature extraction.

parametertypedescriptionoptionaltypical values
waveform.samplingRate[integer]sampling rate of input audio (in Hertz (Hz) or samples per sec). This parameter is typically 16000 Hz, except for telephone speech where a value of 8000 Hz is used. no 8000|16000
waveform.sampleSize[integer]sample size of input audio (bits per sample), currently only 16 bits is supported. no 16
preemphasis[boolean]whether to apply pre-emphasis to the waveform. Pre-emphasis is intended to compensate the high-frequency part suppressed during human speech production. In addition, it can amplify the importance of high-frequency formants. The pre-emphasis coefficient utilized is 0.97. no yes
window.width[integer]size of the analysis window in number of milliseconds. no 20
window.shift[integer]number of milliseconds in between consecutive analysis windows. no 10
window.tapering[string]window tapering function used to multiply each sample in the window of samples in order to ensure the continuity of the first and last points in the window
  • 'none': no window tapering
  • 'Hann': Hann window
  • 'HannModified': modified version of the Hann window so the initial and final samples are not lost
  • 'Hamming': Hamming window
no Hamming
features.type[string]feature type, currently only Mel Frequency Cepstral Coefficients (MFCC) are supported. no mfcc
filterbank.frequency.min[integer]minimum frequency in Hz used to build the bank of filters. There is no speech information below 100Hz. no 0
filterbank.frequency.max[integer]maximum frequency in Hz used to build the bank of filters. This value cannot be higher than the Nyquist frequency (i.e. half the sampling rate). Additionally little speech information is present above 6800Hz, so that value can be used to exclude high frquency noise from the filterbank analysis. no 8000
filterbank.filters[integer]number of triangular filters in the filterbank used to compute mfcc coefficients.no 20
cepstralCoefficients[integer]total number of static cepstral coefficients, currently only 12 is supported.no 12
energy[boolean]whether to append the signal energy to the vector of static cepstral coefficients. no yes
derivatives.order[integer]order of higher derivatives that will be computed, this value is typically set to 2 so speed and acceleration are computed, a value of 3 can be used for extracting feature vectors whose dimensionality will be later reduced using a linear transform like HLDA or LDA. no 2
derivatives.delta[integer]size of the window (in number of feature vectors) used to compute derivatives. no 2
spliced.size[integer]number of consecutive static feature vectors centered at each time-slice that will be spliced together. This is an alternative method for incporporating dynamic information into the feature vectors. This method may produce better features than computing first and second derivatives when enough training data is available. This parameter can only be specified if the parameter 'derivatives.delta' is absent.yes 9
Below there is an example of a typical configuration file.
# --------------------------------------- # feature extraction parameters # --------------------------------------- waveform.samplingRate = 16000 waveform.sampleSize = 16 dcRemoval = yes preemphasis = yes window.width = 20 window.shift = 10 window.tapering = Hamming features.type = mfcc filterbank.frequency.min = 0 filterbank.frequency.max = 8000 filterbank.filters = 20 cepstralCoefficients = 12 energy = yes derivatives.order = 2 derivatives.delta = 2
lmfsm

The lmfsm tool is used to convert a language model in ARPA format to binary format. Language models in binary format enable fast loading times from disk and use less storage space.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-lex[file]pronunciation lexiconno
-lm[file]input language model in ARPA format no
-fsm[file]output language model in binary format no
aligner

The aligner tool is used to align features against a given sequence of words/symbols using a set of acoustic models. It transforms the given sequence of words along with optional symbols like silence or fillers into a graph of phones, which is then transformed into an optimized graph of HMM-states. Finally it aligns the graph of HMM-states against the feature vectors using the Viterbi algorithm and produces a time alignment of the input lexical units.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-lex[file]pronunciation lexiconno
-mod[file]acoustic modelsno
-fea[file]feature file containing features to alignyes
-txt[file]text file containing words and symbols to align (pronunciations for all words and symbols must be included in the lexicon file)yes
-for[string]alignment file format
  • 'binary': binary format
  • 'text': text format
no
-out[file]output alignment fileno
-fof[folder]base-folder containing feature files to align ('-mlf' must be specified) yes
-mlf[file]master label file containing features and words to align toyes
-dir[folder]output directory to store the alignments ('-mlf must be specified')yes
-bat[file]batch file containing entries (featuresFile txtFile alignmentFile)yes
-opt[file]file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed.yes
-pro[boolean]whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models.yesno
-bea[float]beam width used for likelihood-based pruning (Viterbi search).yes1000.0
-hlt[boolean]whether to stop the batch processing if an error is found (either '-mlf' or '-bat' must be specified).yesno
hmminitializer

The hmminitializer tool is used to create an initial set of acoustic models. Specifically this tool creates a set of single-Gaussian Hidden Markov Models (HMMs) initialized to the global distribution of the data, or to the HMM-state distributions given by an input set of alignment files. Acoustic models created with this tool can be later refined using the mlaccumulator and mlestimator tools.

parametertypedescriptionoptionaldefault value
-cfg[file]feature configuration fileno
-fea[folder] base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to feature filenames found in the MLF)no
-pho[file]phonetic symbol set no
-lex[file]pronunciation lexiconno
-mlf[file]master label file containing features and words to align tono
-met[file]initialization method
  • 'flatStart': Gaussian distributions of all HMM-states are initialized to the global distribution of the data
  • 'alignment': the Gaussian distribution of each HMM-state is initialized to the distribution of the data aligned to the HMM-state
yesflatstart
-cov[string]covariance modeling type
  • 'diagonal': diagonal covariance (n parameters)
  • 'full': full covariance (n(n+1)/2 parameters)
yesdiagonal
-mod[file]output acoustic modelsno
mlaccumulator

The mlaccumulator tool accumulates sufficient statistics necessary to estimate acoustic model parameters under the Maximum Likelihood Estimation (MLE) criterion. Statistics are accumulated from the training data by aligning feature vectors extracted from the audio against the transcriptions. For each utterance in the master label file, a graph is constructed from its words considering alternative pronunciations (optional) and optional filler symbols. This graph is then aligned against the feature vectors using the forward-backward algorithm and occupation statistics are dumped into the accumulator file.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]acoustic models for aligning the datano
-lex[file]pronunciation lexiconno
-fea[folder] base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to feature filenames found in the MLF).no
-cfg[file]feature configuration fileno
-feaA[folder] base folder containing feature files that will be used for accumulation of statistics. This parameter is only used for single pass retraining, and must be specified along with '-cfgA' and '-covA'.yes
-cfgA[file]feature configuration file of features used for statistics accumulation. This parameter is only used for single pass retrainingyes
-covA[string]covariance modeling type used for statistics accumulation (single pass retraining)
  • diagonal: diagonal covariance (n parameters)
  • full: full covariance (n(n+1)/2 parameters)
yesdiagonal
-mlf[file]master label file containing training data for statistics accumulationno
-ww[string]within-word context modeling order for logical accumulators. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. If not specified, physical accumulators will be generated.yestriphones
-cw[string]cross-word context modeling order for the accumulators. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. If not specified, physical accumulators will be generated. Currently most tools within the Bavieca toolkit only support acoustic models with the same cross-word and within-word order.yestriphones
-opt[file]file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumedyes
-pro[boolean]whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models.yesno
-fwd[float]beam size used for likelihood-based forward pruning. Forward pruning is safe since the forward pass is performed after the backward pass and therefore the log-likelihood of the whole utterance is known beforehand. yes-20
-bwd[float]beam size used for likelihood-based backward pruning. High values are recommended for accurate estimation of occupation probabilitiesyes800
-tre[int]maximum size (in MBs) allowed when allocating a trellis for the forward-backward algorithm. Utterances that are long enough to exceed this limit will be discarded from the accumulation process and a Warning message will be generated yes500
-dAcc[file]output accumulator fileno
mlestimator

The mlestimator tool reestimates the parameters of a given set of acoustic models using a list of accumulator files containing sufficient statistics. Estimation of acoustic model parameters is carried out under the Maximum Likelihood Estimation (MLE) criterion (for discriminative estimation see dtestimator). After the MLE is carried out, Gaussian covariances are floored using the flooring ratio provided.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]input acoustic modelsno
-acc[file]file containing the list of accumulator files that will be used for the estimationno
-cov[float]ratio used to apply covariance flooringyes0.05
-out[file]output acoustic modelsno
mapestimator

The mapestimator tool reestimates the parameters of a given set of acoustic models using a list of accumulator files containing sufficient statistics. A Maximum A Posteriori (MAP) adaptation of the acoustic model parameters is performed, which is typically used for adapting a set of well trained acoustic models to a new domain, environment or even to a particular speaker when enough adaptation data is available.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]input acoustic modelsno
-acc[file]file containing the list of accumulator files with the adaptation data (domain data) that will be used for the estimationno
-pkw[float]prior knowledge weightyes2.0
-out[file]output acoustic modelsno
dtaccumulator

The dtaccumulator tool accumulates sufficient statistics necessary to estimate acoustic model parameters under two alternative discriminative criteria: Maximum Mutual Information (MMI) or boosted Maximum Mutual Information (bMMI). Statistics are accumulated from the training data by aligning feature vectors extracted from the audio against the transcriptions. For each utterance in the master label file, a graph is constructed from its words considering alternative pronunciations (optional) and optional filler symbols. This graph is then aligned against the feature vectors using the forward-backward algorithm and occupation statistics are dumped into the accumulator file.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]acoustic models for aligning the datano
-lex[file]pronunciation lexiconno
-fea[folder] base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to relative paths found in the MLF)no
-cfg[file]feature configuration fileno
-mlf[file]master label file containing training data for statistics accumulationno
-lat[folder]base folder containing lattice files needed for the accumulation of statisticsno
-ams[float]scale factor applied to acoustic log-likelihoods before doing the forward-backward algorithm to compute lattice occupation (typically set to the inverse of the language model scale factor used during the recognition process that generated the lattices)no
-lms[float]scale factor applied to language model log-likelihoods before doing the forward-backward algorithm to compute lattice occupation (typically set to 1.0)yes1.0
-opt[file]file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumedyes
-pro[boolean]whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models.yesno
-fwd[float]beam size used for likelihood-based forward pruning. Forward pruning is safe since the forward pass is performed after the backward pass and therefore the log-likelihood of the whole utterance is known beforehand. yes-20
-bwd[float]beam size used for likelihood-based backward pruning. High values are recommended for accurate estimation of occupation probabilitiesyes800
-tre[int]maximum size (in MBs) allowed when allocating a trellis for the forward-backward algorithm. Utterances that are long enough to exceed this limit will be discarded from the accumulation process and a Warning message will be generated yes500
-dAccNum[file]output accumulator file where numerator statistics will be storedno
-dAccDen[file]output accumulator file where denominator statistics will be storedno
-obj[string] objective function
  • 'MMI': Maximum Mutual Information. This discriminative training criterion typically outperforms MLE.
  • 'bMMI': boosted Maximum Mutual Information, it can be seen as a type of large margin discriminative training criterion. It typically produces superior results to those of standard MMI.
yesMMI
-bst[float]boosting factor used for bMMI yes0.5
-can[boolean]whether to perform cancellation of statistics between numerator and denominator yesyes
dtestimator

The dtestimator tool reestimates the parameters of a given set of acoustic models discriminatively. Unlike the mlestimator tool, this tool needs two sets of accumulators, one with numerator statistics and another one with denominator statistics. Numerator and denominator statistics are typically accumulated over a set of lattices using the dtaccumulator tool. After the discriminative estimation is carried out, Gaussian covariances are floored using the flooring ratio provided.

parametertypedescriptionoptionaldefault value
-pho[file]file containing the phonetic symbol set no
-mod[file]file containing input acoustic modelsno
-accNum[file]file containing the list of accumulator files used for the estimation (numerator)no
-accDen[file]file containing the list of accumulator files used for the estimation (denominator)no
-cov[float]ratio used to apply covariance flooringyes0.05
-out[file]output acoustic modelsno
-E[float]learning rate constantyes2.0
-I[string]I-smoothing type
  • 'none': no I-smoothing
  • 'prev': I-smoothing to the previous iteration
noprev
-tau[float]I-smoothing constantyes100.0
contextclustering

The contextclustering tool refines a given set of context-independent acoustic models by performing context clustering. Context clustering is carried out using logical accumulators obtained from single Gaussian HMMs and decision trees that are either state-specific or global. Decision trees are generated following a standard top-down procedure that iteratively splits the data by applying binary questions using a ML criterion. Questions are asked about the correspondence to phonetic groups (defined by hand-made phonetic rules) and the within-word position (initial, internal and final). The splitting process is governed by two parameters: a minimum occupation count for each leaf and a minimum likelihood increase for each split. Finally, a bottom-up merging process is applied to merge those leaves which, when merged, produce a likelihood decrease below the minimum value used to allow a split.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]input acoustic modelsno
-rul[file]phonetic rules used for the top-down n-phone clusteringno
-ww[string]within-word context modeling order. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. While a few tens of hours are typically enough to get the benefit of triphone context modeling, using pentaphones and above is only helpful when hundreds of hours of training data are used.yestriphones
-cw[string]cross-word context modeling order. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'.yestriphones
-met[string]clustering method
  • 'local': a decision tree is built for each HMM-state. The clustering starts with all the allophones placed at the top of the tree
  • 'global': a single global decision tree is built for all HMM-states. The clustering starts with all the allophones observed in the training data placed at the top of the tree. This method is substantially slower than local clustering.
yeslocal
-mrg[boolean]whether to perform bottom-up merging after the top-down clustering process. Bottom-up merging consists of examinating the leaves of the decision tree and merging those leaves that when merged the likelihood decrease remains below the value specified by "-gan". It typically results in a more compact set of context modeling units.yesyes
-acc[file]file containing the list of logical accumulator files that will be used for the estimationno
-occ[float]minimum cluster occupation (no cluster of n-phones will be split unless the resuling clusters have an occupation above this value)yes200
-gan[float]minimum likelihood gain to split a cluster of n-phonesyes2000
-out[file]output acoustic modelsno
gmmeditor

The gmmeditor tool is used to refine a set of acoustic models by applying mixture splitting and merging to the mixtures of Gaussian distributions. Gaussian splitting allows for a more detailed modeling of the acoustics while Gaussian merging eliminates Gaussian distributions that become unnecesary. Acoustic model refinement through mixture splitting and merging can be performed after each reestimation iteration using the gmmeditor tool, the original set of HMMs and the accumulated statistics. After applying the gmmeditor tool GMMs will typically have a variable number of components depending on the data aligned to the HMM-state, which varies across reestimation iterations.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]input acoustic modelsno
-acc[file]list of accumulator files needed by the splitting and mergining processno
-dbl[boolean]whether to double the number of Gaussian components per mixtureyesno
-inc[int]number of Gaussian components that will be added to the mixture. Note that the resulting number of Gaussian components in the mixture will never exceed two times the original number of componentsyes
-crt[string]criterion used to decide which Gaussian component will be split
  • 'covariance': largest average covariance
  • 'weight': maximum weight (occupation)
yescovariance
-occ[float]minimum occupation (number of feature vectors aligned to the Gaussian component). Typically at least 100 feature vectors are required to robustly train a Gaussian componentyes100.0
-wgh[float]Gaussian components whose weight falls below this threshold will be merged to the closest component in hte mixtureyes0.05
-eps[float]epsilon value used to perturb the mean of a Gaussian component in order to estimate the mean of the resulting pair of Gaussian componentsyes0.00001
-mrg[boolean]yesyes
-cov[float]ratio used to apply covariance flooringyes0.05
-out[file]output acoustic modelsno
-vrb[boolean]verbose outputyesno
dynamicdecoder

This tool is used to recognize speech in batch mode, which means that the audio to recognize is completely available beforehand as opposed to live mode, in which the samples of audio are captured and passed to the recognition engine in real time. In order to perform speech recognition in live mode see Bavieca's API.

parametertypedescriptionoptional
-cfg[file]configuration file no
-hyp[file]file where the recognition hypotheses will be storedno
-bat[file]batch file containing entries [rawFile/featureFile utteranceId]no

Dynamic decoder configuration file

This file specifies parameters needed for the dynamicdecoder tool.

parametertypedescriptionoptionaltypical values
input
input.type[string]type of input data that will be fed to the decoder noaudio
feature extraction
feature.configurationFile[file]feature configuration file no
feature.cepstralNormalization.mode[string]cepstral normalization mode
  • 'none': no cepstral normalization
  • 'utterance': utterance-based cepstral normalization
  • 'session': session-based cepstral normalization
noutterance
feature.cepstralNormalization.method[string] cepstral normalization method
  • 'none': no cepstral normalization
  • 'CMN': cepstral mean normalization
  • 'CMVN': cepstral mean variance normalization
noCMN
feature.cepstralNormalization.bufferSize[int]size (in number of feature vectors) of the circular buffer used to perform cepstral normalization no[2000-360000]
feature.warpFactor[file]warp factor to be applied during feature extraction no 1.0
feature.transformFile[file]file containing feature transforms that will be applied to extracted features yes
phonetic symbol set
phoneticSymbolSet.file[file]phonetic symbol set
acoustic models
acousticModels.file[file]acoustic models
language model
languageModel.file[file]language model
languageModel.format[string]language model format, language models in binary ('FSM') and 'ARPA' format are supported. See language model formats. no
languageModel.type[string]language model type, currently only 'ngram' language models are supported nongram
languageModel.scalingFactor[float]weight applied to language model log-likelihods when combined with acoustic scores to compute the most likely recognition path no[10,50]
languageModel.crossUtterance[boolean]whether the language model state will be kept from the end of one utterance to the beginning of the next one no no
pronunciation lexicon
lexicon.file[file]pronunciation lexicon no
insertion penalty
insertionPenalty.standard[float]insertion penalty added to a path score when transitioning to a word. This value is typically negative (penalizing word insertions) although it can also be positive in order to compensate for a heavy language model scale factor. Its optimal value must be determined empirically no [5,25]
insertionPenalty.filler[float]insertion penalty added to a path score when transitioning to a filler symbol (including silence). Its value must be determined empirically. no [0,40]
insertionPenalty.filler.file[file]file containing pairs [fillerSymbol insertionPenalty], insertion penalties defined in this file override the generic filler insertion penalty specified by 'insertionPenalty.filler' yes
Viterbi pruning
pruning.maxActiveArcs[int]maximum number of active arcs (histogram pruning) no [1000,10000]
pruning.maxActiveArcsWE[int]maximum number of active arcs at word ends (histogram pruning) no [500,2000]
pruning.maxActiveTokensArc[int]maximum number of active tokens per arc (histogram pruning) no [5-50]
pruning.likelihoodBeam[float]beam size for likelihood based pruning at all arcs no [50.0-300.0]
pruning.likelihoodBeamWE[float]beam size for likelihood based pruning at word ends no [50.0-250.0]
pruning.likelihoodBeamTokensArc[float]beam size for likelihood based pruning within each active arc no [50.0-200.0]
output
output.bestSinglePath[boolean]whether to output the best recognition path no
output.lattice.folder[folder]folder where lattices will be dumped. If this parameter is not defined the recognizer will not produce lattices. This parameter must be defined in addition to the parameter 'output.lattice.maxWordSequencesState'. yes
output.lattice.maxWordSequencesState[integer]This parameter, whose value must be 2 or higher controls the lattice depth. The higher the value, the deeper the lattices produced will be, which means that they will contain a higher number of alternative recognition hypotheses. Specifically it is the number of best unique word sequences that are kept at each token during the Viterbi search. yes
output.audio.folder[folder]folder to store raw audio used for recognition yes
output.features.folder[folder]folder to store extracted features yes
output.alignment.folder[folder]folder to store time-alignments of the best recognition paths yes
wfsabuilder

Tool used to build a static decoding network in the form of a Weighted Finite State Acceptor (WFSA). Decoding networks built using this tool can be used for recognition with the wfsadecoder tool. WFSA-based decoding is very fast since all sources of information (acoustic models, pronunciation lexicon and language model) are combined and optimized statically before the actual recognition process, which is therefore substantially simplified. For large language model sizes the process of building a WFSA decoding network can be time and memory intensive, which requires large amounts of physical memory. In those cases the use of the dynamicdecoder tool is recommended.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]input acoustic modelsno
-lex[file]pronunciation lexiconno
-lm[file]language model no
-scl[float]scale factor applied to language model log-likelihoods no [10-50]
-ip[float]penalty for inserting a word no [5-25]
-ips[float]penalty for inserting silence no [0-40]
-ipf[file]file containing pairs [fillerSymbol insertionPenalty], a silence insertion penalty defined in this file may override the insertion penalty specified by '-ips' yes
-srg[string]semiring used for weight pushing
  • 'none': no weight pushing
  • 'tropical': tropical semiring
  • 'log': log semiring
yes log
-net[file]file to store the decoding network built no
wfsadecoder

This tool is a Weighted Finite State Acceptor (WFSA) based speech decoder which, in the same way as the dynamicdecoder tool, is used to process input speech and produce recognition hypotheses. In particular, this tool is used to recognize speech in batch mode, which means that the audio to recognize is completely available beforehand as opposed to live mode, in which the samples of audio are captured and passed to the recognition engine in real time. In order to perform speech recognition in live mode see Bavieca's API.

parametertypedescriptionoptional
-cfg[file]configuration file no
-hyp[file]file where the recognition hypotheses will be storedno
-bat[file]batch file containing entries [rawFile/featureFile utteranceId]no

WFSA-decoder configuration file

This file specifies configuration parameters for the wfsadecoder tool.

parametertypedescriptionoptionaltypical values
decodingNetwork.file[file]decoding network, which is a WFSA built using the wfsabuilder tool no
input
input.type[string]type of input data that will be fed to the decoder
  • 'audio': raw audio files input formatted as specified in the feature configuration file
  • 'features': feature files
no
feature extraction
feature.configurationFile[file]feature configuration file no
feature.cepstralNormalization.mode[string]cepstral normalization mode
  • 'none': no cepstral normalization
  • 'utterance': utterance-based cepstral normalization
  • 'session': session-based cepstral normalization
no
feature.cepstralNormalization.method[string] cepstral normalization method
  • 'none': no cepstral normalization
  • 'CMN': cepstral mean normalization
  • 'CMVN': cepstral mean variance normalization
no
feature.cepstralNormalization.bufferSize[int]size (in number of feature vectors) of the circular buffer used to perform cepstral normalization no2000-360000
feature.warpFactor[file]warp factor to be applied during feature extraction no
feature.transformFile[file]file containing feature transforms that will be applied to extracted features yes
phonetic symbol set
phoneticSymbolSet.file[file]phonetic symbol set
acoustic models
acousticModels.file[file]acoustic models
pronunciation lexicon
lexicon.file[file]pronunciation lexicon no
Viterbi pruning
pruning.maxActiveStates[int]maximum number of active states (histogram pruning) no 100-20000
pruning.likelihoodBeam[float]beam size for likelihood based pruning at all states no 50.0-300.0
output
output.bestSinglePath[boolean]whether to output the best recognition path no
output.lattice.folder[folder]folder where lattices will be dumped. If this parameter is not defined the recognizer will not produce lattices. This parameter must be defined in addition to the parameter 'output.lattice.maxWordSequencesState'. yes
output.lattice.maxWordSequencesState[integer]This parameter, whose value must be 2 or higher controls the lattice depth. The higher the value the deeper the lattices produced will be, which means that they will contain a higher number of alternative recognition hypotheses. Specifically it is the number of best unique word sequences that are kept at each token during the Viterbi search. yes
output.audio.folder[folder]folder to store raw audio used for recognition yes
output.features.folder[folder]folder to store extracted features yes
output.alignment.folder[folder]folder to store time-alignments of the best recognition path yes
sadmodule

This command line tool is used to perform Speech Activity Detection (SAD) over features extracted from an audio file. SAD is useful to spot speech segments within an audio stream for further processing. Since speech recognition can be a time consuming process and SAD is usually not, directing the recognition only to those segments of audio where speech is detected is a typical procedure. This implementation of SAD is based on two Hidden Markov Models (HMM) with three states each, one HMM for silence and another HMM for speech. All the HMMs for silence and speech share the same set of Gaussian distributions, which are drawn from the HMM for silence and the HMMs from speech that are found in the set of acoustic models passed as a parameter. Viterbi search is used to find the most likely alignment of features to speech and silence and the resulting segmentation is written to a file.

The accuracy of this tool is very sensitive to the following parameters: '-sil', '-sph' and '-pen', whose values should be determined empirically.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-mod[file]acoustic modelsno
-fea[file]file containing features to process no
-sil[int]maximum number of Gaussian components used to build the HMM for silence (-1 for for all)no
-sph[int]maximum number of Gaussian components used to build the HMM for speech (-1 for all)no
-pad[int]number of time-slices to pad speech segments with no 10
-pen[float]penalty for transitioning from the silence state to the speech state no
-out[file]output file where the speech/silence segmentation will be written no
hldaestimator

The hldaestimator tool is used to perform Heteroscedastic Linear Discriminant Analysis (HLDA). It consists of estimating a feature transform to decorrelate features and reduce their dimensionality while preserving the most discriminative information. In order to compute the transform a set of full-covariance acoustic models along with physical accumulators is passed as input. Typically full-covariance acoustic models are trained on high dimensional feature vectors (from scratch or doing single pass retraining) and then a HLDA transform is estimated to decorrelate the features and reduce their dimensionality. A common scenario consists of training full-covariance acoustic models on feature vectors with 52 coefficients (static features+Δ+ΔΔ+ΔΔΔ) and then estimating a HLDA transform to reduce the dimensionality to 39 coefficients.

parametertypedescriptionoptionaldefault value
-pho[file]file containing the phonetic symbol set no
-mod[file]acoustic models for the estimationno
-acc[file]file containing the list of physical accumulator files that will be used for the estimationno
-itt[integer]number of iterations for the transform updateyes10
-itp[integer]number of iterations for the parameter updateyes10
-red[integer]dimensionality reduction resulting from the transformationyes13
-out[folder]output acoustic modelsno
vtlestimator

The vtlestimator command line tool performs Maximum Likelihood (ML) based Vocal Tract Length estimation over a set of feature files. Features are extracted for different warp factors (starting with the warp factor specified by '-floor' up to the warp factor specified by '-ceiling', using a step size specified by '-step') and the warp factor that results in the highest overall likelihood is written to the output file. Unvoiced phones and symbols should be excluded from the estimation using the parameter '-fil'.

parametertypedescriptionoptionaldefault value
-cfg[file]feature configuration fileno
-pho[file]phonetic symbol set no
-mod[file]acoustic modelsno
-lex[file]pronunciation lexiconno
-bat[file]batch file containing pairs [rawFile alignmentFile]no
-for[string]format of alignment files passed as input no
-out[file]file to store pairs [rawFile warpFactor]no
-fil[file]list of phones and symbols that will be ignored, typically unvoiced phonemes and filler symbols should be excluded from the vocal tract length estimationyes
-floor[float]warp factor flooryes0.80
-ceiling[float]warp factor ceilingyes1.20
-step[float]warp factor increment among testsyes0.02
-ali[boolean]whether to realign data for each warp factoryesno
-nrm[string]cepstral normalization mode
  • none: no cepstral normalization
  • utterance: utterance-based cepstral normalization
  • session: session-based cepstral normalization
yesutterance
-met[string]cepstral normalization method
  • none: no cepstral normalization
  • CMN: cepstral mean normalization
  • CMVN: cepstral mean variance normalization
yesCMN
regtree

The regtree tool builds regression trees that can be used to collect statistics for a form of adaptation called Maximum Likelihood Linear Regression (MLLR). The regression tree is built by performing top-down clustering of the means of all Gaussian distributions present in the acoustic models. Regression trees created using this tool are then passed to the mllrestimator tool to perform the actual MLLR adaptation

parametertypedescriptionoptionaldefault value
-pho[file]file containing the phonetic symbol set no
-mod[file]file containing input acoustic modelsno
-met[string]clustering method
  • 'kMeans': k-means clustering, implies hard assignment of Gaussian means to clusters
  • 'EM': Expectation Maximimzation clustering, implies soft-assignments of Gaussian means to clusters
yesEM
-rgc[int]number of regression classes (base-classes)yes50
-gau[int]minimum number of Gaussian components per base-classyes1
-out[file]file to store the regression treeno
mllrestimator

The mllrestimator tool performs Maximum Likelihood Linear Regression (MLLR) to adapt the Gaussian distributions of a set of acoustic models. This tool receives as input a baseline set of acoustic models and adaptation data (a set of feature files with corresponsing transcriptions, which are typically hand-made transcriptions or recognition hypotheses) and estimates a mean and, optionally, a covariance tranform for each regression-class. The total number of transforms computed is automatically determined based on the adaptation data and the value of the parameters '-occ' and '-gau'. Estimated transforms can be stored into a file for later utilization using the tool hmmx

parametertypedescriptionoptionaldefault value
-pho[file]file containing the phonetic symbol set no
-mod[file]file containing input acoustic modelsno
-bat[file]batch file containing pairs [featureFile alignmentFile]no
-for[string]format of alignment files passed as input no
-out[file]file to store adapted acoustic modelsyes
-tra[folder]folder to store the transformsyes
-rgt[file]file containing the regression tree (see the regtree tool)no
-occ[int]minimum number of frames to compute a transform. Each frame corresponds to one hundredth of a second, typically at least 10 seconds of adaptation (1000 frames) data are needed to reliably estimate a transformyes3500
-gau[int]minimum number of Gaussian distributions with occupation to compute a transformyes1
-bst[boolean]whether to assign all occupation to best scoring Gaussian component. Enabling this option speeds up the estimation process, however it slightly reduces the likelihood increase resulting from the adaptationyesyes
-cov[boolean]whether to compute the covariance transform. If disabled only the mean transform will be computed, which is sufficient for most tasks.yesno
fmllrestimator

The fmllrestimator tool performs feature-space Maximum Likelihood Linear Regression (MLLR) to adapt a set of feature vectors. This tool receives as input a baseline set of acoustic models and adaptation data (a set of feature files with corresponsing transcriptions, which are typically hand-made transcriptions or recognition hypotheses) and estimates a transform. The estimated transform is stored into a file for later utilization using the tool paramx.

parametertypedescriptionoptionaldefault value
-pho[file]file containing the phonetic symbol set no
-mod[file]file containing input acoustic modelsno
-bat[file]batch file containing pairs [featureFile alignmentFile]no
-for[string]format of alignment files passed as input no
-tra[file]file to store the computed feature transformno
-bst[boolean]whether to assign all occupation to best scoring Gaussian component. Enabling this option speeds up the estimation process, however it slightly reduces the likelihood increase resulting from the adaptationyesyes
paramx

The paramx tool applies linear and affine transformations to feature vectors. Transformations can be applied over pairs of input and output feature vectors using the parameter '-bat', or, alternatively over a single pair of feature vectors using the parameters '-in' and '-out'.

parametertypedescriptionoptional
-cfg[file]feature configuration fileno
-tra[file]file containing input acoustic modelsno
-bat[file]batch file containing pairs [featureFileIn featureFileOut]yes
-in[file]input feature vectorsyes
-out[file]output feature vectorsyes
hmmx

The hmmx tool applies transformations to acoustic model parameters. Transformations are applied to a set of input acoustic models and adapted acoustic models are stored in a file.

parametertypedescriptionoptional
-pho[file]file containing the phonetic symbol set no
-tra[file]file containing the transform to be appliedno
-rgt[file]file containing the regression tree (see the regtree tool)no
-in[file]input acoustic modelsno
-out[file]output acoustic modelsno
latticeeditor

The latticeeditor tool can be used to perform a wide variety of operations over lattices. It can be used for computing the lattice Word Error Rate, computing lattice-based posterior probabilities and confidence scores, aligning lattices against feature vectors and attaching acoustic log-likelihoods and HMM-state information to its edges, compacting a lattice by merging redundant edges and nodes, rescoring lattices according to a given criterion, attaching language model log-likelihoods to edges in the lattice, and adding paths to lattices. Different operations require different combination of input parameters.

parametertypedescriptionoptionaldefault value
-pho[file]phonetic symbol set no
-lex[file]pronunciation lexiconno
-mod[file]input acoustic modelsyes
-lm[file]language model yes
-bat[file]batch file containing entries which elements depend on the actions to performno
-act[string] action to perform
  • 'wer': compute lattice Word Error Rate (WER), also called oracle. The WER is computed by performing a Viterbi alignment between hypotheses in the lattice and the reference string of words found in the transcription file.
  • 'pp': compute posterior probabilities using the forward-backward algorithm
  • 'align': align the lattice against a set of feature vectrors. Acoustic scores are attached to edges in the lattice, HMM-marking is also performed.
  • 'compact': compact the lattice by doing a forward-backward merging of redundant nodes and edges
  • 'rescore': rescore the lattice and produce the best hypothesis according to the rescoring criterion. Rescoring is based on the Dijkstra algorithm.
  • 'lm': attach language model scores to edges in the lattice
  • 'addpath': add a path to the lattice (needed for discriminative training)
  • 'nbest': generate n-best lists from lattices using maximum likelihood or posterior probabilities
no
-trn[file]transcription fileyes
-hyp[file]hypothesis fileyes
-hypf[string]hypothesis file format
  • 'full': full hypothesis format
  • 'trn': trn hypothesis format
yestrn
-ip[float]penalty for inserting a word, silence or filler yes
-ipf[file]file containing pairs [fillerSymbol insertionPenalty] yes
-ams[float]scale factor applied to acoustic log-likelihoodsyes
-lms[float]scale factor applied to language model log-likelihoodsyes
-res[string]rescoring method
  • 'likelihood': finds the lattice path with the highest likelihood
  • 'pp': finds the lattice path with the highest posterior probability
yeslikelihood
-conf[string]confidence annotation method (applies to '-act pp')
  • 'posteriors': each edge is annotated with its lattice-based posterior probability
  • 'accumulated': each edge is annotated with the sum of posterior probabilities of overlapping edges with the same word identity
  • 'maximum': for each time-slice within the edge the maximum posterior probability of overlapping edges with same word identity is kept, finally the edge is annotated with the maximum per-frame posterior probability
yesmaximum
-map[file]file containing word mappings to be used for computing WERyes
-nbest[int]maximum number of entries (word sequences) in the n-best listsyes 100

Bavieca's Application Programming Interface

Bavieca's API provides an easy way to incorporate speech-recognition capabilities into an application. These capabilities are listed below:

Bavieca's API comprises a relatively reduced set of functions and data structures, all of them are declared in the header file BaviecaAPI.h. Below there is an example showing how to use the API.

Example

The example below shows a very simple way to use Bavieca's API to recognize speech in live-mode. For the sake of simplicity, speech samples in the example are retrieved from a file instead of from the microphone, however they are fed into the recognition process as if they were captured in real time.



	// initialize API
	const char *strFileConfiguration = "configuration.txt";
	BaviecaAPI baviecaAPI(strFileConfiguration);
	if (baviecaAPI.initialize(INIT_SAD|INIT_ALIGNER|INIT_DECODER) == false) {  
		return -1; 
	} 
	
	// load audio samples
	const char *strFileRaw = "audio.raw";
	ifstream is;
	is.open(strFileRaw,ios::binary);
	if (!is.is_open()) {
		cerr << "unable to open the file: \"\"" << strFileRaw; 
	}	
	is.seekg(0, ios::end);
	int iBytes = is.tellg();
	is.seekg(0, ios::beg);	
	int iSamples = (iBytes/2);
	short *sSamples = new short[iSamples];
	is.read((char*)sSamples,iSamples*2); 
	if (is.fail()) {
		cerr << "error reading from stream at position: " << is.tellg();
	}
	is.close();
	
	// begin utterance processing
	baviecaAPI.decBeginUtterance();
	
	// simulate streaming data (1 second chunks)
	int iSamplesChunk = 16000;
	int iSamplesUsed = 0;
	while((iSamplesUsed+iSamplesChunk) < iSamples) {
		
		// extract features from the audio
		int iFeatures = -1;
		float *fFeatures = baviecaAPI.extractFeatures(sSamples+iSamplesUsed,iSamplesChunk,&iFeatures);
		
		baviecaAPI.decProcess(fFeatures,iFeatures);		
		baviecaAPI.free(fFeatures);
		iSamplesUsed += iSamplesChunk;
	}
	delete [] sSamples;
	
	// print recognition hypothesis
	int iWords = -1;
	const char *strFileHypothesisLattice = NULL;
	WordHypothesisI *wordHypothesis = baviecaAPI.decGetHypothesis(&iWords,strFileHypothesisLattice);
	if (wordHypothesis) {
		for(int i=0 ; i < iWords ; ++i) {
			cout << " [" << wordHypothesis[i].iFrameStart << " ";
			cout << wordHypothesis[i].strWord << " ";
			cout << wordHypothesis[i].iFrameEnd << "] ";
		}
		cout << endl;
	}
	
	baviecaAPI.free(wordHypothesis,iWords);
		
	// end utterance processing
	baviecaAPI.decEndUtterance();
	
	baviecaAPI.uninitialize();