Overview

The Bavieca toolkit comprises a set of about 25 command line tools that can be used to build very sophisticated large vocabulary speech recognition systems from scratch. Additionally it offers an Application Programming Interface (API) that exposes speech processing features such as speech recognition, speech activity detection, forced alignment, etc. This API is provided as a C++ library that can be used to create stand-alone applications that exploit Bavieca's speech recognition features.

Phonetic symbol set

The phonetic symbol can contain any set of symbols.

Below there is an example of a file containing a phonetic symbol set. Each phonenetic symbol must be on a separate line. Symbols between parentheses indicate that are context independent (see contextclustering tool).

# # phonetic symbol set # AA AE AH AO AW AY B CH D DH EH ER EY F G HH IH IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH (SIL) (_BREATH) (_COUGH) (_FP)

Phonetic rules

Phonetic rules serve to express groupings of phones that share similar properties attending to some criterion (e.g. fricatives, frontvowels, etc). They are used to guide the procedure of clustering context dependent units (see the contextclustering tool) under the assumption that phones that present similar properties are likely to affect the realization of neighboring phones in a similar fashion. During the process of building the decision tree, a cluster of allophones is split into subclusters by testing the applicable phonetic rules and picking the rule that results in a higher likelihood increase.

A very good way to see how to define phonetic rules is by looking at the example below. The character '#' can be used to write comments. Note that each phonetic symbol in the phonetic symbol set must appear in the right-side of at least one rule. However, phonetic symbols typically appear on the right side of many rules.

# # a few phonetic rules for a few phonetic symbols # $CH CH $AE JH $F F $V V $TH TH $DH DH $S S $SH SH $Z Z $ZH ZH $affricate CH JH $frontfricative F V TH DH $centralfricative S SH Z ZH $fricative $frontfricative $centralfricative $affricate

Pronunciation lexicon

The pronunciation lexicon contains pronunciations for all the words in the vocabulary. A pronunciation for a word is defined by the actual word followed by the sequence of phonemes that express the pronunciation. Alternative pronunciations of a word can be defined by appending '(n)' to the word. This suffix applies once the first pronunciation of the word has been defined (n starts at 2).

Depending on whether it is used for training or recognition, the lexicon file has different uses. For training purposes the lexicon file must contain pronunciations for all words in the master label file, while for recognition, the lexicon file determines the active vocabulary (i.e. those words that the recognizer can potentially recognize).

Below there is an example of a lexicon file containing alternative pronunciations for some words. When creating a pronunciation lexicon the following considerations must be taken into account:

Symbols between pointy brackets, such as '<SIL>' or '<BREATH>', denote filler symbols, which have a special treatment during recognition. In particular, a special insertion penalty can be specified for them and they are not affected by the language model (do not appear in any n-gram).
Due to the way the decoding network is built, two words/filler symbols that have the same initial phone cannot have different insertion penalties. Thus, it is recommended that pronunciations for filler symbols do not include phonetic symbols that are used in actual words but phonetic symbols created ad-hoc (such as '_BREATH' or '_UH' in the example below).
The silence symbol '<SIL>' must always be defined.
The characters "##" can be used to write comments
Although it is not a requirement (the pronunciation lexicon is internally reordered when loaded), entries in the lexicon should be ordered alphabetically.

## ## this is a very short pronunciation lexicon ## <SIL> SIL <BREATH> _BREATH <COUGH> _COUGH <HMM> _HMM <MMM> _MMM <UM> _UM A AH A(2) EY AB AE B AB(2) EY B IY ABBREVIATION AH B R IY V IY EY SH AH N ABBREVIATIONS AH B R IY V IY EY SH AH N Z ABBY AE B IY ABDUCTS AE B D AH K T S ABILITY AH B IH L AH T IY

Language model

Currently the Bavieca speech recognition toolkit supports language models in two formats, the ARPA format and binary format. Internally, for efficiency reasons, language models are represented as Finite State Machines (FSMs). Any n-gram orders are supported, however only orders up to fourgram have been tested.

ARPA format

The ARPA format is a simple and human-readable language model format. There are several freely available language modeling toolkits that can be used to build language models in the ARPA format. Two excellent resources are The CMU Statistical Language Modeling (SLM) Toolkit and the MIT Language Modeling Toolkit. Language models built with these toolkits have been successfully used with Bavieca.

Fragments of a trigram language model in the ARPA format are listed below.

\data\ ngram 1=7001 ngram 2=129859 ngram 3=448367 \1-grams: -1.601124 </s> -99 <s> -0.629519 -1.990822 A -0.847622 -4.108753 A'S -0.167362 -4.115170 AB -0.181871 -5.289563 ABBREVIATION -0.048269 -5.590593 ABBREVIATIONS -0.048251 -5.289563 ABBY -0.048249 -4.647300 ABILITY -0.159118 -3.341374 ABLE -0.856884 ... \2-grams: -2.588038 <s> A -0.594312 -4.355186 <s> A'S -0.035111 -4.360168 <s> AB -0.034363 -3.064455 <s> ABOUT -0.318412 -3.929392 <s> ABOVE -0.037674 -4.298169 <s> ABSORBED -0.034854 -4.581251 <s> ACCELERATING -0.027021 ... \3-grams: -1.872919 <s> </s> A -5.586053 <s> </s> A'S -5.081562 <s> </s> AB -2.974584 <s> </s> ABOUT ... -1.750264 CIRCUIT WE CREATED -1.343804 CIRCUIT WE HAVE -1.546748 CIRCUIT WE MADE -1.690600 CIRCUIT WE NEEDED -1.152217 CIRCUIT WE PUT -1.640675 CIRCUIT WE SAW -1.350947 CIRCUIT WE USED

Binary format

The binary format allows a much faster loading from disk, usually about an order of magnitude compared to the time it takes to load a language model in the ARPA format. The binary representation is lossless and encodes the language model as a Finite State Machine. In addition, the binary representation uses about half the space on disk. A language model in ARPA format can be converted to binary format using the lmfsm command line tool.

Master Label File

Master Label Files (MLFs) are used to describe a dataset composed of a series of utterances with corresponding transcriptions. MLFs serve as input to tools that initialize acoustic model parameters or accumulate sufficient statistics to reestimate acoustic model parameters.

Each utterance in the MLF is expressed by a line with a relative path to a feature file (containing features extracted from the utterance) followed by a line for each word in the transcription of the utterance. Each transcription must contain at least one word (or symbol). Typically, Bavieca tools produce absolute paths to feature files by appending relative paths in the MLFs to base folders specified as parameters. This is a flexible mechanism to, for example, use different sets of features with a single MLF.

Below there is an example of MLF. Note the use of the character '"' at the beginning and end of the relative paths.

"/MS/4/EI291/3/MS_4_3_EI291_KB__01-21-2009_trans/38.fea" BECAUSE UM UH IT'S AT TWENTY AND UH SO YEAH <LAUGH> "/MS/4/EI291/3/MS_4_3_EI291_KB__01-21-2009_trans/128.fea" BECAUSE UM A THERMOMETER <BREATH> IT IS ACTUALLY UM YEAH SO LIKE A PERSON IS IN A THERMOMETER SO THEY DON'T KNOW THE EXACT TEMPERATURE

Features

Feature files are binary files containing a series of feature vectors. These files are produced by the param tool according to a series of parameters defined in a feature configuration file. The characteristics of the features, along with their dimensionality, are not specified in the feature file. They are only available through the feature configuration file used to create them.

Alignment file

Alignment files can be produced by a Viterbi aligner or a Forward-Backward aligner. They can be either in binary or text format.

Text format

Alignment files in text format can be produced by a Viterbi aligner and keep time-alignment information for each word/symbol aligned. This format is intended to be easily readable for humans, nontheless it can be used as the input of some tools. Below there is an example of alignment file in text format. Each line represents a phone alignment and the six numbers on the left are the initial and final feature-frames aligned to each of the three HMM-states associated to the phone. In the next column there is the phonetic symbol ('TH' in the first line). The next column is the acoustic score (log-likelihood) for the phone given the features and acoustic models used. Finally, the last column shows the word in the event that the phone is the initial phone of a word.

0 0 1 2 3 5 TH -158.5670 THANK 6 6 7 9 10 10 AE -125.2262 11 11 12 13 14 15 NG -132.1665 16 17 18 18 19 23 KD -164.0810 24 26 27 28 29 29 Y -142.7290 YOU 30 33 34 37 38 77 UW -559.0276

Binary format

Alignments in binary format keep occupation counts of each HMM-state for each time-slice. In addition to that, in the case of Viterbi alignments (in which there is a hard assignment between time-slices and HMM-states), alignment information is kept for each word/symbol aligned.

Accumulator

Accumulator files keep sufficient statistics to estimate acoustic model parameters (Gaussian distributions). There exist two types of accumulator files:

Logical accumulators: this type of accumulators keep sufficient statistics at the logical HMM-state level. A logical HMM-state is an HMM-state in a particular phonetic context. For example, "S-AH-L+UW+T" represents a logical HMM-state for a pentaphone (where L is the central phone and the characters "-" and "+" are used to denote left and right phonetic context respectively). In reality, a logical HMM-state is also characterized for the HMM-state number (HMMs in Bavieca have three states) and the position of the central phone within the word, which can be initial, internal or final. Statistics for logical HMM-states are accumulated everytime a logical HMM-state is observed (i.e. receives some occupation) during the accumulation process. For example the logical accumulator for "S-AH-L+UW+T" will receive some data every time the word ABSOLUTE ("AE B S AH L UW T") is observed in the training data.
Logical accumulators are used to build context dependent HMM-states using the contextclustering tool. In order to manage data sparsity, this tool clusters a set of logical HMM-states into a physical context-dependent HMM-state.
When large phonetic contexts are used, the number of different left and right contexts observed for each phone during the accumulation process can be very large, which means that a large number of logical accumulators (and a large storage space) will be needed to keep sufficient statistics.
Physical accumulators: these types of accumulators keep sufficient statistics at the physical HMM-state level. Physical HMM-states are actual HMM-states that can be used to estimate log-likelihoods during training and recognition, while logical HMM-states are just an abstraction used to describe how context modeling works. Specifically, each physical accumulator keeps sufficient statistics to estimate the mean and covariance of a Gaussian distribution. Physical accumulators are identified by a physical HMM-state identifier and an index within its mixture of Gaussian components. Physical accumulators do not know about context dependency, they can be generated to reestimate the parameters of context-independent monophones or clustered n-phones.

Hypothesis

Hypothesis files contain word-hypothesis generated by recognition or rescoring tools. Bavieca supports two types of hypothesis file: trn and ctm, as defined in the SCLITE scoring toolkit.

Below there is an example of hypothesis file in trn format. There are hypotheses for four utterances (one per line). The trailing code between parentheses is the utterance identifier, which scoring tools use to match pairs hypothesis-reference.

THE NAIL IS NOT CONNECTED TO THE PAPER CLIP BECAUSE THE MAGNET JUST MADE THE PAPER THE NAIL A MAGNET (ME|1|CV683-2|NON|28_0) THE MAGNET TO MAKE THE NAIL INTO A MAGNET (ME|1|CV683-2|NON|29_1) THAT MAGNETS ONLY STICK TO HAVE TWO WIRES AND THEN METAL (ME|1|CV683-2|NON|3_1) CAUSE THE MAGNET TURNS THE NAIL INTO A MAGNET (ME|1|CV683-2|NON|30_0)

Transcription

Transcription files keep orthographic transcriptions of utterances and are typically used to score recognition hypothesis and compute the word error rate (WER). WER is defined as the number of edit errors resulting from aligning a hypothesis against a reference file (transcription) divided by the number of words in the reference file and it is the most common metric to measure speech recognition accuracy. Transcription files in Bavieca are in the trn format, as defined in the SCLITE scoring toolkit.

Below there is an example of transcription file in the trn format. There are transcriptions for six utterances (one per line). The trailing code between parentheses is the utterance identifier, which scoring tools use to match pairs hypothesis-reference.

GOOD (ME|1|CV232-3|SKH|1) IT DEPENDS ON IF YOU HAVE ANY OF THOSE YELLOW THINGS (ME|1|CV232-3|SKH|25) YES (ME|1|CV232-3|SKH|23) BECAUSE THE WASHER IS SO HEAVY IT MAKES THE OTHER SIDE GO UP (ME|1|CV232-3|SKH|19) THEY STICK (ME|1|CV232-3|SKH|11) YES (ME|1|CV232-3|SKH|17)

Lattice file

Lattice files can be either in binary or text format.

Text format

Decoders and lattice edition tools can output lattices in text format, however this is exclusively an output format and it is only intended for easy visualization of the lattice.

Example 1: Simple lattice in text format. The first four lines contain mandatory lattice properties: the number of edges in the lattice, the number of nodes, the number of time-slices (frames), and the lattice version.

[edges] 23 [frames] 130 [nodes] 17 [version] 0.1 (5 12 ) 0 48 BYE (5 10 ) 0 48 BI (5 3 ) 0 48 BUY (5 0 ) 0 48 HI (5 2 ) 0 48 HIGH (5 16 ) 0 48 BY (16 4 ) 49 68 <SIL> (4 7 ) 69 129 <SIL> (4 8 ) 69 111 <SIL> (4 14 ) 69 121 <SIL> (14 7 ) 122 129 IT (8 15 ) 112 117 THE (15 7 ) 118 129 <SIL> (2 13 ) 49 68 <SIL> (13 7 ) 69 129 <SIL> (0 1 ) 49 68 <SIL> (1 7 ) 69 129 <SIL> (3 6 ) 49 68 <SIL> (6 7 ) 69 129 <SIL> (10 9 ) 49 68 <SIL> (9 7 ) 69 129 <SIL> (12 11 ) 49 68 <SIL> (11 7 ) 69 129 <SIL> (0) 48 (1) 68 (2) 48 (3) 48 (4) 68 (5) -1 (6) 68 (7) 129 (8) 111 (9) 68 (10) 48 (11) 68 (12) 48 (13) 68 (14) 121 (15) 117 (16) 48

Example 2: Graphical representation of a lattice created from a lattice in text format. Word identities and word boundaries are depicted for each edge in the lattice

Example 3: Segment of a lattice in text format with a number of properties. Each edge is followed by the phone alignment (including HMM-state identifiers) of the word attached to the edge.

"am" = acoustic model log-likelihood
"lm" = language model log-likelihood
"ip" = word insertion penalty
"fw" = edge forward score
"bw" = edge backward score
"pp" = edge posterior probability

[am-prob] yes [bwd-prob] yes [edges] 477 [frames] 433 [fwd-prob] yes [hmms] yes [ip] yes [lm-prob] yes [n-gram] trigram [nodes] 283 [ph-align] yes [pp] yes [version] 0.1 (99 123 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-177.123 pp=0.000191005 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 124 ) 0 43 <SIL> am=316.765 lm=0 ip=-10 fw=12.2706 bw=-177.147 pp=0.000225194 SIL 0 0 [3448 ] 1 1 [3449 ] 2 43 [3450 ] (99 125 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-177.334 pp=0.000154648 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 43 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-174.922 pp=0.00172563 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 282 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-178.44 pp=5.11558e-05 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 234 ) 0 44 <SIL> am=312.042 lm=0 ip=-10 fw=12.0817 bw=-169.834 pp=0.279544 SIL 0 0 [3448 ] 1 1 [3449 ] 2 44 [3450 ] (99 21 ) 0 17 <SIL> am=116.77 lm=0 ip=-10 fw=4.27082 bw=-169.123 pp=0.000230687 SIL 0 0 [3448 ] 1 1 [3449 ] 2 17 [3450 ] (99 22 ) 0 20 OH am=-166.985 lm=-4.0015 ip=-10 fw=-11.0809 bw=-163.599 pp=1.2449e-08 OW 0 1 [2378 ] 2 2 [2414 ] 3 20 [2424 ] (99 26 ) 0 42 <SIL> am=317.296 lm=0 ip=-10 fw=12.2918 bw=-177.079 pp=0.000246154 SIL 0 0 [3448 ] 1 1 [3449 ] 2 42 [3450 ] (99 27 ) 0 49 IT am=-216.481 lm=-1.6863 ip=-10 fw=-10.7455 bw=-168.938 pp=8.35046e-11 IH 0 0 [1509 ] 1 1 [1602 ] 2 2 [1658 ] T 3 3 [2926 ] 4 4 [2973 ] 5 49 [3040 ] (99 29 ) 0 49 <SIL> am=199.937 lm=0 ip=-10 fw=7.5975 bw=-172.466 pp=0.000226988 SIL 0 42 [3448 ] 43 48 [3449 ] 49 49 [3450 ] (99 30 ) 0 58 BUT am=-362.411 lm=-1.3901 ip=-10 fw=-16.2866 bw=-160.109 pp=2.23863e-09 B 0 45 [714 ] 46 46 [719 ] 47 47 [747 ] AH 48 48 [283 ] 49 49 [390 ] 50 50 [523 ] T 51 51 [2936 ] 52 53 [2977 ] 54 58 [3047 ] (99 31 ) 0 49 IF am=-433.365 lm=-2.0184 ip=-10 fw=-19.753 bw=-167.319 pp=5.1672e-14 IH 0 0 [1509 ] 1 1 [1602 ] 2 2 [1647 ] F 3 3 [1333 ] 4 4 [1351 ] 5 49 [1360 ] (99 33 ) 0 2 OR(2) am=-72.4437 lm=-3.2337 ip=-10 fw=-6.53145 bw=-159.416 pp=7.71802e-05 ER 0 0 [1130 ] 1 1 [1159 ] 2 2 [1190 ]

Binary format

For performance reasons, the binary format is the only input format supported in the latticeeditor tool. This format is intended to be compact, easily extendable and readable by machines.

param

The param tool is used to extract features from raw audio. Feature vectors can then be used to train acoustic models, compute transforms, etc. The table below summarizes its optional and required command line parameters.

utterance

parameter	type	description	optional	default value
-cfg	[file]	feature configuration file	no
-raw	[file]	file containing samples of raw audio (either '-raw' or '-bat' must be specified)	yes
-fea	[file]	file where features will be written to (either '-fea' or '-bat' must be specified)	yes
-bat	[file]	batch file containing pairs [rawFile featureFile]	yes
-wrp	[float]	warp factor (see Vocal Tract Length Normalization)	yes	1.0
-nrm	[string]	cepstral normalization mode 'none': no cepstral normalization 'utterance': utterance-based cepstral normalization 'session': session-based cepstral normalization	yes
-met	[string]	cepstral normalization method 'none': no cepstral normalization 'CMN': cepstral mean normalization 'CMVN': cepstral mean variance normalization	yes	CMN
-hlt	[bool]	whether to halt the batch processing if an error is found	yes	no

Feature configuration file

This file specifies parameters needed for feature extraction.

parameter	type	description	optional	typical values
waveform.samplingRate	[integer]	sampling rate of input audio (in Hertz (Hz) or samples per sec). This parameter is typically 16000 Hz, except for telephone speech where a value of 8000 Hz is used.	no	8000\|16000
waveform.sampleSize	[integer]	sample size of input audio (bits per sample), currently only 16 bits is supported.	no	16
preemphasis	[boolean]	whether to apply pre-emphasis to the waveform. Pre-emphasis is intended to compensate the high-frequency part suppressed during human speech production. In addition, it can amplify the importance of high-frequency formants. The pre-emphasis coefficient utilized is 0.97.	no	yes
window.width	[integer]	size of the analysis window in number of milliseconds.	no	20
window.shift	[integer]	number of milliseconds in between consecutive analysis windows.	no	10
window.tapering	[string]	window tapering function used to multiply each sample in the window of samples in order to ensure the continuity of the first and last points in the window 'none': no window tapering 'Hann': Hann window 'HannModified': modified version of the Hann window so the initial and final samples are not lost 'Hamming': Hamming window	no	Hamming
features.type	[string]	feature type, currently only Mel Frequency Cepstral Coefficients (MFCC) are supported.	no	mfcc
filterbank.frequency.min	[integer]	minimum frequency in Hz used to build the bank of filters. There is no speech information below 100Hz.	no	0
filterbank.frequency.max	[integer]	maximum frequency in Hz used to build the bank of filters. This value cannot be higher than the Nyquist frequency (i.e. half the sampling rate). Additionally little speech information is present above 6800Hz, so that value can be used to exclude high frquency noise from the filterbank analysis.	no	8000
filterbank.filters	[integer]	number of triangular filters in the filterbank used to compute mfcc coefficients.	no	20
cepstralCoefficients	[integer]	total number of static cepstral coefficients, currently only 12 is supported.	no	12
energy	[boolean]	whether to append the signal energy to the vector of static cepstral coefficients.	no	yes
derivatives.order	[integer]	order of higher derivatives that will be computed, this value is typically set to 2 so speed and acceleration are computed, a value of 3 can be used for extracting feature vectors whose dimensionality will be later reduced using a linear transform like HLDA or LDA.	no	2
derivatives.delta	[integer]	size of the window (in number of feature vectors) used to compute derivatives.	no	2
spliced.size	[integer]	number of consecutive static feature vectors centered at each time-slice that will be spliced together. This is an alternative method for incporporating dynamic information into the feature vectors. This method may produce better features than computing first and second derivatives when enough training data is available. This parameter can only be specified if the parameter 'derivatives.delta' is absent.	yes	9

Below there is an example of a typical configuration file.

# --------------------------------------- # feature extraction parameters # --------------------------------------- waveform.samplingRate = 16000 waveform.sampleSize = 16 dcRemoval = yes preemphasis = yes window.width = 20 window.shift = 10 window.tapering = Hamming features.type = mfcc filterbank.frequency.min = 0 filterbank.frequency.max = 8000 filterbank.filters = 20 cepstralCoefficients = 12 energy = yes derivatives.order = 2 derivatives.delta = 2

lmfsm

The lmfsm tool is used to convert a language model in ARPA format to binary format. Language models in binary format enable fast loading times from disk and use less storage space.

parameter	type	description	optional
-pho	[file]	phonetic symbol set	no
-lex	[file]	pronunciation lexicon	no
-lm	[file]	input language model in ARPA format	no
-fsm	[file]	output language model in binary format	no

aligner

The aligner tool is used to align features against a given sequence of words/symbols using a set of acoustic models. It transforms the given sequence of words along with optional symbols like silence or fillers into a graph of phones, which is then transformed into an optimized graph of HMM-states. Finally it aligns the graph of HMM-states against the feature vectors using the Viterbi algorithm and produces a time alignment of the input lexical units.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-lex	[file]	pronunciation lexicon	no
-mod	[file]	acoustic models	no
-fea	[file]	feature file containing features to align	yes
-txt	[file]	text file containing words and symbols to align (pronunciations for all words and symbols must be included in the lexicon file)	yes
-for	[string]	alignment file format 'binary': binary format 'text': text format	no
-out	[file]	output alignment file	no
-fof	[folder]	base-folder containing feature files to align ('-mlf' must be specified)	yes
-mlf	[file]	master label file containing features and words to align to	yes
-dir	[folder]	output directory to store the alignments ('-mlf must be specified')	yes
-bat	[file]	batch file containing entries (featuresFile txtFile alignmentFile)	yes
-opt	[file]	file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed.	yes
-pro	[boolean]	whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models.	yes	no
-bea	[float]	beam width used for likelihood-based pruning (Viterbi search).	yes	1000.0
-hlt	[boolean]	whether to stop the batch processing if an error is found (either '-mlf' or '-bat' must be specified).	yes	no

hmminitializer

The hmminitializer tool is used to create an initial set of acoustic models. Specifically this tool creates a set of single-Gaussian Hidden Markov Models (HMMs) initialized to the global distribution of the data, or to the HMM-state distributions given by an input set of alignment files. Acoustic models created with this tool can be later refined using the mlaccumulator and mlestimator tools.

parameter	type	description	optional	default value
-cfg	[file]	feature configuration file	no
-fea	[folder]	base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to feature filenames found in the MLF)	no
-pho	[file]	phonetic symbol set	no
-lex	[file]	pronunciation lexicon	no
-mlf	[file]	master label file containing features and words to align to	no
-met	[file]	initialization method 'flatStart': Gaussian distributions of all HMM-states are initialized to the global distribution of the data 'alignment': the Gaussian distribution of each HMM-state is initialized to the distribution of the data aligned to the HMM-state	yes	flatstart
-cov	[string]	covariance modeling type 'diagonal': diagonal covariance (n parameters) 'full': full covariance (n(n+1)/2 parameters)	yes	diagonal
-mod	[file]	output acoustic models	no

mlaccumulator

The mlaccumulator tool accumulates sufficient statistics necessary to estimate acoustic model parameters under the Maximum Likelihood Estimation (MLE) criterion. Statistics are accumulated from the training data by aligning feature vectors extracted from the audio against the transcriptions. For each utterance in the master label file, a graph is constructed from its words considering alternative pronunciations (optional) and optional filler symbols. This graph is then aligned against the feature vectors using the forward-backward algorithm and occupation statistics are dumped into the accumulator file.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	acoustic models for aligning the data	no
-lex	[file]	pronunciation lexicon	no
-fea	[folder]	base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to feature filenames found in the MLF).	no
-cfg	[file]	feature configuration file	no
-feaA	[folder]	base folder containing feature files that will be used for accumulation of statistics. This parameter is only used for single pass retraining, and must be specified along with '-cfgA' and '-covA'.	yes
-cfgA	[file]	feature configuration file of features used for statistics accumulation. This parameter is only used for single pass retraining	yes
-covA	[string]	covariance modeling type used for statistics accumulation (single pass retraining) diagonal: diagonal covariance (n parameters) full: full covariance (n(n+1)/2 parameters)	yes	diagonal
-mlf	[file]	master label file containing training data for statistics accumulation	no
-ww	[string]	within-word context modeling order for logical accumulators. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. If not specified, physical accumulators will be generated.	yes	triphones
-cw	[string]	cross-word context modeling order for the accumulators. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. If not specified, physical accumulators will be generated. Currently most tools within the Bavieca toolkit only support acoustic models with the same cross-word and within-word order.	yes	triphones
-opt	[file]	file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed	yes
-pro	[boolean]	whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models.	yes	no
-fwd	[float]	beam size used for likelihood-based forward pruning. Forward pruning is safe since the forward pass is performed after the backward pass and therefore the log-likelihood of the whole utterance is known beforehand.	yes	-20
-bwd	[float]	beam size used for likelihood-based backward pruning. High values are recommended for accurate estimation of occupation probabilities	yes	800
-tre	[int]	maximum size (in MBs) allowed when allocating a trellis for the forward-backward algorithm. Utterances that are long enough to exceed this limit will be discarded from the accumulation process and a Warning message will be generated	yes	500
-dAcc	[file]	output accumulator file	no

mlestimator

The mlestimator tool reestimates the parameters of a given set of acoustic models using a list of accumulator files containing sufficient statistics. Estimation of acoustic model parameters is carried out under the Maximum Likelihood Estimation (MLE) criterion (for discriminative estimation see dtestimator). After the MLE is carried out, Gaussian covariances are floored using the flooring ratio provided.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	input acoustic models	no
-acc	[file]	file containing the list of accumulator files that will be used for the estimation	no
-cov	[float]	ratio used to apply covariance flooring	yes	0.05
-out	[file]	output acoustic models	no

mapestimator

The mapestimator tool reestimates the parameters of a given set of acoustic models using a list of accumulator files containing sufficient statistics. A Maximum A Posteriori (MAP) adaptation of the acoustic model parameters is performed, which is typically used for adapting a set of well trained acoustic models to a new domain, environment or even to a particular speaker when enough adaptation data is available.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	input acoustic models	no
-acc	[file]	file containing the list of accumulator files with the adaptation data (domain data) that will be used for the estimation	no
-pkw	[float]	prior knowledge weight	yes	2.0
-out	[file]	output acoustic models	no

dtaccumulator

The dtaccumulator tool accumulates sufficient statistics necessary to estimate acoustic model parameters under two alternative discriminative criteria: Maximum Mutual Information (MMI) or boosted Maximum Mutual Information (bMMI). Statistics are accumulated from the training data by aligning feature vectors extracted from the audio against the transcriptions. For each utterance in the master label file, a graph is constructed from its words considering alternative pronunciations (optional) and optional filler symbols. This graph is then aligned against the feature vectors using the forward-backward algorithm and occupation statistics are dumped into the accumulator file.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	acoustic models for aligning the data	no
-lex	[file]	pronunciation lexicon	no
-fea	[folder]	base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to relative paths found in the MLF)	no
-cfg	[file]	feature configuration file	no
-mlf	[file]	master label file containing training data for statistics accumulation	no
-lat	[folder]	base folder containing lattice files needed for the accumulation of statistics	no
-ams	[float]	scale factor applied to acoustic log-likelihoods before doing the forward-backward algorithm to compute lattice occupation (typically set to the inverse of the language model scale factor used during the recognition process that generated the lattices)	no
-lms	[float]	scale factor applied to language model log-likelihoods before doing the forward-backward algorithm to compute lattice occupation (typically set to 1.0)	yes	1.0
-opt	[file]	file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed	yes
-pro	[boolean]	whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models.	yes	no
-fwd	[float]	beam size used for likelihood-based forward pruning. Forward pruning is safe since the forward pass is performed after the backward pass and therefore the log-likelihood of the whole utterance is known beforehand.	yes	-20
-bwd	[float]	beam size used for likelihood-based backward pruning. High values are recommended for accurate estimation of occupation probabilities	yes	800
-tre	[int]	maximum size (in MBs) allowed when allocating a trellis for the forward-backward algorithm. Utterances that are long enough to exceed this limit will be discarded from the accumulation process and a Warning message will be generated	yes	500
-dAccNum	[file]	output accumulator file where numerator statistics will be stored	no
-dAccDen	[file]	output accumulator file where denominator statistics will be stored	no
-obj	[string]	objective function 'MMI': Maximum Mutual Information. This discriminative training criterion typically outperforms MLE. 'bMMI': boosted Maximum Mutual Information, it can be seen as a type of large margin discriminative training criterion. It typically produces superior results to those of standard MMI.	yes	MMI
-bst	[float]	boosting factor used for bMMI	yes	0.5
-can	[boolean]	whether to perform cancellation of statistics between numerator and denominator	yes	yes

dtestimator

The dtestimator tool reestimates the parameters of a given set of acoustic models discriminatively. Unlike the mlestimator tool, this tool needs two sets of accumulators, one with numerator statistics and another one with denominator statistics. Numerator and denominator statistics are typically accumulated over a set of lattices using the dtaccumulator tool. After the discriminative estimation is carried out, Gaussian covariances are floored using the flooring ratio provided.

parameter	type	description	optional	default value
-pho	[file]	file containing the phonetic symbol set	no
-mod	[file]	file containing input acoustic models	no
-accNum	[file]	file containing the list of accumulator files used for the estimation (numerator)	no
-accDen	[file]	file containing the list of accumulator files used for the estimation (denominator)	no
-cov	[float]	ratio used to apply covariance flooring	yes	0.05
-out	[file]	output acoustic models	no
-E	[float]	learning rate constant	yes	2.0
-I	[string]	I-smoothing type 'none': no I-smoothing 'prev': I-smoothing to the previous iteration	no	prev
-tau	[float]	I-smoothing constant	yes	100.0

contextclustering

The contextclustering tool refines a given set of context-independent acoustic models by performing context clustering. Context clustering is carried out using logical accumulators obtained from single Gaussian HMMs and decision trees that are either state-specific or global. Decision trees are generated following a standard top-down procedure that iteratively splits the data by applying binary questions using a ML criterion. Questions are asked about the correspondence to phonetic groups (defined by hand-made phonetic rules) and the within-word position (initial, internal and final). The splitting process is governed by two parameters: a minimum occupation count for each leaf and a minimum likelihood increase for each split. Finally, a bottom-up merging process is applied to merge those leaves which, when merged, produce a likelihood decrease below the minimum value used to allow a split.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	input acoustic models	no
-rul	[file]	phonetic rules used for the top-down n-phone clustering	no
-ww	[string]	within-word context modeling order. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. While a few tens of hours are typically enough to get the benefit of triphone context modeling, using pentaphones and above is only helpful when hundreds of hours of training data are used.	yes	triphones
-cw	[string]	cross-word context modeling order. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'.	yes	triphones
-met	[string]	clustering method 'local': a decision tree is built for each HMM-state. The clustering starts with all the allophones placed at the top of the tree 'global': a single global decision tree is built for all HMM-states. The clustering starts with all the allophones observed in the training data placed at the top of the tree. This method is substantially slower than local clustering.	yes	local
-mrg	[boolean]	whether to perform bottom-up merging after the top-down clustering process. Bottom-up merging consists of examinating the leaves of the decision tree and merging those leaves that when merged the likelihood decrease remains below the value specified by "-gan". It typically results in a more compact set of context modeling units.	yes	yes
-acc	[file]	file containing the list of logical accumulator files that will be used for the estimation	no
-occ	[float]	minimum cluster occupation (no cluster of n-phones will be split unless the resuling clusters have an occupation above this value)	yes	200
-gan	[float]	minimum likelihood gain to split a cluster of n-phones	yes	2000
-out	[file]	output acoustic models	no

gmmeditor

The gmmeditor tool is used to refine a set of acoustic models by applying mixture splitting and merging to the mixtures of Gaussian distributions. Gaussian splitting allows for a more detailed modeling of the acoustics while Gaussian merging eliminates Gaussian distributions that become unnecesary. Acoustic model refinement through mixture splitting and merging can be performed after each reestimation iteration using the gmmeditor tool, the original set of HMMs and the accumulated statistics. After applying the gmmeditor tool GMMs will typically have a variable number of components depending on the data aligned to the HMM-state, which varies across reestimation iterations.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	input acoustic models	no
-acc	[file]	list of accumulator files needed by the splitting and mergining process	no
-dbl	[boolean]	whether to double the number of Gaussian components per mixture	yes	no
-inc	[int]	number of Gaussian components that will be added to the mixture. Note that the resulting number of Gaussian components in the mixture will never exceed two times the original number of components	yes
-crt	[string]	criterion used to decide which Gaussian component will be split 'covariance': largest average covariance 'weight': maximum weight (occupation)	yes	covariance
-occ	[float]	minimum occupation (number of feature vectors aligned to the Gaussian component). Typically at least 100 feature vectors are required to robustly train a Gaussian component	yes	100.0
-wgh	[float]	Gaussian components whose weight falls below this threshold will be merged to the closest component in hte mixture	yes	0.05
-eps	[float]	epsilon value used to perturb the mean of a Gaussian component in order to estimate the mean of the resulting pair of Gaussian components	yes	0.00001
-mrg	[boolean]		yes	yes
-cov	[float]	ratio used to apply covariance flooring	yes	0.05
-out	[file]	output acoustic models	no
-vrb	[boolean]	verbose output	yes	no

dynamicdecoder

This tool is used to recognize speech in batch mode, which means that the audio to recognize is completely available beforehand as opposed to live mode, in which the samples of audio are captured and passed to the recognition engine in real time. In order to perform speech recognition in live mode see Bavieca's API.

parameter	type	description	optional
-cfg	[file]	configuration file	no
-hyp	[file]	file where the recognition hypotheses will be stored	no
-bat	[file]	batch file containing entries [rawFile/featureFile utteranceId]	no

Dynamic decoder configuration file

This file specifies parameters needed for the dynamicdecoder tool.

parameter	type	description	optional	typical values
input
input.type	[string]	type of input data that will be fed to the decoder 'audio': files of raw audio formatted as specified in the feature configuration file 'features': feature files	no	audio
feature extraction
feature.configurationFile	[file]	feature configuration file	no
feature.cepstralNormalization.mode	[string]	cepstral normalization mode 'none': no cepstral normalization 'utterance': utterance-based cepstral normalization 'session': session-based cepstral normalization	no	utterance
feature.cepstralNormalization.method	[string]	cepstral normalization method 'none': no cepstral normalization 'CMN': cepstral mean normalization 'CMVN': cepstral mean variance normalization	no	CMN
feature.cepstralNormalization.bufferSize	[int]	size (in number of feature vectors) of the circular buffer used to perform cepstral normalization	no	[2000-360000]
feature.warpFactor	[file]	warp factor to be applied during feature extraction	no	1.0
feature.transformFile	[file]	file containing feature transforms that will be applied to extracted features	yes
phonetic symbol set
phoneticSymbolSet.file	[file]	phonetic symbol set
acoustic models
acousticModels.file	[file]	acoustic models
language model
languageModel.file	[file]	language model
languageModel.format	[string]	language model format, language models in binary ('FSM') and 'ARPA' format are supported. See language model formats.	no
languageModel.type	[string]	language model type, currently only 'ngram' language models are supported	no	ngram
languageModel.scalingFactor	[float]	weight applied to language model log-likelihods when combined with acoustic scores to compute the most likely recognition path	no	[10,50]
languageModel.crossUtterance	[boolean]	whether the language model state will be kept from the end of one utterance to the beginning of the next one	no	no
pronunciation lexicon
lexicon.file	[file]	pronunciation lexicon	no
insertion penalty
insertionPenalty.standard	[float]	insertion penalty added to a path score when transitioning to a word. This value is typically negative (penalizing word insertions) although it can also be positive in order to compensate for a heavy language model scale factor. Its optimal value must be determined empirically	no	[5,25]
insertionPenalty.filler	[float]	insertion penalty added to a path score when transitioning to a filler symbol (including silence). Its value must be determined empirically.	no	[0,40]
insertionPenalty.filler.file	[file]	file containing pairs [fillerSymbol insertionPenalty], insertion penalties defined in this file override the generic filler insertion penalty specified by 'insertionPenalty.filler'	yes
Viterbi pruning
pruning.maxActiveArcs	[int]	maximum number of active arcs (histogram pruning)	no	[1000,10000]
pruning.maxActiveArcsWE	[int]	maximum number of active arcs at word ends (histogram pruning)	no	[500,2000]
pruning.maxActiveTokensArc	[int]	maximum number of active tokens per arc (histogram pruning)	no	[5-50]
pruning.likelihoodBeam	[float]	beam size for likelihood based pruning at all arcs	no	[50.0-300.0]
pruning.likelihoodBeamWE	[float]	beam size for likelihood based pruning at word ends	no	[50.0-250.0]
pruning.likelihoodBeamTokensArc	[float]	beam size for likelihood based pruning within each active arc	no	[50.0-200.0]
output
output.bestSinglePath	[boolean]	whether to output the best recognition path	no
output.lattice.folder	[folder]	folder where lattices will be dumped. If this parameter is not defined the recognizer will not produce lattices. This parameter must be defined in addition to the parameter 'output.lattice.maxWordSequencesState'.	yes
output.lattice.maxWordSequencesState	[integer]	This parameter, whose value must be 2 or higher controls the lattice depth. The higher the value, the deeper the lattices produced will be, which means that they will contain a higher number of alternative recognition hypotheses. Specifically it is the number of best unique word sequences that are kept at each token during the Viterbi search.	yes
output.audio.folder	[folder]	folder to store raw audio used for recognition	yes
output.features.folder	[folder]	folder to store extracted features	yes
output.alignment.folder	[folder]	folder to store time-alignments of the best recognition paths	yes

wfsabuilder

Tool used to build a static decoding network in the form of a Weighted Finite State Acceptor (WFSA). Decoding networks built using this tool can be used for recognition with the wfsadecoder tool. WFSA-based decoding is very fast since all sources of information (acoustic models, pronunciation lexicon and language model) are combined and optimized statically before the actual recognition process, which is therefore substantially simplified. For large language model sizes the process of building a WFSA decoding network can be time and memory intensive, which requires large amounts of physical memory. In those cases the use of the dynamicdecoder tool is recommended.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	input acoustic models	no
-lex	[file]	pronunciation lexicon	no
-lm	[file]	language model	no
-scl	[float]	scale factor applied to language model log-likelihoods	no	[10-50]
-ip	[float]	penalty for inserting a word	no	[5-25]
-ips	[float]	penalty for inserting silence	no	[0-40]
-ipf	[file]	file containing pairs [fillerSymbol insertionPenalty], a silence insertion penalty defined in this file may override the insertion penalty specified by '-ips'	yes
-srg	[string]	semiring used for weight pushing 'none': no weight pushing 'tropical': tropical semiring 'log': log semiring	yes	log
-net	[file]	file to store the decoding network built	no

wfsadecoder

This tool is a Weighted Finite State Acceptor (WFSA) based speech decoder which, in the same way as the dynamicdecoder tool, is used to process input speech and produce recognition hypotheses. In particular, this tool is used to recognize speech in batch mode, which means that the audio to recognize is completely available beforehand as opposed to live mode, in which the samples of audio are captured and passed to the recognition engine in real time. In order to perform speech recognition in live mode see Bavieca's API.

parameter	type	description	optional
-cfg	[file]	configuration file	no
-hyp	[file]	file where the recognition hypotheses will be stored	no
-bat	[file]	batch file containing entries [rawFile/featureFile utteranceId]	no

WFSA-decoder configuration file

This file specifies configuration parameters for the wfsadecoder tool.

parameter	type	description	optional	typical values
decodingNetwork.file	[file]	decoding network, which is a WFSA built using the `wfsabuilder` tool	no
input
input.type	[string]	type of input data that will be fed to the decoder 'audio': raw audio files input formatted as specified in the feature configuration file 'features': feature files	no
feature extraction
feature.configurationFile	[file]	feature configuration file	no
feature.cepstralNormalization.mode	[string]	cepstral normalization mode 'none': no cepstral normalization 'utterance': utterance-based cepstral normalization 'session': session-based cepstral normalization	no
feature.cepstralNormalization.method	[string]	cepstral normalization method 'none': no cepstral normalization 'CMN': cepstral mean normalization 'CMVN': cepstral mean variance normalization	no
feature.cepstralNormalization.bufferSize	[int]	size (in number of feature vectors) of the circular buffer used to perform cepstral normalization	no	2000-360000
feature.warpFactor	[file]	warp factor to be applied during feature extraction	no
feature.transformFile	[file]	file containing feature transforms that will be applied to extracted features	yes
phonetic symbol set
phoneticSymbolSet.file	[file]	phonetic symbol set
acoustic models
acousticModels.file	[file]	acoustic models
pronunciation lexicon
lexicon.file	[file]	pronunciation lexicon	no
Viterbi pruning
pruning.maxActiveStates	[int]	maximum number of active states (histogram pruning)	no	100-20000
pruning.likelihoodBeam	[float]	beam size for likelihood based pruning at all states	no	50.0-300.0
output
output.bestSinglePath	[boolean]	whether to output the best recognition path	no
output.lattice.folder	[folder]	folder where lattices will be dumped. If this parameter is not defined the recognizer will not produce lattices. This parameter must be defined in addition to the parameter 'output.lattice.maxWordSequencesState'.	yes
output.lattice.maxWordSequencesState	[integer]	This parameter, whose value must be 2 or higher controls the lattice depth. The higher the value the deeper the lattices produced will be, which means that they will contain a higher number of alternative recognition hypotheses. Specifically it is the number of best unique word sequences that are kept at each token during the Viterbi search.	yes
output.audio.folder	[folder]	folder to store raw audio used for recognition	yes
output.features.folder	[folder]	folder to store extracted features	yes
output.alignment.folder	[folder]	folder to store time-alignments of the best recognition path	yes

sadmodule

This command line tool is used to perform Speech Activity Detection (SAD) over features extracted from an audio file. SAD is useful to spot speech segments within an audio stream for further processing. Since speech recognition can be a time consuming process and SAD is usually not, directing the recognition only to those segments of audio where speech is detected is a typical procedure. This implementation of SAD is based on two Hidden Markov Models (HMM) with three states each, one HMM for silence and another HMM for speech. All the HMMs for silence and speech share the same set of Gaussian distributions, which are drawn from the HMM for silence and the HMMs from speech that are found in the set of acoustic models passed as a parameter. Viterbi search is used to find the most likely alignment of features to speech and silence and the resulting segmentation is written to a file.

The accuracy of this tool is very sensitive to the following parameters: '-sil', '-sph' and '-pen', whose values should be determined empirically.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-mod	[file]	acoustic models	no
-fea	[file]	file containing features to process	no
-sil	[int]	maximum number of Gaussian components used to build the HMM for silence (-1 for for all)	no
-sph	[int]	maximum number of Gaussian components used to build the HMM for speech (-1 for all)	no
-pad	[int]	number of time-slices to pad speech segments with	no	10
-pen	[float]	penalty for transitioning from the silence state to the speech state	no
-out	[file]	output file where the speech/silence segmentation will be written	no

hldaestimator

The hldaestimator tool is used to perform Heteroscedastic Linear Discriminant Analysis (HLDA). It consists of estimating a feature transform to decorrelate features and reduce their dimensionality while preserving the most discriminative information. In order to compute the transform a set of full-covariance acoustic models along with physical accumulators is passed as input. Typically full-covariance acoustic models are trained on high dimensional feature vectors (from scratch or doing single pass retraining) and then a HLDA transform is estimated to decorrelate the features and reduce their dimensionality. A common scenario consists of training full-covariance acoustic models on feature vectors with 52 coefficients (static features+Δ+ΔΔ+ΔΔΔ) and then estimating a HLDA transform to reduce the dimensionality to 39 coefficients.

parameter	type	description	optional	default value
-pho	[file]	file containing the phonetic symbol set	no
-mod	[file]	acoustic models for the estimation	no
-acc	[file]	file containing the list of physical accumulator files that will be used for the estimation	no
-itt	[integer]	number of iterations for the transform update	yes	10
-itp	[integer]	number of iterations for the parameter update	yes	10
-red	[integer]	dimensionality reduction resulting from the transformation	yes	13
-out	[folder]	output acoustic models	no

vtlestimator

The vtlestimator command line tool performs Maximum Likelihood (ML) based Vocal Tract Length estimation over a set of feature files. Features are extracted for different warp factors (starting with the warp factor specified by '-floor' up to the warp factor specified by '-ceiling', using a step size specified by '-step') and the warp factor that results in the highest overall likelihood is written to the output file. Unvoiced phones and symbols should be excluded from the estimation using the parameter '-fil'.

parameter	type	description	optional	default value
-cfg	[file]	feature configuration file	no
-pho	[file]	phonetic symbol set	no
-mod	[file]	acoustic models	no
-lex	[file]	pronunciation lexicon	no
-bat	[file]	batch file containing pairs [rawFile alignmentFile]	no
-for	[string]	format of alignment files passed as input 'binary': binary alignment format 'text': text alignment format	no
-out	[file]	file to store pairs [rawFile warpFactor]	no
-fil	[file]	list of phones and symbols that will be ignored, typically unvoiced phonemes and filler symbols should be excluded from the vocal tract length estimation	yes
-floor	[float]	warp factor floor	yes	0.80
-ceiling	[float]	warp factor ceiling	yes	1.20
-step	[float]	warp factor increment among tests	yes	0.02
-ali	[boolean]	whether to realign data for each warp factor	yes	no
-nrm	[string]	cepstral normalization mode none: no cepstral normalization utterance: utterance-based cepstral normalization session: session-based cepstral normalization	yes	utterance
-met	[string]	cepstral normalization method none: no cepstral normalization CMN: cepstral mean normalization CMVN: cepstral mean variance normalization	yes	CMN

regtree

The regtree tool builds regression trees that can be used to collect statistics for a form of adaptation called Maximum Likelihood Linear Regression (MLLR). The regression tree is built by performing top-down clustering of the means of all Gaussian distributions present in the acoustic models. Regression trees created using this tool are then passed to the mllrestimator tool to perform the actual MLLR adaptation

parameter	type	description	optional	default value
-pho	[file]	file containing the phonetic symbol set	no
-mod	[file]	file containing input acoustic models	no
-met	[string]	clustering method 'kMeans': k-means clustering, implies hard assignment of Gaussian means to clusters 'EM': Expectation Maximimzation clustering, implies soft-assignments of Gaussian means to clusters	yes	EM
-rgc	[int]	number of regression classes (base-classes)	yes	50
-gau	[int]	minimum number of Gaussian components per base-class	yes	1
-out	[file]	file to store the regression tree	no

mllrestimator

The mllrestimator tool performs Maximum Likelihood Linear Regression (MLLR) to adapt the Gaussian distributions of a set of acoustic models. This tool receives as input a baseline set of acoustic models and adaptation data (a set of feature files with corresponsing transcriptions, which are typically hand-made transcriptions or recognition hypotheses) and estimates a mean and, optionally, a covariance tranform for each regression-class. The total number of transforms computed is automatically determined based on the adaptation data and the value of the parameters '-occ' and '-gau'. Estimated transforms can be stored into a file for later utilization using the tool hmmx

parameter	type	description	optional	default value
-pho	[file]	file containing the phonetic symbol set	no
-mod	[file]	file containing input acoustic models	no
-bat	[file]	batch file containing pairs [featureFile alignmentFile]	no
-for	[string]	format of alignment files passed as input 'binary': binary alignment format 'text': text alignment format	no
-out	[file]	file to store adapted acoustic models	yes
-tra	[folder]	folder to store the transforms	yes
-rgt	[file]	file containing the regression tree (see the regtree tool)	no
-occ	[int]	minimum number of frames to compute a transform. Each frame corresponds to one hundredth of a second, typically at least 10 seconds of adaptation (1000 frames) data are needed to reliably estimate a transform	yes	3500
-gau	[int]	minimum number of Gaussian distributions with occupation to compute a transform	yes	1
-bst	[boolean]	whether to assign all occupation to best scoring Gaussian component. Enabling this option speeds up the estimation process, however it slightly reduces the likelihood increase resulting from the adaptation	yes	yes
-cov	[boolean]	whether to compute the covariance transform. If disabled only the mean transform will be computed, which is sufficient for most tasks.	yes	no

fmllrestimator

The fmllrestimator tool performs feature-space Maximum Likelihood Linear Regression (MLLR) to adapt a set of feature vectors. This tool receives as input a baseline set of acoustic models and adaptation data (a set of feature files with corresponsing transcriptions, which are typically hand-made transcriptions or recognition hypotheses) and estimates a transform. The estimated transform is stored into a file for later utilization using the tool paramx.

parameter	type	description	optional	default value
-pho	[file]	file containing the phonetic symbol set	no
-mod	[file]	file containing input acoustic models	no
-bat	[file]	batch file containing pairs [featureFile alignmentFile]	no
-for	[string]	format of alignment files passed as input 'binary': binary alignment format 'text': text alignment format	no
-tra	[file]	file to store the computed feature transform	no
-bst	[boolean]	whether to assign all occupation to best scoring Gaussian component. Enabling this option speeds up the estimation process, however it slightly reduces the likelihood increase resulting from the adaptation	yes	yes

paramx

The paramx tool applies linear and affine transformations to feature vectors. Transformations can be applied over pairs of input and output feature vectors using the parameter '-bat', or, alternatively over a single pair of feature vectors using the parameters '-in' and '-out'.

parameter	type	description	optional
-cfg	[file]	feature configuration file	no
-tra	[file]	file containing input acoustic models	no
-bat	[file]	batch file containing pairs [featureFileIn featureFileOut]	yes
-in	[file]	input feature vectors	yes
-out	[file]	output feature vectors	yes

hmmx

The hmmx tool applies transformations to acoustic model parameters. Transformations are applied to a set of input acoustic models and adapted acoustic models are stored in a file.

parameter	type	description	optional
-pho	[file]	file containing the phonetic symbol set	no
-tra	[file]	file containing the transform to be applied	no
-rgt	[file]	file containing the regression tree (see the regtree tool)	no
-in	[file]	input acoustic models	no
-out	[file]	output acoustic models	no

latticeeditor

The latticeeditor tool can be used to perform a wide variety of operations over lattices. It can be used for computing the lattice Word Error Rate, computing lattice-based posterior probabilities and confidence scores, aligning lattices against feature vectors and attaching acoustic log-likelihoods and HMM-state information to its edges, compacting a lattice by merging redundant edges and nodes, rescoring lattices according to a given criterion, attaching language model log-likelihoods to edges in the lattice, and adding paths to lattices. Different operations require different combination of input parameters.

parameter	type	description	optional	default value
-pho	[file]	phonetic symbol set	no
-lex	[file]	pronunciation lexicon	no
-mod	[file]	input acoustic models	yes
-lm	[file]	language model	yes
-bat	[file]	batch file containing entries which elements depend on the actions to perform	no
-act	[string]	action to perform 'wer': compute lattice Word Error Rate (WER), also called oracle. The WER is computed by performing a Viterbi alignment between hypotheses in the lattice and the reference string of words found in the transcription file. 'pp': compute posterior probabilities using the forward-backward algorithm 'align': align the lattice against a set of feature vectrors. Acoustic scores are attached to edges in the lattice, HMM-marking is also performed. 'compact': compact the lattice by doing a forward-backward merging of redundant nodes and edges 'rescore': rescore the lattice and produce the best hypothesis according to the rescoring criterion. Rescoring is based on the Dijkstra algorithm. 'lm': attach language model scores to edges in the lattice 'addpath': add a path to the lattice (needed for discriminative training) 'nbest': generate n-best lists from lattices using maximum likelihood or posterior probabilities	no
-trn	[file]	transcription file	yes
-hyp	[file]	hypothesis file	yes
-hypf	[string]	hypothesis file format 'full': full hypothesis format 'trn': trn hypothesis format	yes	trn
-ip	[float]	penalty for inserting a word, silence or filler	yes
-ipf	[file]	file containing pairs [fillerSymbol insertionPenalty]	yes
-ams	[float]	scale factor applied to acoustic log-likelihoods	yes
-lms	[float]	scale factor applied to language model log-likelihoods	yes
-res	[string]	rescoring method 'likelihood': finds the lattice path with the highest likelihood 'pp': finds the lattice path with the highest posterior probability	yes	likelihood
-conf	[string]	confidence annotation method (applies to '-act pp') 'posteriors': each edge is annotated with its lattice-based posterior probability 'accumulated': each edge is annotated with the sum of posterior probabilities of overlapping edges with the same word identity 'maximum': for each time-slice within the edge the maximum posterior probability of overlapping edges with same word identity is kept, finally the edge is annotated with the maximum per-frame posterior probability	yes	maximum
-map	[file]	file containing word mappings to be used for computing WER	yes
-nbest	[int]	maximum number of entries (word sequences) in the n-best lists	yes	100

Bavieca's Application Programming Interface

Bavieca's API provides an easy way to incorporate speech-recognition capabilities into an application. These capabilities are listed below:

Stream-based feature extraction: speech features can be extracted from samples of audio as the audio becomes available and fed into the feature extraction process. Feature normalization is done in stream mode and feature vectors (instead of audio samples) are used as input to the various speech processing functions (recognition, speech detection, alignment, etc). The same set of features can be first used for speech activity detection and then for recognition.
Stream-based speech recognition: while Bavieca's command line speech recognizers (dynamicdecoder and wfsadecoder) enable speech recognition in batch mode, which is ideal for experimentation, they do not provide the means to perform speech recognition over a live stream of audio. Bavieca's API enables speech recognition in live mode.
Stream-based speech activity detection: speech activity detection is a very common mechanism to reduce the amount of audio fed to the speech recognition system in order to make a better use of computational resources and reduce recognition errors.
Speech-to-text alignment: time-alignment information of words and phonemes within an utterance can be generated to be used, for example, to highlight words and phonemes while playing back speech, or as input data to perform synthetic lip movement.
Live-mode MLLR speaker adaptation: Coming soon...

Bavieca's API comprises a relatively reduced set of functions and data structures, all of them are declared in the header file BaviecaAPI.h. Below there is an example showing how to use the API.

Example

The example below shows a very simple way to use Bavieca's API to recognize speech in live-mode. For the sake of simplicity, speech samples in the example are retrieved from a file instead of from the microphone, however they are fed into the recognition process as if they were captured in real time.



	// initialize API
	const char *strFileConfiguration = "configuration.txt";
	BaviecaAPI baviecaAPI(strFileConfiguration);
	if (baviecaAPI.initialize(INIT_SAD|INIT_ALIGNER|INIT_DECODER) == false) {  
		return -1; 
	} 
	
	// load audio samples
	const char *strFileRaw = "audio.raw";
	ifstream is;
	is.open(strFileRaw,ios::binary);
	if (!is.is_open()) {
		cerr << "unable to open the file: \"\"" << strFileRaw; 
	}	
	is.seekg(0, ios::end);
	int iBytes = is.tellg();
	is.seekg(0, ios::beg);	
	int iSamples = (iBytes/2);
	short *sSamples = new short[iSamples];
	is.read((char*)sSamples,iSamples*2); 
	if (is.fail()) {
		cerr << "error reading from stream at position: " << is.tellg();
	}
	is.close();
	
	// begin utterance processing
	baviecaAPI.decBeginUtterance();
	
	// simulate streaming data (1 second chunks)
	int iSamplesChunk = 16000;
	int iSamplesUsed = 0;
	while((iSamplesUsed+iSamplesChunk) < iSamples) {
		
		// extract features from the audio
		int iFeatures = -1;
		float *fFeatures = baviecaAPI.extractFeatures(sSamples+iSamplesUsed,iSamplesChunk,&iFeatures);
		
		baviecaAPI.decProcess(fFeatures,iFeatures);		
		baviecaAPI.free(fFeatures);
		iSamplesUsed += iSamplesChunk;
	}
	delete [] sSamples;
	
	// print recognition hypothesis
	int iWords = -1;
	const char *strFileHypothesisLattice = NULL;
	WordHypothesisI *wordHypothesis = baviecaAPI.decGetHypothesis(&iWords,strFileHypothesisLattice);
	if (wordHypothesis) {
		for(int i=0 ; i < iWords ; ++i) {
			cout << " [" << wordHypothesis[i].iFrameStart << " ";
			cout << wordHypothesis[i].strWord << " ";
			cout << wordHypothesis[i].iFrameEnd << "] ";
		}
		cout << endl;
	}
	
	baviecaAPI.free(wordHypothesis,iWords);
		
	// end utterance processing
	baviecaAPI.decEndUtterance();
	
	baviecaAPI.uninitialize();

Overview

File Formats

Feature Extraction

Language Modeling

Viterbi Aligner

Training Tools

Recognition Tools

Transforms

Lattice Tools

API

Overview

Phonetic symbol set

Phonetic rules

Pronunciation lexicon

Language model

ARPA format

Binary format

Master Label File

Features

Alignment file

Text format

Binary format

Accumulator

Hypothesis

Transcription

Lattice file

Text format

Binary format

Feature configuration file

Dynamic decoder configuration file

WFSA-decoder configuration file

Bavieca's Application Programming Interface

Example