Overview
The Bavieca toolkit comprises a set of about 25 command line tools that can be used to build very sophisticated large vocabulary speech recognition systems from scratch. Additionally it offers an Application Programming Interface (API) that exposes speech processing features such as speech recognition, speech activity detection, forced alignment, etc. This API is provided as a C++ library that can be used to create stand-alone applications that exploit Bavieca's speech recognition features.
Phonetic symbol set
The phonetic symbol can contain any set of symbols.
Below there is an example of a file containing a phonetic symbol set. Each phonenetic symbol must be on a separate line. Symbols between parentheses indicate that are context independent (seecontextclustering
tool).
Phonetic rules
Phonetic rules serve to express groupings of phones that share similar properties attending to some criterion
(e.g. fricatives, frontvowels, etc). They are used to guide the procedure of clustering context dependent
units (see the contextclustering
tool) under the assumption that phones that present similar
properties are likely to affect the realization of neighboring phones in a similar fashion. During the
process of building the decision tree, a cluster of allophones is split into subclusters by testing
the applicable phonetic rules and picking the rule that results in a higher likelihood increase.
A very good way to see how to define phonetic rules is by looking at the example below. The character '#' can be used to write comments. Note that each phonetic symbol in the phonetic symbol set must appear in the right-side of at least one rule. However, phonetic symbols typically appear on the right side of many rules.
Pronunciation lexicon
The pronunciation lexicon contains pronunciations for all the words in the vocabulary. A pronunciation for a word is defined by the actual word followed by the sequence of phonemes that express the pronunciation. Alternative pronunciations of a word can be defined by appending '(n)' to the word. This suffix applies once the first pronunciation of the word has been defined (n starts at 2).
Depending on whether it is used for training or recognition, the lexicon file has different uses. For training purposes the lexicon file must contain pronunciations for all words in the master label file, while for recognition, the lexicon file determines the active vocabulary (i.e. those words that the recognizer can potentially recognize).
Below there is an example of a lexicon file containing alternative pronunciations for some words. When creating a pronunciation lexicon the following considerations must be taken into account:
- Symbols between pointy brackets, such as '<SIL>' or '<BREATH>', denote filler symbols, which have a special treatment during recognition. In particular, a special insertion penalty can be specified for them and they are not affected by the language model (do not appear in any n-gram).
- Due to the way the decoding network is built, two words/filler symbols that have the same initial phone cannot have different insertion penalties. Thus, it is recommended that pronunciations for filler symbols do not include phonetic symbols that are used in actual words but phonetic symbols created ad-hoc (such as '_BREATH' or '_UH' in the example below).
- The silence symbol '<SIL>' must always be defined.
- The characters "##" can be used to write comments
- Although it is not a requirement (the pronunciation lexicon is internally reordered when loaded), entries in the lexicon should be ordered alphabetically.
Language model
Currently the Bavieca speech recognition toolkit supports language models in two formats, the ARPA format and binary format. Internally, for efficiency reasons, language models are represented as Finite State Machines (FSMs). Any n-gram orders are supported, however only orders up to fourgram have been tested.
ARPA format
The ARPA format is a simple and human-readable language model format. There are several freely available language modeling toolkits that can be used to build language models in the ARPA format. Two excellent resources are The CMU Statistical Language Modeling (SLM) Toolkit and the MIT Language Modeling Toolkit. Language models built with these toolkits have been successfully used with Bavieca.
Fragments of a trigram language model in the ARPA format are listed below.
Binary format
The binary format allows a much faster loading from disk, usually about an order of magnitude compared to the time
it takes to load a language model in the ARPA format. The binary representation is lossless and encodes the
language model as a Finite State Machine. In addition, the binary representation uses about half the space on disk.
A language model in ARPA format can be converted to binary format using the lmfsm
command line tool.
Master Label File
Master Label Files (MLFs) are used to describe a dataset composed of a series of utterances with corresponding transcriptions. MLFs serve as input to tools that initialize acoustic model parameters or accumulate sufficient statistics to reestimate acoustic model parameters.
Each utterance in the MLF is expressed by a line with a relative path to a feature file (containing features extracted from the utterance) followed by a line for each word in the transcription of the utterance. Each transcription must contain at least one word (or symbol). Typically, Bavieca tools produce absolute paths to feature files by appending relative paths in the MLFs to base folders specified as parameters. This is a flexible mechanism to, for example, use different sets of features with a single MLF.
Below there is an example of MLF. Note the use of the character '"' at the beginning and end of the relative paths.
Features
Feature files are binary files containing a series of feature vectors. These files are produced by the
param
tool according to a series of parameters defined in a
feature configuration file. The characteristics of the
features, along with their dimensionality, are not specified in the feature file. They are only available
through the feature configuration file used to create them.
Alignment file
Alignment files can be produced by a Viterbi aligner or a Forward-Backward aligner. They can be either in binary or text format.
Text format
Alignment files in text format can be produced by a Viterbi aligner and keep time-alignment information for each word/symbol aligned. This format is intended to be easily readable for humans, nontheless it can be used as the input of some tools. Below there is an example of alignment file in text format. Each line represents a phone alignment and the six numbers on the left are the initial and final feature-frames aligned to each of the three HMM-states associated to the phone. In the next column there is the phonetic symbol ('TH' in the first line). The next column is the acoustic score (log-likelihood) for the phone given the features and acoustic models used. Finally, the last column shows the word in the event that the phone is the initial phone of a word.Binary format
Alignments in binary format keep occupation counts of each HMM-state for each time-slice. In addition to that, in the case of Viterbi alignments (in which there is a hard assignment between time-slices and HMM-states), alignment information is kept for each word/symbol aligned.Accumulator
Accumulator files keep sufficient statistics to estimate acoustic model parameters (Gaussian distributions). There exist two types of accumulator files:
- Logical accumulators: this type of accumulators keep sufficient statistics at the logical HMM-state
level. A logical HMM-state is an HMM-state in a particular phonetic context. For example, "S-AH-L+UW+T"
represents a logical HMM-state for a pentaphone (where L is the central phone and the characters "-" and "+"
are used to denote left and right phonetic context respectively). In reality, a logical HMM-state is also
characterized for the HMM-state number (HMMs in Bavieca have three states) and the position of the central
phone within the word, which can be initial, internal or final. Statistics for logical HMM-states are
accumulated everytime a logical HMM-state is observed (i.e. receives some occupation) during the
accumulation process. For example the logical accumulator for "S-AH-L+UW+T" will receive some data every time the
word ABSOLUTE ("AE B S AH L UW T") is observed in the training data.
Logical accumulators are used to build context dependent HMM-states using the
contextclustering
tool. In order to manage data sparsity, this tool clusters a set of logical HMM-states into a physical context-dependent HMM-state.When large phonetic contexts are used, the number of different left and right contexts observed for each phone during the accumulation process can be very large, which means that a large number of logical accumulators (and a large storage space) will be needed to keep sufficient statistics.
- Physical accumulators: these types of accumulators keep sufficient statistics at the physical HMM-state level. Physical HMM-states are actual HMM-states that can be used to estimate log-likelihoods during training and recognition, while logical HMM-states are just an abstraction used to describe how context modeling works. Specifically, each physical accumulator keeps sufficient statistics to estimate the mean and covariance of a Gaussian distribution. Physical accumulators are identified by a physical HMM-state identifier and an index within its mixture of Gaussian components. Physical accumulators do not know about context dependency, they can be generated to reestimate the parameters of context-independent monophones or clustered n-phones.
Hypothesis
Hypothesis files contain word-hypothesis generated by recognition or rescoring tools. Bavieca supports two types of hypothesis file: trn and ctm, as defined in the SCLITE scoring toolkit.
Below there is an example of hypothesis file in trn format. There are hypotheses for four utterances (one per line). The trailing code between parentheses is the utterance identifier, which scoring tools use to match pairs hypothesis-reference.
Transcription
Transcription files keep orthographic transcriptions of utterances and are typically used to score recognition hypothesis and compute the word error rate (WER). WER is defined as the number of edit errors resulting from aligning a hypothesis against a reference file (transcription) divided by the number of words in the reference file and it is the most common metric to measure speech recognition accuracy. Transcription files in Bavieca are in the trn format, as defined in the SCLITE scoring toolkit.
Below there is an example of transcription file in the trn format. There are transcriptions for six utterances (one per line). The trailing code between parentheses is the utterance identifier, which scoring tools use to match pairs hypothesis-reference.
Lattice file
Lattice files can be either in binary or text format.
Text format
Decoders and lattice edition tools can output lattices in text format, however this is exclusively an output format and it is only intended for easy visualization of the lattice.
Example 1: Simple lattice in text format. The first four lines contain mandatory lattice properties: the number of edges in the lattice, the number of nodes, the number of time-slices (frames), and the lattice version.
Example 2: Graphical representation of a lattice created from a lattice in text format. Word identities and word boundaries are depicted for each edge in the lattice
Example 3: Segment of a lattice in text format with a number of properties. Each edge is followed by the phone alignment (including HMM-state identifiers) of the word attached to the edge.
- "am" = acoustic model log-likelihood
- "lm" = language model log-likelihood
- "ip" = word insertion penalty
- "fw" = edge forward score
- "bw" = edge backward score
- "pp" = edge posterior probability
Binary format
For performance reasons, the binary format is the only input format supported in thelatticeeditor
tool.
This format is intended to be compact, easily extendable and readable by machines.
The param
tool is used to extract features from raw audio. Feature vectors can then be used
to train acoustic models, compute transforms, etc. The table below summarizes its optional and required
command line parameters.
parameter | type | description | optional | default value |
---|---|---|---|---|
-cfg | [file] | feature configuration file | no | |
-raw | [file] | file containing samples of raw audio (either '-raw' or '-bat' must be specified) | yes | |
-fea | [file] | file where features will be written to (either '-fea' or '-bat' must be specified) | yes | |
-bat | [file] | batch file containing pairs [rawFile featureFile] | yes | |
-wrp | [float] | warp factor (see Vocal Tract Length Normalization) | yes | 1.0 |
-nrm | [string] | cepstral normalization mode
| yes | utterance |
-met | [string] | cepstral normalization method
| yes | CMN |
-hlt | [bool] | whether to halt the batch processing if an error is found | yes | no |
Feature configuration file
This file specifies parameters needed for feature extraction.
parameter | type | description | optional | typical values |
---|---|---|---|---|
waveform.samplingRate | [integer] | sampling rate of input audio (in Hertz (Hz) or samples per sec). This parameter is typically 16000 Hz, except for telephone speech where a value of 8000 Hz is used. | no | 8000|16000 |
waveform.sampleSize | [integer] | sample size of input audio (bits per sample), currently only 16 bits is supported. | no | 16 |
preemphasis | [boolean] | whether to apply pre-emphasis to the waveform. Pre-emphasis is intended to compensate the high-frequency part suppressed during human speech production. In addition, it can amplify the importance of high-frequency formants. The pre-emphasis coefficient utilized is 0.97. | no | yes |
window.width | [integer] | size of the analysis window in number of milliseconds. | no | 20 |
window.shift | [integer] | number of milliseconds in between consecutive analysis windows. | no | 10 |
window.tapering | [string] | window tapering function used to multiply each sample
in the window of samples in order to ensure the continuity of the first and last points in the window
| no | Hamming |
features.type | [string] | feature type, currently only Mel Frequency Cepstral Coefficients (MFCC) are supported. | no | mfcc |
filterbank.frequency.min | [integer] | minimum frequency in Hz used to build the bank of filters. There is no speech information below 100Hz. | no | 0 |
filterbank.frequency.max | [integer] | maximum frequency in Hz used to build the bank of filters. This value cannot be higher than the Nyquist frequency (i.e. half the sampling rate). Additionally little speech information is present above 6800Hz, so that value can be used to exclude high frquency noise from the filterbank analysis. | no | 8000 |
filterbank.filters | [integer] | number of triangular filters in the filterbank used to compute mfcc coefficients. | no | 20 |
cepstralCoefficients | [integer] | total number of static cepstral coefficients, currently only 12 is supported. | no | 12 |
energy | [boolean] | whether to append the signal energy to the vector of static cepstral coefficients. | no | yes |
derivatives.order | [integer] | order of higher derivatives that will be computed, this value is typically set to 2 so speed and acceleration are computed, a value of 3 can be used for extracting feature vectors whose dimensionality will be later reduced using a linear transform like HLDA or LDA. | no | 2 |
derivatives.delta | [integer] | size of the window (in number of feature vectors) used to compute derivatives. | no | 2 |
spliced.size | [integer] | number of consecutive static feature vectors centered at each time-slice that will be spliced together. This is an alternative method for incporporating dynamic information into the feature vectors. This method may produce better features than computing first and second derivatives when enough training data is available. This parameter can only be specified if the parameter 'derivatives.delta' is absent. | yes | 9 |
The lmfsm
tool is used to convert a language model in ARPA format to binary format. Language
models in binary format enable fast loading times from disk and use less storage space.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-lex | [file] | pronunciation lexicon | no | |
-lm | [file] | input language model in ARPA format | no | |
-fsm | [file] | output language model in binary format | no |
The aligner
tool is used to align features against a given sequence of words/symbols using
a set of acoustic models. It transforms the given sequence of words along with optional symbols like
silence or fillers into a graph of phones, which is then transformed into an optimized graph of HMM-states.
Finally it aligns the graph of HMM-states against the feature vectors using the Viterbi algorithm and
produces a time alignment of the input lexical units.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-lex | [file] | pronunciation lexicon | no | |
-mod | [file] | acoustic models | no | |
-fea | [file] | feature file containing features to align | yes | |
-txt | [file] | text file containing words and symbols to align (pronunciations for all words and symbols must be included in the lexicon file) | yes | |
-for | [string] | alignment file format
| no | |
-out | [file] | output alignment file | no | |
-fof | [folder] | base-folder containing feature files to align ('-mlf' must be specified) | yes | |
-mlf | [file] | master label file containing features and words to align to | yes | |
-dir | [folder] | output directory to store the alignments ('-mlf must be specified') | yes | |
-bat | [file] | batch file containing entries (featuresFile txtFile alignmentFile) | yes | |
-opt | [file] | file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed. | yes | |
-pro | [boolean] | whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models. | yes | no |
-bea | [float] | beam width used for likelihood-based pruning (Viterbi search). | yes | 1000.0 |
-hlt | [boolean] | whether to stop the batch processing if an error is found (either '-mlf' or '-bat' must be specified). | yes | no |
The hmminitializer
tool is used to create an initial set of acoustic models. Specifically
this tool creates a set of single-Gaussian Hidden Markov Models (HMMs) initialized to the global
distribution of the data, or to the HMM-state distributions given by an input set of alignment files.
Acoustic models created with this tool can be later refined using the mlaccumulator
and mlestimator
tools.
parameter | type | description | optional | default value |
---|---|---|---|---|
-cfg | [file] | feature configuration file | no | |
-fea | [folder] | base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to feature filenames found in the MLF) | no | |
-pho | [file] | phonetic symbol set | no | |
-lex | [file] | pronunciation lexicon | no | |
-mlf | [file] | master label file containing features and words to align to | no | |
-met | [file] | initialization method
| yes | flatstart |
-cov | [string] | covariance modeling type
| yes | diagonal |
-mod | [file] | output acoustic models | no |
The mlaccumulator
tool accumulates sufficient statistics necessary to estimate
acoustic model parameters under the Maximum Likelihood Estimation (MLE) criterion. Statistics are accumulated from the training
data by aligning feature vectors extracted from the audio against the transcriptions. For each utterance in the master
label file, a graph is constructed from its words considering alternative pronunciations (optional) and optional
filler symbols. This graph is then aligned against the feature vectors using the forward-backward algorithm and
occupation statistics are dumped into the accumulator file.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | acoustic models for aligning the data | no | |
-lex | [file] | pronunciation lexicon | no | |
-fea | [folder] | base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to feature filenames found in the MLF). | no | |
-cfg | [file] | feature configuration file | no | |
-feaA | [folder] | base folder containing feature files that will be used for accumulation of statistics. This parameter is only used for single pass retraining, and must be specified along with '-cfgA' and '-covA'. | yes | |
-cfgA | [file] | feature configuration file of features used for statistics accumulation. This parameter is only used for single pass retraining | yes | |
-covA | [string] | covariance modeling type used for statistics accumulation (single pass retraining)
| yes | diagonal |
-mlf | [file] | master label file containing training data for statistics accumulation | no | |
-ww | [string] | within-word context modeling order for logical accumulators. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. If not specified, physical accumulators will be generated. | yes | triphones |
-cw | [string] | cross-word context modeling order for the accumulators. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. If not specified, physical accumulators will be generated. Currently most tools within the Bavieca toolkit only support acoustic models with the same cross-word and within-word order. | yes | triphones |
-opt | [file] | file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed | yes | |
-pro | [boolean] | whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models. | yes | no |
-fwd | [float] | beam size used for likelihood-based forward pruning. Forward pruning is safe since the forward pass is performed after the backward pass and therefore the log-likelihood of the whole utterance is known beforehand. | yes | -20 |
-bwd | [float] | beam size used for likelihood-based backward pruning. High values are recommended for accurate estimation of occupation probabilities | yes | 800 |
-tre | [int] | maximum size (in MBs) allowed when allocating a trellis for the forward-backward algorithm. Utterances that are long enough to exceed this limit will be discarded from the accumulation process and a Warning message will be generated | yes | 500 |
-dAcc | [file] | output accumulator file | no |
The mlestimator
tool reestimates the parameters of a given set of acoustic models
using a list of accumulator files containing sufficient statistics. Estimation of acoustic model parameters
is carried out under the Maximum Likelihood Estimation (MLE) criterion (for discriminative estimation see
dtestimator
). After the MLE is carried out, Gaussian covariances are floored using the flooring ratio provided.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | input acoustic models | no | |
-acc | [file] | file containing the list of accumulator files that will be used for the estimation | no | |
-cov | [float] | ratio used to apply covariance flooring | yes | 0.05 |
-out | [file] | output acoustic models | no |
The mapestimator
tool reestimates the parameters of a given set of acoustic models
using a list of accumulator files containing sufficient statistics. A Maximum A Posteriori (MAP)
adaptation of the acoustic model parameters is performed, which is typically used for adapting a set
of well trained acoustic models to a new domain, environment or even to a particular speaker when enough
adaptation data is available.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | input acoustic models | no | |
-acc | [file] | file containing the list of accumulator files with the adaptation data (domain data) that will be used for the estimation | no | |
-pkw | [float] | prior knowledge weight | yes | 2.0 |
-out | [file] | output acoustic models | no |
The dtaccumulator
tool accumulates sufficient statistics necessary to estimate
acoustic model parameters under two alternative discriminative criteria: Maximum Mutual Information (MMI) or
boosted Maximum Mutual Information (bMMI). Statistics are accumulated from the training data by aligning
feature vectors extracted from the audio against the transcriptions. For each utterance in the master
label file, a graph is constructed from its words considering alternative pronunciations (optional) and optional
filler symbols. This graph is then aligned against the feature vectors using the forward-backward algorithm and
occupation statistics are dumped into the accumulator file.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | acoustic models for aligning the data | no | |
-lex | [file] | pronunciation lexicon | no | |
-fea | [folder] | base folder containing feature files needed for the accumulation of statistics (this folder will be prepended to relative paths found in the MLF) | no | |
-cfg | [file] | feature configuration file | no | |
-mlf | [file] | master label file containing training data for statistics accumulation | no | |
-lat | [folder] | base folder containing lattice files needed for the accumulation of statistics | no | |
-ams | [float] | scale factor applied to acoustic log-likelihoods before doing the forward-backward algorithm to compute lattice occupation (typically set to the inverse of the language model scale factor used during the recognition process that generated the lattices) | no | |
-lms | [float] | scale factor applied to language model log-likelihoods before doing the forward-backward algorithm to compute lattice occupation (typically set to 1.0) | yes | 1.0 |
-opt | [file] | file containing optional symbols that can be inserted at word boundaries. Symbols are only inserted at word boundaries when the insertion increases the alignment's log-likelihood. If no file is specified the silence symbol ('SIL') will be assumed | yes | |
-pro | [boolean] | whether to allow alternative pronunciations. Multiple pronunciations of the same word can receive occupation simultaneously. Enabling alternative pronunciations will typically increase the likelihood of the training data, which may or may not result in more discriminative acoustic models. | yes | no |
-fwd | [float] | beam size used for likelihood-based forward pruning. Forward pruning is safe since the forward pass is performed after the backward pass and therefore the log-likelihood of the whole utterance is known beforehand. | yes | -20 |
-bwd | [float] | beam size used for likelihood-based backward pruning. High values are recommended for accurate estimation of occupation probabilities | yes | 800 |
-tre | [int] | maximum size (in MBs) allowed when allocating a trellis for the forward-backward algorithm. Utterances that are long enough to exceed this limit will be discarded from the accumulation process and a Warning message will be generated | yes | 500 |
-dAccNum | [file] | output accumulator file where numerator statistics will be stored | no | |
-dAccDen | [file] | output accumulator file where denominator statistics will be stored | no | |
-obj | [string] | objective function
| yes | MMI |
-bst | [float] | boosting factor used for bMMI | yes | 0.5 |
-can | [boolean] | whether to perform cancellation of statistics between numerator and denominator | yes | yes |
The dtestimator
tool reestimates the parameters of a given set of acoustic models discriminatively.
Unlike the mlestimator
tool, this tool needs two sets of accumulators, one with numerator statistics
and another one with denominator statistics. Numerator and denominator statistics are typically accumulated
over a set of lattices using the dtaccumulator
tool. After the discriminative estimation is
carried out, Gaussian covariances are floored using the flooring ratio provided.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | file containing the phonetic symbol set | no | |
-mod | [file] | file containing input acoustic models | no | |
-accNum | [file] | file containing the list of accumulator files used for the estimation (numerator) | no | |
-accDen | [file] | file containing the list of accumulator files used for the estimation (denominator) | no | |
-cov | [float] | ratio used to apply covariance flooring | yes | 0.05 |
-out | [file] | output acoustic models | no | |
-E | [float] | learning rate constant | yes | 2.0 |
-I | [string] | I-smoothing type
| no | prev |
-tau | [float] | I-smoothing constant | yes | 100.0 |
The contextclustering
tool refines a given set of context-independent acoustic models by
performing context clustering. Context clustering is carried out using logical accumulators
obtained from single Gaussian HMMs and decision trees that are either state-specific or global.
Decision trees are generated following a standard top-down procedure that iteratively
splits the data by applying binary questions using a ML criterion. Questions are asked about
the correspondence to phonetic groups (defined by hand-made phonetic rules) and the within-word
position (initial, internal and final). The splitting process is governed by two parameters: a minimum
occupation count for each leaf and a minimum likelihood increase for each split. Finally, a
bottom-up merging process is applied to merge those leaves which, when merged, produce a likelihood
decrease below the minimum value used to allow a split.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | input acoustic models | no | |
-rul | [file] | phonetic rules used for the top-down n-phone clustering | no | |
-ww | [string] | within-word context modeling order. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. While a few tens of hours are typically enough to get the benefit of triphone context modeling, using pentaphones and above is only helpful when hundreds of hours of training data are used. | yes | triphones |
-cw | [string] | cross-word context modeling order. Possible values are: 'triphones', 'pentaphones', 'heptaphones', 'nonaphones' or 'endecaphones'. | yes | triphones |
-met | [string] | clustering method
| yes | local |
-mrg | [boolean] | whether to perform bottom-up merging after the top-down clustering process. Bottom-up merging consists of examinating the leaves of the decision tree and merging those leaves that when merged the likelihood decrease remains below the value specified by "-gan". It typically results in a more compact set of context modeling units. | yes | yes |
-acc | [file] | file containing the list of logical accumulator files that will be used for the estimation | no | |
-occ | [float] | minimum cluster occupation (no cluster of n-phones will be split unless the resuling clusters have an occupation above this value) | yes | 200 |
-gan | [float] | minimum likelihood gain to split a cluster of n-phones | yes | 2000 |
-out | [file] | output acoustic models | no |
The gmmeditor
tool is used to refine a set of acoustic models by applying mixture splitting and
merging to the mixtures of Gaussian distributions. Gaussian splitting allows for a more detailed modeling
of the acoustics while Gaussian merging eliminates Gaussian distributions that become unnecesary. Acoustic
model refinement through mixture splitting and merging can be performed after each reestimation iteration
using the gmmeditor tool, the original set of HMMs and the accumulated statistics. After applying the
gmmeditor
tool GMMs will typically have a variable number of components depending on the data aligned
to the HMM-state, which varies across reestimation iterations.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | input acoustic models | no | |
-acc | [file] | list of accumulator files needed by the splitting and mergining process | no | |
-dbl | [boolean] | whether to double the number of Gaussian components per mixture | yes | no |
-inc | [int] | number of Gaussian components that will be added to the mixture. Note that the resulting number of Gaussian components in the mixture will never exceed two times the original number of components | yes | |
-crt | [string] | criterion used to decide which Gaussian component will be split
| yes | covariance |
-occ | [float] | minimum occupation (number of feature vectors aligned to the Gaussian component). Typically at least 100 feature vectors are required to robustly train a Gaussian component | yes | 100.0 |
-wgh | [float] | Gaussian components whose weight falls below this threshold will be merged to the closest component in hte mixture | yes | 0.05 |
-eps | [float] | epsilon value used to perturb the mean of a Gaussian component in order to estimate the mean of the resulting pair of Gaussian components | yes | 0.00001 |
-mrg | [boolean] | yes | yes | |
-cov | [float] | ratio used to apply covariance flooring | yes | 0.05 |
-out | [file] | output acoustic models | no | |
-vrb | [boolean] | verbose output | yes | no |
This tool is used to recognize speech in batch mode, which means that the audio to recognize is completely available beforehand as opposed to live mode, in which the samples of audio are captured and passed to the recognition engine in real time. In order to perform speech recognition in live mode see Bavieca's API.
parameter | type | description | optional |
---|---|---|---|
-cfg | [file] | configuration file | no |
-hyp | [file] | file where the recognition hypotheses will be stored | no |
-bat | [file] | batch file containing entries [rawFile/featureFile utteranceId] | no |
Dynamic decoder configuration file
This file specifies parameters needed for the dynamicdecoder
tool.
parameter | type | description | optional | typical values |
---|---|---|---|---|
input | ||||
input.type | [string] | type of input data that will be fed to the decoder
| no | audio |
feature extraction | ||||
feature.configurationFile | [file] | feature configuration file | no | |
feature.cepstralNormalization.mode | [string] | cepstral normalization mode
| no | utterance |
feature.cepstralNormalization.method | [string] | cepstral normalization method
| no | CMN |
feature.cepstralNormalization.bufferSize | [int] | size (in number of feature vectors) of the circular buffer used to perform cepstral normalization | no | [2000-360000] |
feature.warpFactor | [file] | warp factor to be applied during feature extraction | no | 1.0 |
feature.transformFile | [file] | file containing feature transforms that will be applied to extracted features | yes | |
phonetic symbol set | ||||
phoneticSymbolSet.file | [file] | phonetic symbol set | ||
acoustic models | ||||
acousticModels.file | [file] | acoustic models | ||
language model | ||||
languageModel.file | [file] | language model | ||
languageModel.format | [string] | language model format, language models in binary ('FSM') and 'ARPA' format are supported. See language model formats. | no | |
languageModel.type | [string] | language model type, currently only 'ngram' language models are supported | no | ngram |
languageModel.scalingFactor | [float] | weight applied to language model log-likelihods when combined with acoustic scores to compute the most likely recognition path | no | [10,50] |
languageModel.crossUtterance | [boolean] | whether the language model state will be kept from the end of one utterance to the beginning of the next one | no | no |
pronunciation lexicon | ||||
lexicon.file | [file] | pronunciation lexicon | no | |
insertion penalty | ||||
insertionPenalty.standard | [float] | insertion penalty added to a path score when transitioning to a word. This value is typically negative (penalizing word insertions) although it can also be positive in order to compensate for a heavy language model scale factor. Its optimal value must be determined empirically | no | [5,25] |
insertionPenalty.filler | [float] | insertion penalty added to a path score when transitioning to a filler symbol (including silence). Its value must be determined empirically. | no | [0,40] |
insertionPenalty.filler.file | [file] | file containing pairs [fillerSymbol insertionPenalty], insertion penalties defined in this file override the generic filler insertion penalty specified by 'insertionPenalty.filler' | yes | |
Viterbi pruning | ||||
pruning.maxActiveArcs | [int] | maximum number of active arcs (histogram pruning) | no | [1000,10000] |
pruning.maxActiveArcsWE | [int] | maximum number of active arcs at word ends (histogram pruning) | no | [500,2000] |
pruning.maxActiveTokensArc | [int] | maximum number of active tokens per arc (histogram pruning) | no | [5-50] |
pruning.likelihoodBeam | [float] | beam size for likelihood based pruning at all arcs | no | [50.0-300.0] |
pruning.likelihoodBeamWE | [float] | beam size for likelihood based pruning at word ends | no | [50.0-250.0] |
pruning.likelihoodBeamTokensArc | [float] | beam size for likelihood based pruning within each active arc | no | [50.0-200.0] |
output | ||||
output.bestSinglePath | [boolean] | whether to output the best recognition path | no | |
output.lattice.folder | [folder] | folder where lattices will be dumped. If this parameter is not defined the recognizer will not produce lattices. This parameter must be defined in addition to the parameter 'output.lattice.maxWordSequencesState'. | yes | |
output.lattice.maxWordSequencesState | [integer] | This parameter, whose value must be 2 or higher controls the lattice depth. The higher the value, the deeper the lattices produced will be, which means that they will contain a higher number of alternative recognition hypotheses. Specifically it is the number of best unique word sequences that are kept at each token during the Viterbi search. | yes | |
output.audio.folder | [folder] | folder to store raw audio used for recognition | yes | |
output.features.folder | [folder] | folder to store extracted features | yes | |
output.alignment.folder | [folder] | folder to store time-alignments of the best recognition paths | yes |
Tool used to build a static decoding network in the form of a Weighted Finite State Acceptor (WFSA).
Decoding networks built using this tool can be used for recognition with the wfsadecoder
tool.
WFSA-based decoding is very fast since all sources of information (acoustic models, pronunciation lexicon and
language model) are combined and optimized statically before the actual recognition process, which is therefore
substantially simplified. For large language model sizes the process of building a WFSA decoding network can
be time and memory intensive, which requires large amounts of physical memory. In those cases the use of
the dynamicdecoder
tool is recommended.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | input acoustic models | no | |
-lex | [file] | pronunciation lexicon | no | |
-lm | [file] | language model | no | |
-scl | [float] | scale factor applied to language model log-likelihoods | no | [10-50] |
-ip | [float] | penalty for inserting a word | no | [5-25] |
-ips | [float] | penalty for inserting silence | no | [0-40] |
-ipf | [file] | file containing pairs [fillerSymbol insertionPenalty], a silence insertion penalty defined in this file may override the insertion penalty specified by '-ips' | yes | |
-srg | [string] | semiring used for weight pushing
| yes | log |
-net | [file] | file to store the decoding network built | no |
This tool is a Weighted Finite State Acceptor (WFSA) based speech decoder which, in the same way as the
dynamicdecoder
tool, is used to process input speech and produce recognition hypotheses.
In particular, this tool is used to recognize speech in batch mode, which means that the audio to recognize is completely
available beforehand as opposed to live mode, in which the samples of audio are captured and passed to the
recognition engine in real time. In order to perform speech recognition in live mode see Bavieca's API.
parameter | type | description | optional |
---|---|---|---|
-cfg | [file] | configuration file | no |
-hyp | [file] | file where the recognition hypotheses will be stored | no |
-bat | [file] | batch file containing entries [rawFile/featureFile utteranceId] | no |
WFSA-decoder configuration file
This file specifies configuration parameters for the wfsadecoder
tool.
parameter | type | description | optional | typical values |
---|---|---|---|---|
decodingNetwork.file | [file] | decoding network, which is a WFSA built using the
wfsabuilder tool | no | |
input | ||||
input.type | [string] | type of input data that will be fed to the decoder
| no | |
feature extraction | ||||
feature.configurationFile | [file] | feature configuration file | no | |
feature.cepstralNormalization.mode | [string] | cepstral normalization mode
| no | |
feature.cepstralNormalization.method | [string] | cepstral normalization method
| no | |
feature.cepstralNormalization.bufferSize | [int] | size (in number of feature vectors) of the circular buffer used to perform cepstral normalization | no | 2000-360000 |
feature.warpFactor | [file] | warp factor to be applied during feature extraction | no | |
feature.transformFile | [file] | file containing feature transforms that will be applied to extracted features | yes | |
phonetic symbol set | ||||
phoneticSymbolSet.file | [file] | phonetic symbol set | ||
acoustic models | ||||
acousticModels.file | [file] | acoustic models | ||
pronunciation lexicon | ||||
lexicon.file | [file] | pronunciation lexicon | no | |
Viterbi pruning | ||||
pruning.maxActiveStates | [int] | maximum number of active states (histogram pruning) | no | 100-20000 |
pruning.likelihoodBeam | [float] | beam size for likelihood based pruning at all states | no | 50.0-300.0 |
output | ||||
output.bestSinglePath | [boolean] | whether to output the best recognition path | no | |
output.lattice.folder | [folder] | folder where lattices will be dumped. If this parameter is not defined the recognizer will not produce lattices. This parameter must be defined in addition to the parameter 'output.lattice.maxWordSequencesState'. | yes | |
output.lattice.maxWordSequencesState | [integer] | This parameter, whose value must be 2 or higher controls the lattice depth. The higher the value the deeper the lattices produced will be, which means that they will contain a higher number of alternative recognition hypotheses. Specifically it is the number of best unique word sequences that are kept at each token during the Viterbi search. | yes | |
output.audio.folder | [folder] | folder to store raw audio used for recognition | yes | |
output.features.folder | [folder] | folder to store extracted features | yes | |
output.alignment.folder | [folder] | folder to store time-alignments of the best recognition path | yes |
This command line tool is used to perform Speech Activity Detection (SAD) over features extracted from an audio file. SAD is useful to spot speech segments within an audio stream for further processing. Since speech recognition can be a time consuming process and SAD is usually not, directing the recognition only to those segments of audio where speech is detected is a typical procedure. This implementation of SAD is based on two Hidden Markov Models (HMM) with three states each, one HMM for silence and another HMM for speech. All the HMMs for silence and speech share the same set of Gaussian distributions, which are drawn from the HMM for silence and the HMMs from speech that are found in the set of acoustic models passed as a parameter. Viterbi search is used to find the most likely alignment of features to speech and silence and the resulting segmentation is written to a file.
The accuracy of this tool is very sensitive to the following parameters: '-sil', '-sph' and '-pen', whose values should be determined empirically.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | acoustic models | no | |
-fea | [file] | file containing features to process | no | |
-sil | [int] | maximum number of Gaussian components used to build the HMM for silence (-1 for for all) | no | |
-sph | [int] | maximum number of Gaussian components used to build the HMM for speech (-1 for all) | no | |
-pad | [int] | number of time-slices to pad speech segments with | no | 10 |
-pen | [float] | penalty for transitioning from the silence state to the speech state | no | |
-out | [file] | output file where the speech/silence segmentation will be written | no |
The hldaestimator
tool is used to perform Heteroscedastic Linear Discriminant Analysis
(HLDA). It consists of estimating a feature transform to decorrelate features and
reduce their dimensionality while preserving the most discriminative information. In order to compute the
transform a set of full-covariance acoustic models along with physical accumulators is passed as input.
Typically full-covariance acoustic models are trained on high dimensional feature vectors (from scratch
or doing single pass retraining) and then a HLDA transform is estimated to decorrelate the features and reduce
their dimensionality. A common scenario consists of training full-covariance acoustic models on feature vectors
with 52 coefficients (static features+Δ+ΔΔ+ΔΔΔ) and then estimating a HLDA transform
to reduce the dimensionality to 39 coefficients.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | file containing the phonetic symbol set | no | |
-mod | [file] | acoustic models for the estimation | no | |
-acc | [file] | file containing the list of physical accumulator files that will be used for the estimation | no | |
-itt | [integer] | number of iterations for the transform update | yes | 10 |
-itp | [integer] | number of iterations for the parameter update | yes | 10 |
-red | [integer] | dimensionality reduction resulting from the transformation | yes | 13 |
-out | [folder] | output acoustic models | no |
The vtlestimator
command line tool performs Maximum Likelihood (ML) based Vocal Tract Length
estimation over a set of feature files. Features are extracted for different warp factors (starting with the
warp factor specified by '-floor' up to the warp factor specified by '-ceiling', using a step size
specified by '-step') and the warp factor that results in the highest overall likelihood is written to the
output file. Unvoiced phones and symbols should be excluded from the estimation using the parameter '-fil'.
parameter | type | description | optional | default value |
---|---|---|---|---|
-cfg | [file] | feature configuration file | no | |
-pho | [file] | phonetic symbol set | no | |
-mod | [file] | acoustic models | no | |
-lex | [file] | pronunciation lexicon | no | |
-bat | [file] | batch file containing pairs [rawFile alignmentFile] | no | |
-for | [string] | format of alignment files passed as input
| no | |
-out | [file] | file to store pairs [rawFile warpFactor] | no | |
-fil | [file] | list of phones and symbols that will be ignored, typically unvoiced phonemes and filler symbols should be excluded from the vocal tract length estimation | yes | |
-floor | [float] | warp factor floor | yes | 0.80 |
-ceiling | [float] | warp factor ceiling | yes | 1.20 |
-step | [float] | warp factor increment among tests | yes | 0.02 |
-ali | [boolean] | whether to realign data for each warp factor | yes | no |
-nrm | [string] | cepstral normalization mode
| yes | utterance |
-met | [string] | cepstral normalization method
| yes | CMN |
The regtree
tool builds regression trees that can be used to collect statistics for a form
of adaptation called Maximum Likelihood Linear Regression (MLLR). The regression tree is built by
performing top-down clustering of the means of all Gaussian distributions present in the acoustic
models. Regression trees created using this tool are then passed to the mllrestimator
tool
to perform the actual MLLR adaptation
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | file containing the phonetic symbol set | no | |
-mod | [file] | file containing input acoustic models | no | |
-met | [string] | clustering method
| yes | EM |
-rgc | [int] | number of regression classes (base-classes) | yes | 50 |
-gau | [int] | minimum number of Gaussian components per base-class | yes | 1 |
-out | [file] | file to store the regression tree | no |
The mllrestimator
tool performs Maximum Likelihood Linear Regression (MLLR)
to adapt the Gaussian distributions of a set of acoustic models. This tool receives as input a baseline set of
acoustic models and adaptation data (a set of feature files with corresponsing transcriptions, which are
typically hand-made transcriptions or recognition hypotheses) and estimates a mean and, optionally, a
covariance tranform for each regression-class. The total number of transforms computed is automatically
determined based on the adaptation data and the value of the parameters '-occ' and '-gau'. Estimated
transforms can be stored into a file for later utilization using the tool hmmx
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | file containing the phonetic symbol set | no | |
-mod | [file] | file containing input acoustic models | no | |
-bat | [file] | batch file containing pairs [featureFile alignmentFile] | no | |
-for | [string] | format of alignment files passed as input
| no | |
-out | [file] | file to store adapted acoustic models | yes | |
-tra | [folder] | folder to store the transforms | yes | |
-rgt | [file] | file containing the regression tree (see the regtree tool) | no | |
-occ | [int] | minimum number of frames to compute a transform. Each frame corresponds to one hundredth of a second, typically at least 10 seconds of adaptation (1000 frames) data are needed to reliably estimate a transform | yes | 3500 |
-gau | [int] | minimum number of Gaussian distributions with occupation to compute a transform | yes | 1 |
-bst | [boolean] | whether to assign all occupation to best scoring Gaussian component. Enabling this option speeds up the estimation process, however it slightly reduces the likelihood increase resulting from the adaptation | yes | yes |
-cov | [boolean] | whether to compute the covariance transform. If disabled only the mean transform will be computed, which is sufficient for most tasks. | yes | no |
The fmllrestimator
tool performs feature-space Maximum Likelihood Linear Regression (MLLR)
to adapt a set of feature vectors. This tool receives as input a baseline set of
acoustic models and adaptation data (a set of feature files with corresponsing transcriptions, which are
typically hand-made transcriptions or recognition hypotheses) and estimates a transform. The estimated
transform is stored into a file for later utilization using the tool
paramx.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | file containing the phonetic symbol set | no | |
-mod | [file] | file containing input acoustic models | no | |
-bat | [file] | batch file containing pairs [featureFile alignmentFile] | no | |
-for | [string] | format of alignment files passed as input
| no | |
-tra | [file] | file to store the computed feature transform | no | |
-bst | [boolean] | whether to assign all occupation to best scoring Gaussian component. Enabling this option speeds up the estimation process, however it slightly reduces the likelihood increase resulting from the adaptation | yes | yes |
The paramx
tool applies linear and affine transformations to feature vectors. Transformations
can be applied over pairs of input and output feature vectors using the parameter '-bat', or, alternatively over a single pair
of feature vectors using the parameters '-in' and '-out'.
parameter | type | description | optional |
---|---|---|---|
-cfg | [file] | feature configuration file | no |
-tra | [file] | file containing input acoustic models | no |
-bat | [file] | batch file containing pairs [featureFileIn featureFileOut] | yes |
-in | [file] | input feature vectors | yes |
-out | [file] | output feature vectors | yes |
The hmmx
tool applies transformations to acoustic model parameters.
Transformations are applied to a set of input acoustic models and adapted acoustic models are stored in a
file.
parameter | type | description | optional |
---|---|---|---|
-pho | [file] | file containing the phonetic symbol set | no |
-tra | [file] | file containing the transform to be applied | no |
-rgt | [file] | file containing the regression tree (see the regtree tool) | no |
-in | [file] | input acoustic models | no |
-out | [file] | output acoustic models | no |
The latticeeditor
tool can be used to perform a wide variety of operations over lattices.
It can be used for computing the lattice Word Error Rate, computing lattice-based posterior probabilities and
confidence scores, aligning lattices against feature vectors and attaching acoustic log-likelihoods and
HMM-state information to its edges, compacting a lattice by merging redundant edges and nodes, rescoring
lattices according to a given criterion, attaching language model log-likelihoods to edges in the lattice,
and adding paths to lattices. Different operations require different combination of input parameters.
parameter | type | description | optional | default value |
---|---|---|---|---|
-pho | [file] | phonetic symbol set | no | |
-lex | [file] | pronunciation lexicon | no | |
-mod | [file] | input acoustic models | yes | |
-lm | [file] | language model | yes | |
-bat | [file] | batch file containing entries which elements depend on the actions to perform | no | |
-act | [string] | action to perform
| no | |
-trn | [file] | transcription file | yes | |
-hyp | [file] | hypothesis file | yes | |
-hypf | [string] | hypothesis file format
| yes | trn |
-ip | [float] | penalty for inserting a word, silence or filler | yes | |
-ipf | [file] | file containing pairs [fillerSymbol insertionPenalty] | yes | |
-ams | [float] | scale factor applied to acoustic log-likelihoods | yes | |
-lms | [float] | scale factor applied to language model log-likelihoods | yes | |
-res | [string] | rescoring method
| yes | likelihood |
-conf | [string] | confidence annotation method (applies to '-act pp')
| yes | maximum |
-map | [file] | file containing word mappings to be used for computing WER | yes | |
-nbest | [int] | maximum number of entries (word sequences) in the n-best lists | yes | 100 |
Bavieca's Application Programming Interface
Bavieca's API provides an easy way to incorporate speech-recognition capabilities into an application. These capabilities are listed below:
- Stream-based feature extraction: speech features can be extracted from samples of audio as the audio becomes available and fed into the feature extraction process. Feature normalization is done in stream mode and feature vectors (instead of audio samples) are used as input to the various speech processing functions (recognition, speech detection, alignment, etc). The same set of features can be first used for speech activity detection and then for recognition.
- Stream-based speech recognition: while Bavieca's command line speech recognizers
(
dynamicdecoder
andwfsadecoder
) enable speech recognition in batch mode, which is ideal for experimentation, they do not provide the means to perform speech recognition over a live stream of audio. Bavieca's API enables speech recognition in live mode. - Stream-based speech activity detection: speech activity detection is a very common mechanism to reduce the amount of audio fed to the speech recognition system in order to make a better use of computational resources and reduce recognition errors.
- Speech-to-text alignment: time-alignment information of words and phonemes within an utterance can be generated to be used, for example, to highlight words and phonemes while playing back speech, or as input data to perform synthetic lip movement.
- Live-mode MLLR speaker adaptation: Coming soon...
Bavieca's API comprises a relatively reduced set of functions and data structures, all of them are declared in the header file BaviecaAPI.h. Below there is an example showing how to use the API.
Example
The example below shows a very simple way to use Bavieca's API to recognize speech in live-mode. For the sake of simplicity, speech samples in the example are retrieved from a file instead of from the microphone, however they are fed into the recognition process as if they were captured in real time.
// initialize API const char *strFileConfiguration = "configuration.txt"; BaviecaAPI baviecaAPI(strFileConfiguration); if (baviecaAPI.initialize(INIT_SAD|INIT_ALIGNER|INIT_DECODER) == false) { return -1; } // load audio samples const char *strFileRaw = "audio.raw"; ifstream is; is.open(strFileRaw,ios::binary); if (!is.is_open()) { cerr << "unable to open the file: \"\"" << strFileRaw; } is.seekg(0, ios::end); int iBytes = is.tellg(); is.seekg(0, ios::beg); int iSamples = (iBytes/2); short *sSamples = new short[iSamples]; is.read((char*)sSamples,iSamples*2); if (is.fail()) { cerr << "error reading from stream at position: " << is.tellg(); } is.close(); // begin utterance processing baviecaAPI.decBeginUtterance(); // simulate streaming data (1 second chunks) int iSamplesChunk = 16000; int iSamplesUsed = 0; while((iSamplesUsed+iSamplesChunk) < iSamples) { // extract features from the audio int iFeatures = -1; float *fFeatures = baviecaAPI.extractFeatures(sSamples+iSamplesUsed,iSamplesChunk,&iFeatures); baviecaAPI.decProcess(fFeatures,iFeatures); baviecaAPI.free(fFeatures); iSamplesUsed += iSamplesChunk; } delete [] sSamples; // print recognition hypothesis int iWords = -1; const char *strFileHypothesisLattice = NULL; WordHypothesisI *wordHypothesis = baviecaAPI.decGetHypothesis(&iWords,strFileHypothesisLattice); if (wordHypothesis) { for(int i=0 ; i < iWords ; ++i) { cout << " [" << wordHypothesis[i].iFrameStart << " "; cout << wordHypothesis[i].strWord << " "; cout << wordHypothesis[i].iFrameEnd << "] "; } cout << endl; } baviecaAPI.free(wordHypothesis,iWords); // end utterance processing baviecaAPI.decEndUtterance(); baviecaAPI.uninitialize();