
Most recognizers can be broken down into the following steps:
Audio recording and Utterance detection
Pre-Filtering (pre-emphasis, normalization, banding, etc.)
Framing and Windowing (chopping the data into a usable format)
Filtering (further filtering of each window/frame/freq. band)
Comparison and Matching (recognizing the utterance)
Action (Perform function associated with the recognized pattern)
Although each step seems simple, each one can involve a multitude of different (and sometimes completely opposite) techniques.
(1) Audio/Utterance Recording: can be accomplished in a number of ways. Starting points can be found by comparing ambient audio levels (acoustic energy in some cases) with the sample just recorded. Endpoint detection is harder because speakers tend to leave "artifacts" including breathing/sighing,teeth chatters, and echoes.
(2) Pre-Filtering: is accomplished in a variety of ways, depending on other features of the recognition system. The most common methods are the "Bank-of-Filters" method which utilizes a series of audio filters to prepare the sample, and the Linear Predictive Coding method which uses a prediction function to calculate differences (errors). Different forms of spectral analysis are also used.
(3) Framing/Windowing involves separating the sample data into specific sizes. This is often rolled into step 2 or step 4. This step also involves preparing the sample boundaries for analysis (removing edge clicks, etc.)
(4) Additional Filtering is not always present. It is the final preparation for each window before comparison and matching. Often this consists of time alignment and normalization.
There are a huge number of techniques available for (5), Comparison and Matching. Most involve comparing the current window with known samples. There are methods that use Hidden Markov Models (HMM), frequency analysis, differential analysis, linear algebra techniques/shortcuts, spectral distortion, and time distortion methods. All these methods are used to generate a probability and accuracy match.
(6) Actions can be just about anything the developer wants. *GRIN*