In: Computer Science
When you try to automatically process human speech, you frequently try to understand what the person was saying (or trying to say) from very ambiguous sounds. For instance, the phrases “Recognize speech” and “Wreck a nice beach” sound very similar when pronounced by the average American speaker. As humans, when we process speech, we always use the context of a word to judge what the speaker is probably trying to say. Our minds automatically discard options that are very inconsistent with what we have inferred about the sentence so far, and we reevaluate our earlier “guesses” based on new sounds that come in. For example, if your friend tells you “I didn’t recognize you”, you infer pretty safely from “you” that it wasn’t “I didn’t wreck a nice you”, whereas if your friend says “I didn’t wreck a nice car”, you find this more likely than “I didn’t recognize car”. If we want to get machines to do this kind of reasoning, we can use the following simplistic model. There is a known directed graph G = (V, E). Each node v ∈ V encodes an initial portion of an (intended) sentence. Each directed edge e has two things associated with it: (1) a probability pe ∈ [0, 1], and (2) a label λe ∈ Σ, where Σ is an alphabet of phonemes.3 We assume that for each node v and each label `, there is at least one edge e = (v, u) out of v labeled by λe = ` and at least one edge e = (u, v) into v labeled by λe = `. Also, we assume that the sum of probabilities of all edges out of v is 1, i.e., P u:(v,u)∈E p(v,u) = 1. In addition to the graph, you are given a start node s (which corresponds to not having heard anything yet) and a sequence of observed phonemes L = `1`2`3 · · · `k of length k. Your goal is to find a directed path P = (e1, e2, . . . , ek) in G of length k, starting from s, such that (1) the sequence of labels on P matches the given sequence, i.e., λei = `i for all i, and (2) subject to this requirement, the probability Qk i=1 pei is maximized. Such a path P (and its final node v) give a good guess as to what the speaker was trying to say. Give (and analyze) a polynomial-time algorithm (polynomial in the size of the graph and the length of the sequence) for finding such a path P. Note: The main difficulty arises from the fact that a node v may have multiple outgoing edges with the same label. This corresponds to not being able to tell for sure whether you heard “Recognize” or “Wreck a nice”. Otherwise, this problem would be nearly trivial.
A more technical definition is given by Jurafsky where he defines ASR
as the building of system for mapping acoustic signals to a string of words.
He continues by defining automatic speech understanding(ASU) as extending
the goal to producing some sort of understanding of the sentence.
We will consider speaker independent ASR, i.e. systems that have not
been adapted to a single speaker, but in some sense all speakers of a particular
language.
Humans use more than their ears when listening, they use the knowledge they
have about the speaker and the subject. Words are not arbitrarily sequenced
together, there is a grammatical structure and redundancy that humans use
to predict words not yet spoken. Furthermore, idioms and how we ’usually’
say things makes prediction even easier.
In ASR we only have the speech signal. We can of course construct a
model for the grammatical structure and use some kind of statistical model
to improve prediction, but there are still the problem of how to model world
knowledge, the knowledge of the speaker and encyclopedic knowledge. We
can, of course, not model world knowledge exhaustively, but an interesting
question is how much we actually need in the ASR to measure up to human
Spoken language has for many years been viewed just as a less complicated
version of written language, with the main difference that spoken language is
grammatically less complex and that humans make more performance errors
while speaking. However, it has become clear in the last few years that
spoken language is essentially different from written language. In ASR, we
have to identify and address these differences.
Written communication is usually a one-way communication, but speech
is dialogue-oriented. In a dialogue, we give feed-back to signal that we un-
derstand, we negotiate about the meaning of words, we adapt to the receiver
etc.
Another important issue is disfluences in speech, e.g. normal speech is
filled with hesitations, repetitions, changes of subject in the middle of an
utterance, slips of the tounge etc.
comprehension