In: Computer Science
Explain in detail about the Maximum Entropy Model for POS Tagging. Use flowcharts and diagrams to explain your points better.
Maximum Entropy Model :
Maximum entropy probability models offer a clean way to combine diverse pieces of contextual evidence in order to estimate the probability of a certain linguistic class occurring with a certain linguistic context.
Representing Evidence :
Normally, evidence will be represented with functions known as contextual predicates and features.
If A = {a1.....an} represents the set of possible classes that are needed to be predicated, and B represents the set of possible contexts or textual material that can be observed, then a contextual predicate is a function,
cp : B -> {true , false}
Maximum Entropy Model for POS Tagging :
Many natural language tasks require the accurate assignment of Part-Of-Speech (POS) tags to previously unseen text. Due to the availability of large corpora that have been manually annotated with POS information, many taggers use annotated text to "learn" either probability distributions or rules and use them to automatically assign POS tags to unseen text.
The POS tagger implemented under the maximum entropy framework that learns a probability distribution for tagging from manually annotated data, namely, the Wall Street Journal corpus of the Penn Treebank project, Since most realistic natural language applications must process words that were never seen before in training data, all experiments in this chapter are conducted on test data that include unknown words.
The Probability Model :
The probability p(a|b) represents
the conditional probability of a tag a A, given some
context or history b
B, where A is the
set of allowable tags, and where B is the set of possible word and
tag contexts..Then the probability,
p(a|b) =
whereas usual, each parameter aj corresponds to a feature fj. Given a sequence of words {w1.....,wn}
and tags {a1........,an} as training data, we define bi as the context available when predicting ai.
Features for POS Tagging :
The conditional probability of a history b and tag a is determined by those parameters whose corresponding features are active, i.e., those aj such that fj(a,b) = 1. A feature, given (a; b), may activate on any word or tag in the history b, and must encode any information that might help predict a, such as the spelling of the current word, or the identity of the previous two tags. For example, define the contextual predicate currentsuffix_is_ing(b) to return true if the current word in b ends with the suffix "ing". A useful feature might be ,
fj(a,bi) =
Contextual Predicates :
The contextual predicates are generated automatically from the training data scanning each bi with the "templates" as shown in below Table,
Condition | Contextual Predicates |
---|---|
wi is not rare | wi = X |
wi is rare | X is prefix of wi, |X| <= 4 |
X is a suffix of wi, |X| <= 4 | |
wi contains a number | |
wi contains an uppercase character | |
wi contains hyphen | |
![]() |
ti-1 = X |
ti-2ti-1= XY | |
wi-1 = X | |
wi-2 = X | |
wi+1 = X | |
wi+2 = X |
Word: | the | stories | about | well-heeled | communication | and | developers |
Tag; | DT | NNS | IN | JJ | NNS | CC | NNS |
Position: | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
The generation of contextual predicates for tagging unknown words relies on the hypothesized distinction that "rare" words in the training set are similar to unknown words in test data, with respect to how their spellings help predict their tags.