In: Computer Science
Explain/demonstrate how to use NLTK to train a language model to predict a word's language given a limited training set of unigram, bigram, and trigram character sets.
Hi,
NLTK stands for Natural Language Tool Kit,it is written in python programming language and is a suite of programs for symbolic and statistical natural language processing.Natural Language processing(NLP) enables the computer to interact with humans in a natural manner.It helps the computer to understand the human language and derive meaning from it.
NLTK is called as the wonderful tool for teaching , and working in computational linquistics using python and an amazing library to play with natural language.
Lets describe how NLTK is used to train a language model to predict a words language in this three character set.
1: Unigram:
In unigram each individual word is considered as a token.Using NLTK
1. import nltk
2.From nltk.corpus import brown (from brown corpus statistics on words are derived )
3.words=brown.words()
4 .fdist=nltk.FreqDist(w.lower() for w in words)
5. total=0
6: for word in fdist : total+= fdist [word]
2: Bigrams:
In bigrams a pair of 2 words are considered as a token. Using NLTK
1: Import nltk
2: bigrams=nltk.bigrams(words)
3:cfd= nltk.ConditionalFreqDist(bigrams)
3: Trigrams
In trigrams 3 words are taken as token.Using NLTK,
1: import nltk
2: from nltk import word tokenize
3:from nltk . util import ngrams
4:token = nltk.word_tokenize(text1) (the text is tokenized)
5: trigrams=ngrams(token,3) (find all trigrams in the text1)
6: trigrams
Hope you help this...
Thank you....