In: Computer Science
In text analysis. Discuss the implications of using the following types of weighting with this dataset:
(a) TF-IDF (no scaling or normalization);
(b) TF-IDF with sublinear TF scaling;
and (c) TF-IDF with TF normalization.
please type, not hand write
TF-IDF definition: term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
Solution a) TF-IDF (no scaling or normalization);
Tf-idf is a simple twist on the bag-of-words approach. It stands for term frequency–inverse document frequency. Instead of looking at the raw counts of each word in each document in a dataset, tf-idf looks at a normalized count where each word count is divided by the number of documents this word appears in. That is:
bow(w, d) = # times word w appears in document d
tf-idf(w, d) = bow(w, d) * N / (documents in which word w appears)
N is the total number of documents in the dataset. The fraction N / (# documents ...) is what’s known as the inverse document frequency. If a word appears in many documents, then its inverse document frequency is close to 1. If a word appears in just a few documents, then the inverse document frequency is much higher.
Solution b) TF-IDF with sublinear TF scaling
It seems unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. A common modification is to use instead the logarithm of the term frequency, which assigns a weight given by
wf(t,d) = 1+ logtf(t,d) if tf(t,d) > 0
0 otherwise
In this form, we may replace tf-idf by some other function by wf-idf.
wf-idf(t.d) = wf(t,d) * idf(t)
Solution c) TF-IDF with TF normalization.
In TF-IDF, normalization is generally used in two ways: first, to prevent bias in term frequency from terms in shorter or longer documents; second, to calculate each term’s idf value (inverse document frequency). For example, Scikit-Learn’s implementation represents N as N+1, calculates the natural logarithm of (N+1)/dfi, and then adds 1 to the final result.
To express Scikit-Learn’s idf transformation7, we can state the following equation:
idfi=ln[(N+1)/dfi]+1
Once idfi is calculated, tf-idfi is tfi multiplied by idfi.
tf-idfi=tfi×idfi