Question

In: Computer Science

In text analysis. Discuss the implications of using the following types of weighting with this dataset:...

In text analysis. Discuss the implications of using the following types of weighting with this dataset:

(a) TF-IDF (no scaling or normalization);

(b) TF-IDF with sublinear TF scaling;

and (c) TF-IDF with TF normalization.

please type, not hand write

Solutions

Expert Solution

TF-IDF definition: term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Solution a) TF-IDF (no scaling or normalization);

Tf-idf is a simple twist on the bag-of-words approach. It stands for term frequency–inverse document frequency. Instead of looking at the raw counts of each word in each document in a dataset, tf-idf looks at a normalized count where each word count is divided by the number of documents this word appears in. That is:

bow(w, d) = # times word w appears in document d

tf-idf(w, d) = bow(w, d) * N / (documents in which word w appears)

N is the total number of documents in the dataset. The fraction N / (# documents ...) is what’s known as the inverse document frequency. If a word appears in many documents, then its inverse document frequency is close to 1. If a word appears in just a few documents, then the inverse document frequency is much higher.

Solution b) TF-IDF with sublinear TF scaling

It seems unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. A common modification is to use instead the logarithm of the term frequency, which assigns a weight given by

wf(t,d) = 1+ logtf(t,d) if tf(t,d) > 0

0 otherwise

In this form, we may replace tf-idf by some other function by wf-idf.

wf-idf(t.d) = wf(t,d) * idf(t)

Solution c) TF-IDF with TF normalization.

In TF-IDF, normalization is generally used in two ways: first, to prevent bias in term frequency from terms in shorter or longer documents; second, to calculate each term’s idf value (inverse document frequency). For example, Scikit-Learn’s implementation represents N as N+1, calculates the natural logarithm of (N+1)/dfi, and then adds 1 to the final result.

To express Scikit-Learn’s idf transformation7, we can state the following equation:

idfi=ln[(N+1)/dfi]+1

Once idfi is calculated, tf-idfi is tfi multiplied by idfi.

tf-idfi=tfi×idfi


Related Solutions

Example Dataset Using the "Example Dataset" and SPSS, apply the t-test to assess the following statement:...
Example Dataset Using the "Example Dataset" and SPSS, apply the t-test to assess the following statement: "Men and women have different incomes in this city." Show your calculations and copy of the SPSS output in a Word document. In a separate 250-500 Word document, address the following questions: Describe what t-test is the most appropriate and explain why. Discuss whether you used a one-tailed or two-tailed test and explain why. Using SPSS, calculate the t-test and provide the test statistic...
Describe the four major types of design for technology policy and discuss their implications for promoting...
Describe the four major types of design for technology policy and discuss their implications for promoting innovation to address a particular environmental issue you choose in a few pages.
Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document...
Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document the findings and justify the choice of K. Hint: use data(ruspini) to load the dataset into the R workspace.
(b) Discuss the implications of market efficiency on technical analysis and portfolio management ( PLS explain...
(b) Discuss the implications of market efficiency on technical analysis and portfolio management ( PLS explain not textual ) dont write less than 400 words
ANSWER USING R CODE Using the dataset 'LakeHuron' which is a built in R dataset describing...
ANSWER USING R CODE Using the dataset 'LakeHuron' which is a built in R dataset describing the level in feet of Lake Huron from 1872- 1972. To assign the values into an ordinary vector,x, we can do the following 'x <- as.vector(LakeHuron)'. From there, we can access the data easily. Assume the values in X are a random sample from a normal population with distribution X. Also assume the X has an unknown mean and unknown standard deviation. With this...
Using the mtcars dataset, answer the following questions: Fill in the following table: Variable Correlation with...
Using the mtcars dataset, answer the following questions: Fill in the following table: Variable Correlation with mpg cyl -0.85216 disp -0.84755 hp -0.77617 drat 0.681172 wt -0.86766 qsec 0.418684 vs 0.664039 am 0.599832 gear 0.480285 carb -0.55093 mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108...
Discuss how to think through the ethical implications of using a new operations management technique to...
Discuss how to think through the ethical implications of using a new operations management technique to improve organizational performance.
Using an appropriate model, show and discuss the implications of COVID-19 pandemic on consumer welfare.
Using an appropriate model, show and discuss the implications of COVID-19 pandemic on consumer welfare.
Discuss the tax implications for the different types of partnership transactions, such as partner-partnership, partner-partner, partner-external...
Discuss the tax implications for the different types of partnership transactions, such as partner-partnership, partner-partner, partner-external partner. How are gains and losses allotted for each pass-through entity?
Analysis of a certain text in English gave the following table of frequencies for the letters....
Analysis of a certain text in English gave the following table of frequencies for the letters. Spacebar and all punctuation marks are ignored. Find expected number of letters A in a random excerpt of this text with 1000 letters. Find the most frequent letter and its expected number of occurrences in the same excerpt. Find the most rare letter and its expected number of occurrences in the same excerpt. Letter Count A 14810 B 2715 C 4943 D 7874 E...
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT