Question

In: Computer Science

In text analysis. Discuss the implications of using the following types of weighting with this dataset:...

In text analysis. Discuss the implications of using the following types of weighting with this dataset:

(a) TF-IDF (no scaling or normalization);

(b) TF-IDF with sublinear TF scaling;

and (c) TF-IDF with TF normalization.

please type, not hand write

Expert Solution

TF-IDF definition: term frequency-inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.

Solution a) TF-IDF (no scaling or normalization);

Tf-idf is a simple twist on the bag-of-words approach. It stands for term frequency–inverse document frequency. Instead of looking at the raw counts of each word in each document in a dataset, tf-idf looks at a normalized count where each word count is divided by the number of documents this word appears in. That is:

bow(w, d) = # times word w appears in document d

tf-idf(w, d) = bow(w, d) * N / (documents in which word w appears)

N is the total number of documents in the dataset. The fraction N / (# documents ...) is what’s known as the inverse document frequency. If a word appears in many documents, then its inverse document frequency is close to 1. If a word appears in just a few documents, then the inverse document frequency is much higher.

Solution b) TF-IDF with sublinear TF scaling

It seems unlikely that twenty occurrences of a term in a document truly carry twenty times the significance of a single occurrence. Accordingly, there has been considerable research into variants of term frequency that go beyond counting the number of occurrences of a term. A common modification is to use instead the logarithm of the term frequency, which assigns a weight given by

wf(t,d) = 1+ logtf(t,d) if tf(t,d) > 0

0 otherwise

In this form, we may replace tf-idf by some other function by wf-idf.

wf-idf(t.d) = wf(t,d) * idf(t)

Solution c) TF-IDF with TF normalization.

In TF-IDF, normalization is generally used in two ways: first, to prevent bias in term frequency from terms in shorter or longer documents; second, to calculate each term’s idf value (inverse document frequency). For example, Scikit-Learn’s implementation represents N as N+1, calculates the natural logarithm of (N+1)/df_i, and then adds 1 to the final result.

To express Scikit-Learn’s idf transformation⁷, we can state the following equation:

idfi=ln[(N+1)/dfi]+1

Once idf_i is calculated, tf-idf_i is tf_i multiplied by idf_i.

tf-idfi=tfi×idfi

venereology answered 3 years ago

Describe the four major types of design for technology policy and discuss their implications for promoting...

Describe the four major types of design for technology policy and discuss their implications for promoting innovation to address a particular environmental issue you choose in a few pages.

Example Dataset Using the "Example Dataset" and SPSS, apply the t-test to assess the following statement:...

Example Dataset Using the "Example Dataset" and SPSS, apply the t-test to assess the following statement: "Men and women have different incomes in this city." Show your calculations and copy of the SPSS output in a Word document. In a separate 250-500 Word document, address the following questions: Describe what t-test is the most appropriate and explain why. Discuss whether you used a one-tailed or two-tailed test and explain why. Using SPSS, calculate the t-test and provide the test statistic...

Prepare an essay in response to discuss the meaning or implications of using new technology to...

Prepare an essay in response to discuss the meaning or implications of using new technology to understand and impact consumer.

Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document...

Using the ruspini dataset provided with the cluster package in R, perform a k-means analysis. Document the findings and justify the choice of K. Hint: use data(ruspini) to load the dataset into the R workspace.

(b) Discuss the implications of market efficiency on technical analysis and portfolio management ( PLS explain...

(b) Discuss the implications of market efficiency on technical analysis and portfolio management ( PLS explain not textual ) dont write less than 400 words

ANSWER USING R CODE Using the dataset 'LakeHuron' which is a built in R dataset describing...

ANSWER USING R CODE Using the dataset 'LakeHuron' which is a built in R dataset describing the level in feet of Lake Huron from 1872- 1972. To assign the values into an ordinary vector,x, we can do the following 'x <- as.vector(LakeHuron)'. From there, we can access the data easily. Assume the values in X are a random sample from a normal population with distribution X. Also assume the X has an unknown mean and unknown standard deviation. With this...

In linux , Using a simple text editor, create a text file with the following name...

In linux , Using a simple text editor, create a text file with the following name "Test" and content: GNU GENERAL PUBLIC LICENSE The GNU General Public License is a free, copy left license for the GNU General Public License is intended to guarantee your freedom to GNU General Public License for most of our software; it applies … 2-Write the command to create the text file. 3-Print the file permissions. 4-Create a directory named "361" 5-Copy file "Test" to...

Using the mtcars dataset, answer the following questions: Fill in the following table: Variable Correlation with...

Using the mtcars dataset, answer the following questions: Fill in the following table: Variable Correlation with mpg cyl -0.85216 disp -0.84755 hp -0.77617 drat 0.681172 wt -0.86766 qsec 0.418684 vs 0.664039 am 0.599832 gear 0.480285 carb -0.55093 mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108...

Discuss how to think through the ethical implications of using a new operations management technique to...

Discuss how to think through the ethical implications of using a new operations management technique to improve organizational performance.

Using an appropriate model, show and discuss the implications of COVID-19 pandemic on consumer welfare.