In: Computer Science
ID | Documents |
1 | I love data mining |
2 | The seven dwarves love mining |
3 | Data science is a hot new career |
4 | I don't love my major or career |
Use the corpus of documents shown in the above table to answer the quiz questions below.
What is the inverse document frequency (IDF) of the term "love"? (Round your answer to 2 decimal places).
What is the TF-IDF value (importance) of the term "data" to document 1? (Round your answer to 2 decimal places)
Can you show me the calculations? Thank you!
After removing the word stems and irrelevant words, the term frequency matrix is created as shown below.
a) What is the inverse document frequency (IDF) of the term "love"? (Round your answer to 2 decimal places).
Inverse Document Frequency can be calculated using the following equation.
where d is the document collection, and dt is the set of documents containing term t.
Here d is the number of documents=4
dt is the number of documents containing the term "love" = 3 (d1, d2 and d4 documents contain the term love).
IDF(love)=log(1+4)/3)
=log(5/3)=0.51
b) What is the TF-IDF value (importance) of the term "data" to document 1? (Round your answer to 2 decimal places)
TF-IDF(t)=TF(d,t) x IDF(t)
TF (d,t) can be calculated using the following equation.
TF(d1,data) =1+log(1+log(1)) [ In document 1, the term data appear 1 time.]
= 1+log(1+0)
=1+log(1) = 1
Inverse Document Frequency can be calculated using the following equation.
where d is the document collection, and dt is the set of documents containing term t.
Here d is the number of documents=4
dt is the number of documents containing the term "data" = 2 (d1 and d3 documents contain the term data)
IDF(data)=log(1+4)/2)
=log(5/2)=0.92
TF-IDF(t)=TF(d,t) x IDF(t)
= 1 * 0.92
= 0.92