In: Statistics and Probability
Explain the vector space model and the term frequency-inverse document frequency.
Vector space model or term vector model is an algebraic model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for example, index terms.
Translation: We represent each example in our dataset as a list of features.
The document is a vector of features weight.
The model is used to represent documents in an n-dimensional space. But a “document” can mean any object you’re trying to model.
The Term-Fequency ( )is computed with respect to the i-th term and j-th document :
where
are the occurrences of the i-th term in the j-th document.
The idea is that if a document has multiple receptions of given terms, it will probably deals with that argument.
Given a corpus D, a term ti and a document dj ∈ D, we denote the number of occurrences of ti in dj by tfij. This is referred as the term frequency.
The inverse document frequency()for a term ti is defined as
where ∣ D ∣ is the number of documents in our corpus, and ∣ {d : ti ∈ d} ∣ is the number of documents in which the term appears. If the term ti appears in every document of the corpus, is equal to 0. The fewer documents the term ti appears in, the higher the value.
The measure called term frequency-inverse document frequency (tf-idf) is defined as It is a measure of importance of a term ti in a given document dj. It is a term frequency measure which gives a larger weight to terms which are less common in the corpus. The importance of very frequent terms will then be lowered, which could be a desirable feature.
Do comment for any doubts.