Question

In: Statistics and Probability

Compare the TF-IDF pivoted normalization formula and Okapi BM25 formula analytically. Both formulas are given in...

Compare the TF-IDF pivoted normalization formula and Okapi BM25 formula analytically. Both formulas are given in Table 1 of Singhal's review paper.(Note that there is an error in the Okapi formula.). What are the common statistical information about documents and queries that they both use? How are the two formulas similar to each other, and how are they different?

Solutions

Expert Solution

TF-IDF weighting and document length normalization are quite easy to understand intuitively. However, how to implement them exactly in a formula is quite challenging and is still an open question. Empirically, people have found that some kind of sublinear transformation of the term frequency in a document is needed and incorporating document length normalization through the form "(1-s + s*doclength/avgdoclen)" (i.e., pivoted length normalization) is effective. This kind of normalization was justified in the a paper by Amit Singhal and others. Click here to read this paper. While BM25/Okapi and pivoted normalization are among the most effective ways to implement TF-IDF, it remains the single most challenging research question in information retrieval what is the optimal way of implementing TF-IDF. Read Fang et al. 04 for a recent discussion of this issue in the axiomatic retrieval framework.

An essential component of any retrieval model is the feedback mechanism. That is, when the user is willing to judge documents and label some as relevant others as non-relevant, the system should be able to learn from such examples to improve search accuracy. This is called relevance feedback. User studies have shown, however, a user is often unwilling to make such judgments, raising concerns about the practical value of relevance feedback. Pseudo feedback (also called blind/automatic feedback) simply assumes some top-ranked documents to be relevant, thus doesn't require a user to label documents. Pseudo feedback has also been shown to be effective on average, though it may hurt performance for some queries. Intuitively, pseudo feedback approach relies on term co-occurrences in the top-ranked documents to mine for related terms to the query terms. These new terms can be used to expand a query and increase recall. Pseudo feedback may also improve precision through supplementing the original query terms with new related terms and assigning more accurate weights to query terms.


Related Solutions

Compare the TF-IDF pivoted normalization formula and Okapi formula analytically. Both formulas are given in the...
Compare the TF-IDF pivoted normalization formula and Okapi formula analytically. Both formulas are given in the figure above. What are the common statistical information about documents and queries that they both use? How are the two formulas similar to each other, and how are they different?
Implement the TF-IDF pivoted normalization method and Okapi retrieval formula above. For the Okapi formula, set...
Implement the TF-IDF pivoted normalization method and Okapi retrieval formula above. For the Okapi formula, set k3=1000 and k1=1.2, so that you have only one parameter b to vary. In this way, both algorithms have precisely one parameter to tune.
ADVERTISEMENT
ADVERTISEMENT
ADVERTISEMENT