In: Statistics and Probability
Compare the TF-IDF pivoted normalization formula and Okapi BM25 formula analytically. Both formulas are given in Table 1 of Singhal's review paper.(Note that there is an error in the Okapi formula.). What are the common statistical information about documents and queries that they both use? How are the two formulas similar to each other, and how are they different?
TF-IDF weighting and document length normalization are quite easy to understand intuitively. However, how to implement them exactly in a formula is quite challenging and is still an open question. Empirically, people have found that some kind of sublinear transformation of the term frequency in a document is needed and incorporating document length normalization through the form "(1-s + s*doclength/avgdoclen)" (i.e., pivoted length normalization) is effective. This kind of normalization was justified in the a paper by Amit Singhal and others. Click here to read this paper. While BM25/Okapi and pivoted normalization are among the most effective ways to implement TF-IDF, it remains the single most challenging research question in information retrieval what is the optimal way of implementing TF-IDF. Read Fang et al. 04 for a recent discussion of this issue in the axiomatic retrieval framework.
An essential component of any retrieval model is the feedback mechanism. That is, when the user is willing to judge documents and label some as relevant others as non-relevant, the system should be able to learn from such examples to improve search accuracy. This is called relevance feedback. User studies have shown, however, a user is often unwilling to make such judgments, raising concerns about the practical value of relevance feedback. Pseudo feedback (also called blind/automatic feedback) simply assumes some top-ranked documents to be relevant, thus doesn't require a user to label documents. Pseudo feedback has also been shown to be effective on average, though it may hurt performance for some queries. Intuitively, pseudo feedback approach relies on term co-occurrences in the top-ranked documents to mine for related terms to the query terms. These new terms can be used to expand a query and increase recall. Pseudo feedback may also improve precision through supplementing the original query terms with new related terms and assigning more accurate weights to query terms.