In: Computer Science
Can you compare 2 search engines based on information retrieval models like Probability model or boolean model?
Search Engine
The second approach for organizing and locating the information on the Web is search engine. Search engine is a program that searches documents for specified keywords and returns a list of the documents where the keywords are found. Search engine technology has solution for the rapid growth of Web data on the Internet, to help Web users to find desired information. For this, a number of commercial search engines are available online such as Yahoo!, Google, AltaVista, Baidu etc. Search engine can be categorized into two types: a general-purpose search engine and specific-purpose search engine. The generalpurpose search engine, such as Google, is to retrieve as many Web pages available on the Internet that are relevant to the user query to Web users. The returned Web pages to the user are ranked in a sequence according to their relevant weights to the query. The user satisfaction with the search results is dependent on how quickly and how accurately they can find the desired information. The specific–purpose search engines, on the other hand, aim at searching those Web pages for a specific task or an identified community. For example, Google Scholar and Digital Bibliography & Library Project (DBLP) are two representatives of the specific-purpose search engines. Such search engines are designed for a specific researcher community, which provides various information regarding conferences or journals in the computer science domain. No matter which type the search engine is, each search engine mainly performs the following tasks:
(i) A user interface is used for searching the information on the Web. The user can submit his query on this interface to find relevant information. The query must consist of words or phrases describing the specific information of user’s interest.
(ii) The search engine then searches its repository corresponding to the given query.
(iii) The search engine returns all URLs, which matched the given query.
(iv) This list provides better matched URL link on the top of the returned URL list. These returned URLs may consist of links to other Web pages, textual data, images, audio, video etc.
Architecture of Search Engine The general architecture of a search engine is as shown in figure 3.2. The architecture consists of the following modules- (a) User interface (b) Query processor (c) Searcher (d) Evaluator (e) Web Crawler (f) Indexer (g) Repository.
Model is an idealization or abstraction of an actual process. Information Retrieval models can describe the computational process.
For example
1. how documents are ranked
2. Note that how documents or indexes are stored is implemented.
The goal of information retrieval (IR) is to provide users with those documents that will satisfy their information need. Retrieval models can attempt to describe the human Process, such as the information need, interaction.
Retrieval model can be categorize as
1. Boolean retrieval model
2. Vector space model
3. Probabilistic model
4. Model based on belief net
The Boolean model of information retrieval is a classical information retrieval (IR) model and is the first and most adopted one. It is used by virtually all commercial IR systems today.
Exact vs Best match
In exact match a query specifies precise criteria. Each document either matches or fails to match the query. The results retrieved in exact match is a set of document (without ranking).
In best match a query describes good or best matching documents. In this case the result is a ranked list of document. The Boolean model here I’m going to deal with is the most common exact match model.
Basic Assumption of Boolean Model
1. An index term is either present(1) or absent(0) in the document
2. All index terms provide equal evidence with respect to information needs.
3. Queries are Boolean combinations of index terms.
o X AND Y: represents doc that contains both X and Y
o X OR Y: represents doc that contains either X or Y
o NOT X: represents the doc that do not contain X
Boolean Queries Example
User information need: Interested to know about Everest and Nepal
User Boolean query: Everest AND Nepal
Implementation Part
Example of Input collection
Doc1= English tutorial and fast track
Doc2 = learning latent semantic indexing
Doc3 = Book on semantic indexing
Doc4 = Advance in structure and semantic indexing
Doc5 = Analysis of latent structures
Query problem: advance and structure AND NOT analysis
Boolean Model Index Construction
First we build the term-document incidence matrix which represents a list of all the distinct terms and their presence on each document (incidence vector). If the document contains the term than incidence vector is 1 otherwise 0.
Terms/doc |
Doc1 |
Doc2 |
Doc3 |
Doc4 |
Doc5 |
English |
1 |
0 |
0 |
0 |
0 |
Tutorial |
1 |
0 |
0 |
0 |
0 |
Fast |
1 |
0 |
0 |
0 |
0 |
Track |
1 |
0 |
0 |
0 |
0 |
Books |
0 |
1 |
0 |
0 |
0 |
Semantic |
0 |
1 |
1 |
1 |
0 |
Analysis |
0 |
1 |
0 |
0 |
1 |
Learning |
0 |
0 |
1 |
0 |
0 |
Latent |
0 |
0 |
1 |
0 |
1 |
Indexing |
0 |
0 |
1 |
1 |
0 |
Advance |
0 |
0 |
0 |
1 |
0 |
Structures |
0 |
0 |
0 |
1 |
1 |
So now we have 0/1 vector for each term. To answer the query we
take the vectors for advance,
structure and analysis,
complement the last, and then do a bitwise AND.
Doc1 |
Doc2 |
Doc3 |
Doc4 |
Doc5 |
|
0 |
0 |
0 |
1 |
0 |
|
0 |
0 |
0 |
1 |
1 |
(AND) |
0 |
0 |
0 |
1 |
0 |
|
1 |
0 |
1 |
1 |
0 |
(NOT analysis) |
0 |
0 |
0 |
1 |
0 |
|
Doc4 |
|