In: Computer Science
1, In the vector space models, you can use concepts or terms as basic vectors. Describe the advantages and disadvantages of these two types of vectors with respect to each other.
2. Consider following two words: {precision, precise}. Shall we cluster them together if we set-up the similarity threshold to be 0.5? Please justify your answer. (Hint: use the dice coefficient to compute the similarity.)
Ans: The Solution are given below.
1) Vector Space Models
a) Definition
Vector space model is an algebraic model for representing text documents (any objects) as vectors of identifiers. It is used in information filtering, information retrieval, indexing and relevancy rankings.
Documents and queries are represented as vectors.
Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights as j and q in above equation, have been developed. One of the best known schemes is tf-idf weighting.
The definition of term depends on the application. Typically terms are single words, keywords, or longer phrases. If words are chosen to be the terms, the dimensionality of the vector is the number of words in the vocabulary.
Application
b) Advantages
1. Simple model based on linear algebra
2. Term weights not binary
3. Allows computing a continuous degree of similarity between queries and documents
4. Allows ranking documents according to their possible relevance
5. Allows partial matching
c) Disadvantages
1. Long documents are poorly represented because they have poor similarity values (a small scalar product and a large dimensionality)
2. Search keywords must precisely match document terms; word substrings might result in a "false positive match"
3. Semantic sensitivity; documents with similar context but different term vocabulary won't be associated, resulting in a "false negative match".
4. The order in which the terms appear in the document is lost in the vector space representation.
5. Theoretically assumes terms are statistically independent.
6. Weighting is intuitive but not very formal.
Two Types of Vectors
1. Dot Product
The dot product of two vectors a and b (sometimes called the inner product, or since its result is a scalar, the scalar product) is denoted by a . b and is defined as:
where θ is the measure of the angle between a and b (see trigonometric function for an explanation of cosine). Geometrically, this means that a and b are drawn with a common start point, and then the length of a is multiplied with the length of the component of b that points in the same direction as a.
The dot product can also be defined as the sum of the products of the components of each vector as
The dot product of two vectors a and b (sometimes called the inner product, or, since its result is a scalar, the scalar product) is denoted by a ∙ b and is defined as:
where θ is the measure of the angle between a and b (see trigonometric function for an explanation of cosine). Geometrically, this means that a and b are drawn with a common start point, and then the length of a is multiplied with the length of the component of b that points in the same direction as a.
The dot product can also be defined as the sum of the products of the components of each vector as
2. Cross Product
The cross product (also called the vector
product or outer product) is only meaningful in three
or seven dimensions. The cross product differs from the dot product
primarily in that the result of the cross product of two vectors is
a vector. The cross product, denoted a ×
b, is a vector perpendicular to both
a and b and is defined
as
where θ is the measure of the angle between a and b, and n is a unit vector perpendicular to both a and b which completes a right-handed system. The right-handedness constraint is necessary because there exist two unit vectors that are perpendicular to both a and b, namely, n and (−n).
The cross product a × b is defined so that a, b, and a × b also becomes a right-handed system (although a and b are not necessarily orthogonal). This is the right-hand rule.
The length of a × b can be interpreted as the area of the parallelogram having a and b as sides.
The cross product can be written as
For arbitrary choices of spatial orientation (that is, allowing for left-handed as well as right-handed coordinate systems) the cross product of two vectors is a pseudovector instead of a vector
2) Cluster the Precision and Precise
where nt is the number of character bigrams found in both strings, nx is the number of bigrams in string x and ny is the number of bigrams in string y. For example, to calculate the similarity between:
precision
precise
We would find the set of bigrams in each word:
{pr,ec,is,ion}
{pr,ec,is,e}
simiilarity threshold = 0.5
S = 2Nt / (Nx + Ny)