In: Computer Science
After reviewing the resources for this module, discuss the power of clustering and association models. Give an example of a company that collects or uses data for various reasons. How can clustering or association models help the company complete the sentence "You might also be interested in
`Hey,
Note: Brother if you have any queries related the answer please do comment. I would be very happy to resolve all your queries.
The Clustering is an explorative analysis that tries to
recognize structures within the data. Clustering is utilized
to
recognize groups of cases if the gathering is not previously known.
Clustering is often part of the sequence of
analysis of factor analysis, cluster analysis, and finally,
discriminant analysis. The general categories of cluster
analysis methods are Joining (Tree Clustering), Two-way Joining
(Block Clustering), Hierarchical Clustering and kmeans
Clustering. In short, whatever the way of your business is,
sometime you will keep running into a clustering
problem of some structure.
Hierarchical Cluster is the most common method. It creates a series
of models with cluster solutions from “1” (all
cases in one cluster) to “n” (all cases are an individual cluster).
In addition, hierarchical cluster analysis can deal
with nominal, ordinal, and scale data, however, it is not
recommended to blend different levels of estimation. Kmeans
cluster is a strategy to rapidly cluster huge data sets, which
ordinarily take a while to compute with the
preferred hierarchical cluster analysis.
The purpose of cluster analysis is to place objects into groups, or
clusters, suggested by the data, not defined a priori,
such that objects in a given cluster tend to be similar to each
other in some sense, and objects in different clusters
tend to be dissimilar. You can also use cluster analysis to
summarize data rather than to find "natural" or "real"
clusters; this use of clustering is sometimes called dissection.
Clustering techniques have been applied to a wide
variety of research problems.
The reason for cluster analysis is to place objects into groups, or
clusters, recommended by the data, not defined a
priori, such that objects in a given cluster have a tendency to be
like one another in some sense, and objects in
different clusters have a tendency to be different. You can
likewise utilize cluster analysis to summarize data as
rather than to find "natural" or "real" clusters; this utilization
of clustering is sometimes called dissection. Clustering
techniques have been connected to a wide variety of research
problems.
For example, in the field of medicine, clustering diseases, cures
for diseases, or symptoms of diseases can lead to
very useful taxonomies. What are the diagnostic clusters? To answer
this question the researcher would devise a
diagnostic questionnaire that entails the symptoms (for example in
psychology standardized scales for anxiety,
depression etc.). The cluster analysis can then identify groups of
patients that present with similar symptoms and
simultaneously maximize the difference between the groups. In
Marketing – What are the customer segments? To
answer this question a market researcher conducts a survey most
commonly covering needs, attitudes,
demographics, and behavior of customers. The researcher then uses
the cluster analysis to identify homogenous
groups of customers that have similar needs and attitudes but are
distinctively different from other customer
segments. In general, whenever we need to classify a "mountain" of
information into manageable meaningful piles,
cluster analysis is of great utility.
Association rule reveal fascinating affiliations and correlation
relationships among large sets of data items.
Association rules show attribute value conditions that happen
regularly together in a given data set. A typical
example of association rule mining is Market Basket Analysis. In
data mining, association’s rules are helpful for
analyzing and foreseeing customer behavior. Data is collected using
bar-code scanners in supermarkets. Such market
basket databases consist of a large number of transaction records.
Every record list all items purchased by a
customer on a single purchase transaction. The Association model is
often associated with "market basket analysis",
which is utilized to find relationships or correlations in a set of
items. A typical association rule of this kind affirms
the probability that, for instance, "70% of the general population
who purchase spaghetti, wine, and sauce likewise
purchase garlic bread."
Clustering and Association models can help organizations to be
interested in Market Segmentation which is being
one of the best uses of data mining is to segment your customers.
Furthermore, it's really simple; from your
information you can separate your market into important segments
such as age, income, occupation or gender.
Segmentations can also help you with understanding your
competition. This insight alone will offer you some
assistance with identifying that the typical suspects are not the
only ones focusing on the same client money as you
seem to be.
The purpose of cluster analysis is to place objects into groups,
or clusters, suggested by the data,
not defined a priori, such that objects in a given cluster tend to
be similar to each other in some
sense, and objects in different clusters tend to be dissimilar. You
can also use cluster analysis to
summarize data rather than to find "natural" or "real" clusters;
this use of clustering is sometimes
called dissection. The SAS/STAT procedures for clustering are
oriented toward disjoint or
hierarchical clusters from coordinate data, distance data, or a
correlation or covariance matrix.
The SAS/STAT cluster analysis procedures include the
following:
ACECLUS Procedure — Obtains approximate estimates of the pooled
within-cluster
covariance matrix when the clusters are assumed to be multivariate
normal with equal
covariance matrices
CLUSTER Procedure — Hierarchically clusters the observations in a
SAS data
DISTANCE Procedure — Computes various measures of distance,
dissimilarity, or
similarity between the observations (rows) of a SAS data set.
Proximity measures are
stored as a lower triangular matrix or a square matrix in an output
data set that can then
be used as input to the CLUSTER, MDS, and MODECLUS
procedures.
FASTCLUS Procedure — Disjoint cluster analysis on the basis of
distances computed
from one or more quantitative variables
MODECLUS Procedure — Clusters observations in a SAS data
set
TREE Procedure — Produces a tree diagram, also known as a
dendrogram or
phenogram, from a data set created by the CLUSTER or VARCLUS
procedure
VARCLUS Procedure — Divides a set of numeric variables into
disjoint or hierarchical
clusters
Statistical Significance Testing
Note that the above discussions refer to clustering algorithms and
do not mention anything about
statistical significance testing. In fact, cluster analysis is not
as much a typical statistical test as it
is a "collection" of different algorithms that "put objects into
clusters according to well defined
similarity rules." The point here is that, unlike many other
statistical procedures, cluster analysis
methods are mostly used when we do not have any a priori
hypotheses, but are still in the
exploratory phase of our research. In a sense, cluster analysis
finds the "most significant solution
possible." Therefore, statistical significance testing is really
not appropriate here, even in cases
when p-levels are reported (as in k -means clustering).
Area of Application
Clustering techniques have been applied to a wide variety of
research problems. Hartigan (1975)
provides an excellent summary of the many published studies
reporting the results of cluster
analyses. For example, in the field of medicine, clustering
diseases, cures for diseases, or
symptoms of diseases can lead to very useful taxonomies. In the
field of psychiatry, the correct
diagnosis of clusters of symptoms such as paranoia, schizophrenia,
etc. is essential for successful
therapy. In archeology, researchers have attempted to establish
taxonomies of stone tools, funeral
objects, etc. by applying cluster analytic techniques. In general,
whenever we need to classify a
"mountain" of information into manageable meaningful piles, cluster
analysis is of great utility
What is the Cluster Analysis?
The Cluster Analysis is an explorative analysis that tries to
identify structures within the data.
Cluster analysis is also called segmentation analysis or taxonomy
analysis. More specifically, it
tries to identify homogenous groups of cases, i.e., observations,
participants, respondents.
Cluster analysis is used to identify groups of cases if the
grouping is not previously known.
Because it is explorative it does make any distinction between
dependent and independent
variables.
The Cluster Analysis is often part of the sequence of analyses of
factor analysis, cluster analysis,
and finally, discriminant analysis. First, a factor analysis that
reduces the dimensions and
therefore the number of variables makes it easier to run the
cluster analysis. Also, the factor
analysis minimizes multicollinearity effects. The next analysis is
the cluster analysis, which
identifies the grouping. Lastly, a discriminant analysis checks the
goodness of fit of the model
that the cluster analysis found and profiles the clusters
Medicine – What are the diagnostic clusters? To answer this
question the researcher would
devise a diagnostic questionnaire that entails the symptoms (for
example in psychology
standardized scales for anxiety, depression etc.). The cluster
analysis can then identify groups of
patients that present with similar symptoms and simultaneously
maximize the difference between
the groups.
Marketing – What are the customer segments? To answer this
question a market researcher
conducts a survey most commonly covering needs, attitudes,
demographics, and behavior of
customers. The researcher then uses the cluster analysis to
identify homogenous groups of
customers that have similar needs and attitudes but are
distinctively different from other
customer segments.
Education – What are student groups that need special attention?
The researcher measures a
couple of psychological, aptitude, and achievement characteristics.
A cluster analysis then
identifies what homogeneous groups exist among students (for
example, high achievers in all
subjects, or students that excel in certain subjects but fail in
others, etc.). A discriminant analysis
then profiles these performance clusters and tells us what
psychological, environmental,
aptitudinal, affective, and attitudinal factors characterize these
student groups.
Biology – What is the taxonomy of species? The researcher has
collected a data set of
different plants and noted different attributes of their
phenotypes. A hierarchical cluster analysis
groups those observations into a series of clusters and builds a
taxonomy tree of groups and
subgroups of similar plants.
K-means cluster is a method to quickly cluster large data sets,
which typically take a while to
compute with the preferred hierarchical cluster analysis. The
researcher must to define the
number of clusters in advance. This is useful to test different
models with a different assumed
number of clusters (for example, in customer segmentation).
Hierarchical cluster is the most common method. We will discuss
this method shortly. It takes
time to calculate, but it generates a series of models with cluster
solutions from 1 (all cases in
one cluster) to n (all cases are an individual cluster).
Hierarchical cluster also works with
variables as opposed to cases; it can cluster variables together in
a manner somewhat similar to
factor analysis. In addition, hierarchical cluster analysis can
handle nominal, ordinal, and scale
data, however it is not recommended to mix different levels of
measurement.
Kindly revert for any queries
Thanks.