In: Computer Science
Scenario. Centres for Disease Control The Centres for Disease Control and Prevention (CDC) is the national public health institute of the United States. Its main aim is to protect people health and safety through the control and prevention of diseases. CDC had to rely on doctor reports of influenza outbreaks. CDC was weeks behind in providing vaccines to the affected patients. Using historical data from the CDC, Google compared search term queries against geographical areas that were known to have had flu outbreaks. Google then found forty five terms correlated with the outbreak of flu. With this data, CDC can act immediately. Questions: 1. To effectively identify the terms related to influenza outbreaks, Google may need to apply clustering techniques, which type or types of clusters would best fit this application? Please justify your answer. 2. To better the clustering performance, what configuration of the clustering would you suggest Google to do according to the characteristics of the input data? Please justify your answer. 3. Another option for Google is to use the classification approach. Please compare the classification and clustering approaches in the context of this scenario, by listing at least three differences between them and the impacts to the sorting process.
Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into clusters.
We have clustering algorithms in machine learning
Connectivity model
Centroid model Distribution model
Density model
Semantic, Hierarchical, Online Clustering algorithm which comes under the Connectivity model is that uses suffix arrays to extract frequent phrases and singular value decomposition techniques to discover the cluster content. Hence it would be beneficial to extract out the terms related to any medical condition, as question specific, influenza outbreaks.
2. Clustering results mainly depend upon the selected objective function than on the selected algorithm. Clusters of variable sizes would also cause large clusters to be split, and smaller ones to be merged. Use of genetic algorithm,random swap, particle swarm optimization, spectral and density clustering helps understanding the complex algorithms. Better clustering result could be achieved by using an objective function based on Mahalanobis distance or Gaussian mixture model instead of SSE, if a natural cluster is the expected outcome.
3. Although both techniques have certain similarities, the difference lies in the fact that classification uses predefined classes in which objects are assigned, while clustering identifies similarities between objects, which it groups according to those characteristics in common and which differentiate them from other groups of objects. These groups are known as "clusters".