In: Nursing
Scenario. Centres for Disease Control
The Centres for Disease Control and Prevention (CDC) is the national public health institute of the United States. Its main aim is to protect people health and safety through the control and prevention of diseases. CDC had to rely on doctor reports of influenza outbreaks. CDC was weeks behind in providing vaccines to the affected patients. Using historical data from the CDC, Google compared search term queries against geographical areas that were known to have had flu outbreaks. Google then found forty five terms correlated with the outbreak of flu. With this data, CDC can act immediately.
Questions:
1. To effectively identify the terms related to influenza outbreaks, Google may need to apply clustering techniques, which type or types of clusters would best fit this application? Please justify your answer.
2. To better the clustering performance, what configuration of the clustering would you suggest Google to do according to the characteristics of the input data? Please justify your answer.
3. Another option for Google is to use the classification approach. Please compare the classification and clustering approaches in the context of this scenario, by listing at least three differences between them and the impacts to the sorting process.
1. Clustering is a method of data collection where the data are seggregated into groups with similar traits and they are assigned into different groups based on their characteristics.
It is of different types and is subjective, based on its goal. Most commonly used are:
Connectivity model
Centroid model
Distribution model
Density model
Semantic, Hierarchical, Online Clustering algorithm which comes under the Connectivity model is that uses suffix arrays to extract frequent phrases and singular value decomposition techniques to discover the cluster content. Hence it would be beneficial to extract out the terms related to any medical condition, as question specific, influenza outbreaks.
2. Clustering results mainly depend upon the selected objective function than on the selected algorithm. Clusters of variable sizes would also cause large clusters to be split, and smaller ones to be merged. Use of genetic algorithm,random swap, particle swarm optimization, spectral and density clustering helps understanding the complex algorithms better.
a better clustering result could be achieved by using an objective function based on Mahalanobis distance or Gaussian mixture model instead of SSE, if a natural cluster is the expected outcome