In: Computer Science
You would like to build a classifier for an Autism early detection application. Each data point in the dataset represents a patient. Each patient is described by a set of attributes such as age, sex, ethnicity, communication and development figures, etc. You know from domain knowledge that autism is more prevalent in males than females.
If the dataset you are using to build the classifier is noisy, contains redundant attributes and missing values. If you are considering a decision tree classifier and a k-nearest neighbor classifier, explain how each of these can handle the three mentioned problems:
1. Noise
2. Missing Values
3. Redundant Attributes
there are various method to handle these problem, some methods are following:
1. Noise : you can use feature correlation heatmap features to find out co-rrelation between feature and target variable. you can select a group of features and apply cross validation on it. find out which groups or features have a lower accuracy. analysis the attribute and data which is having poor performance which help you to remove that data.
2. Missing Values: In health care domain, we have to used a real world based values of patient. we have to avoid simulated or unbiased data to impute in dataset. if you are having a missing value in an attribute which is not a part of your feature selection you dont have to do anythings. but if it is a part of it then you can do some basic things: you can remove it from dataset, you can impute a mean or median value of that coloum of dataset or you can impute a most occured value in that column( e.g: age, gender).
3. redundant Attributes: you can use feature correlation heatmap to remove a redundant attributes from dataset or as per your question you can use features importance method to find out best features for your tree classifier. it will auto calculate best fetaures for your model using tree classifier.