In: Computer Science
Run the following code and answer the following questions:
(a) How do accuracy change with changing the tree maximum depth? [Your answer]
(b) What are the ways to reduce overfitting in a decision tree? [Your answer]
from sklearn import datasets
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=1, stratify=y)
tree = DecisionTreeClassifier(criterion='entropy',
max_depth=10,
random_state=1)
tree.fit(X_train, y_train)
y_pred = tree.predict(X_test)
test_accuracy = metrics.accuracy_score(y_test, y_pred)
print("Test accuracy of decision tree classifier on Iris dataset:
"+str(test_accuracy))
plt.figure(figsize=(10, 7))
plot_tree(tree,
filled=True,
rounded=True,
class_names=['Setosa',
'Versicolor',
'Virginica'],
feature_names=['petal length',
'petal width'])
plt.show()
How do accuracy change with changing the tree maximum depth?
ans-> Max depth of tree simply states that if we increase the depth of tree , the maximum we allow tree to grow the more complex our model will become. Higher the max_depth highly complex model will be there.
Hence for the training data,our accuracy increases as model complexity increases, but for the testing set our model might get overfits,and accuracy of the l will be lesser. So it is recommended to use random search cross-validation and grid-search cv to find the best value of parameter max_depth in order to get high accuracy.
* max_depth => very high then model get overfits
* max_depth => very low then model get underfits
Q. What are the ways to reduce overfitting in a decision tree?
Solution->
one way to reduce overfitting is to uses pruning. It means changing the various important parameter like max_depth and min_samples_split and min_samples_leaf and observe the accuracy of model according .
min_samples_split: min number of sample required to do split at each node, taking it too high causes overfitting and taking too low causes underfitting.
max_depth : HIgher is the max_depth higher will be number of sample split, so it may cause overfitting so choose it accordingly , you can use random search Cv(cross validation ).
I have corrected your below code and now its working fine and showing the accuracy as well and showing the decision tree as well.