Python-決策樹

決策樹圖形無法出現

分別在pydotplus及graphviz package出現問題

無法安裝pydotplus套件

直接使用conda或pip在Python3環境下都捉不到此套件,查了下stackoverflow,可以透過conda-forge進行安裝
conda install -c conda-forge pydotplus
conda install -c conda-forge graphviz

Graphviz’s executable not found

  1. 分別用conda及pip安裝graphviz,兩者版本不同,一個是2.38,一個為0.7,不清楚差異。
  2. 但兩者分別安裝後仍舊無法使用
  3. stackoverflow裡有人提到安裝順序的問題,應該graphviz先,再pydotplus,但試過後無用
  4. 最後依stackoverflow提到從graphviz官網下載msi檔案安裝,並且將graphviz/bin的路徑設定進去電腦環境的PATH中,在原眾多路徑的最後加;及graphviz/bin的絕對路徑
  5. 重新啓動python IDE即可正常產出決策樹圖形

決策樹的資料分為

  1. 訓練資料及測試資料各有獨立資料庫檔案
  2. Cross-validation,交叉驗證,即又建模又測試,主要用於資料量少時使用,因為做比較多次,資料計算較平均,但計算成本較高
    如分10 folder,即代表分10大類,每次挑1個測試,另9個當變數(leave one out ),一直交換當測試
  3. Percentage split,多少比例用於建模,剩餘比例用於測試,主要用於資料量多時
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from sklearn.datasets import load_iris
from sklearn import tree
from sklearn.tree import export_graphviz
from sklearn.cross_validation import train_test_split
from sklearn import metrics
from IPython.display import Image
import graphviz
import pydotplus
iris = load_iris() # 讀入鳶尾花資料
train_X, test_X, train_y, test_y = train_test_split(iris.data, iris.target, test_size = 0.3) # 切分訓練與測試資料
dt = tree.DecisionTreeClassifier() #決策樹分類器,如指定層,則可用_ max_depth=2
dt.fit(train_X, train_y) #將X,y帶入產生模組
test_y_predicted = dt.predict(test_X) #產出預測值
accuracy = metrics.accuracy_score(test_y, test_y_predicted) #計算模型正確率
print(accuracy)
# 繪製決策樹圖
dot_data = tree.export_graphviz(dt, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data)
Image(graph.create_png())

K-folder, Cross-validation, LeaveOneOut

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
iris = load_iris()
X = iris.data
y = iris.target
acc = []
kf = KFold(n_splits=10)
for train, test in kf.split(X):
train_X, test_X, train_y, test_y = X[train],X[test], y[train], y[test]
clf = DecisionTreeClassifier()
clf.fit(train_X, train_y)
predicted = clf.predict(test_X)
acc.append(accuracy_score(test_y, predicted))
sum(acc) / len(acc)
from sklearn.model_selection import cross_val_score
acc1 = cross_val_score(clf, X=iris.data, y= iris.target, cv= 10)
acc1
from sklearn.model_selection import LeaveOneOut
res = []
loo = LeaveOneOut()
for train, test in loo.split(X):
train_X, test_X, train_y, test_y = X[train],X[test], y[train], y[test]
clf = DecisionTreeClassifier()
clf.fit(train_X, train_y)
predicted = clf.predict(test_X)
res.extend((predicted == test_y).tolist())
print ("LeaveOneOut:",sum(res) /len(res))