管理书籍排行榜,天下高月小说

新聞中心

這里有您想知道的互聯(lián)網(wǎng)營銷解決方案

如何用XGBoost在Python中進(jìn)行特征重要性分析和特征選擇

如何用XGBoost在Python 中進(jìn)行特征重要性分析和特征選擇，很多新手對此不是很清楚，為了幫助大家解決這個(gè)難題，下面小編將為大家詳細(xì)講解，有這方面需求的人可以來學(xué)習(xí)下，希望你能有所收獲。

創(chuàng)新互聯(lián)公司是一家專業(yè)提供沐川企業(yè)網(wǎng)站建設(shè),專注與成都做網(wǎng)站、網(wǎng)站設(shè)計(jì)、H5場景定制、小程序制作等業(yè)務(wù)。10年已為沐川眾多企業(yè)、政府機(jī)構(gòu)等服務(wù)。創(chuàng)新互聯(lián)專業(yè)網(wǎng)站設(shè)計(jì)公司優(yōu)惠進(jìn)行中。

使用諸如梯度增強(qiáng)之類的決策樹方法的集成的好處是，它們可以從訓(xùn)練有素的預(yù)測模型中自動(dòng)提供特征重要性的估計(jì)。

使用梯度增強(qiáng)的好處是，在構(gòu)建增強(qiáng)后的樹之后，檢索每個(gè)屬性的重要性得分相對簡單。通常，重要性提供了一個(gè)分?jǐn)?shù)，該分?jǐn)?shù)指示每個(gè)特征在模型中構(gòu)建增強(qiáng)決策樹時(shí)的有用性或價(jià)值。用于決策樹的關(guān)鍵決策使用的屬性越多，其相對重要性就越高。

此重要性是針對數(shù)據(jù)集中的每個(gè)屬性明確計(jì)算得出的，從而可以對屬性進(jìn)行排名并進(jìn)行相互比較。單個(gè)決策樹的重要性是通過每個(gè)屬性拆分點(diǎn)提高性能指標(biāo)的數(shù)量來計(jì)算的，并由節(jié)點(diǎn)負(fù)責(zé)的觀察次數(shù)來加權(quán)。性能度量可以是用于選擇拆分點(diǎn)的純度(基尼系數(shù))，也可以是其他更特定的誤差函數(shù)。然后，將特征重要性在模型中所有決策樹之間平均。有關(guān)如何在增強(qiáng)型決策樹中計(jì)算特征重要性的更多技術(shù)信息，

手動(dòng)繪制特征重要性

訓(xùn)練有素的XGBoost模型會(huì)自動(dòng)計(jì)算出您的預(yù)測建模問題中的特征重要性。這些重要性分?jǐn)?shù)可在訓(xùn)練模型的feature_importances_成員變量中獲得。例如，可以按如下所示直接打印它們：

print(model.feature_importances_)

我們可以將這些得分直接繪制在條形圖上，以直觀表示數(shù)據(jù)集中每個(gè)特征的相對重要性。例如：

# plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

我們可以通過在皮馬印第安人發(fā)病的糖尿病數(shù)據(jù)集上訓(xùn)練XGBoost模型并根據(jù)計(jì)算出的特征重要性創(chuàng)建條形圖來證明這一點(diǎn)。

下載數(shù)據(jù)集并將其放置在當(dāng)前工作目錄中。

數(shù)據(jù)集文件:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv

數(shù)據(jù)集詳細(xì)信息:

https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.names

# plot feature importance manually from numpy import loadtxt from xgboost import XGBClassifier from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # feature importance print(model.feature_importances_) # plot pyplot.bar(range(len(model.feature_importances_)), model.feature_importances_) pyplot.show()

注意：由于算法或評估程序的隨機(jī)性，或者數(shù)值精度的差異，您的結(jié)果可能會(huì)有所不同?？紤]運(yùn)行該示例幾次并比較平均結(jié)果。

首先運(yùn)行此示例將輸出重要性分?jǐn)?shù)。

[ 0.089701    0.17109634  0.08139535  0.04651163  0.10465116  0.2026578 0.1627907   0.14119601]

我們還獲得了相對重要性的條形圖。

如何用XGBoost在Python 中進(jìn)行特征重要性分析和特征選擇

該圖的缺點(diǎn)是要素按其輸入索引而不是其重要性排序。我們可以在繪制之前對特征進(jìn)行排序。

值得慶幸的是，有一個(gè)內(nèi)置的繪圖函數(shù)可以幫助我們。

使用內(nèi)置XGBoost特征重要性圖XGBoost庫提供了一個(gè)內(nèi)置函數(shù)，可以按重要性順序繪制要素。該函數(shù)稱為plot_importance()，可以按以下方式使用：

# plot feature importance plot_importance(model) pyplot.show()

例如，以下是完整的代碼清單，其中使用內(nèi)置的plot_importance()函數(shù)繪制了Pima Indians數(shù)據(jù)集的特征重要性。

# plot feature importance using built-in function from numpy import loadtxt from xgboost import XGBClassifier from xgboost import plot_importance from matplotlib import pyplot # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] y = dataset[:,8] # fit model no training data model = XGBClassifier() model.fit(X, y) # plot feature importance plot_importance(model) pyplot.show()

注意：由于算法或評估程序的隨機(jī)性，或者數(shù)值精度的差異，您的結(jié)果可能會(huì)有所不同。考慮運(yùn)行該示例幾次并比較平均結(jié)果。

運(yùn)行該示例將為我們提供更有用的條形圖。

如何用XGBoost在Python 中進(jìn)行特征重要性分析和特征選擇

您可以看到，要素是根據(jù)它們在F0至F7的輸入數(shù)組(X)中的索引自動(dòng)命名的。手動(dòng)將這些索引映射到問題描述中的名稱，可以看到該圖顯示F5(體重指數(shù))具有最高的重要性，而F3(皮膚褶皺厚度)具有最低的重要性。

XGBoost特征重要性評分的特征選擇

特征重要性評分可用于scikit-learn中的特征選擇。這是通過使用SelectFromModel類完成的，該類采用一個(gè)模型，并且可以將數(shù)據(jù)集轉(zhuǎn)換為具有選定要素的子集。此類可以采用預(yù)訓(xùn)練的模型，例如在整個(gè)訓(xùn)練數(shù)據(jù)集上進(jìn)行訓(xùn)練的模型。然后，它可以使用閾值來確定要選擇的特征。當(dāng)您在SelectFromModel實(shí)例上調(diào)用transform()方法以一致地選擇訓(xùn)練數(shù)據(jù)集和測試數(shù)據(jù)集上的相同要素時(shí)，將使用此閾值。

在下面的示例中，我們首先訓(xùn)練，然后分別在整個(gè)訓(xùn)練數(shù)據(jù)集和測試數(shù)據(jù)集上評估XGBoost模型。使用從訓(xùn)練數(shù)據(jù)集計(jì)算出的特征重要性，然后將模型包裝在SelectFromModel實(shí)例中。我們使用它來選擇訓(xùn)練數(shù)據(jù)集上的特征，從選定的特征子集中訓(xùn)練模型，然后在測試集上評估模型，并遵循相同的特征選擇方案。

例如：

# select features using threshold selection = SelectFromModel(model, threshold=thresh, prefit=True) select_X_train = selection.transform(X_train) # train model selection_model = XGBClassifier() selection_model.fit(select_X_train, y_train) # eval model select_X_test = selection.transform(X_test) y_pred = selection_model.predict(select_X_test)

出于興趣，我們可以測試多個(gè)閾值，以根據(jù)特征重要性選擇特征。具體來說，每個(gè)輸入變量的特征重要性，從本質(zhì)上講，使我們能夠按重要性測試每個(gè)特征子集，從所有特征開始，到具有最重要特征的子集結(jié)束。

下面提供了完整的代碼清單：

# use feature importance for feature selection from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model on all training data model = XGBClassifier() model.fit(X_train, y_train) # make predictions for test data and evaluate y_pred = model.predict(X_test) predictions = [round(value) for value in y_pred] accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort(model.feature_importances_) for thresh in thresholds:  # select features using threshold  selection = SelectFromModel(model, threshold=thresh, prefit=True)  select_X_train = selection.transform(X_train)  # train model  selection_model = XGBClassifier()  selection_model.fit(select_X_train, y_train)  # eval model  select_X_test = selection.transform(X_test)  y_pred = selection_model.predict(select_X_test)  predictions = [round(value) for value in y_pred]  accuracy = accuracy_score(y_test, predictions)  print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

請注意，如果您使用的是XGBoost 1.0.2(可能還有其他版本)，則XGBClassifier類中存在一個(gè)錯(cuò)誤，該錯(cuò)誤會(huì)導(dǎo)致錯(cuò)誤：

KeyError: 'weight'

這可以通過使用自定義XGBClassifier類來解決，該類為coef_屬性返回None。下面列出了完整的示例。

# use feature importance for feature selection, with fix for xgboost 1.0.2 from numpy import loadtxt from numpy import sort from xgboost import XGBClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.feature_selection import SelectFromModel   # define custom class to fix bug in xgboost 1.0.2 class MyXGBClassifier(XGBClassifier):  @property  def coef_(self):   return None   # load data dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",") # split data into X and y X = dataset[:,0:8] Y = dataset[:,8] # split data into train and test sets X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7) # fit model on all training data model = MyXGBClassifier() model.fit(X_train, y_train) # make predictions for test data and evaluate predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) print("Accuracy: %.2f%%" % (accuracy * 100.0)) # Fit model using each importance as a threshold thresholds = sort(model.feature_importances_) for thresh in thresholds:  # select features using threshold  selection = SelectFromModel(model, threshold=thresh, prefit=True)  select_X_train = selection.transform(X_train)  # train model  selection_model = XGBClassifier()  selection_model.fit(select_X_train, y_train)  # eval model  select_X_test = selection.transform(X_test)  predictions = selection_model.predict(select_X_test)  accuracy = accuracy_score(y_test, predictions)  print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))

運(yùn)行此示例將打印以下輸出。

Accuracy: 77.95% Thresh=0.071, n=8, Accuracy: 77.95% Thresh=0.073, n=7, Accuracy: 76.38% Thresh=0.084, n=6, Accuracy: 77.56% Thresh=0.090, n=5, Accuracy: 76.38% Thresh=0.128, n=4, Accuracy: 76.38% Thresh=0.160, n=3, Accuracy: 74.80% Thresh=0.186, n=2, Accuracy: 71.65% Thresh=0.208, n=1, Accuracy: 63.78%

我們可以看到，模型的性能通常隨所選特征的數(shù)量而降低。

在此問題上，需要權(quán)衡測試集精度的特征，我們可以決定采用較不復(fù)雜的模型(較少的屬性，例如n = 4)，并接受估計(jì)精度的適度降低，從77.95%降至76.38%。

這可能是對這么小的數(shù)據(jù)集的洗禮，但是對于更大的數(shù)據(jù)集并使用交叉驗(yàn)證作為模型評估方案可能是更有用的策略。

看完上述內(nèi)容是否對您有幫助呢？如果還想對相關(guān)知識有進(jìn)一步的了解或閱讀更多相關(guān)文章，請關(guān)注創(chuàng)新互聯(lián)行業(yè)資訊頻道，感謝您對創(chuàng)新互聯(lián)的支持。

本文名稱：如何用XGBoost在Python中進(jìn)行特征重要性分析和特征選擇
網(wǎng)頁網(wǎng)址：http://fisionsoft.com.cn/article/iihodj.html

新聞中心

手動(dòng)繪制特征重要性

XGBoost特征重要性評分的特征選擇

其他資訊