随机森林算法基础教程文档

收录于 2023-04-20 00:10:05 · بالعربية · English · Español · हिंदीName · 日本語 · Русский язык · 中文繁體

简介

随机森林是一种监督学习算法，可用于分类和回归。但是，它主要用于分类问题。众所周知，森林由树木组成，更多的树木意味着更坚固的森林。同样，随机森林算法在数据样本上创建决策树，然后从每个样本中获取预测，最后通过投票选择最佳解决方案。这是一种集成方法，比单个决策树要好，因为它可以通过对结果进行平均来减少过拟合。

随机森林算法的工作

借助以下步骤，我们可以了解随机森林算法的工作原理-

第1步-首先，从给定的数据集中选择随机样本。 第2步-接下来，该算法将为每个样本构造一个决策树。然后从每个决策树中获取预测结果。 第3步 -在此步骤中，将对每个预测结果进行投票。 第4步 -最后，选择投票最多的预测结果作为最终预测结果。

下图将说明其工作方式-

Python的实现

首先，从导入必要的Python包开始-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

接下来，按如下方式从其Web链接下载虹膜数据集-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

接下来，我们需要按如下所示为数据集分配列名-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']

现在，我们需要按如下方式将数据集读取到pandas数据框-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
dataset = pd.read_csv(path, names = headernames)
dataset.head()

	分隔长度	分隔宽度	花瓣长度	花瓣宽度	类
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

数据预处理将在以下脚本行的帮助下完成。

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values

接下来，我们将数据分为训练和测试拆分。以下代码会将数据集分为70％的训练数据和30％的测试数据-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

接下来，在sklearn的 RandomForestClassifier 类的帮助下训练模型，如下所示-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50)
classifier.fit(X_train, y_train)

最后，我们需要进行预测。可以在以下脚本的帮助下完成-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
y_pred = classifier.predict(X_test)

接下来，按如下所示打印结果-

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)

输出

# Filename : example.py
# Copyright : 2020 By Lidihuo
# Author by : www.lidihuo.com
# Date : 2020-08-26
Confusion Matrix:
[[14 0 0]
   [ 0 18 1]
   [ 0 0 12]]
Classification Report:
              precision recall f1-score support
    Iris-setosa 1.00 1.00 1.00 14
Iris-versicolor 1.00 0.95 0.97 19
Iris-virginica 0.92 1.00 0.96 12
      micro avg 0.98 0.98 0.98 45
      macro avg 0.97 0.98 0.98 45
   weighted avg 0.98 0.98 0.98 45
Accuracy: 0.9777777777777777

随机森林的利与弊

专业人士

以下是随机森林算法的优点-

通过平均或组合不同决策树的结果来克服过拟合的问题。与单个决策树相比，随机森林可以处理大量数据项。与单一决策树相比，随机森林的变异性较小。随机森林非常灵活并且具有很高的准确性。在随机森林算法中不需要数据缩放。即使在没有缩放的情况下提供数据后，它也保持了良好的准确性。即使丢失了大部分数据，Random Forest算法也能保持良好的准确性。

缺点

以下是随机森林算法的缺点-

复杂性是随机森林算法的主要缺点。与决策树相比，随机森林的建设更加困难且耗时。实现随机森林算法需要更多的计算资源。如果我们有大量的决策树，那么它就不太直观。与其他算法相比，使用随机森林的预测过程非常耗时。

编程宝典

随机森林算法基础教程文档

简介

随机森林算法的工作

Python的实现

输出

随机森林的利与弊

专业人士

缺点