首页 > 营销学院 > IT资讯

数据挖掘：决策树

本文介绍决策树算法，包括其通过规则分类数据、分分类树和回归树的原理，以及计算复杂度低等优缺点。还以UCI的Adult数据集为例，展示手动实现决策树分类算法和使用sklearn库实现的过程，包括数据加载、处理、模型构建、可视化及测试，两者在该数据集上分类准确度相同。

☞☞☞AI 智能聊天, 问答助手, AI 智能搜索, 免费无限量使用 DeepSeek R1 模型☜☜☜

数据挖掘：决策树 -

1. 算法原理

决策树是通过一系列规则对数据进行分类的过程。它提供一种在什么条件下会得到什么值的类似规则的方法。决策树分为分类树和回归树两种，分类树对离散变量做决策树，回归树对连续变量做决策树。近来的调查表明决策树也是最经常使用的数据挖掘算法，它的概念非常简单。决策树算法之所以如此流行，一个很重要的原因就是使用者基本上不用了解机器学习算法，也不用深究它是如何工作的。直观看上去，决策树分类器就像判断模块和终止块组成的流程图，终止块表示分类结果（也就是树的叶子）。判断模块表示对一个特征取值的判断（该特征有几个值，判断模块就有几个分支）。

如果不考虑效率等，那么样本所有特征的判断级联起来终会将某一个样本分到一个类终止块上。实际上，样本所有特征中有一些特征在分类时起到决定性作用，决策树的构造过程就是找到这些具有决定性作用的特征，根据其决定性程度来构造一个倒立的树–-决定性作用最大的那个特征作为根节点，然后递归找到各分支下子数据集中次大的决定性特征，直至子数据集中所有数据都属于同一类。所以，构造决策树的过程本质上就是根据数据特征将数据集分类的递归过程，我们需要解决的第一个问题就是，当前数据集上哪个特征在划分数据分类时起决定性作用。

2. 优缺点分析

决策树适用于数值型和标称型（离散型数据，变量的结果只在有限目标集中取值），能够读取数据集合，提取一些列数据中蕴含的规则。在分类问题中使用决策树模型有很多的优点，决策树计算复杂度不高、便于使用、而且高效，决策树可处理具有不相关特征的数据、可很容易地构造出易于理解的规则，而规则通常易于解释和理解。

决策树模型也有一些缺点，比如处理缺失数据时的困难、过度拟合以及忽略数据集中属性之间的相关性等。

3. Adult数据集

数据集使用 UCI 数据集中的 Adult 数据集。数据集下载链接：http://archive.ics.uci.edu/ml/datasets/Adult ，我们提供了已经下载好的数据集：【Data Mining】Adult。
源数据集共有 14 个属性，分类有两种 (50K)。这 14 个属性中，有 6 个是连续类型的，有 8 个是离散类型的。

4. 代码展示

4.1 导入依赖

由于AI Studio平台没有提供sklearn_pandas库，因此我们手动安装一下，只要运行下方代码块即可：

In [1]

# 升级pip!pip install --upgrade pip# 安装sklearn_pandas!pip install sklearn_pandas

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: pip in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (23.0)
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: sklearn_pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (2.2.0)
Requirement already satisfied: scipy>=1.5.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from sklearn_pandas) (1.7.3)
Requirement already satisfied: numpy>=1.18.1 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from sklearn_pandas) (1.21.6)
Requirement already satisfied: pandas>=1.1.4 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from sklearn_pandas) (1.1.5)
Requirement already satisfied: scikit-learn>=0.23.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from sklearn_pandas) (1.0.2)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas>=1.1.4->sklearn_pandas) (2.8.2)
Requirement already satisfied: pytz>=2017.2 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas>=1.1.4->sklearn_pandas) (2019.3)
Requirement already satisfied: joblib>=0.11 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.23.0->sklearn_pandas) (0.14.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from scikit-learn>=0.23.0->sklearn_pandas) (3.1.0)
Requirement already satisfied: six>=1.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas>=1.1.4->sklearn_pandas) (1.16.0)

In [2]

import pandas as pd 
import numpy as np 
from sklearn_pandas import DataFrameMapper #sklearn-pandas模块提供了Scikit-Learn的机器学习方法和pandas风格的数据框架之间的桥梁。from sklearn.preprocessing import LabelEncoderfrom sklearn.tree import DecisionTreeClassifier #sklearn提供的决策树分类器from sklearn.tree import export_graphviz # 决策树可视化import graphviz # 用于绘制DOT语言脚本描述的图形from matplotlib import pyplot as pltfrom pylab import *from collections import defaultdict,Counter 
from tqdm import tqdm # 进度条

4.2 加载数据集

In [3]

# 14个属性+类别columns=['age','workclass','fnlwgt','education','education_num','marital_status','occupation','relationship',                 'race','sex','capital_gain','capital_loss','hours_per_week','native_country','annual_salary']# 加载训练集adult_train_path = 'data/data87314/adult.data'adult_train = pd.read_csv(adult_train_path,header=None,names=columns)# 加载测试集adult_test_path = 'data/data87314/adult.test'adult_test = pd.read_csv(adult_test_path,header=None,names=columns)

数据展示：

In [4]

adult_train.head()

   age          workclass  fnlwgt   education  education_num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital_status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital_gain  capital_loss  hours_per_week  native_country annual_salary  
0          2174             0              40   United-States         <=50K  
1             0             0              13   United-States         <=50K  
2             0             0              40   United-States         <=50K  
3             0             0              40   United-States         <=50K  
4             0             0              40            Cuba         <=50K

4.3 数据向量化

使用LabelEncoder、DataFrameMapper将非数值列的数据转化为数值，也即向量化。

处理训练集：

In [5]

# 获取非数值列的列名train_dtype=adult_train.dtypes#print(train_dtype)train_list=[train_dtype.index[i] for i in range(len(train_dtype)) if train_dtype[i]=='object']# 使用LabelEncoder、DataFrameMapper将非数值列的数据转化为数值，也即向量化。 # 列的顺序会发生变化mapper=DataFrameMapper([(i, LabelEncoder()) for i in train_list], df_out=True, default=None)
adult_train = mapper.fit_transform(adult_train.copy()).astype(dtype='int64')

处理测试集：

OEmarry婚嫁电子商务系统免费版

OEmarry婚庆商家电子商务网站系统（又名：OEmarry婚嫁O2O电商平台系统）是O.E研发团队继OElove婚恋网站产品发布之后经长期的深入调研策划后，根据婚庆行业客户实际应用需求而提供的一套以满足企业级（OEPHP MVC架构）大型数据架构及大规模运营需求的解决方案，该系统的集商家展示点评、O2O团购、垂直搜索、分类导行、本地信息、优惠券、商家活动、在线购物、微信营销、广告管理、手机app

0 查看详情 OEmarry婚嫁电子商务系统免费版

In [6]

test_dtype=adult_test.dtypes
test_list=[test_dtype.index[i] for i in range(len(test_dtype)) if test_dtype[i]=='object']


mapper=DataFrameMapper([(i, LabelEncoder()) for i in test_list], df_out=True, default=None)
adult_test = mapper.fit_transform(adult_test.copy()).astype(dtype='int64')

向量化后的数据展示：

In [7]

adult_train.head()

   workclass  education  marital_status  occupation  relationship  race  sex  \
0          7          9               4           1             1     4    1   
1          6          9               2           4             0     4    1   
2          4         11               0           6             1     4    1   
3          4          1               2           6             0     2    1   
4          4          9               2          10             5     2    0   

   native_country  annual_salary  age  fnlwgt  education_num  capital_gain  \
0              39              0   39   77516             13          2174   
1              39              0   50   83311             13             0   
2              39              0   38  215646              9             0   
3              39              0   53  234721              7             0   
4               5              0   28  338409             13             0   

   capital_loss  hours_per_week  
0             0              40  
1             0              13  
2             0              40  
3             0              40  
4             0              40

4.4 属性与标签划分

fnlwgt属性，由于数据过于分散，在生成决策树的过程中会很耗时，在数据处理过程中删除该属性。
最后得到的训练集大小为 (32561, 13) (32561,)；测试集大小为(16281, 13) (16281,)。

In [8]

col=list(adult_train.columns)
label='annual_salary'col.remove(label)
col.remove('fnlwgt')

x_train,y_train=adult_train[col].values,adult_train[label].values
x_test,y_test=adult_test[col].values,adult_test[label].valuesprint("训练集shape: ",x_train.shape,y_train.shape,"\n测试集shape: ",x_test.shape,y_test.shape)

训练集shape:  (32561, 13) (32561,) 
测试集shape:  (16281, 13) (16281,)

4.5 手动实现决策树分类算法并可视化

4.5.1 定义决策树类：

In [9]

class Branch:
    no=0 # 决策树节点的编号
    column=0 # 该节点的属性
    entropy=0 # 交叉熵
    samples=0 # 该节点下的数据数目
    value=[] # 记录由该节点划分的不同类别的数据
    split=0 # 分类临界值
    clss=-1 # 该节点的分类
    
    branch_positive=None # 左分支
    branch_negative=None # 右分支

4.5.2 构造决策树分类器

主要进行的操作有：

定义计算熵的函数
定义根据指定属性进行分类的函数
定义在指定数据范围内查找最佳分类属性的函数
定义递归的构造决策树分类器的函数

定义计算熵的函数：

In [10]

def entroy(y):
    counter=Counter(y)
    res=0.0
    for num in counter.values():
        p=num/len(y) #每个类别的占比
        res+=-p*np.log2(p)    return res

定义数据划分的函数：

In [11]

def split(x,y,d,value):
    # x 数据集属性
    # y 数据集标签
    # d 划分的维度
    # value 划分的参考值
    left=(x[:,d]<=value)
    
    right=(x[:,d]>value)    return x[left],x[right],y[left],y[right]

定义选取最好分类特征的函数，在当前的数据下（x，y）选取最合适的分类特征，并返回分类后的左右分支数据：

In [12]

def find_best_fearture(x,y):
    best_entroy=entroy(y) # 熵初始化
    best_v=None # 分类临界值
    best_d='' # 分类属性
    x_r=None
    x_l=None
    y_r=None
    y_l=None
    # 逐个属性进行比较
    for d in range(x.shape[1]):        # 每个属性中，寻找最好的切分点。
        # 因为有的属性本身是数值类型的，需要进行更细致的查找，确定最好的切分点。
        sorted_index=np.argsort(x[:,d])# 根据d维度进行排序
        for i in range(1,len(x)):#遍历每个样本
            if x[sorted_index[i-1],d]!=x[sorted_index[i],d]:
                v=(x[sorted_index[i-1],d]+x[sorted_index[i],d])/2.0
                # 调用split函数进行划分
                xl,xr,yl,yr=split(x,y,d,v)
                n1=len(yl)
                n2=len(yr)
                n=n1+n2                
                # 计算基尼系数
                e=n1/n*entroy(yl)+n2/n*entroy(yr)                
                if e<best_entroy:
                    best_entroy,best_d,best_v,x_l,x_r,y_l,y_r=e,d,v,xl,xr,yl,yr    
    # 返回值：最好的分类属性、分类临界值、基尼系数、左分支属性、右分支属性、左分支标签、右分支标签
    return best_d,best_v,entroy(y),x_l,x_r,y_l,y_r

定义决策树的构造函数，通过该函数递归生成一颗决策树：

In [13]

number=0def decison_tree_in(x,y,depth,max_depth=3):
    global number
    branch=Branch()
    branch.no=number
    number+=1
    ddepth=depth # 记录分支的深度
    
    branch.samples=len(y) # 记录该结点所包含数据的数量
    n_positive=y[y==1].shape[0]
    branch.value=[branch.samples-n_positive,n_positive] # 该结点下，0与1类别数目列表
    if branch.value[0]>branch.value[1]:
        branch.clss=0
    else :
        branch.clss=1
    branch.entropy=entroy(y) # 计算该节点下的信息熵
    best_feature=find_best_fearture(x,y)
    branch.column=best_feature[0]
    branch.split=best_feature[1]    
    if ddepth==max_depth or branch.column=='':        return branch    else:
        x_l,y_l=best_feature[3],best_feature[5]
        branch.branch_positive=decison_tree_in(x_l,y_l,ddepth+1,max_depth)
        x_r,y_r=best_feature[4],best_feature[6]
        branch.branch_negative=decison_tree_in(x_r,y_r,ddepth+1,max_depth)        
    return branch

In [14]

tree=decison_tree_in(x_train,y_train,0,max_depth=4)

4.5.3 可视化构造好的决策树分类器模型

使用graphviz（使用DOT语言脚本绘制图形）可视化决策树。

In [15]

def get_dot_data_innner(branch:Branch, dot_data):
   
    if branch.branch_positive:
        dot_data=dot_data+'{} [label=<{}≤{}<br/>entropy = {:.3f}<br/>samples = {}<br/>value = {}<br/>class = {}> , fillcolor="#FFFFFFFF"] ;\r\n'.format(
            branch.no, col[branch.column],branch.split, branch.entropy, branch.samples, branch.value,branch.clss)    else:
        dot_data=dot_data+'{} [label=<{} <br/>entropy = {:.3f}<br/>samples = {}<br/>value = {}<br/>class = {}> , fillcolor="#FFFFFFFF"] ;\r\n'.format(
            branch.no,branch.column, branch.entropy, branch.samples, branch.value,branch.clss)            
    if branch.branch_positive:
        dot_data=dot_data+'{} -> {} [labeldistance=2.5, labelangle=45, headlabel="True"]; \r\n'.format(branch.no, branch.branch_positive.no)
        dot_data=get_dot_data_innner(branch.branch_positive, dot_data)    
    if branch.branch_negative:
        dot_data=dot_data+'{} -> {} [labeldistance=2.5, labelangle=45, headlabel="False"]; \r\n'.format(branch.no, branch.branch_negative.no)
        dot_data=get_dot_data_innner(branch.branch_negative, dot_data)        
    return dot_data

In [16]

def get_dot_data(branch:Branch):
    dot_data="""
digraph Tree {
node [shape=box, style="filled, rounded", color="black", fontname=helvetica] ;
edge [fontname=helvetica] ;
"""
    dot_data=get_dot_data_innner(branch,  dot_data)
    dot_data=dot_data+'\r\n}'
    return dot_data

In [17]

dot_data=get_dot_data(tree)

In [18]

graph = graphviz.Source(dot_data) 
graph.render('./data/my_dt', format='png')
graph

entropy = 0.480samples = 1459value = [1308, 151]class = 0True8 entropy = 0.920samples = 7307value = [4862, 2445]class = 0Falseentropy = 0.000samples = 410value = [0, 410]class = 1True10 entropy = 0.650samples = 48value = [8, 40]class = 1Falsecapital_loss≤1782.5entropy = 0.911samples = 3356value = [1094, 2262]class = 1Trueage≤62.5entropy = 0.045samples = 613value = [3, 610]class = 1False12 entropy = 0.944samples = 2999value = [1083, 1916]class = 1True11 entropy = 0.198samples = 357value = [11, 346]class = 1Falseentropy = 0.000samples = 542value = [0, 542]class = 1True0 entropy = 0.253samples = 71value = [3, 68]class = 1Falserelationship≤4.5entropy = 0.400samples = 18932value = [17431, 1501]class = 0Trueeducation_num≤10.5entropy = 0.205samples = 436value = [14, 422]class = 1Falseeducation_num≤12.5entropy = 0.286samples = 17482value = [16610, 872]class = 0Trueeducation_num≤10.5entropy = 0.987samples = 1450value = [821, 629]class = 0False8 entropy = 0.172samples = 14036value = [13677, 359]class = 0True8 entropy = 0.607samples = 3446value = [2933, 513]class = 0False9 entropy = 0.895samples = 902value = [621, 281]class = 0True11 entropy = 0.947samples = 548value = [200, 348]class = 1Falseage≤20.5entropy = 0.454samples = 147value = [14, 133]class = 1Trueentropy = 0.000samples = 289value = [0, 289]class = 1False2 entropy = 0.722samples = 5value = [4, 1]class = 0True10 entropy = 0.367samples = 142value = [10, 132]class = 1False

<graphviz.files.Source at 0x7ff8f9df9d50>

4.5.4 用测试集进行验证，计算模型分类准确性得分

In [20]

def cl(branch:Branch, x):
    
    # 纯的数据集，不需要继续划分
    if branch.split==None:        return branch.clss    
    # 继续划分，直至最大深度
    if x[branch.column]<=branch.split:        if branch.branch_positive is not None:            return cl(branch.branch_positive,x)        else:            return branch.clss        
    if x[branch.column]>branch.split:        if branch.branch_negative is not None:            return cl(branch.branch_negative,x)        else:            return branch.clss

In [21]

def compute_score(branch:Branch,x,y):
    re=[]    for i in range(len(x)):
        re.append(cl(branch,x[i]))        
    if len(re)!=len(y):        print("预测结果与实际结果数量不同，请检查程序。")
        exit(0)
        
    a= re==y
    score=a[a==1].shape[0]/a.shape[0]    
    return score

In [22]

score=compute_score(tree,x_test,y_test)print('自己搭建的决策树分类准确度得分:', score)

自己搭建的决策树分类准确度得分: 0.8443584546403784

4.6 使用sklearn提供的决策树分类器进行实验

4.6.1 实例化决策树分类器

In [23]

treeClassifier = DecisionTreeClassifier(max_depth=4,criterion='entropy')

4.6.2 决策树分类器训练

In [24]

treeClassifier.fit(x_train, y_train)

DecisionTreeClassifier(criterion='entropy', max_depth=4)

4.6.3 可视化训练好的决策树分类器模型

In [25]

export_graphviz(treeClassifier, out_file="dt_clf.pdf",feature_names=col)with open('dt_clf.pdf','r') as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

<graphviz.files.Source at 0x7ff8fca5fb10>

4.6.4 用测试集进行验证，计算模型分类准确性得分

In [26]

score = treeClassifier.score(x_test, y_test)print('使用sklearn提供的决策树分类准确度得分:', score)

使用sklearn提供的决策树分类准确度得分: 0.8443584546403784

以上就是数据挖掘：决策树的详细内容，更多请关注其它相关文章！

# 中文网 # 广州seo外包超联顾问 # 昆山物流网站建设 # 重庆好看的网站建设 # 曲靖手机关键词推广排名 # 兰州网站建设的知识 # 铁力律师网站推广公司 # 珠海网站建设入门 # 门户网站怎么建设 # 河南网站建设多少钱 # 贵州网站建设特点 # 转化为 # python # 切分 # 加载 # 数据挖掘 # 最好的 # 一言 # 临界值 # 递归 # 决策树 # type # udio # ai

相关栏目：【 Google疑问12 】【 Facebook疑问10 】【优化推广96088 】【技术知识133117 】【 IDC资讯59369 】【网络运营7196 】【 IT资讯61894 】

2025-07-16

Notion AI怎么写笔记 Notion AI辅助写作及自动摘要生成技巧【教学】 AI一键生成高质量论文大纲 Claude帮你改写和润色文章 Claude写作风格优化技巧怎么用ai创作绘本 AI儿童故事与插画自动生成【秘籍】去哪旅行ai抢票助手怎么查看抢票历史_去哪旅行ai抢票助手历史记录查询与筛选【教程】如何用AI一键去除图片背景？AI自动抠图去底最强工具【实测】 DeepSeek数学建模应用指南 DeepSeek解决复杂问题技巧如何用AI一键扩图补全背景？Photoshop AI生成填充使用技巧【教程】 DeepSeek辅助撰写技术文档方法 DeepSeek开发者必备技巧文心一言官方网站在线入口文心一言在线版使用地址 Kimi国内访问入口_Kimi智能助手网页版链接直达如何用AI生成室内设计效果图？AI装修设计灵感生成指南【教程】 AI一键生成社交媒体自动回复蚂蚁阿福官网网页版入口_电脑端使用医保与健康服务如何用AI一键去视频水印 AI视频无痕去水印软件使用方法【教程】 Claude帮你解读晦涩的学术理论 Claude知识学习助手 Jasper AI怎么写社交媒体帖子 Jasper AI社媒内容创作【攻略】 DeepSeek长代码项目理解与分析 DeepSeek代码库学习方法 DeepSeek进行科学计算教程 DeepSeek物理建模与* AI一键生成短视频分镜头脚本

了解您产品搜索量及市场趋势，制定营销计划

同行竞争及网站分析保障您的广告效果

点击免费数据支持

提交您的需求，1小时内享受我们的专业解答。

运城市盐湖区信雨科技有限公司

运城市盐湖区信雨科技有限公司是一家深耕海外推广领域十年的专业服务商，作为谷歌推广与Facebook广告全球合作伙伴，聚焦外贸企业出海痛点，以数字化营销为核心，提供一站式海外营销解决方案。公司凭借十年行业沉淀与平台官方资源加持，打破传统外贸获客壁垒，助力企业高效开拓全球市场，成为中小企业出海的可靠合作伙伴。