Gadzan

AI课程笔记(五)

朴素贝叶斯分类器从贝叶斯定理到分类模型分类 vs 回归分类模型预测一个标签(类型、类别)输出的预测值是离散值回归模型预测一个量输出的预测值则是连续值贝叶斯定理公式一般化的贝叶斯公式公式朴素贝叶斯分类器朴素贝叶斯算法公式朴素贝叶斯分类器的模型函数,Fi表示第i个特征,C为类别标签P(Fi|C)表示样本被判定为类别C条件下,第i个特征的条件概率条件概率的参数估计极大似然估计 (Maximum Likelihood Estimation, MLE)两个学派频率学派贝叶斯学派世界是确定的,有一个本体,这个本体的真值不变。我们的目标就是要找到这个真值或真值所在的范围世界是不确定的,本体没有确定真值,而是其真值符合一个概率分布。我们的目标是找到最优的,可以用来描述本体的概率分布频率学派的极大似然估计(Maximum Likelihood Estimation, MLE)参数估计(Parameter Estimation)估计概率分布参数极大似然估计法似然(Likelihood)似然指某种事件发生的可能,和概率相似公式寻找让似然函数的取值达到最大的参数值的估计方法最大化一个似然函数同最大化它的自然对数是等价的正态分布的极大似然估计公式

用代码实现朴素贝叶斯模型

# csv
no,985,education,skill,enrolled
1,Yes,bachlor,C++,No
2,Yes,bachlor,Java,Yes
3,No,master,Java,Yes
4,No,master,C++,No
5,Yes,bachlor,Java,Yes
6,No,master,C++,No
7,Yes,master,Java,Yes
8,Yes,phd,C++,Yes
9,No,phd,Java,Yes
10,No,bachlor,Java,No
import pandas as pd
    import numpy as np
    import time
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import GaussianNB

    # Importing dataset. 
    # Please refer to the 【Data】 part after the code for the data file.
    data = pd.read_csv("career_data.csv") 

    # Convert categorical variable to numeric
    data["985_cleaned"]=np.where(data["985"]=="Yes",1,0)
    data["education_cleaned"]=np.where(data["education"]=="bachlor",1,
                                      np.where(data["education"]=="master",2,
                                               np.where(data["education"]=="phd",3,4)
                                              )
                                     )
    data["skill_cleaned"]=np.where(data["skill"]=="c++",1,
                                      np.where(data["skill"]=="java",2,3
                                              )
                                     )
    data["enrolled_cleaned"]=np.where(data["enrolled"]=="Yes",1,0)

    # Split dataset in training and test datasets
    X_train, X_test = train_test_split(data, test_size=0.1, random_state=int(time.time()))

    # Instantiate the classifier
    gnb = GaussianNB()
    used_features =[
        "985_cleaned",
        "education_cleaned",
        "skill_cleaned"
    ]

    # Train classifier
    gnb.fit(
        X_train[used_features].values,
        X_train["enrolled_cleaned"]
    )
    y_pred = gnb.predict(X_test[used_features])

    # Print results
    print("Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%"
          .format(
              X_test.shape[0],
              (X_test["enrolled_cleaned"] != y_pred).sum(),
              100*(1-(X_test["enrolled_cleaned"] != y_pred).sum()/X_test.shape[0])
    ))
# output: Number of mislabeled points out of a total 1 points : 0, performance 100.00%

评论