A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service.
The data set is Churn . The fields are as follows:
可下载资源
Data background
State | discrete. |
account length | continuous. |
area code | continuous. |
phone number | discrete. |
international plan | discrete. |
voice mail plan | discrete. |
number vmail messages | continuous. |
total day minutes | continuous. |
total day calls | continuous. |
total day charge | continuous. |
total eve minutes | continuous. |
total eve calls | continuous. |
total eve charge | continuous. |
total night minutes | continuous. |
total night calls | continuous. |
total night charge | continuous. |
total intl minutes | continuous. |
total intl calls | continuous. |
total intl charge | continuous. |
number customer service calls | continuous. |
churn | Discrete |
1、KNN概述
最简单最初级的分类器,就是将全部的训练数据所对应的类别都记录下来,当测试对象的属性和某个训练对象的属性完全匹配时,便可以对其进行分类
K近邻(k-nearest neighbour,KNN)是一种基本分类方法,通过测量不同特征值之间的距离进行分类。
k近邻的四路是:如果一个样本在特征控件中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别,其中k通常是不大于20的整数
KNN算法中,所选择的邻居都是已经正确分类的对象
2、KNN示例
-
绿色园要被决定赋予哪个类是红色三角形还是蓝色四方形?
-
如果k=3,由于红色三角形所占比例为2/3,绿色园将被赋予红色三角形哪个类
-
如果k=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类
-
KNN算法的结果很大程度取决于K的选择
3、KNN距离计算
KNN中,通过计算对象间距离来作为各个对象之间的费相似性指标,避免了对象之间的匹配问题,在这里距离一般使用欧氏距离或曼哈顿距离:
4、KNN算法
在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与测试集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是k个数据中出现次数最多的那个分类,其算法的描述为:
-
计算测试数据与各个训练数据之间的距离
-
按照距离的递增关系进行排序
-
选取距离最小的k个点
-
确定前k个点所在类别的出现频率
-
返回前k个2点中出现频率最高的类别作为测试数据的预测分类
Data Preparation and Exploration
查看数据概览
## state account.length area.code phone.number
## WV : 158 Min. : 1.0 Min. :408.0 327-1058: 1
## MN : 125 1st Qu.: 73.0 1st Qu.:408.0 327-1319: 1
## AL : 124 Median :100.0 Median :415.0 327-2040: 1
## ID : 119 Mean :100.3 Mean :436.9 327-2475: 1
## VA : 118 3rd Qu.:127.0 3rd Qu.:415.0 327-3053: 1
## OH : 116 Max. :243.0 Max. :510.0 327-3587: 1
## (Other):4240 (Other) :4994
## international.plan voice.mail.plan number.vmail.messages
## no :4527 no :3677 Min. : 0.000
## yes: 473 yes:1323 1st Qu.: 0.000
## Median : 0.000
## Mean : 7.755
## 3rd Qu.:17.000
## Max. :52.000
##
## total.day.minutes total.day.calls total.day.charge total.eve.minutes
## Min. : 0.0 Min. : 0 Min. : 0.00 Min. : 0.0
## 1st Qu.:143.7 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4
## Median :180.1 Median :100 Median :30.62 Median :201.0
## Mean :180.3 Mean :100 Mean :30.65 Mean :200.6
## 3rd Qu.:216.2 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1
## Max. :351.5 Max. :165 Max. :59.76 Max. :363.7
##
## total.eve.calls total.eve.charge total.night.minutes total.night.calls
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 87.0 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00
## Median :100.0 Median :17.09 Median :200.4 Median :100.00
## Mean :100.2 Mean :17.05 Mean :200.4 Mean : 99.92
## 3rd Qu.:114.0 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00
## Max. :170.0 Max. :30.91 Max. :395.0 Max. :175.00
##
## total.night.charge total.intl.minutes total.intl.calls total.intl.charge
## Min. : 0.000 Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 7.510 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median : 9.020 Median :10.30 Median : 4.000 Median :2.780
## Mean : 9.018 Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:10.560 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :17.770 Max. :20.00 Max. :20.000 Max. :5.400
##
## number.customer.service.calls churn
## Min. :0.00 False.:4293
## 1st Qu.:1.00 True. : 707
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00
##
从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去
Examine the variables graphically
从上面的结果中,我们可以看到churn为no的样本数目要远远大于churn为yes的样本,因此所有样本中churn占多数。
从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。
## account.length area.code number.vmail.messages total.day.minutes
## Min. : 1.0 Min. :408.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 73.0 1st Qu.:408.0 1st Qu.: 0.000 1st Qu.:143.7
## Median :100.0 Median :415.0 Median : 0.000 Median :180.1
## Mean :100.3 Mean :436.9 Mean : 7.755 Mean :180.3
## 3rd Qu.:127.0 3rd Qu.:415.0 3rd Qu.:17.000 3rd Qu.:216.2
## Max. :243.0 Max. :510.0 Max. :52.000 Max. :351.5
## total.day.calls total.day.charge total.eve.minutes total.eve.calls
## Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 87 1st Qu.:24.43 1st Qu.:166.4 1st Qu.: 87.0
## Median :100 Median :30.62 Median :201.0 Median :100.0
## Mean :100 Mean :30.65 Mean :200.6 Mean :100.2
## 3rd Qu.:113 3rd Qu.:36.75 3rd Qu.:234.1 3rd Qu.:114.0
## Max. :165 Max. :59.76 Max. :363.7 Max. :170.0
## total.eve.charge total.night.minutes total.night.calls total.night.charge
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.:14.14 1st Qu.:166.9 1st Qu.: 87.00 1st Qu.: 7.510
## Median :17.09 Median :200.4 Median :100.00 Median : 9.020
## Mean :17.05 Mean :200.4 Mean : 99.92 Mean : 9.018
## 3rd Qu.:19.90 3rd Qu.:234.7 3rd Qu.:113.00 3rd Qu.:10.560
## Max. :30.91 Max. :395.0 Max. :175.00 Max. :17.770
## total.intl.minutes total.intl.calls total.intl.charge
## Min. : 0.00 Min. : 0.000 Min. :0.000
## 1st Qu.: 8.50 1st Qu.: 3.000 1st Qu.:2.300
## Median :10.30 Median : 4.000 Median :2.780
## Mean :10.26 Mean : 4.435 Mean :2.771
## 3rd Qu.:12.00 3rd Qu.: 6.000 3rd Qu.:3.240
## Max. :20.00 Max. :20.000 Max. :5.400
## number.customer.service.calls
## Min. :0.00
## 1st Qu.:1.00
## Median :1.00
## Mean :1.57
## 3rd Qu.:2.00
## Max. :9.00
Relationships between variables
从结果中我们可以看到两者之间存在显著的正相关线性关系。
Using the statistics node, report
## account.length area.code
## account.length 1.0000000000 -0.018054187
## area.code -0.0180541874 1.000000000
## number.vmail.messages -0.0145746663 -0.003398983
## total.day.minutes -0.0010174908 -0.019118245
## total.day.calls 0.0282402279 -0.019313854
## total.day.charge -0.0010191980 -0.019119256
## total.eve.minutes -0.0095913331 0.007097877
## total.eve.calls 0.0091425790 -0.012299947
## total.eve.charge -0.0095873958 0.007114130
## total.night.minutes 0.0006679112 0.002083626
## total.night.calls -0.0078254785 0.014656846
## total.night.charge 0.0006558937 0.002070264
## total.intl.minutes 0.0012908394 -0.004153729
## total.intl.calls 0.0142772733 -0.013623309
## total.intl.charge 0.0012918112 -0.004219099
## number.customer.service.calls -0.0014447918 0.020920513
## number.vmail.messages total.day.minutes
## account.length -0.0145746663 -0.001017491
## area.code -0.0033989831 -0.019118245
## number.vmail.messages 1.0000000000 0.005381376
## total.day.minutes 0.0053813760 1.000000000
## total.day.calls 0.0008831280 0.001935149
## total.day.charge 0.0053767959 0.999999951
## total.eve.minutes 0.0194901208 -0.010750427
## total.eve.calls -0.0039543728 0.008128130
## total.eve.charge 0.0194959757 -0.010760022
## total.night.minutes 0.0055413838 0.011798660
## total.night.calls 0.0026762202 0.004236100
## total.night.charge 0.0055349281 0.011782533
## total.intl.minutes 0.0024627018 -0.019485746
## total.intl.calls 0.0001243302 -0.001303123
## total.intl.charge 0.0025051773 -0.019414797
## number.customer.service.calls -0.0070856427 0.002732576
## total.day.calls total.day.charge
## account.length 0.0282402279 -0.001019198
## area.code -0.0193138545 -0.019119256
## number.vmail.messages 0.0008831280 0.005376796
## total.day.minutes 0.0019351487 0.999999951
## total.day.calls 1.0000000000 0.001935884
## total.day.charge 0.0019358844 1.000000000
## total.eve.minutes -0.0006994115 -0.010747297
## total.eve.calls 0.0037541787 0.008129319
## total.eve.charge -0.0006952217 -0.010756893
## total.night.minutes 0.0028044650 0.011801434
## total.night.calls -0.0083083467 0.004234934
## total.night.charge 0.0028018169 0.011785301
## total.intl.minutes 0.0130972198 -0.019489700
## total.intl.calls 0.0108928533 -0.001306635
## total.intl.charge 0.0131613976 -0.019418755
## number.customer.service.calls -0.0107394951 0.002726370
## total.eve.minutes total.eve.calls
## account.length -0.0095913331 0.009142579
## area.code 0.0070978766 -0.012299947
## number.vmail.messages 0.0194901208 -0.003954373
## total.day.minutes -0.0107504274 0.008128130
## total.day.calls -0.0006994115 0.003754179
## total.day.charge -0.0107472968 0.008129319
## total.eve.minutes 1.0000000000 0.002763019
## total.eve.calls 0.0027630194 1.000000000
## total.eve.charge 0.9999997749 0.002778097
## total.night.minutes -0.0166391160 0.001781411
## total.night.calls 0.0134202163 -0.013682341
## total.night.charge -0.0166420421 0.001799380
## total.intl.minutes 0.0001365487 -0.007458458
## total.intl.calls 0.0083881559 0.005574500
## total.intl.charge 0.0001593155 -0.007507151
## number.customer.service.calls -0.0138234228 0.006234831
## total.eve.charge total.night.minutes
## account.length -0.0095873958 0.0006679112
## area.code 0.0071141298 0.0020836263
## number.vmail.messages 0.0194959757 0.0055413838
## total.day.minutes -0.0107600217 0.0117986600
## total.day.calls -0.0006952217 0.0028044650
## total.day.charge -0.0107568931 0.0118014339
## total.eve.minutes 0.9999997749 -0.0166391160
## total.eve.calls 0.0027780971 0.0017814106
## total.eve.charge 1.0000000000 -0.0166489191
## total.night.minutes -0.0166489191 1.0000000000
## total.night.calls 0.0134220174 0.0269718182
## total.night.charge -0.0166518367 0.9999992072
## total.intl.minutes 0.0001320238 -0.0067209669
## total.intl.calls 0.0083930603 -0.0172140162
## total.intl.charge 0.0001547783 -0.0066545873
## number.customer.service.calls -0.0138363623 -0.0085325365
如果把高相关性的变量保存下来,可能会造成多重共线性问题,因此需要把高相关关系的变量删去。
随时关注您喜欢的主题
Data Manipulation
从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
特别是voicemial为no的变量之间存在负相关关系。
Discretize (make categorical) a relevant numeric variable
对变量进行离散化
construct a distribution of the variable with a churn overlay
construct a histogram of the variable with a churn overlay
Find a pair of numeric variables which are correlated to churn.
从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
Model Building
特别是churn为no的变量之间存在相关关系。
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3082150 0.0735760 4.189 2.85e-05 ***
## stateAL 0.0151188 0.0462343 0.327 0.743680
## stateAR 0.0894792 0.0490897 1.823 0.068399 .
## stateAZ 0.0329566 0.0494195 0.667 0.504883
## stateCA 0.1951511 0.0567439 3.439 0.000588 ***
## international.plan yes 0.3059341 0.0151677 20.170 < 2e-16 ***
## voice.mail.plan yes -0.1375056 0.0337533 -4.074 4.70e-05 ***
## number.vmail.messages 0.0017068 0.0010988 1.553 0.120402
## total.day.minutes 0.3796323 0.2629027 1.444 0.148802
## total.day.calls 0.0002191 0.0002235 0.981 0.326781
## total.day.charge -2.2207671 1.5464583 -1.436 0.151056
## total.eve.minutes 0.0288233 0.1307496 0.220 0.825533
## total.eve.calls -0.0001585 0.0002238 -0.708 0.478915
## total.eve.charge -0.3316041 1.5382391 -0.216 0.829329
## total.night.minutes 0.0083224 0.0695916 0.120 0.904814
## total.night.calls -0.0001824 0.0002225 -0.820 0.412290
## total.night.charge -0.1760782 1.5464674 -0.114 0.909355
## total.intl.minutes -0.0104679 0.4192270 -0.025 0.980080
## total.intl.calls -0.0063448 0.0018062 -3.513 0.000447 ***
## total.intl.charge 0.0676460 1.5528267 0.044 0.965254
## number.customer.service.calls 0.0566474 0.0033945 16.688 < 2e-16 ***
## total.day.minutes1medium 0.0502681 0.0160228 3.137 0.001715 **
## total.day.minutes1short 0.2404020 0.0322293 7.459 1.02e-13 ***
从结果中看,我们可以发现 state total.intl.calls、number.customer.service.calls 、 total.day.minutes1medium 、total.day.minutes1short的变量有重要的影响。
Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn
## Direction.2005
## knn.pred 1 2
## 1 760 97
## 2 100 43
[1] 0.803
混淆矩阵(英语:confusion matrix)是可视化工具,特别用于监督学习,在无监督学习一般叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行表示一个实际的类的实例。
## Direction.2005
## knn.pred 1 2
## 1 827 104
## 2 33 36
[1] 0.863
从测试集的结果,我们可以看到准确度达到86%。
Findings
我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state total.intl.calls 、number.customer.service.calls 、 total.day.minutes1medium、 total.day.minutes1short 的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。
可下载资源
关于作者
Kaizong Ye是拓端研究室(TRL)的研究员。在此对他对本文所作的贡献表示诚挚感谢,他在上海财经大学完成了统计学专业的硕士学位,专注人工智能领域。擅长Python.Matlab仿真、视觉处理、神经网络、数据分析。
本文借鉴了作者最近为《R语言数据分析挖掘必知必会 》课堂做的准备。
非常感谢您阅读本文,如需帮助请联系我们!