A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service.
The data set is Churn . The fields are as follows:
可下载资源
Data background
| State | discrete. | 
| account length | continuous. | 
| area code | continuous. | 
| phone number | discrete. | 
| international plan | discrete. | 
| voice mail plan | discrete. | 
| number vmail messages | continuous. | 
| total day minutes | continuous. | 
| total day calls | continuous. | 
| total day charge | continuous. | 
| total eve minutes | continuous. | 
| total eve calls | continuous. | 
| total eve charge | continuous. | 
| total night minutes | continuous. | 
| total night calls | continuous. | 
| total night charge | continuous. | 
| total intl minutes | continuous. | 
| total intl calls | continuous. | 
| total intl charge | continuous. | 
| number customer service calls | continuous. | 
| churn | Discrete | 
1、KNN概述
最简单最初级的分类器,就是将全部的训练数据所对应的类别都记录下来,当测试对象的属性和某个训练对象的属性完全匹配时,便可以对其进行分类
K近邻(k-nearest neighbour,KNN)是一种基本分类方法,通过测量不同特征值之间的距离进行分类。
k近邻的四路是:如果一个样本在特征控件中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别,其中k通常是不大于20的整数
KNN算法中,所选择的邻居都是已经正确分类的对象
2、KNN示例
    - 
            
绿色园要被决定赋予哪个类是红色三角形还是蓝色四方形?
 - 
            
如果k=3,由于红色三角形所占比例为2/3,绿色园将被赋予红色三角形哪个类
 - 
            
如果k=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类
 
- 
            
KNN算法的结果很大程度取决于K的选择
 
3、KNN距离计算
KNN中,通过计算对象间距离来作为各个对象之间的费相似性指标,避免了对象之间的匹配问题,在这里距离一般使用欧氏距离或曼哈顿距离:
        
    
4、KNN算法
在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与测试集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是k个数据中出现次数最多的那个分类,其算法的描述为:
- 
            
计算测试数据与各个训练数据之间的距离
 - 
            
按照距离的递增关系进行排序
 - 
            
选取距离最小的k个点
 - 
            
确定前k个点所在类别的出现频率
 - 
            
返回前k个2点中出现频率最高的类别作为测试数据的预测分类
 
        
    
Data Preparation and Exploration 
查看数据概览
##      state      account.length    area.code        phone.number 
##  WV     : 158   Min.   :  1.0   Min.   :408.0    327-1058:   1  
##  MN     : 125   1st Qu.: 73.0   1st Qu.:408.0    327-1319:   1  
##  AL     : 124   Median :100.0   Median :415.0    327-2040:   1  
##  ID     : 119   Mean   :100.3   Mean   :436.9    327-2475:   1  
##  VA     : 118   3rd Qu.:127.0   3rd Qu.:415.0    327-3053:   1  
##  OH     : 116   Max.   :243.0   Max.   :510.0    327-3587:   1  
##  (Other):4240                                   (Other)  :4994  
##  international.plan voice.mail.plan number.vmail.messages
##   no :4527           no :3677       Min.   : 0.000       
##   yes: 473           yes:1323       1st Qu.: 0.000       
##                                     Median : 0.000       
##                                     Mean   : 7.755       
##                                     3rd Qu.:17.000       
##                                     Max.   :52.000       
##                                                          
##  total.day.minutes total.day.calls total.day.charge total.eve.minutes
##  Min.   :  0.0     Min.   :  0     Min.   : 0.00    Min.   :  0.0    
##  1st Qu.:143.7     1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4    
##  Median :180.1     Median :100     Median :30.62    Median :201.0    
##  Mean   :180.3     Mean   :100     Mean   :30.65    Mean   :200.6    
##  3rd Qu.:216.2     3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1    
##  Max.   :351.5     Max.   :165     Max.   :59.76    Max.   :363.7    
##                                                                      
##  total.eve.calls total.eve.charge total.night.minutes total.night.calls
##  Min.   :  0.0   Min.   : 0.00    Min.   :  0.0       Min.   :  0.00   
##  1st Qu.: 87.0   1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00   
##  Median :100.0   Median :17.09    Median :200.4       Median :100.00   
##  Mean   :100.2   Mean   :17.05    Mean   :200.4       Mean   : 99.92   
##  3rd Qu.:114.0   3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00   
##  Max.   :170.0   Max.   :30.91    Max.   :395.0       Max.   :175.00   
##                                                                        
##  total.night.charge total.intl.minutes total.intl.calls total.intl.charge
##  Min.   : 0.000     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 7.510     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median : 9.020     Median :10.30      Median : 4.000   Median :2.780    
##  Mean   : 9.018     Mean   :10.26      Mean   : 4.435   Mean   :2.771    
##  3rd Qu.:10.560     3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
##  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
##                                                                          
##  number.customer.service.calls     churn     
##  Min.   :0.00                   False.:4293  
##  1st Qu.:1.00                   True. : 707  
##  Median :1.00                                
##  Mean   :1.57                                
##  3rd Qu.:2.00                                
##  Max.   :9.00                                
## 
 从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去
Examine the variables graphically



从上面的结果中,我们可以看到churn为no的样本数目要远远大于churn为yes的样本,因此所有样本中churn占多数。



从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。
##  account.length    area.code     number.vmail.messages total.day.minutes
##  Min.   :  1.0   Min.   :408.0   Min.   : 0.000        Min.   :  0.0    
##  1st Qu.: 73.0   1st Qu.:408.0   1st Qu.: 0.000        1st Qu.:143.7    
##  Median :100.0   Median :415.0   Median : 0.000        Median :180.1    
##  Mean   :100.3   Mean   :436.9   Mean   : 7.755        Mean   :180.3    
##  3rd Qu.:127.0   3rd Qu.:415.0   3rd Qu.:17.000        3rd Qu.:216.2    
##  Max.   :243.0   Max.   :510.0   Max.   :52.000        Max.   :351.5    
##  total.day.calls total.day.charge total.eve.minutes total.eve.calls
##  Min.   :  0     Min.   : 0.00    Min.   :  0.0     Min.   :  0.0  
##  1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4     1st Qu.: 87.0  
##  Median :100     Median :30.62    Median :201.0     Median :100.0  
##  Mean   :100     Mean   :30.65    Mean   :200.6     Mean   :100.2  
##  3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1     3rd Qu.:114.0  
##  Max.   :165     Max.   :59.76    Max.   :363.7     Max.   :170.0  
##  total.eve.charge total.night.minutes total.night.calls total.night.charge
##  Min.   : 0.00    Min.   :  0.0       Min.   :  0.00    Min.   : 0.000    
##  1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00    1st Qu.: 7.510    
##  Median :17.09    Median :200.4       Median :100.00    Median : 9.020    
##  Mean   :17.05    Mean   :200.4       Mean   : 99.92    Mean   : 9.018    
##  3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00    3rd Qu.:10.560    
##  Max.   :30.91    Max.   :395.0       Max.   :175.00    Max.   :17.770    
##  total.intl.minutes total.intl.calls total.intl.charge
##  Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median :10.30      Median : 4.000   Median :2.780    
##  Mean   :10.26      Mean   : 4.435   Mean   :2.771    
##  3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
##  Max.   :20.00      Max.   :20.000   Max.   :5.400    
##  number.customer.service.calls
##  Min.   :0.00                 
##  1st Qu.:1.00                 
##  Median :1.00                 
##  Mean   :1.57                 
##  3rd Qu.:2.00                 
##  Max.   :9.00
Relationships between variables

从结果中我们可以看到两者之间存在显著的正相关线性关系。

Using the statistics node, report
##                               account.length    area.code
## account.length                  1.0000000000 -0.018054187
## area.code                      -0.0180541874  1.000000000
## number.vmail.messages          -0.0145746663 -0.003398983
## total.day.minutes              -0.0010174908 -0.019118245
## total.day.calls                 0.0282402279 -0.019313854
## total.day.charge               -0.0010191980 -0.019119256
## total.eve.minutes              -0.0095913331  0.007097877
## total.eve.calls                 0.0091425790 -0.012299947
## total.eve.charge               -0.0095873958  0.007114130
## total.night.minutes             0.0006679112  0.002083626
## total.night.calls              -0.0078254785  0.014656846
## total.night.charge              0.0006558937  0.002070264
## total.intl.minutes              0.0012908394 -0.004153729
## total.intl.calls                0.0142772733 -0.013623309
## total.intl.charge               0.0012918112 -0.004219099
## number.customer.service.calls  -0.0014447918  0.020920513
##                               number.vmail.messages total.day.minutes
## account.length                        -0.0145746663      -0.001017491
## area.code                             -0.0033989831      -0.019118245
## number.vmail.messages                  1.0000000000       0.005381376
## total.day.minutes                      0.0053813760       1.000000000
## total.day.calls                        0.0008831280       0.001935149
## total.day.charge                       0.0053767959       0.999999951
## total.eve.minutes                      0.0194901208      -0.010750427
## total.eve.calls                       -0.0039543728       0.008128130
## total.eve.charge                       0.0194959757      -0.010760022
## total.night.minutes                    0.0055413838       0.011798660
## total.night.calls                      0.0026762202       0.004236100
## total.night.charge                     0.0055349281       0.011782533
## total.intl.minutes                     0.0024627018      -0.019485746
## total.intl.calls                       0.0001243302      -0.001303123
## total.intl.charge                      0.0025051773      -0.019414797
## number.customer.service.calls         -0.0070856427       0.002732576
##                               total.day.calls total.day.charge
## account.length                   0.0282402279     -0.001019198
## area.code                       -0.0193138545     -0.019119256
## number.vmail.messages            0.0008831280      0.005376796
## total.day.minutes                0.0019351487      0.999999951
## total.day.calls                  1.0000000000      0.001935884
## total.day.charge                 0.0019358844      1.000000000
## total.eve.minutes               -0.0006994115     -0.010747297
## total.eve.calls                  0.0037541787      0.008129319
## total.eve.charge                -0.0006952217     -0.010756893
## total.night.minutes              0.0028044650      0.011801434
## total.night.calls               -0.0083083467      0.004234934
## total.night.charge               0.0028018169      0.011785301
## total.intl.minutes               0.0130972198     -0.019489700
## total.intl.calls                 0.0108928533     -0.001306635
## total.intl.charge                0.0131613976     -0.019418755
## number.customer.service.calls   -0.0107394951      0.002726370
##                               total.eve.minutes total.eve.calls
## account.length                    -0.0095913331     0.009142579
## area.code                          0.0070978766    -0.012299947
## number.vmail.messages              0.0194901208    -0.003954373
## total.day.minutes                 -0.0107504274     0.008128130
## total.day.calls                   -0.0006994115     0.003754179
## total.day.charge                  -0.0107472968     0.008129319
## total.eve.minutes                  1.0000000000     0.002763019
## total.eve.calls                    0.0027630194     1.000000000
## total.eve.charge                   0.9999997749     0.002778097
## total.night.minutes               -0.0166391160     0.001781411
## total.night.calls                  0.0134202163    -0.013682341
## total.night.charge                -0.0166420421     0.001799380
## total.intl.minutes                 0.0001365487    -0.007458458
## total.intl.calls                   0.0083881559     0.005574500
## total.intl.charge                  0.0001593155    -0.007507151
## number.customer.service.calls     -0.0138234228     0.006234831
##                               total.eve.charge total.night.minutes
## account.length                   -0.0095873958        0.0006679112
## area.code                         0.0071141298        0.0020836263
## number.vmail.messages             0.0194959757        0.0055413838
## total.day.minutes                -0.0107600217        0.0117986600
## total.day.calls                  -0.0006952217        0.0028044650
## total.day.charge                 -0.0107568931        0.0118014339
## total.eve.minutes                 0.9999997749       -0.0166391160
## total.eve.calls                   0.0027780971        0.0017814106
## total.eve.charge                  1.0000000000       -0.0166489191
## total.night.minutes              -0.0166489191        1.0000000000
## total.night.calls                 0.0134220174        0.0269718182
## total.night.charge               -0.0166518367        0.9999992072
## total.intl.minutes                0.0001320238       -0.0067209669
## total.intl.calls                  0.0083930603       -0.0172140162
## total.intl.charge                 0.0001547783       -0.0066545873
## number.customer.service.calls    -0.0138363623       -0.0085325365
如果把高相关性的变量保存下来,可能会造成多重共线性问题,因此需要把高相关关系的变量删去。
随时关注您喜欢的主题
Data Manipulation

从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
特别是voicemial为no的变量之间存在负相关关系。
Discretize (make categorical) a relevant numeric variable

对变量进行离散化
construct a distribution of the variable with a churn overlay

construct a histogram of the variable with a churn overlay



Find a pair of numeric variables which are correlated to churn.

从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。
Model Building
特别是churn为no的变量之间存在相关关系。
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.3082150  0.0735760   4.189 2.85e-05 ***
## stateAL                        0.0151188  0.0462343   0.327 0.743680    
## stateAR                        0.0894792  0.0490897   1.823 0.068399 .  
## stateAZ                        0.0329566  0.0494195   0.667 0.504883    
## stateCA                        0.1951511  0.0567439   3.439 0.000588 ***
## international.plan yes         0.3059341  0.0151677  20.170  < 2e-16 ***
## voice.mail.plan yes           -0.1375056  0.0337533  -4.074 4.70e-05 ***
## number.vmail.messages          0.0017068  0.0010988   1.553 0.120402    
## total.day.minutes              0.3796323  0.2629027   1.444 0.148802    
## total.day.calls                0.0002191  0.0002235   0.981 0.326781    
## total.day.charge              -2.2207671  1.5464583  -1.436 0.151056    
## total.eve.minutes              0.0288233  0.1307496   0.220 0.825533    
## total.eve.calls               -0.0001585  0.0002238  -0.708 0.478915    
## total.eve.charge              -0.3316041  1.5382391  -0.216 0.829329    
## total.night.minutes            0.0083224  0.0695916   0.120 0.904814    
## total.night.calls             -0.0001824  0.0002225  -0.820 0.412290    
## total.night.charge            -0.1760782  1.5464674  -0.114 0.909355    
## total.intl.minutes            -0.0104679  0.4192270  -0.025 0.980080    
## total.intl.calls              -0.0063448  0.0018062  -3.513 0.000447 ***
## total.intl.charge              0.0676460  1.5528267   0.044 0.965254    
## number.customer.service.calls  0.0566474  0.0033945  16.688  < 2e-16 ***
## total.day.minutes1medium       0.0502681  0.0160228   3.137 0.001715 ** 
## total.day.minutes1short        0.2404020  0.0322293   7.459 1.02e-13 ***
从结果中看,我们可以发现 state total.intl.calls、number.customer.service.calls 、 total.day.minutes1medium 、total.day.minutes1short的变量有重要的影响。
Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn 
##         Direction.2005
## knn.pred   1   2
##        1 760  97
##        2 100  43
 [1] 0.803
混淆矩阵(英语:confusion matrix)是可视化工具,特别用于监督学习,在无监督学习一般叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行表示一个实际的类的实例。
##         Direction.2005
## knn.pred   1   2
##        1 827 104
##        2  33  36
 
 [1] 0.863
从测试集的结果,我们可以看到准确度达到86%。
Findings
我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium、    total.day.minutes1short    的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。
可下载资源
关于作者
Kaizong Ye是拓端研究室(TRL)的研究员。在此对他对本文所作的贡献表示诚挚感谢,他在上海财经大学完成了统计学专业的硕士学位,专注人工智能领域。擅长Python.Matlab仿真、视觉处理、神经网络、数据分析。
本文借鉴了作者最近为《R语言数据分析挖掘必知必会 》课堂做的准备。
非常感谢您阅读本文,如需帮助请联系我们!



Matlab古代玻璃制品化学成分数据鉴别:K近邻回归、聚类、决策树、随机森林、卡方检验、相关性分析
Python电影票房预测模型研究——贝叶斯岭回归Ridge、决策树、Adaboost、KNN分析猫眼豆瓣数据
Python电信客户流失预测研究:神经网络、K-Means聚类、RFM、CART决策树、Logistic回归、SVM多模型融合及客户分群
视频讲解|Stata和R语言自助法Bootstrap结合GARCH对sp500收益率数据分析
                        
                        
                    

