R语言电信公司churn数据客户流失 k近邻(knn)模型预测分析

A telephone company is interested in determining which customer characteristics are useful for predicting churn, customers who will leave their service. 

由Kaizong Ye,Coin Ge撰写

The data set  is Churn . The fields are as follows:

Data background

State discrete.
account length continuous.
area code continuous.
phone number discrete.
international plan discrete.
voice mail plan discrete.
number vmail messages continuous.
total day minutes continuous.
total day calls continuous.
total day charge continuous.
total eve minutes continuous.
total eve calls continuous.
total eve charge continuous.
total night minutes continuous.
total night calls continuous.
total night charge continuous.
total intl minutes continuous.
total intl calls continuous.
total intl charge continuous.
number customer service calls continuous.
churn Discrete
×

1、KNN概述

最简单最初级的分类器,就是将全部的训练数据所对应的类别都记录下来,当测试对象的属性和某个训练对象的属性完全匹配时,便可以对其进行分类

K近邻(k-nearest neighbour,KNN)是一种基本分类方法,通过测量不同特征值之间的距离进行分类。

k近邻的四路是:如果一个样本在特征控件中的k个最相似(即特征空间中最邻近)的样本中的大多数属于某一个类别,则该样本也属于这个类别,其中k通常是不大于20的整数

KNN算法中,所选择的邻居都是已经正确分类的对象

2、KNN示例

  • 绿色园要被决定赋予哪个类是红色三角形还是蓝色四方形?

  • 如果k=3,由于红色三角形所占比例为2/3,绿色园将被赋予红色三角形哪个类

  • 如果k=5,由于蓝色四方形比例为3/5,因此绿色圆被赋予蓝色四方形类

  • KNN算法的结果很大程度取决于K的选择

3、KNN距离计算

KNN中,通过计算对象间距离来作为各个对象之间的费相似性指标,避免了对象之间的匹配问题,在这里距离一般使用欧氏距离曼哈顿距离

4、KNN算法

在训练集中数据和标签已知的情况下,输入测试数据,将测试数据的特征与测试集中对应的特征进行相互比较,找到训练集中与之最为相似的前K个数据,则该测试数据对应的类别就是k个数据中出现次数最多的那个分类,其算法的描述为:

  1. 计算测试数据与各个训练数据之间的距离

  2. 按照距离的递增关系进行排序

  3. 选取距离最小的k个点

  4. 确定前k个点所在类别的出现频率

  5. 返回前k个2点中出现频率最高的类别作为测试数据的预测分类



Data Preparation and Exploration 

查看数据概览



##      state      account.length    area.code        phone.number 
##  WV     : 158   Min.   :  1.0   Min.   :408.0    327-1058:   1  
##  MN     : 125   1st Qu.: 73.0   1st Qu.:408.0    327-1319:   1  
##  AL     : 124   Median :100.0   Median :415.0    327-2040:   1  
##  ID     : 119   Mean   :100.3   Mean   :436.9    327-2475:   1  
##  VA     : 118   3rd Qu.:127.0   3rd Qu.:415.0    327-3053:   1  
##  OH     : 116   Max.   :243.0   Max.   :510.0    327-3587:   1  
##  (Other):4240                                   (Other)  :4994  
##  international.plan voice.mail.plan number.vmail.messages
##   no :4527           no :3677       Min.   : 0.000       
##   yes: 473           yes:1323       1st Qu.: 0.000       
##                                     Median : 0.000       
##                                     Mean   : 7.755       
##                                     3rd Qu.:17.000       
##                                     Max.   :52.000       
##                                                          
##  total.day.minutes total.day.calls total.day.charge total.eve.minutes
##  Min.   :  0.0     Min.   :  0     Min.   : 0.00    Min.   :  0.0    
##  1st Qu.:143.7     1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4    
##  Median :180.1     Median :100     Median :30.62    Median :201.0    
##  Mean   :180.3     Mean   :100     Mean   :30.65    Mean   :200.6    
##  3rd Qu.:216.2     3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1    
##  Max.   :351.5     Max.   :165     Max.   :59.76    Max.   :363.7    
##                                                                      
##  total.eve.calls total.eve.charge total.night.minutes total.night.calls
##  Min.   :  0.0   Min.   : 0.00    Min.   :  0.0       Min.   :  0.00   
##  1st Qu.: 87.0   1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00   
##  Median :100.0   Median :17.09    Median :200.4       Median :100.00   
##  Mean   :100.2   Mean   :17.05    Mean   :200.4       Mean   : 99.92   
##  3rd Qu.:114.0   3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00   
##  Max.   :170.0   Max.   :30.91    Max.   :395.0       Max.   :175.00   
##                                                                        
##  total.night.charge total.intl.minutes total.intl.calls total.intl.charge
##  Min.   : 0.000     Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 7.510     1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median : 9.020     Median :10.30      Median : 4.000   Median :2.780    
##  Mean   : 9.018     Mean   :10.26      Mean   : 4.435   Mean   :2.771    
##  3rd Qu.:10.560     3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
##  Max.   :17.770     Max.   :20.00      Max.   :20.000   Max.   :5.400    
##                                                                          
##  number.customer.service.calls     churn     
##  Min.   :0.00                   False.:4293  
##  1st Qu.:1.00                   True. : 707  
##  Median :1.00                                
##  Mean   :1.57                                
##  3rd Qu.:2.00                                
##  Max.   :9.00                                
## 


课程

R语言数据分析挖掘必知必会

从数据获取和清理开始,有目的的进行探索性分析与可视化。让数据从生涩的资料,摇身成为有温度的故事。

立即参加

 从数据概览中我们可以发现没有缺失数据,同时可以发现电话号 地区代码是没有价值的变量,可以删去

Examine the variables graphically 

从上面的结果中,我们可以看到除了emailcode和areacode之外,其他数值变量近似符合正态分布。

##  account.length    area.code     number.vmail.messages total.day.minutes
##  Min.   :  1.0   Min.   :408.0   Min.   : 0.000        Min.   :  0.0    
##  1st Qu.: 73.0   1st Qu.:408.0   1st Qu.: 0.000        1st Qu.:143.7    
##  Median :100.0   Median :415.0   Median : 0.000        Median :180.1    
##  Mean   :100.3   Mean   :436.9   Mean   : 7.755        Mean   :180.3    
##  3rd Qu.:127.0   3rd Qu.:415.0   3rd Qu.:17.000        3rd Qu.:216.2    
##  Max.   :243.0   Max.   :510.0   Max.   :52.000        Max.   :351.5    
##  total.day.calls total.day.charge total.eve.minutes total.eve.calls
##  Min.   :  0     Min.   : 0.00    Min.   :  0.0     Min.   :  0.0  
##  1st Qu.: 87     1st Qu.:24.43    1st Qu.:166.4     1st Qu.: 87.0  
##  Median :100     Median :30.62    Median :201.0     Median :100.0  
##  Mean   :100     Mean   :30.65    Mean   :200.6     Mean   :100.2  
##  3rd Qu.:113     3rd Qu.:36.75    3rd Qu.:234.1     3rd Qu.:114.0  
##  Max.   :165     Max.   :59.76    Max.   :363.7     Max.   :170.0  
##  total.eve.charge total.night.minutes total.night.calls total.night.charge
##  Min.   : 0.00    Min.   :  0.0       Min.   :  0.00    Min.   : 0.000    
##  1st Qu.:14.14    1st Qu.:166.9       1st Qu.: 87.00    1st Qu.: 7.510    
##  Median :17.09    Median :200.4       Median :100.00    Median : 9.020    
##  Mean   :17.05    Mean   :200.4       Mean   : 99.92    Mean   : 9.018    
##  3rd Qu.:19.90    3rd Qu.:234.7       3rd Qu.:113.00    3rd Qu.:10.560    
##  Max.   :30.91    Max.   :395.0       Max.   :175.00    Max.   :17.770    
##  total.intl.minutes total.intl.calls total.intl.charge
##  Min.   : 0.00      Min.   : 0.000   Min.   :0.000    
##  1st Qu.: 8.50      1st Qu.: 3.000   1st Qu.:2.300    
##  Median :10.30      Median : 4.000   Median :2.780    
##  Mean   :10.26      Mean   : 4.435   Mean   :2.771    
##  3rd Qu.:12.00      3rd Qu.: 6.000   3rd Qu.:3.240    
##  Max.   :20.00      Max.   :20.000   Max.   :5.400    
##  number.customer.service.calls
##  Min.   :0.00                 
##  1st Qu.:1.00                 
##  Median :1.00                 
##  Mean   :1.57                 
##  3rd Qu.:2.00                 
##  Max.   :9.00

Matlab建立SVM,KNN和朴素贝叶斯模型分类绘制ROC曲线

阅读文章


Relationships between variables


从结果中我们可以看到两者之间存在显著的正相关线性关系。

Using the statistics node, report

##                               account.length    area.code
## account.length                  1.0000000000 -0.018054187
## area.code                      -0.0180541874  1.000000000
## number.vmail.messages          -0.0145746663 -0.003398983
## total.day.minutes              -0.0010174908 -0.019118245
## total.day.calls                 0.0282402279 -0.019313854
## total.day.charge               -0.0010191980 -0.019119256
## total.eve.minutes              -0.0095913331  0.007097877
## total.eve.calls                 0.0091425790 -0.012299947
## total.eve.charge               -0.0095873958  0.007114130
## total.night.minutes             0.0006679112  0.002083626
## total.night.calls              -0.0078254785  0.014656846
## total.night.charge              0.0006558937  0.002070264
## total.intl.minutes              0.0012908394 -0.004153729
## total.intl.calls                0.0142772733 -0.013623309
## total.intl.charge               0.0012918112 -0.004219099
## number.customer.service.calls  -0.0014447918  0.020920513
##                               number.vmail.messages total.day.minutes
## account.length                        -0.0145746663      -0.001017491
## area.code                             -0.0033989831      -0.019118245
## number.vmail.messages                  1.0000000000       0.005381376
## total.day.minutes                      0.0053813760       1.000000000
## total.day.calls                        0.0008831280       0.001935149
## total.day.charge                       0.0053767959       0.999999951
## total.eve.minutes                      0.0194901208      -0.010750427
## total.eve.calls                       -0.0039543728       0.008128130
## total.eve.charge                       0.0194959757      -0.010760022
## total.night.minutes                    0.0055413838       0.011798660
## total.night.calls                      0.0026762202       0.004236100
## total.night.charge                     0.0055349281       0.011782533
## total.intl.minutes                     0.0024627018      -0.019485746
## total.intl.calls                       0.0001243302      -0.001303123
## total.intl.charge                      0.0025051773      -0.019414797
## number.customer.service.calls         -0.0070856427       0.002732576
##                               total.day.calls total.day.charge
## account.length                   0.0282402279     -0.001019198
## area.code                       -0.0193138545     -0.019119256
## number.vmail.messages            0.0008831280      0.005376796
## total.day.minutes                0.0019351487      0.999999951
## total.day.calls                  1.0000000000      0.001935884
## total.day.charge                 0.0019358844      1.000000000
## total.eve.minutes               -0.0006994115     -0.010747297
## total.eve.calls                  0.0037541787      0.008129319
## total.eve.charge                -0.0006952217     -0.010756893
## total.night.minutes              0.0028044650      0.011801434
## total.night.calls               -0.0083083467      0.004234934
## total.night.charge               0.0028018169      0.011785301
## total.intl.minutes               0.0130972198     -0.019489700
## total.intl.calls                 0.0108928533     -0.001306635
## total.intl.charge                0.0131613976     -0.019418755
## number.customer.service.calls   -0.0107394951      0.002726370
##                               total.eve.minutes total.eve.calls
## account.length                    -0.0095913331     0.009142579
## area.code                          0.0070978766    -0.012299947
## number.vmail.messages              0.0194901208    -0.003954373
## total.day.minutes                 -0.0107504274     0.008128130
## total.day.calls                   -0.0006994115     0.003754179
## total.day.charge                  -0.0107472968     0.008129319
## total.eve.minutes                  1.0000000000     0.002763019
## total.eve.calls                    0.0027630194     1.000000000
## total.eve.charge                   0.9999997749     0.002778097
## total.night.minutes               -0.0166391160     0.001781411
## total.night.calls                  0.0134202163    -0.013682341
## total.night.charge                -0.0166420421     0.001799380
## total.intl.minutes                 0.0001365487    -0.007458458
## total.intl.calls                   0.0083881559     0.005574500
## total.intl.charge                  0.0001593155    -0.007507151
## number.customer.service.calls     -0.0138234228     0.006234831
##                               total.eve.charge total.night.minutes
## account.length                   -0.0095873958        0.0006679112
## area.code                         0.0071141298        0.0020836263
## number.vmail.messages             0.0194959757        0.0055413838
## total.day.minutes                -0.0107600217        0.0117986600
## total.day.calls                  -0.0006952217        0.0028044650
## total.day.charge                 -0.0107568931        0.0118014339
## total.eve.minutes                 0.9999997749       -0.0166391160
## total.eve.calls                   0.0027780971        0.0017814106
## total.eve.charge                  1.0000000000       -0.0166489191
## total.night.minutes              -0.0166489191        1.0000000000
## total.night.calls                 0.0134220174        0.0269718182
## total.night.charge               -0.0166518367        0.9999992072
## total.intl.minutes                0.0001320238       -0.0067209669
## total.intl.calls                  0.0083930603       -0.0172140162
## total.intl.charge                 0.0001547783       -0.0066545873
## number.customer.service.calls    -0.0138363623       -0.0085325365

如果把高相关性的变量保存下来,可能会造成多重共线性问题,因此需要把高相关关系的变量删去。


随时关注您喜欢的主题


Data Manipulation

从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。

特别是voicemial为no的变量之间存在负相关关系。

 Discretize (make categorical) a relevant numeric variable  

对变量进行离散化

 construct a distribution of the variable with a churn overlay 

construct a histogram of the variable with a churn overlay



 Find a pair of numeric variables which are correlated to churn. 


从结果中可以看到,total.day.calls和total.day.charge之间存在一定的相关关系。

Model Building

特别是churn为no的变量之间存在相关关系。

##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    0.3082150  0.0735760   4.189 2.85e-05 ***
## stateAL                        0.0151188  0.0462343   0.327 0.743680    
## stateAR                        0.0894792  0.0490897   1.823 0.068399 .  
## stateAZ                        0.0329566  0.0494195   0.667 0.504883    
## stateCA                        0.1951511  0.0567439   3.439 0.000588 ***
## international.plan yes         0.3059341  0.0151677  20.170  < 2e-16 ***
## voice.mail.plan yes           -0.1375056  0.0337533  -4.074 4.70e-05 ***
## number.vmail.messages          0.0017068  0.0010988   1.553 0.120402    
## total.day.minutes              0.3796323  0.2629027   1.444 0.148802    
## total.day.calls                0.0002191  0.0002235   0.981 0.326781    
## total.day.charge              -2.2207671  1.5464583  -1.436 0.151056    
## total.eve.minutes              0.0288233  0.1307496   0.220 0.825533    
## total.eve.calls               -0.0001585  0.0002238  -0.708 0.478915    
## total.eve.charge              -0.3316041  1.5382391  -0.216 0.829329    
## total.night.minutes            0.0083224  0.0695916   0.120 0.904814    
## total.night.calls             -0.0001824  0.0002225  -0.820 0.412290    
## total.night.charge            -0.1760782  1.5464674  -0.114 0.909355    
## total.intl.minutes            -0.0104679  0.4192270  -0.025 0.980080    
## total.intl.calls              -0.0063448  0.0018062  -3.513 0.000447 ***
## total.intl.charge              0.0676460  1.5528267   0.044 0.965254    
## number.customer.service.calls  0.0566474  0.0033945  16.688  < 2e-16 ***
## total.day.minutes1medium       0.0502681  0.0160228   3.137 0.001715 ** 
## total.day.minutes1short        0.2404020  0.0322293   7.459 1.02e-13 ***

从结果中看,我们可以发现 state  total.intl.calls、number.customer.service.calls 、 total.day.minutes1medium 、total.day.minutes1short的变量有重要的影响。

Use K-Nearest-Neighbors (K-NN) algorithm to develop a model for predicting Churn 

##         Direction.2005
## knn.pred   1   2
##        1 760  97
##        2 100  43


 [1] 0.803

混淆矩阵(英语:confusion matrix)是可视化工具,特别用于监督学习,在无监督学习一般叫做匹配矩阵。 矩阵的每一列代表一个类的实例预测,而每一行表示一个实际的类的实例。

##         Direction.2005
## knn.pred   1   2
##        1 827 104
##        2  33  36
 


 [1] 0.863

从测试集的结果,我们可以看到准确度达到86%。

Findings  

我们可以发现 ,total.day.calls和total.day.charge之间存在一定的相关关系。特别是churn为no的变量之间存在相关关系。同时我们可以发现 state  total.intl.calls   、number.customer.service.calls 、 total.day.minutes1medium、    total.day.minutes1short    的变量有重要的影响。同时我们可以发现,total.day.calls和total.day.charge之间存在一定的相关关系。最后从knn模型结果中,我们可以发现从训练集的结果中,我们可以看到准确度有80%,从测试集的结果,我们可以看到准确度达到86%。说明模型有很好的预测效果。


可下载资源

关于作者

Kaizong Ye拓端研究室(TRL)的研究员。在此对他对本文所作的贡献表示诚挚感谢,他在上海财经大学完成了统计学专业的硕士学位,专注人工智能领域。擅长Python.Matlab仿真、视觉处理、神经网络、数据分析。

本文借鉴了作者最近为《R语言数据分析挖掘必知必会 》课堂做的准备。

​非常感谢您阅读本文,如需帮助请联系我们!

 
QQ在线咨询
售前咨询热线
15121130882
售后咨询热线
0571-63341498

关注有关新文章的微信公众号


永远不要错过任何见解。当新文章发表时,我们会通过微信公众号向您推送。

技术干货

最新洞察

This will close in 0 seconds