介绍
有限混合模型在应用于数据时非常有用,其中观察来自不同的群体,并且群体隶属关系未知。
模拟数据
首先,我们将模拟一些数据。让我们模拟两个正态分布 – 一个平均值为0,另一个平均值为50,两者的标准差为5。
m1 <- 0
m2 <- 50
sd1 <- sd2 <- 5
N1 <- 100
N2 <- 10
a <- rnorm(n=N1, mean=m1, sd=sd1)
b <- rnorm(n=N2, mean=m2, sd=sd2)
data:image/s3,"s3://crabby-images/5457f/5457f7192c5dd5657faadc41c6ab408b99290bb4" alt=""
现在让我们将数据“混合”在一起……
data:image/s3,"s3://crabby-images/f1aff/f1aff0b9a07bed30f7f678a6ae003f140ba64603" alt=""
data:image/s3,"s3://crabby-images/a6ebe/a6ebe33f215474681fce4ab26c3ff6cc22c7459c" alt=""
print(table(clusters(flexfit), data$class))
##
## 1 2
## 1 100 0
## 2 0 10
data:image/s3,"s3://crabby-images/e4b9c/e4b9cf7b118e8cefc450483630d6aa24530e283a" alt=""
参数怎么样?
cat('pred:', c1[1], '\n')
cat('true:', m1, '\n\n')
cat('pred:', c1[2], '\n')
cat('true:', sd1, '\n\n')
cat('pred:', c2[1], '\n')
cat('true:', m2, '\n\n')
cat('pred:', c2[2], '\n')
cat('true:', sd2, '\n\n')
## pred: -0.5613484
## true: 0
##
## pred: 4.799484
## true: 5
##
## pred: 52.86911
## true: 50
##
## pred: 6.89413
## true: 5
data:image/s3,"s3://crabby-images/d1ea0/d1ea02e066355aac024296b479cee60acb498415" alt=""
让我们可视化真实数据和我们拟合的混合模型。
ggplot(data) +
geom_histogram(aes(x, ..density..), binwidth = 1, colour = "black", fill = "white") +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(c1[1], c1[2], lam[1]/sum(lam)),
stat_function(geom = "line", fun = plot_mix_comps,
args = list(c2[1], c2[2], lam[2]/sum(lam)),
colour = "blue", lwd = 1.5) +
ylab("Density")
data:image/s3,"s3://crabby-images/903ba/903ba00d54f67d4a7b4b91a7dafa4841f87b77ed" alt=""
data:image/s3,"s3://crabby-images/2cf13/2cf1376fba8b7b86a62e5ed9eac3daae8de2509c" alt=""
data:image/s3,"s3://crabby-images/0b91d/0b91d8c7853869656ed19920188e3ce11eb034dd" alt=""
看起来我们做得很好!
data:image/s3,"s3://crabby-images/669eb/669eb2db4ab79b5d60bb5f40315aeb892a9e8f11" alt=""
data:image/s3,"s3://crabby-images/ee15e/ee15e626c412737429ca4df373c141a94dbdc9ef" alt=""
data:image/s3,"s3://crabby-images/4434e/4434e2442fb48c917c48d4e09d26721ac171e1fe" alt=""
data:image/s3,"s3://crabby-images/b44f6/b44f65846386c65a6e7b49289f58dadee5e7ff98" alt=""
data:image/s3,"s3://crabby-images/fa404/fa404b17e8362c34a9230dcf65178b2ab2e34a15" alt=""
data:image/s3,"s3://crabby-images/14d09/14d09474d1a2669e02bac82d48d209e8ec069440" alt=""
例子
现在,让我们考虑一个花瓣宽度为鸢尾花的真实例子。
p <- ggplot(iris, aes(x = Petal.Width)) +
geom_histogram(aes(x = Petal.Width, ..density..), binwidth = 0.1, colour = "black", fill = "white")
p
data:image/s3,"s3://crabby-images/11182/11182e20e8ece8ec69598beb97eb40e8435d072d" alt=""
data:image/s3,"s3://crabby-images/9d3eb/9d3eb9df71a5f3a973360c3cef89f5178afbdc1e" alt=""
data:image/s3,"s3://crabby-images/d40c5/d40c50f5e5761364088a7869de31a01a80677f6c" alt=""
flexfit <- flexmix(Petal.Width ~ 1, data = iris, k = 3, model = list(mo1, mo2, mo3))
print(table(clusters(flexfit), iris$Species))
##
## setosa versicolor virginica
## 1 0 2 46
## 2 0 48 4
## 3 50 0 0
geom_histogram(aes(x = Petal.Width, ..density..), binwidth = 0.1, colour = "black", fill = "white") +
args = list(c1[1], c1[2], lam[1]/sum(lam)),
colour = "red", lwd = 1.5) +
stat_function(geom = "line", fun = plot_mix_comps,
args = list(c2[1], c2[2], lam[2]/sum(lam)),
stat_function(geom = "line", fun = plot_mix_comps,
args = list(c3[1], c3[2], lam[3]/sum(lam)),
colour = "green", lwd = 1.5) +
ylab("Density")
data:image/s3,"s3://crabby-images/50a40/50a40390cf887b1c264073fcbbc331d6bf8d486a" alt=""
data:image/s3,"s3://crabby-images/8dd66/8dd669058583645949aec352fd20f84ed7e1f65d" alt=""
data:image/s3,"s3://crabby-images/b45fe/b45fe2ac9324ad3cf8f64680cf5451e662776e50" alt=""
即使我们不知道潜在的物种分配,我们也能够对花瓣宽度的基本分布做出某些陈述 。
非常感谢您阅读本文,有任何问题请在下面留言!
1
1
关于作者
Kaizong Ye是拓端研究室(TRL)的研究员。
本文借鉴了作者最近为《R语言数据分析挖掘必知必会 》课堂做的准备。