data:image/s3,"s3://crabby-images/7ff23/7ff23923eab55f27cb2c1cbabde0207ce9cb06fd" alt=""
在这里,我将讨论哪些函数可用于处理正态分布:dnorm,pnorm,qnorm和rnorm。
概率密度函数(PDF,简称:密度)表示观察具有特定值的测量值的概率,因此密度上的积分始终为1。
可下载资源
R中的分布函数
有四个关联的函数, 四个正态分布函数是:
- d范数:正态分布的密度函数
- p范数:正态分布的累积密度函数
- q范数:正态分布的分位数函数
- r范数:从正态分布中随机抽样
概率密度函数:dnorm
概率密度函数(PDF,简称:密度)表示观察具有特定值的测量值的概率,因此密度上的积分始终为1。 XX,正常密度定义为
data:image/s3,"s3://crabby-images/ded0f/ded0f27f3c137b91d134e7a14b960b531f2cb7ce" alt=""
data:image/s3,"s3://crabby-images/1c622/1c622e60576ef88101e61b8d83376426d404f4d1" alt=""
使用密度,可以确定事件的概率。例如,您可能想知道:一个人的IQ恰好为140的可能性是多少?。在这种情况下,您将需要检索IQ分布在值140处的密度。可以用100的平均值和15的标准差对IQ分布进行建模。相应的密度为:
sample.range <- 50:150
iq.mean <- 100
iq.sd <- 15
iq.dist <- dnorm(sample.range, mean = iq.mean, sd = iq.sd)
iq.df <- data.frame("IQ" = sample.range, "Density" = iq.dist)
library(ggplot2)
ggplot(iq.df, aes(x = IQ, y = Density)) + geom_point()
data:image/s3,"s3://crabby-images/8e33b/8e33b86407ef71a440bf16e4fe29f52860a43d79" alt=""
data:image/s3,"s3://crabby-images/e412d/e412dad4e224a28e55cb497facea835ccd840c1c" alt=""
data:image/s3,"s3://crabby-images/aaa6c/aaa6cdb70076a564e9b81c87567cd9b81c4b0341" alt=""
通过这些数据,我们现在可以回答初始问题以及其他问题:
# likelihood of IQ == 140?
pp(iq.df$Density[iq.df$IQ == 140])
data:image/s3,"s3://crabby-images/e83f0/e83f0c54843a82aeb31cc41d4ff8c077246cad72" alt=""
## [1] "0.076%"
# likelihood of IQ >= 140?
data:image/s3,"s3://crabby-images/34125/34125c56c0c8c10e4754a408f84535452582c12f" alt=""
## [1] "0.384%"
# likelihood of 50 < IQ <= 90?
data:image/s3,"s3://crabby-images/8e33b/8e33b86407ef71a440bf16e4fe29f52860a43d79" alt=""
## [1] "26.284%"
累积密度函数:pnorm
累积密度(CDF)函数是单调增加的函数,因为它通过
data:image/s3,"s3://crabby-images/da10d/da10de6fe44f93511346c494fe0c050aea8534a2" alt=""
data:image/s3,"s3://crabby-images/75939/75939d0c2b8a62164a20511cb7337bc8214a0f08" alt=""
为了直观了解CDF,让我们为IQ数据创建一个图:
ggplot(iq.df, aes(x = IQ, y = CDF_LowerTail)) + geom_point()
data:image/s3,"s3://crabby-images/56a6d/56a6d397762e510ad91b86f6eb0b01587bf105ba" alt=""
data:image/s3,"s3://crabby-images/08ea0/08ea0b61d37741d01274129c7bee0189ed3674f1" alt=""
data:image/s3,"s3://crabby-images/e82cb/e82cbf22bcfd4294950218c4dc8b87ab29f17112" alt=""
如我们所见,所描绘的CDF显示了IQ小于或等于给定值的可能性。这是因为pnorm
默认情况下计算低尾巴,即P[ X< = x ]P[X<=X]。利用这些知识,我们可以以略有不同的方式获得一些先前问题的答案:
# likelihood of 50 < IQ <= 90?
data:image/s3,"s3://crabby-images/1c622/1c622e60576ef88101e61b8d83376426d404f4d1" alt=""
## [1] "25.249%"
# set lower.tail to FALSE to obtain P[X >= x]
# Probability for IQ >= 140? same value as before using dnorm!
data:image/s3,"s3://crabby-images/05852/0585249b6174fa2814ee1fc97a3246284c115dc2" alt=""
## [1] "0.383%"
请注意,pnorm的结果与手动汇总通过dnorm所获得的概率所得的结果相同。此外,通过设置lower.tail = FALSE
,dnorm
可用于直接计算p值,该p值用于衡量观察值的可能性至少与获得的值一样高。
分位数功能:qnorm
分位数函数只是累积密度函数(iCDF)的反函数。因此,分位数函数从概率映射到值。让我们看一下分位数函数P[ X< = x ]P[X<=X]:
# input to qnorm is a vector of probabilities
ggplot(icdf.df, aes(x = Probability, y = IQ)) + geom_point()
data:image/s3,"s3://crabby-images/9ee6a/9ee6a80d33c64d9ab24860f1126ee2dd00ba9bc5" alt=""
data:image/s3,"s3://crabby-images/ce050/ce0503484dd7b50e90e0db46140bcbe60da363bd" alt=""
data:image/s3,"s3://crabby-images/8e02e/8e02e76980a875a8ab31caa7d3fdafb138057871" alt=""
使用分位数函数,我们可以回答与分位数有关的问题:
# what is the 25th IQ percentile?
data:image/s3,"s3://crabby-images/01900/01900c3e4df9eaaab438e8e5568da77b5d8b5bf4" alt=""
## [1] 89.88265
# what is the 75 IQ percentile?
data:image/s3,"s3://crabby-images/df7e4/df7e4d1ee967d2fe32aa33d7523b82a8547cfb47" alt=""
## [1] 110.1173
# note: this is the same results as from the quantile function
data:image/s3,"s3://crabby-images/c16b2/c16b290fad54d13c4ce3c464c1c5fbc731e72cf6" alt=""
## 0% 25% 50% 75% 100%
## -Inf 89.88265 100.00000 110.11735 Inf
随机采样函数:rnorm
当您想从正态分布中抽取随机样本时,可以使用rnorm
。例如,我们可以rnorm
用来模拟IQ分布中的随机样本。
# show one facet per random sample of a given size
ggplot() + geom_histogram(data = my.df, aes(x = IQ)) + facet_wrap(.~SampleSize, scales = "free_y")
data:image/s3,"s3://crabby-images/1d718/1d718632853dd8fd81febaefd3e348bdbb46f3c4" alt=""
data:image/s3,"s3://crabby-images/5b1c4/5b1c4a06f8e5bfadf1fbc68a76288dd26f585b65" alt=""
data:image/s3,"s3://crabby-images/d72c3/d72c31e629823a421b2fe495a28b812486696ac9" alt=""
ggplot(my.sample.df, aes(x = IQ)) + geom_histogram()
data:image/s3,"s3://crabby-images/d72c3/d72c31e629823a421b2fe495a28b812486696ac9" alt=""
data:image/s3,"s3://crabby-images/343ed/343eda6e1007b5387d47fe25b5b7da06de4edfaf" alt=""
data:image/s3,"s3://crabby-images/56a6d/56a6d397762e510ad91b86f6eb0b01587bf105ba" alt=""
请注意,我们进行调用set.seed
是为了确保随机数生成器始终生成相同的数字序列以实现可重复性。
可下载资源
关于作者
Kaizong Ye是拓端研究室(TRL)的研究员。在此对他对本文所作的贡献表示诚挚感谢,他在上海财经大学完成了统计学专业的硕士学位,专注人工智能领域。擅长Python.Matlab仿真、视觉处理、神经网络、数据分析。
本文借鉴了作者最近为《R语言数据分析挖掘必知必会 》课堂做的准备。
非常感谢您阅读本文,如需帮助请联系我们!