R语言代做编程辅导Econ220 final:difference between students in sleeping time and GPA

As I first came to the University, I found SAT scores for my friends at Oxford are among 1450.

Whereas for my friends at Emory, their scores are among 1500. According to official data, admitted Oxford students generally have lower high school grades and SAT scores.

Mark Ji撰写

I’m interested in if Oxford students still perform worse than Emory College students in terms of their college grades.

knitr::include_graphics("EmorySAT.png")

image.png
knitr::include_graphics("OxfordSAT.png")
image.png

Students that enrolled in Oxford College v.s. Students that enrolled in Emory College in 2017

Problems interested

The purpose of this project is to analyze if there’s difference between Oxford students and Emory students in terms of sleeping time and their academic success (GPA). If not, what else is determining GPA?


视频

逻辑回归Logistic模型原理和R语言分类预测冠心病风险实例

探索见解

去bilibili观看

探索更多视频

The questions I chose are

1)“Are you coming from Oxford College?”(q11)

2)“What is your cumulative GPA up to this point?”(GPA)

3)“On average, how many hours of sleep do you get per night?”(q14)

## # A tibble: 7 x 3
##   Variables       Emorycollege Oxfordcollege
##   <chr>           <chr>        <chr>        
## 1 Applicants      28211        16687        
## 2 Accepted        5191         4034         
## 3 Enrolled        1408         515          
## 4 GPA(unweighted) 3.80-4.00    3.74-3.97    
## 5 ACT             32-35        31-35        
## 6 SAT_Reading     690-760      690-760      
## 7 SAT_Math        720-790      700-790
“Class of 2024 data”

In the table, I listed high school scores of Emory and Oxford students. I wonder if the trend (Oxford students have lower high school grade) persists after college education. A hypothesis testing will be done on examining if there’s a difference in mean college GPA between Oxford students and Emory students.

After that, I would like to discuss the correlation between sleeping time and GPA.


R语言混合线性模型、多层次模型、回归模型分析学生平均成绩GPA和可视化

阅读文章


According to a research conducted by Professor Cari from UCLA, if the students sacrifices sleep time to study more than usual, he or she has more trouble understanding material taught in class and be more likely to struggle on an assignment test. In the paper, it’s proved that students generally learn best when they keep a consistent study schedule and distribute their study time evenly across a number. Beginning in 9th and continuing in 10th and 12th grades, he recruited students from 3 Los Angeles public high schools and delivered surveys. The result from regressions proved the result.


随时关注您喜欢的主题


But there’s also the possibility that students taking more time studying hard increases their performance. It’s difficult to determine which effect dominates, so I want to examine it by looking at sample data from our class.



knitr::include_graphics("student sleeping.jpg")

image.png

Also, to dig deeper in this problem, I think that sleeping time may be devoted to other things like entertainment or shopping in town, compared to simply studying. Since Oxford College located in relatively rural place, I’d like to investigate if there’s also a relationship between campus and sleeping time. Is it because students at Emory College have more options to spend their time, so they sleep less? Or are they also working hard in weekdays, so that they have similar sleeping time?

Finally, I will do multiple regression and hypothesis test on selected regressors (q11-campus,q14-sleeping time,q4-age,q18-time for social media,q28-time for work out,q45-high school GPA) and check the correlation. The propose for MLR is to determine which is the factor that influence GPA most?

Methodology

  • I’d like to first compare data for both groups (Oxford students and Emory students) and conclude on general trends after basic data cleaning. By doing hypothesis testing, I will check whether there’s a difference in mean GPA between Oxford students and Emory students
  • Then, I will plot sleeping time versus academic success (GPA).
  • Finally, I will perform the multiple regression.

Hypothesis

  • Students’ college GPA are different across campuses
  • Students who sleep less have lower GPA
  • There’s at least one regressor among the selected ones that’s affecting GPA

Coding section – data visualization

Data cleaning and sorting

Hide

data<-read.csv("Econ220DataF20_ano.csv")
cleandata<-data %>%
  select(q11,q14,GPA,q41,age,q18,q28,q45)%>% 
  mutate(q41=as.integer(q41),q18=as.integer(q18))%>%
  rename(Campus = q11,Sleep= q14,Hourstudy = q41,socialmedia=q18,workouttime=q28,hsGPA=q45)%>%
  filter(Sleep<=15,GPA>1)%>% #remove outliers
  drop_na()
#data cleaning by chaning name and excluding outliers
cleandata$Campus<-factor(cleandata$Campus,labels = c("Emory","Oxford"))

Comparison between Oxford students and Emory students’ GPA

In the dataset, we have in total 27 Oxford students and 109 Emory students. The limited sample we have for Oxford students may be a problem for drawing conclusion. But having 27 samples is good enough to give us a glance.

Hide

kable(table(cleandata$Campus), col.names=c("Campus", "Students")) %>% kable_styling(bootstrap_options = "striped", full_width = F)
CampusStudents
Emory109
Oxford27

Hide

##make table to show numbers of students in each campus

According to the graph we generated from the data, indeed Emory students have in general higher GPA in both university and high school than Oxford students. Our next step is to do a t-test to see if that difference is significant enough for asserting Emory students behave better than Oxford students.

Hide

plot1<-ggplot(cleandata, aes(x=Campus, y=GPA,color=Campus))+
  geom_boxplot(aes(alpha=I(0.1),fill=Campus))+  
  scale_color_brewer(palette = "Paired")+
  geom_point(aes(alpha=I(0.5)))+
  ggtitle("College GPA for both campuses")

plot2<-ggplot(cleandata, aes(x=Campus, y=hsGPA,color=Campus))+
  geom_boxplot(aes(alpha=I(0.1),fill=Campus))+  
  geom_point(aes(alpha=I(0.5)))+
  ggtitle("High school GPA for both campuses")
# draw box polt for college GPA and high school GPA for students from both campus for a comparison; put it together
grid.arrange(plot1, plot2, ncol=2, nrow=1)
image.png
cleandata %>% group_by(Campus) %>% summarize( AvgGPA=mean(GPA), AvghsGPA=mean(hsGPA)) %>% kable(digits=3)%>% kable_styling(bootstrap_options = "striped", full_width = F)
#compute average as a table
CampusAvgGPAAvghsGPA
Emory3.6043.788
Oxford3.5883.765

Null hypothesis: H0:μ1−μ2=0H0:μ1−μ2=0

Alternative hypothesis: H1:μ1−μ2≠0H1:μ1−μ2≠0

With the significance level: α=0.05α=0.05

I used two sample t-test because we don’t know population standard deviation and we only have 27 samples from Oxford, which is small. We assume Emory students and Oxford students yield different means and standard deviation and we have independent observations within samples and between samples at first.

Hide

t.test(GPA ~ Campus, data = cleandata)
#do t test for two cases
t.test(hsGPA ~ Campus, data = cleandata)
## 
##  Welch Two Sample t-test
## 
## data:  GPA by Campus
## t = 0.23727, df = 45.152, p-value = 0.8135
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1171749  0.1484729
## sample estimates:
##  mean in group Emory mean in group Oxford 
##             3.603908             3.588259 
## 
## 
##  Welch Two Sample t-test
## 
## data:  hsGPA by Campus
## t = 0.46112, df = 40.718, p-value = 0.6472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.07707745  0.12267718
## sample estimates:
##  mean in group Emory mean in group Oxford 
##             3.787615             3.764815

According to the test, it’s surprising to see that we failed to reject both nulls since 0.8135 (p-value) is way bigger than 0.05 (the common significance level). We failed to reject both nulls that Emory and Oxford students have different college and high school GPAs. This is the first finding of the project —— we falsify the first hypothesis saying that Oxford students have lower GPA.

Hide

plot_ly(cleandata,x=~Campus, y=~GPA,type="violin")
#make a violin plot to see distribution
image.png

It’s fairly interesting to see that despite the difference in graphs and reported data, there’s statistically no difference between GPA of the students from two campuses. As I go back and use a violin plot instead of boxplot, it shows the distribution of GPAs of students from two Campus are indeed similar. In the following section, we will shift our focus to sleeping time.

Sleeping time versus academic success (GPA)

Hide

sleeping_GPA<-ggplot(cleandata, aes(x=Sleep, y=GPA,frame=Campus))+
  geom_point(aes(color=Campus,alpha=I(0.5)))+
  scale_color_brewer(palette = "Dark2")
#make scatter plot that show difference in students from two campus
ggplotly(sleeping_GPA) %>%
  animation_opts(transition = 500, easing = "linear", mode = "immediate")
image.png
# correlation, which is small
cor(cleandata$Sleep,cleandata$GPA)
## [1] 0.0627165

The covariance or correlation of the two variables is 0.0627165, which is small. There’s almost no correlation between sleeping time and performance of students.  I’m to some extent skeptical about the result because that’s counter-intuitive. Given the data from the research by Professor Cari, there shouldn’t be thoroughly no relationship between sleeping time and GPA. The result is probably because we didn’t have a large enough sample to make a scatter plot containing more points. We can detect that by finding the Emory students’ sleeping times (which has a bigger sample size), are way more spread out. And also we are only sampling from our ECON220 class, which is composed of students with similar backgrounds—— capable of entering Emory University, taking major related to Econ220, are by and large junior or senior (shown in the dataset). Sampling on a more diverse population may yield a better measure for the problem.

My next step is going to examine what’s the factor, that indeed influence students’ GPA, given the Campus and Sleeping time are not that determinant.

Before that, I’ll make my final effort to check if there’s a relationship between sleeping time and Campus. As discussed previously, there may be possibility of staying in downtown Atlanta offers students more choices for entertainment. Thus, there may be reduced time to sleep. But both graph and t-test shows that that may not be true. There’s, again, almost no difference between sleeping time of students from two campuses!

One more interesting thing I found is that my sleeping time is below the mean of sleeping time for both Campuses. An average of 7 hours of sleep for a College student is way higher than what I expect

With both hypotheses being rejected, we will proceed to the second part of my project —— a multiple linear regression to look for determinant regressors for GPA.

Campus versus sleeping time

Hide

ggplot(cleandata, aes(x=Campus, y=Sleep))+
  geom_boxplot(aes(color=Campus,alpha=I(0.5)))+
  scale_color_brewer(palette = "Set2")
image.png
#compare sleeping time between two campuses
t.test(Sleep ~ Campus, data = cleandata)
#and do a t test to verify
## 
##  Welch Two Sample t-test
## 
## data:  Sleep by Campus
## t = 0.57771, df = 52.616, p-value = 0.5659
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.2591806  0.4688307
## sample estimates:
##  mean in group Emory mean in group Oxford 
##             7.178899             7.074074

Coding section – Multiple linear regression

Multiple regression on selected regressors

YGPA=4.40+0.013∗XCampus+0.031∗XSleep+0.003∗XHourstudy−0.079∗XageYGPA=4.40+0.013∗XCampus+0.031∗XSleep+0.003∗XHourstudy−0.079∗Xage −0.022∗Xsocialmedia−0.023∗Xworkouttime+0.200∗XhsGPA+u1−0.022∗Xsocialmedia−0.023∗Xworkouttime+0.200∗XhsGPA+u1

n = 128, R-squared = 0.1404

Hide

multi1<- lm(GPA~Campus+Sleep+Hourstudy+age+socialmedia+workouttime+hsGPA, data = cleandata)
#do multiple regression to find significant regressors
summary(multi1)
## 
## Call:
## lm(formula = GPA ~ Campus + Sleep + Hourstudy + age + socialmedia + 
##     workouttime + hsGPA, data = cleandata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.12352 -0.15928  0.06388  0.22667  0.73730 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.403810   0.824095   5.344 4.03e-07 ***
## CampusOxford  0.013443   0.070338   0.191  0.84873    
## Sleep         0.031017   0.027595   1.124  0.26311    
## Hourstudy     0.003006   0.003991   0.753  0.45276    
## age          -0.079418   0.025008  -3.176  0.00187 ** 
## socialmedia  -0.022246   0.011725  -1.897  0.06004 .  
## workouttime  -0.022718   0.016667  -1.363  0.17525    
## hsGPA         0.200136   0.123507   1.620  0.10760    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.319 on 128 degrees of freedom
## Multiple R-squared:  0.1404, Adjusted R-squared:  0.09344 
## F-statistic: 2.988 on 7 and 128 DF,  p-value: 0.006151

Hide

qqPlot(multi1, main="Check for normal assumption")
image.png
#check for normal assumption
ols_plot_resid_fit(multi1)
image.png
#check for heteroscedasticity
## [1] 57 64

Middle section of the scatter plot is basically joint with the theoretical line (blue line). Normal assumption is approximately met.  But MLR5 is violated according to the “Residual vs Fitted Values” plot, there’s heteroscedasticity (Since as Fitted value goes larger, the variance of residual goes larger). Heteroscedasticity usually produces a distinctive fan or cone shape in residual plots.

Heterostkedasticity-robust inference

In this case, OLS is no longer BLUE and the usual standard errors are no longer valid, which means the statistics generated from tests can’t be used. Thus, I’ll use heterostkedasticity-robust inference by calculating robust standard errors.

Interpretation

By looking at the new test performed, we can see only age is significant (p value is smaller than significance level 0.05). This means age is the regressor that significantly affecting GPA. Numerically, holding all else unchanged, one year older for the student will on average decrease GPA by 0.079.

YGPA=4.4038104+0.0134433∗XCampus+0.0310170∗XSleep+0.0030058∗XHourstudyYGPA=4.4038104+0.0134433∗XCampus+0.0310170∗XSleep+0.0030058∗XHourstudy −0.0794176∗Xage−0.0222461∗Xsocialmedia−0.0227182∗Xworkouttime+0.2001355∗XhsGPA+ui−0.0794176∗Xage−0.0222461∗Xsocialmedia−0.0227182∗Xworkouttime+0.2001355∗XhsGPA+ui

Hide

coeftest(multi1,vcov=vcovHC(multi1,type="HC1"))
## 
## t test of coefficients:
## 
##                Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   4.4038104  0.7840754  5.6166 1.158e-07 ***
## CampusOxford  0.0134433  0.0626101  0.2147  0.830331    
## Sleep         0.0310170  0.0235658  1.3162  0.190462    
## Hourstudy     0.0030058  0.0036632  0.8205  0.413440    
## age          -0.0794176  0.0237547 -3.3432  0.001086 ** 
## socialmedia  -0.0222461  0.0138964 -1.6009  0.111875    
## workouttime  -0.0227182  0.0174374 -1.3028  0.194966    
## hsGPA         0.2001355  0.1127011  1.7758  0.078141 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Hide

#compute robust se

Test for multicollinearity

To me, it’s unusual to see that Hourstudy and (time for) socialmedia are not contributing to College GPA (insignificant). So, I suspect there’s multicollinearity problem, where the two variables are highly correlated.

What I’m trying to test here is if the variables are individually not significant but jointly significant.

Hide

linearHypothesis(multi1,c("Hourstudy=0","socialmedia=0"),vcov=vcovHC(multi1,type="HC1"))
## Linear hypothesis test
## 
## Hypothesis:
## Hourstudy = 0
## socialmedia = 0
## 
## Model 1: restricted model
## Model 2: GPA ~ Campus + Sleep + Hourstudy + age + socialmedia + workouttime + 
##     hsGPA
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F Pr(>F)
## 1    130                 
## 2    128  2 1.3923 0.2523

Hide

#check multicollinearity

The resulted P value is way bigger than significance level (0.05), we failed to reject null. Time spending on studying and on social media has no significant relationship with student’s GPA both individually and jointly.

Conclusion

Looking back to the three hypotheses we proposed, the data and graphs generated from this project proved that there’s no difference in sleeping time and GPA between Emory students and Oxford students. What’s more, in this sample, more or less daily sleeping time doesn’t show direct influence on students’ GPA. By examining multiple other variables, we find that older students on average have lower GPAs. That may because as they proceed to higher class, they are taking more sophisticated courses. As a result, they are having lower GPAs. In addition, I ran two regressions to check if students from higher class sleep less or use more time studying since their courses are harder, but the results are both insignificant. (So I didn’t include them in my project)

There are several limitations to the projects. Two remarkable ones are already discussed during the analysis—— we are sampling on rather homogeneous groups and the sample size is small (especially for Oxford students). In terms of sleeping time investigation, to obtain better analysis, we need to add one additional question on the reasons may influence individuals’ sleeping time. Finally, the data is, to some extent, biased, as the survey was conducted during the pandemic. As students are taking online courses, they have more flexibility on time arrangement. That may be the reason why they won’t sacrifice sleeping time for coursework. Having the city being locked down, they also have no outdoor entertainment to distract them from studying. That might account for the insignificant results.

Reference

  • Cari.G, 2013, To Study or to Sleep? The Academic Costs of Extra Studying at the Expense of Sleep
  • Apply.emory.edu 2020, Admitted Students: Class of 2024. [online] Available at:https://apply.emory.edu/discover/facts-stats/first-year.html
  • Nancy M., Jean E, Assessing Sleep in Adolescents Through a Better Understanding of Sleep Physiology,The American Journal of Nursing Vol. 113, No. 6 (June 2013), pp. 26-32 (7 pages),Published by: Lippincott Williams & Wilkins

关于作者

在此对Mark Ji对本文所作的贡献表示诚挚感谢,他毕业于埃默里大学,善于数据分析检索。对数据编程有很高热情。

 
QQ在线咨询
售前咨询热线
15121130882
售后咨询热线
0571-63341498

关注有关新文章的微信公众号


永远不要错过任何见解。当新文章发表时,我们会通过微信公众号向您推送。

技术干货

最新洞察

This will close in 0 seconds