As I first came to the University, I found SAT scores for my friends at Oxford are among 1450.
Whereas for my friends at Emory, their scores are among 1500. According to official data, admitted Oxford students generally have lower high school grades and SAT scores.
I’m interested in if Oxford students still perform worse than Emory College students in terms of their college grades.
knitr::include_graphics("EmorySAT.png")
knitr::include_graphics("OxfordSAT.png")
Students that enrolled in Oxford College v.s. Students that enrolled in Emory College in 2017
The questions I chose are
1)“Are you coming from Oxford College?”(q11)
2)“What is your cumulative GPA up to this point?”(GPA)
3)“On average, how many hours of sleep do you get per night?”(q14)
## # A tibble: 7 x 3
## Variables Emorycollege Oxfordcollege
## <chr> <chr> <chr>
## 1 Applicants 28211 16687
## 2 Accepted 5191 4034
## 3 Enrolled 1408 515
## 4 GPA(unweighted) 3.80-4.00 3.74-3.97
## 5 ACT 32-35 31-35
## 6 SAT_Reading 690-760 690-760
## 7 SAT_Math 720-790 700-790
“Class of 2024 data”
In the table, I listed high school scores of Emory and Oxford students. I wonder if the trend (Oxford students have lower high school grade) persists after college education. A hypothesis testing will be done on examining if there’s a difference in mean college GPA between Oxford students and Emory students.
After that, I would like to discuss the correlation between sleeping time and GPA.
According to a research conducted by Professor Cari from UCLA, if the students sacrifices sleep time to study more than usual, he or she has more trouble understanding material taught in class and be more likely to struggle on an assignment test. In the paper, it’s proved that students generally learn best when they keep a consistent study schedule and distribute their study time evenly across a number. Beginning in 9th and continuing in 10th and 12th grades, he recruited students from 3 Los Angeles public high schools and delivered surveys. The result from regressions proved the result.
随时关注您喜欢的主题
But there’s also the possibility that students taking more time studying hard increases their performance. It’s difficult to determine which effect dominates, so I want to examine it by looking at sample data from our class.
knitr::include_graphics("student sleeping.jpg")
Also, to dig deeper in this problem, I think that sleeping time may be devoted to other things like entertainment or shopping in town, compared to simply studying. Since Oxford College located in relatively rural place, I’d like to investigate if there’s also a relationship between campus and sleeping time. Is it because students at Emory College have more options to spend their time, so they sleep less? Or are they also working hard in weekdays, so that they have similar sleeping time?
Finally, I will do multiple regression and hypothesis test on selected regressors (q11-campus,q14-sleeping time,q4-age,q18-time for social media,q28-time for work out,q45-high school GPA) and check the correlation. The propose for MLR is to determine which is the factor that influence GPA most?
Methodology
- I’d like to first compare data for both groups (Oxford students and Emory students) and conclude on general trends after basic data cleaning. By doing hypothesis testing, I will check whether there’s a difference in mean GPA between Oxford students and Emory students
- Then, I will plot sleeping time versus academic success (GPA).
- Finally, I will perform the multiple regression.
Hypothesis
- Students’ college GPA are different across campuses
- Students who sleep less have lower GPA
- There’s at least one regressor among the selected ones that’s affecting GPA
Coding section – data visualization
Data cleaning and sorting
Hide
data<-read.csv("Econ220DataF20_ano.csv")
cleandata<-data %>%
select(q11,q14,GPA,q41,age,q18,q28,q45)%>%
mutate(q41=as.integer(q41),q18=as.integer(q18))%>%
rename(Campus = q11,Sleep= q14,Hourstudy = q41,socialmedia=q18,workouttime=q28,hsGPA=q45)%>%
filter(Sleep<=15,GPA>1)%>% #remove outliers
drop_na()
#data cleaning by chaning name and excluding outliers
cleandata$Campus<-factor(cleandata$Campus,labels = c("Emory","Oxford"))
Comparison between Oxford students and Emory students’ GPA
In the dataset, we have in total 27 Oxford students and 109 Emory students. The limited sample we have for Oxford students may be a problem for drawing conclusion. But having 27 samples is good enough to give us a glance.
Hide
kable(table(cleandata$Campus), col.names=c("Campus", "Students")) %>% kable_styling(bootstrap_options = "striped", full_width = F)
Campus | Students |
---|---|
Emory | 109 |
Oxford | 27 |
Hide
##make table to show numbers of students in each campus
According to the graph we generated from the data, indeed Emory students have in general higher GPA in both university and high school than Oxford students. Our next step is to do a t-test to see if that difference is significant enough for asserting Emory students behave better than Oxford students.
Hide
plot1<-ggplot(cleandata, aes(x=Campus, y=GPA,color=Campus))+
geom_boxplot(aes(alpha=I(0.1),fill=Campus))+
scale_color_brewer(palette = "Paired")+
geom_point(aes(alpha=I(0.5)))+
ggtitle("College GPA for both campuses")
plot2<-ggplot(cleandata, aes(x=Campus, y=hsGPA,color=Campus))+
geom_boxplot(aes(alpha=I(0.1),fill=Campus))+
geom_point(aes(alpha=I(0.5)))+
ggtitle("High school GPA for both campuses")
# draw box polt for college GPA and high school GPA for students from both campus for a comparison; put it together
grid.arrange(plot1, plot2, ncol=2, nrow=1)
cleandata %>% group_by(Campus) %>% summarize( AvgGPA=mean(GPA), AvghsGPA=mean(hsGPA)) %>% kable(digits=3)%>% kable_styling(bootstrap_options = "striped", full_width = F)
#compute average as a table
Campus | AvgGPA | AvghsGPA |
---|---|---|
Emory | 3.604 | 3.788 |
Oxford | 3.588 | 3.765 |
Null hypothesis: H0:μ1−μ2=0H0:μ1−μ2=0
Alternative hypothesis: H1:μ1−μ2≠0H1:μ1−μ2≠0
With the significance level: α=0.05α=0.05
I used two sample t-test because we don’t know population standard deviation and we only have 27 samples from Oxford, which is small. We assume Emory students and Oxford students yield different means and standard deviation and we have independent observations within samples and between samples at first.
Hide
t.test(GPA ~ Campus, data = cleandata)
#do t test for two cases
t.test(hsGPA ~ Campus, data = cleandata)
##
## Welch Two Sample t-test
##
## data: GPA by Campus
## t = 0.23727, df = 45.152, p-value = 0.8135
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1171749 0.1484729
## sample estimates:
## mean in group Emory mean in group Oxford
## 3.603908 3.588259
##
##
## Welch Two Sample t-test
##
## data: hsGPA by Campus
## t = 0.46112, df = 40.718, p-value = 0.6472
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.07707745 0.12267718
## sample estimates:
## mean in group Emory mean in group Oxford
## 3.787615 3.764815
According to the test, it’s surprising to see that we failed to reject both nulls since 0.8135 (p-value) is way bigger than 0.05 (the common significance level). We failed to reject both nulls that Emory and Oxford students have different college and high school GPAs. This is the first finding of the project —— we falsify the first hypothesis saying that Oxford students have lower GPA.
Hide
plot_ly(cleandata,x=~Campus, y=~GPA,type="violin")
#make a violin plot to see distribution
It’s fairly interesting to see that despite the difference in graphs and reported data, there’s statistically no difference between GPA of the students from two campuses. As I go back and use a violin plot instead of boxplot, it shows the distribution of GPAs of students from two Campus are indeed similar. In the following section, we will shift our focus to sleeping time.
Sleeping time versus academic success (GPA)
Hide
sleeping_GPA<-ggplot(cleandata, aes(x=Sleep, y=GPA,frame=Campus))+
geom_point(aes(color=Campus,alpha=I(0.5)))+
scale_color_brewer(palette = "Dark2")
#make scatter plot that show difference in students from two campus
ggplotly(sleeping_GPA) %>%
animation_opts(transition = 500, easing = "linear", mode = "immediate")
# correlation, which is small
cor(cleandata$Sleep,cleandata$GPA)
## [1] 0.0627165
The covariance or correlation of the two variables is 0.0627165, which is small. There’s almost no correlation between sleeping time and performance of students. I’m to some extent skeptical about the result because that’s counter-intuitive. Given the data from the research by Professor Cari, there shouldn’t be thoroughly no relationship between sleeping time and GPA. The result is probably because we didn’t have a large enough sample to make a scatter plot containing more points. We can detect that by finding the Emory students’ sleeping times (which has a bigger sample size), are way more spread out. And also we are only sampling from our ECON220 class, which is composed of students with similar backgrounds—— capable of entering Emory University, taking major related to Econ220, are by and large junior or senior (shown in the dataset). Sampling on a more diverse population may yield a better measure for the problem.
My next step is going to examine what’s the factor, that indeed influence students’ GPA, given the Campus and Sleeping time are not that determinant.
Before that, I’ll make my final effort to check if there’s a relationship between sleeping time and Campus. As discussed previously, there may be possibility of staying in downtown Atlanta offers students more choices for entertainment. Thus, there may be reduced time to sleep. But both graph and t-test shows that that may not be true. There’s, again, almost no difference between sleeping time of students from two campuses!
One more interesting thing I found is that my sleeping time is below the mean of sleeping time for both Campuses. An average of 7 hours of sleep for a College student is way higher than what I expect
With both hypotheses being rejected, we will proceed to the second part of my project —— a multiple linear regression to look for determinant regressors for GPA.
Campus versus sleeping time
Hide
ggplot(cleandata, aes(x=Campus, y=Sleep))+
geom_boxplot(aes(color=Campus,alpha=I(0.5)))+
scale_color_brewer(palette = "Set2")
#compare sleeping time between two campuses
t.test(Sleep ~ Campus, data = cleandata)
#and do a t test to verify
##
## Welch Two Sample t-test
##
## data: Sleep by Campus
## t = 0.57771, df = 52.616, p-value = 0.5659
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2591806 0.4688307
## sample estimates:
## mean in group Emory mean in group Oxford
## 7.178899 7.074074
Coding section – Multiple linear regression
Multiple regression on selected regressors
YGPA=4.40+0.013∗XCampus+0.031∗XSleep+0.003∗XHourstudy−0.079∗XageYGPA=4.40+0.013∗XCampus+0.031∗XSleep+0.003∗XHourstudy−0.079∗Xage −0.022∗Xsocialmedia−0.023∗Xworkouttime+0.200∗XhsGPA+u1−0.022∗Xsocialmedia−0.023∗Xworkouttime+0.200∗XhsGPA+u1
n = 128, R-squared = 0.1404
Hide
multi1<- lm(GPA~Campus+Sleep+Hourstudy+age+socialmedia+workouttime+hsGPA, data = cleandata)
#do multiple regression to find significant regressors
summary(multi1)
##
## Call:
## lm(formula = GPA ~ Campus + Sleep + Hourstudy + age + socialmedia +
## workouttime + hsGPA, data = cleandata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.12352 -0.15928 0.06388 0.22667 0.73730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.403810 0.824095 5.344 4.03e-07 ***
## CampusOxford 0.013443 0.070338 0.191 0.84873
## Sleep 0.031017 0.027595 1.124 0.26311
## Hourstudy 0.003006 0.003991 0.753 0.45276
## age -0.079418 0.025008 -3.176 0.00187 **
## socialmedia -0.022246 0.011725 -1.897 0.06004 .
## workouttime -0.022718 0.016667 -1.363 0.17525
## hsGPA 0.200136 0.123507 1.620 0.10760
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.319 on 128 degrees of freedom
## Multiple R-squared: 0.1404, Adjusted R-squared: 0.09344
## F-statistic: 2.988 on 7 and 128 DF, p-value: 0.006151
Hide
qqPlot(multi1, main="Check for normal assumption")
#check for normal assumption
ols_plot_resid_fit(multi1)
#check for heteroscedasticity
## [1] 57 64
Middle section of the scatter plot is basically joint with the theoretical line (blue line). Normal assumption is approximately met. But MLR5 is violated according to the “Residual vs Fitted Values” plot, there’s heteroscedasticity (Since as Fitted value goes larger, the variance of residual goes larger). Heteroscedasticity usually produces a distinctive fan or cone shape in residual plots.
Heterostkedasticity-robust inference
In this case, OLS is no longer BLUE and the usual standard errors are no longer valid, which means the statistics generated from tests can’t be used. Thus, I’ll use heterostkedasticity-robust inference by calculating robust standard errors.
Interpretation
By looking at the new test performed, we can see only age is significant (p value is smaller than significance level 0.05). This means age is the regressor that significantly affecting GPA. Numerically, holding all else unchanged, one year older for the student will on average decrease GPA by 0.079.
YGPA=4.4038104+0.0134433∗XCampus+0.0310170∗XSleep+0.0030058∗XHourstudyYGPA=4.4038104+0.0134433∗XCampus+0.0310170∗XSleep+0.0030058∗XHourstudy −0.0794176∗Xage−0.0222461∗Xsocialmedia−0.0227182∗Xworkouttime+0.2001355∗XhsGPA+ui−0.0794176∗Xage−0.0222461∗Xsocialmedia−0.0227182∗Xworkouttime+0.2001355∗XhsGPA+ui
Hide
coeftest(multi1,vcov=vcovHC(multi1,type="HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4038104 0.7840754 5.6166 1.158e-07 ***
## CampusOxford 0.0134433 0.0626101 0.2147 0.830331
## Sleep 0.0310170 0.0235658 1.3162 0.190462
## Hourstudy 0.0030058 0.0036632 0.8205 0.413440
## age -0.0794176 0.0237547 -3.3432 0.001086 **
## socialmedia -0.0222461 0.0138964 -1.6009 0.111875
## workouttime -0.0227182 0.0174374 -1.3028 0.194966
## hsGPA 0.2001355 0.1127011 1.7758 0.078141 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Hide
#compute robust se
Test for multicollinearity
To me, it’s unusual to see that Hourstudy and (time for) socialmedia are not contributing to College GPA (insignificant). So, I suspect there’s multicollinearity problem, where the two variables are highly correlated.
What I’m trying to test here is if the variables are individually not significant but jointly significant.
Hide
linearHypothesis(multi1,c("Hourstudy=0","socialmedia=0"),vcov=vcovHC(multi1,type="HC1"))
## Linear hypothesis test
##
## Hypothesis:
## Hourstudy = 0
## socialmedia = 0
##
## Model 1: restricted model
## Model 2: GPA ~ Campus + Sleep + Hourstudy + age + socialmedia + workouttime +
## hsGPA
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 130
## 2 128 2 1.3923 0.2523
Hide
#check multicollinearity
The resulted P value is way bigger than significance level (0.05), we failed to reject null. Time spending on studying and on social media has no significant relationship with student’s GPA both individually and jointly.
Conclusion
Looking back to the three hypotheses we proposed, the data and graphs generated from this project proved that there’s no difference in sleeping time and GPA between Emory students and Oxford students. What’s more, in this sample, more or less daily sleeping time doesn’t show direct influence on students’ GPA. By examining multiple other variables, we find that older students on average have lower GPAs. That may because as they proceed to higher class, they are taking more sophisticated courses. As a result, they are having lower GPAs. In addition, I ran two regressions to check if students from higher class sleep less or use more time studying since their courses are harder, but the results are both insignificant. (So I didn’t include them in my project)
There are several limitations to the projects. Two remarkable ones are already discussed during the analysis—— we are sampling on rather homogeneous groups and the sample size is small (especially for Oxford students). In terms of sleeping time investigation, to obtain better analysis, we need to add one additional question on the reasons may influence individuals’ sleeping time. Finally, the data is, to some extent, biased, as the survey was conducted during the pandemic. As students are taking online courses, they have more flexibility on time arrangement. That may be the reason why they won’t sacrifice sleeping time for coursework. Having the city being locked down, they also have no outdoor entertainment to distract them from studying. That might account for the insignificant results.
Reference
- Cari.G, 2013, To Study or to Sleep? The Academic Costs of Extra Studying at the Expense of Sleep
- Apply.emory.edu 2020, Admitted Students: Class of 2024. [online] Available at:https://apply.emory.edu/discover/facts-stats/first-year.html
- Nancy M., Jean E, Assessing Sleep in Adolescents Through a Better Understanding of Sleep Physiology,The American Journal of Nursing Vol. 113, No. 6 (June 2013), pp. 26-32 (7 pages),Published by: Lippincott Williams & Wilkins
关于作者
Mark Ji
在此对Mark Ji对本文所作的贡献表示诚挚感谢,他毕业于埃默里大学,善于数据分析检索。对数据编程有很高热情。