Accessing Post Source
We are still working on getting this site set up, so source code for this post is not yet available. Check back soon and you’ll be able to find it linked here.
With exam period approaching fast, every student is wondering how to score the best possible grade. Some factors—like how much sleep you’re getting or how healthy you are—seem to have an obvious correlation with your final grade. What about your relationship status? How much should you be studying to achieve the grade you want? Does the subject you’re studying influence your final grade? In this article, we will use two datasets containing student math and Portuguese language performance in two different Portuguese schools and see which factors affected student performance the most.
Exploratory Data Analysis
Dataset Overview
The variables are the same for the two datasets:
Variable | Description | Type | Possible Values |
---|---|---|---|
school | School | binary | GP—Gabriel Pereira; MS—Mousinho da Silveira |
sex | Sex | binary | F—female; M—male |
age | Age | numeric | 15–22, inclusive |
address | Address type | binary | U—Urban; R—Rural |
famsize | Family Size | binary | LE3—less than or equal to 3; GE3—greater than 3 |
Pstatus | Parent’s cohabitation status | binary | T—living together; A—living apart |
Medu | Mother’s Education | ordinal | 0—none; 1—up to 4th grade; 2—5th–9th grade; 3—secondary; 4—higher |
Fedu | Father’s Education | ordinal | 0—none; 1—up to 4th grade; 2—5th–9th grade; 3—secondary; 4—higher |
Mjob | Mother’s Job | nominal | teacher; health(-care related); (civil )services; at home; other |
Fjob | Father’s Job | nominal | teacher; health(-care related); (civil )services; at home; other |
reason | Reason for choosing school | nominal | (close to )home; (school )reputation; course(preference); other |
guardian | Student’s guardian | nominal | mother; father; other |
traveltime | Travel time to school | ordinal | 1—<15 min.; 2—15–30 min.; 3—30 min.–1 hour; 4—>1 hour |
studytime | Weekly study time | ordinal | 1—<2 hours; 2—2–5 hours; 3—5–10 hours; 4—>10 hours |
failures | Past class failures | numeric | 0–3, else 4 |
schoolsup | Extra educational support | binary | yes; no |
famsup | Family educational support | binary | yes; no |
paid | Extra paid classes | binary | yes; no |
activities | Extra-curricular activities | binary | yes; no |
nursery | Attend nursery | binary | yes; no |
higher | Wants to take higher education | binary | yes; no |
internet | Home internet access | binary | yes; no |
romantic | In a romantic relationship | binary | yes; no |
famrel | Quality of family relationships | ordinal | 1—very bad to 5—very good |
freetime | Free time after school | ordinal | 1—very low to 5—very high |
goout | Going out with friends | ordinal | 1—very low to 5—very high |
Dalc | Workday alcohol consumption | ordinal | 1—very low to 5—very high |
Walc | Weekend alcohol consumption | ordinal | 1—very low to 5—very high |
health | Current health status | ordinal | 1—very bad to 5—very good |
absences | Number absences | numeric | 0–93 |
G1 | First Period Grade | numeric | 0–20 |
G2 | Second Period Grade | numeric | 0–20 |
G3 | Final Grade | numeric | 0–20 |
I will be conducting a basic analysis of the dataset followed by visualizations of the correlations between different factors. Finally, I will build a linear regression model for each subject to predict the students’ final grades.
We will start by importing all the necessary packages and load the datasets into a pandas dataframe.
1 | # Import necessary packages |
We can start by taking a look at the first few rows of each dataset.
First 5 lines of the math performance dataset:
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 | yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
First 5 lines of the Portuguese performance dataset:
school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures | schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 | yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 4 | 0 | 11 | 11 |
1 | GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 | no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 2 | 9 | 11 | 11 |
2 | GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 0 | yes | no | no | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 6 | 12 | 13 | 12 |
3 | GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 | no | yes | no | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 0 | 14 | 14 | 14 |
4 | GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 | no | yes | no | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 0 | 11 | 13 | 13 |
An important detail to note is that there are 395 high school students in the math dataset and 649 in the Portuguese dataset. The grades of the student are from 0 to 20. Furthermore, there are 16 numerical variables out of 33; the rest of the variables will need to be one-hot encoded when we will analyze correlations and build the regression model.
Now let’s visualize the final grades distributions for both subjects.
We can also calculate that the average final grades for math and Portuguese students are 10.42 and 11.91, respectively. This suggests that Portuguese students score higher on average than math students although this comparison could easily have been skewed by the large number of math students scoring zero.
Finding and Visualizing Correlations for Numerical Variables
We are now going to automatically find the variables with the strongest correlation to the final grades for both datasets. Finding correlations between non-numeric features and the outcome can get a bit messy, so we will focus on testing only the existing numerical values of the datasets at first. To better visualize the insights, we will also use correlation bar plots and heat maps for both datasets.
To interpret correlation bar plots and heat map:
- The darker the bar/square, the stronger the correlation is.
- Brown represents negative correlations, whereas purple represents positive correlations.
Insights:
- For both datasets, the number of past classes
failures
has a strong negative correlation withG3
. - Other common variables with a negative correlation are
age
, frequency of going out with friends (goout
),traveltime
,freetime
andhealth
. G1
andG2
have very strong positive correlation coefficients for both datasets because student performance usually remains constant throughout the year; we will therefore ignore them.- Other common variables with a positive correlation are:
studytime
, education of parents (Fedu
andMedu
) and family relationship (famrel
).
One-Hot Encoding
In order to get more insight from these datasets, we need to be able to use the categorical variables as well. An example of categorical variable is the school
variable (the student is either at Gabriel Pereira or Mousinho da Silveira) as there are multiple possible values with no intrinsic ordering. We will use a technique called one-hot encoding, which assigns binary value to each category level indicating whether or not that level was the value of the original predictor. Here is an example of how it would look like for the variable father’s job (Fjob
).
Father’s Job | Occupation_teacher | Occupation_health | Occupation_services | Occupation_at_home | Occupation_other |
---|---|---|---|---|---|
teacher | 1 | 0 | 0 | 0 | 0 |
health | 0 | 1 | 0 | 0 | 0 |
services | 0 | 0 | 1 | 0 | 0 |
at_home | 0 | 0 | 0 | 1 | 0 |
other | 0 | 0 | 0 | 0 | 1 |
Technical Detail
One-hot encoding is actually slightly more subtle than the description given above. The missing detail is that we often drop the first of the resulting binary columns. The reason we do this is that knowing the value of the other columns is enough to be certain of the value of the first. Indeed, if all of the other columns are zero, then the first column must be one, and vice-versa. Removing the first column is important as the algebra behind linear regression fails we duplicate predictor information.
1 | def one_hot_encode(df): |
Finding and Visualizing Correlations for Encoded Categorical Variables
We can now analyze the correlation coefficients for the final grades of all the variables for both datasets.
Insights:
- Variables that impact negatively final grades in both datasets: in a romantic relationship (
romantic_yes
), does not want to go to higher education (higher_no
), lives in a rural area (address_R
) and has no access to internet (internet_no
). - Variables that impact positively final grades in both datasets: not in a romantic relationship (
romantic_no
), wants to go to higher education (higher_yes
), lives in a urban area (address_U
), has access to internet (internet_no
). - In Portuguese performance dataset, the
school
variable has a very high impact on the final grade (negatively impacted if goes to MS and positively impacted if goes to GP). - Males seem to score higher in math whereas females score higher in portuguese.
Visualizing Key Trends
Some of the results are quite unexpected so let’s visualize them.
Effect of Address Type on Grades
Insights:
- For math performance, there is not too much difference between urban and rural students. However, urban students tend to score slightly more.
- For portuguese performance, we can see that urban students score higher more often than rural students.
Effect of Relationship Status on Grades
Note, that of the 395 math students, 132 (33.4%) were in a relationship. Likewise 239 (36.8%) or of the 649 Portuguese students were in a relationship
Insights:
- In both datasets, there are more single students than in a relationship (only 33% in math dataset and 36% in portuguese dataset). This might skew results as there is less data to analyze for students in a relationship. We can see that in the Portuguese dataset where there are more values to analyze, the scatter plot shapes tend to look more similar.
- Not enough data to say if relationship has true impact on math performance.
Effect of Sex on Grades
1 | fig, axs = plt.subplots(2, 1, figsize=(12,10)) |
Effect of School Choice on Grades
1 | #Analyzing impact of choice of school on Portuguese performance |
Insights:
- From the available data, MS students (
school_MS
) tend to score less than GP students (school_GP
) in Portuguese. Maybe GP is specialized in Portuguese and students have access to higher-quality resources. - However as for the relationship analysis, there are less students going to MS so it might affect results.
Model-fitting
We are now going to build a multi-linear regression model for both datasets. To avoid the impact of correlated variables, we only use the top twelve most influential predictors. We start with the math scores.
1 | def fit_regression_model(df, dummy_df): |
1 | r2_m, coeffs_m = fit_regression_model(data_m, dummy_dfm) |
Model R^2: 0.20
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 10.2932 | 3.454 | 2.981 | 0.003 | 3.503 | 17.083 |
paid_yes | 0.2347 | 0.439 | 0.534 | 0.593 | -0.629 | 1.099 |
sex_M | 1.1617 | 0.436 | 2.665 | 0.008 | 0.305 | 2.019 |
address_U | 0.5889 | 0.543 | 1.084 | 0.279 | -0.479 | 1.657 |
Mjob_health | 1.1831 | 0.779 | 1.519 | 0.130 | -0.349 | 2.715 |
traveltime | -0.2877 | 0.324 | -0.887 | 0.376 | -0.926 | 0.350 |
romantic_yes | -0.8371 | 0.458 | -1.829 | 0.068 | -1.737 | 0.063 |
goout | -0.4753 | 0.194 | -2.450 | 0.015 | -0.857 | -0.094 |
Fedu | -0.1004 | 0.251 | -0.400 | 0.689 | -0.594 | 0.393 |
age | -0.0467 | 0.178 | -0.263 | 0.793 | -0.396 | 0.302 |
higher_yes | 1.4442 | 1.044 | 1.383 | 0.167 | -0.608 | 3.497 |
Medu | 0.4811 | 0.259 | 1.861 | 0.064 | -0.027 | 0.989 |
failures | -1.7399 | 0.314 | -5.547 | 0.000 | -2.357 | -1.123 |
Insights for math data set linear regression model:
- Our model explains explains 20% of the inputs into the final grade (
G3
), however it could still be improve if the goal of this article would be pure accuracy. - We can see that the willingness of the student to go into higher education (
higher_yes
) is a variable with one of the largest absolute coefficients. If the student is willing to go into higher education, their score will increase, on average, by 1.44 points. - There are other statistically significant coefficients such as
failures
,sex_M
, andgoout
. - For example,
failures
plays a decisive role in student performance: for each class the student has failed in the past, they can roughly except a decrease of 1.74 in their final score.
And now for the Portuguese scores.
1 | r2_p, coeffs_p = fit_regression_model(data_p, dummy_dfp) |
Model R^2: 0.305
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 9.8593 | 0.612 | 16.114 | 0.000 | 8.658 | 11.061 |
Mjob_teacher | 0.2976 | 0.384 | 0.776 | 0.438 | -0.456 | 1.051 |
internet_yes | 0.3248 | 0.269 | 1.207 | 0.228 | -0.204 | 0.853 |
address_U | 0.3521 | 0.252 | 1.397 | 0.163 | -0.143 | 0.847 |
reason_reputation | 0.4602 | 0.270 | 1.704 | 0.089 | -0.070 | 0.990 |
Walc | -0.1570 | 0.108 | -1.455 | 0.146 | -0.369 | 0.055 |
Dalc | -0.3119 | 0.149 | -2.098 | 0.036 | -0.604 | -0.020 |
Fedu | 0.1503 | 0.129 | 1.169 | 0.243 | -0.102 | 0.403 |
Medu | 0.0980 | 0.137 | 0.718 | 0.473 | -0.170 | 0.366 |
studytime | 0.4366 | 0.137 | 3.192 | 0.001 | 0.168 | 0.705 |
school_MS | -1.0299 | 0.252 | -4.087 | 0.000 | -1.525 | -0.535 |
higher_yes | 1.6627 | 0.376 | 4.421 | 0.000 | 0.924 | 2.401 |
failures | -1.4374 | 0.193 | -7.449 | 0.000 | -1.816 | -1.058 |
Insights for the portuguese data set linear regression model:
- Our model explains explains 30.5% of the inputs into the final grade (
G3
), better than the math model but still leaving room for improvement. - We again see that the desire to go into higher education and the number of previous failures are highly influential when determining a student’s final grade.
- There are other statistically significant coefficients such as
school_MS
,failures
andstudytime
. - In fact, the influence of going to Mousinho da Silveira is strong, with an expected decrease in one mark in a student’s Portuguese grade
Conclusion
We have seen that many factors can influence your final grades, the strongest of which typically being socio-economic characteristics (address, parent’s education, family relationship, etc.) that cannot be changed. Those factors can also depend on the potential biases of the dataset. For example, maybe the mother’s unemployment status has a bigger cultural impact on Portuguese student than on UK students. However, some variables that are controllable by the student such as studytime
, going out (goout
), consumption of alcohol (Dalc
and Walc
) and potentially relationship status (romantic
) have been proved to have an impact on the final grade (G3
) of students in these datasets.
Although valuable insights have been gleaned from this dataset it is clear from our poorly fitting regression model that linear interactions alone are insufficient for capturing a system as complicated as a student’s school performance. If a purely performative model is what we desired, then moving towards a tree-based model or including carefully chosen interaction terms would be advised.