avatar

What Factors Actually Affect Your Grades?

Accessing Post Source
We are still working on getting this site set up, so source code for this post is not yet available. Check back soon and you’ll be able to find it linked here.

With exam period approaching fast, every student is wondering how to score the best possible grade. Some factors—like how much sleep you’re getting or how healthy you are—seem to have an obvious correlation with your final grade. What about your relationship status? How much should you be studying to achieve the grade you want? Does the subject you’re studying influence your final grade? In this article, we will use two datasets containing student math and Portuguese language performance in two different Portuguese schools and see which factors affected student performance the most.

Exploratory Data Analysis

Dataset Overview

The variables are the same for the two datasets:

VariableDescriptionTypePossible Values
schoolSchoolbinaryGP—Gabriel Pereira; MS—Mousinho da Silveira
sexSexbinaryF—female; M—male
ageAgenumeric15–22, inclusive
addressAddress typebinaryU—Urban; R—Rural
famsizeFamily SizebinaryLE3—less than or equal to 3; GE3—greater than 3
PstatusParent’s cohabitation statusbinaryT—living together; A—living apart
MeduMother’s Educationordinal0—none; 1—up to 4th grade; 2—5th–9th grade; 3—secondary; 4—higher
FeduFather’s Educationordinal0—none; 1—up to 4th grade; 2—5th–9th grade; 3—secondary; 4—higher
MjobMother’s Jobnominalteacher; health(-care related); (civil )services; at home; other
FjobFather’s Jobnominalteacher; health(-care related); (civil )services; at home; other
reasonReason for choosing schoolnominal(close to )home; (school )reputation; course(preference); other
guardianStudent’s guardiannominalmother; father; other
traveltimeTravel time to schoolordinal1—<15 min.; 2—15–30 min.; 3—30 min.–1 hour; 4—>1 hour
studytimeWeekly study timeordinal1—<2 hours; 2—2–5 hours; 3—5–10 hours; 4—>10 hours
failuresPast class failuresnumeric0–3, else 4
schoolsupExtra educational supportbinaryyes; no
famsupFamily educational supportbinaryyes; no
paidExtra paid classesbinaryyes; no
activitiesExtra-curricular activitiesbinaryyes; no
nurseryAttend nurserybinaryyes; no
higherWants to take higher educationbinaryyes; no
internetHome internet accessbinaryyes; no
romanticIn a romantic relationshipbinaryyes; no
famrelQuality of family relationshipsordinal1—very bad to 5—very good
freetimeFree time after schoolordinal1—very low to 5—very high
gooutGoing out with friendsordinal1—very low to 5—very high
DalcWorkday alcohol consumptionordinal1—very low to 5—very high
WalcWeekend alcohol consumptionordinal1—very low to 5—very high
healthCurrent health statusordinal1—very bad to 5—very good
absencesNumber absencesnumeric0–93
G1First Period Gradenumeric0–20
G2Second Period Gradenumeric0–20
G3Final Gradenumeric0–20

I will be conducting a basic analysis of the dataset followed by visualizations of the correlations between different factors. Finally, I will build a linear regression model for each subject to predict the students’ final grades.

We will start by importing all the necessary packages and load the datasets into a pandas dataframe.

1
2
3
4
5
6
7
8
9
10
# Import necessary packages
import pandas as pd
import numpy as np

import statistics as stats
import statsmodels.api as sm

# Load the dataset from the csv file using pandas
data_m = pd.read_csv(r'data/student-mat.csv', sep=';')
data_p = pd.read_csv(r'data/student-por.csv', sep=';')

We can start by taking a look at the first few rows of each dataset.

First 5 lines of the math performance dataset:
schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardiantraveltimestudytimefailuresschoolsupfamsuppaidactivitiesnurseryhigherinternetromanticfamrelfreetimegooutDalcWalchealthabsencesG1G2G3
0GPF18UGT3A44at_hometeachercoursemother220yesnononoyesyesnono4341136566
1GPF17UGT3T11at_homeothercoursefather120noyesnononoyesyesno5331134556
2GPF15ULE3T11at_homeotherothermother123yesnoyesnoyesyesyesno432233107810
3GPF15UGT3T42healthserviceshomemother130noyesyesyesyesyesyesyes3221152151415
4GPF16UGT3T33otherotherhomefather120noyesyesnoyesyesnono432125461010
First 5 lines of the Portuguese performance dataset:
schoolsexageaddressfamsizePstatusMeduFeduMjobFjobreasonguardiantraveltimestudytimefailuresschoolsupfamsuppaidactivitiesnurseryhigherinternetromanticfamrelfreetimegooutDalcWalchealthabsencesG1G2G3
0GPF18UGT3A44at_hometeachercoursemother220yesnononoyesyesnono434113401111
1GPF17UGT3T11at_homeothercoursefather120noyesnononoyesyesno533113291111
2GPF15ULE3T11at_homeotherothermother120yesnononoyesyesyesno4322336121312
3GPF15UGT3T42healthserviceshomemother130noyesnoyesyesyesyesyes3221150141414
4GPF16UGT3T33otherotherhomefather120noyesnonoyesyesnono4321250111313

An important detail to note is that there are 395 high school students in the math dataset and 649 in the Portuguese dataset. The grades of the student are from 0 to 20. Furthermore, there are 16 numerical variables out of 33; the rest of the variables will need to be one-hot encoded when we will analyze correlations and build the regression model.

Now let’s visualize the final grades distributions for both subjects.

Distribution of student grades for math and Portuguese

We can also calculate that the average final grades for math and Portuguese students are 10.42 and 11.91, respectively. This suggests that Portuguese students score higher on average than math students although this comparison could easily have been skewed by the large number of math students scoring zero.

Finding and Visualizing Correlations for Numerical Variables

We are now going to automatically find the variables with the strongest correlation to the final grades for both datasets. Finding correlations between non-numeric features and the outcome can get a bit messy, so we will focus on testing only the existing numerical values of the datasets at first. To better visualize the insights, we will also use correlation bar plots and heat maps for both datasets.

Correlations between numeric predictors and the response for math

To interpret correlation bar plots and heat map:

  • The darker the bar/square, the stronger the correlation is.
  • Brown represents negative correlations, whereas purple represents positive correlations.

Correlations between numeric predictors and the response for Portuguese

Insights:

  • For both datasets, the number of past classes failures has a strong negative correlation with G3.
  • Other common variables with a negative correlation are age, frequency of going out with friends (goout), traveltime, freetime and health.
  • G1 and G2 have very strong positive correlation coefficients for both datasets because student performance usually remains constant throughout the year; we will therefore ignore them.
  • Other common variables with a positive correlation are: studytime, education of parents (Fedu and Medu) and family relationship (famrel).

One-Hot Encoding

In order to get more insight from these datasets, we need to be able to use the categorical variables as well. An example of categorical variable is the school variable (the student is either at Gabriel Pereira or Mousinho da Silveira) as there are multiple possible values with no intrinsic ordering. We will use a technique called one-hot encoding, which assigns binary value to each category level indicating whether or not that level was the value of the original predictor. Here is an example of how it would look like for the variable father’s job (Fjob).

Father’s JobOccupation_teacherOccupation_healthOccupation_servicesOccupation_at_homeOccupation_other
teacher10000
health01000
services00100
at_home00010
other00001

Technical Detail
One-hot encoding is actually slightly more subtle than the description given above. The missing detail is that we often drop the first of the resulting binary columns. The reason we do this is that knowing the value of the other columns is enough to be certain of the value of the first. Indeed, if all of the other columns are zero, then the first column must be one, and vice-versa. Removing the first column is important as the algebra behind linear regression fails we duplicate predictor information.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def one_hot_encode(df):
# Select only categorical variables
cat_df = df.select_dtypes(include=['object'])

# One-hot encode variables
dummy_df = pd.get_dummies(cat_df, drop_first=True)

# Add the response back and return
dummy_df['G3'] = df['G3']
return dummy_df

# One-hot encode both datasets
dummy_dfm = one_hot_encode(data_m)
dummy_dfp = one_hot_encode(data_p)

Finding and Visualizing Correlations for Encoded Categorical Variables

We can now analyze the correlation coefficients for the final grades of all the variables for both datasets.

Correlations between categorical predictors and the response for math

Correlations between categorical predictors and the response for Portuguese

Insights:

  • Variables that impact negatively final grades in both datasets: in a romantic relationship (romantic_yes), does not want to go to higher education (higher_no), lives in a rural area (address_R) and has no access to internet (internet_no).
  • Variables that impact positively final grades in both datasets: not in a romantic relationship (romantic_no), wants to go to higher education (higher_yes), lives in a urban area (address_U), has access to internet (internet_no).
  • In Portuguese performance dataset, the school variable has a very high impact on the final grade (negatively impacted if goes to MS and positively impacted if goes to GP).
  • Males seem to score higher in math whereas females score higher in portuguese.

Some of the results are quite unexpected so let’s visualize them.

Effect of Address Type on Grades

Impact of address type on student performance

Insights:

  • For math performance, there is not too much difference between urban and rural students. However, urban students tend to score slightly more.
  • For portuguese performance, we can see that urban students score higher more often than rural students.

Effect of Relationship Status on Grades

Impact of relationship status on student performance

Note, that of the 395 math students, 132 (33.4%) were in a relationship. Likewise 239 (36.8%) or of the 649 Portuguese students were in a relationship

Insights:

  • In both datasets, there are more single students than in a relationship (only 33% in math dataset and 36% in portuguese dataset). This might skew results as there is less data to analyze for students in a relationship. We can see that in the Portuguese dataset where there are more values to analyze, the scatter plot shapes tend to look more similar.
  • Not enough data to say if relationship has true impact on math performance.

Effect of Sex on Grades

1
2
3
4
5
6
7
8
9
10
11
12
13
fig, axs = plt.subplots(2, 1, figsize=(12,10))
plt.subplots_adjust(hspace=.25)

def sex_plot(data, ax, subject):
ax.set_xlim(0, 20)
sns.kdeplot(data.loc[data['sex'] == 'F', 'G3'], label='Female', shade=True, ax=ax)
sns.kdeplot(data.loc[data['sex'] == 'M', 'G3'], label='Male', shade=True, ax=ax)
ax.set_title(f'Female vs Male Students {subject} Performance')
ax.set_xlabel('Grade')
ax.set_ylabel('Density')

sex_plot(data_m, axs[0], subject='Math')
sex_plot(data_p, axs[1], subject='Portuguese')

Impact of sex on student performance

Effect of School Choice on Grades

1
2
3
4
5
6
#Analyzing impact of choice of school on Portuguese performance
plt.subplots(figsize=(12,8))
b = sns.swarmplot(x='school', y='G3', data=data_p)
b.axes.set_title('School Choice vs Final Grade Portuguese')
b.set_xlabel('School')
b.set_ylabel('Final Grade');

Impact of school choice on student performance

Insights:

  • From the available data, MS students (school_MS) tend to score less than GP students (school_GP) in Portuguese. Maybe GP is specialized in Portuguese and students have access to higher-quality resources.
  • However as for the relationship analysis, there are less students going to MS so it might affect results.

Model-fitting

We are now going to build a multi-linear regression model for both datasets. To avoid the impact of correlated variables, we only use the top twelve most influential predictors. We start with the math scores.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def fit_regression_model(df, dummy_df):
num_df = df.select_dtypes(exclude=['object'])
full_df = pd.concat([num_df, dummy_df.drop('G3', axis=1)], axis=1)
full_df.drop(['G1', 'G2'], axis=1, inplace=True)
most_inf = np.abs(full_df.corr()['G3']).sort_values()[-13:].index
red_df = full_df.loc[:, most_inf]

X = np.array(red_df.drop('G3', axis=1))
y = np.array(red_df['G3'])

Z = sm.add_constant(X)
mod = sm.OLS(y, Z).fit()

results_as_html = mod.summary().tables[1].as_html()
coeffs = pd.read_html(results_as_html, header=0, index_col=0)[0]
coeffs = coeffs.set_index(pd.Index(['intercept']).append(red_df.drop('G3', axis=1).columns))

return mod.rsquared, coeffs
1
2
3
4
r2_m, coeffs_m = fit_regression_model(data_m, dummy_dfm)

print(f"Model R^2: {r2_m:.02f}")
display(coeffs_m)
Model R^2: 0.20
coefstd errtP>|t|[0.0250.975]
intercept10.29323.4542.9810.0033.50317.083
paid_yes0.23470.4390.5340.593-0.6291.099
sex_M1.16170.4362.6650.0080.3052.019
address_U0.58890.5431.0840.279-0.4791.657
Mjob_health1.18310.7791.5190.130-0.3492.715
traveltime-0.28770.324-0.8870.376-0.9260.350
romantic_yes-0.83710.458-1.8290.068-1.7370.063
goout-0.47530.194-2.4500.015-0.857-0.094
Fedu-0.10040.251-0.4000.689-0.5940.393
age-0.04670.178-0.2630.793-0.3960.302
higher_yes1.44421.0441.3830.167-0.6083.497
Medu0.48110.2591.8610.064-0.0270.989
failures-1.73990.314-5.5470.000-2.357-1.123

Insights for math data set linear regression model:

  • Our model explains explains 20% of the inputs into the final grade (G3), however it could still be improve if the goal of this article would be pure accuracy.
  • We can see that the willingness of the student to go into higher education (higher_yes) is a variable with one of the largest absolute coefficients. If the student is willing to go into higher education, their score will increase, on average, by 1.44 points.
  • There are other statistically significant coefficients such as failures, sex_M, and goout.
  • For example, failures plays a decisive role in student performance: for each class the student has failed in the past, they can roughly except a decrease of 1.74 in their final score.

And now for the Portuguese scores.

1
2
3
4
r2_p, coeffs_p = fit_regression_model(data_p, dummy_dfp)

print(f"Model R^2: {r2_p:.03f}")
display(coeffs_p)
Model R^2: 0.305
coefstd errtP>|t|[0.0250.975]
intercept9.85930.61216.1140.0008.65811.061
Mjob_teacher0.29760.3840.7760.438-0.4561.051
internet_yes0.32480.2691.2070.228-0.2040.853
address_U0.35210.2521.3970.163-0.1430.847
reason_reputation0.46020.2701.7040.089-0.0700.990
Walc-0.15700.108-1.4550.146-0.3690.055
Dalc-0.31190.149-2.0980.036-0.604-0.020
Fedu0.15030.1291.1690.243-0.1020.403
Medu0.09800.1370.7180.473-0.1700.366
studytime0.43660.1373.1920.0010.1680.705
school_MS-1.02990.252-4.0870.000-1.525-0.535
higher_yes1.66270.3764.4210.0000.9242.401
failures-1.43740.193-7.4490.000-1.816-1.058

Insights for the portuguese data set linear regression model:

  • Our model explains explains 30.5% of the inputs into the final grade (G3), better than the math model but still leaving room for improvement.
  • We again see that the desire to go into higher education and the number of previous failures are highly influential when determining a student’s final grade.
  • There are other statistically significant coefficients such as school_MS, failures and studytime.
  • In fact, the influence of going to Mousinho da Silveira is strong, with an expected decrease in one mark in a student’s Portuguese grade

Conclusion

We have seen that many factors can influence your final grades, the strongest of which typically being socio-economic characteristics (address, parent’s education, family relationship, etc.) that cannot be changed. Those factors can also depend on the potential biases of the dataset. For example, maybe the mother’s unemployment status has a bigger cultural impact on Portuguese student than on UK students. However, some variables that are controllable by the student such as studytime, going out (goout), consumption of alcohol (Dalc and Walc) and potentially relationship status (romantic) have been proved to have an impact on the final grade (G3) of students in these datasets.

Although valuable insights have been gleaned from this dataset it is clear from our poorly fitting regression model that linear interactions alone are insufficient for capturing a system as complicated as a student’s school performance. If a purely performative model is what we desired, then moving towards a tree-based model or including carefully chosen interaction terms would be advised.

Author: Brandusa Draghici
Permalink: https://research.wdss.io/school-success/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 4.0 unless otherwise specified.

Comment