Skip to content

Use data mining to explore secondary school student performance in Portugal.

Notifications You must be signed in to change notification settings

karlieym/student_portugal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Exploratory data mining on secondary students in Portugal

✏️ Project description

This project uses four data mining techniques to answer four research questions related to secondary student school performance. The techniques are regression, random forest classification, relationship mining and PCA clustering. Full code and paper can be found in this repository.
The four research questions are:
1. Is it possible to predict student performance in Portuguese (mother tongue) Class? What are the factors that affect student achievement?
2. Do the same analysis of students’ math final test performance, does it have the same attribution as Portuguese performance?
3. Can we find any interesting associations between other factors other than student’s academic performance?
4. What are the group characteristics of students with high and low grades respectively?

📊 Data description

The data were obtained in a survey of students' math and Portuguese language courses in secondary school, which is available on Kaggle.

Variables description

Variable Description
school Student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
sex Student's sex (binary: 'F' - female or 'M' - male)
age Student's age (numeric: from 15 to 22)
address Student's home address type (binary: 'U' - urban or 'R' - rural)
famsize Family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
Pstatus Parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
Medu Mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)
Fedu Father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education, or 4 - higher education)
Mjob Mother's job (nominal: 'teacher', 'health' care related, civil 'services', 'at_home' or 'other')
Fjob Father's job (nominal: 'teacher', 'health' care related, civil 'services', 'at_home' or 'other')
reason Reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
guardian Student's guardian (nominal: 'mother', 'father' or 'other')
traveltime Home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime Weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures Number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup Extra educational support (binary: yes or no)
famsup Family educational support (binary: yes or no)
paid Extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities Extra-curricular activities (binary: yes or no)
nursery Attended nursery school (binary: yes or no)
higher Wants to take higher education (binary: yes or no)
internet Internet access at home (binary: yes or no)
romantic With a romantic relationship (binary: yes or no)
famrel Quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime Free time after school (numeric: from 1 - very low to 5 - very high)
goout Going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc Workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc Weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health Current health status (numeric: from 1 - very bad to 5 - very good)
absences Number of school absences (numeric: from 0 to 93)
G1 First period grade (numeric: from 0 to 20)
G2 Second period grade (numeric: from 0 to 20)
G3 Final grade (numeric: from 0 to 20, output target)

💡 Results and findings

Findings regarding student final test grade(Q1&2):

  1. Is it possible to predict student performance in Portuguese (mother tongue) Class? What are the factors that affect student achievement?
    2. Do the same analysis of students’ math final test performance, does it have the same attribution as Portuguese performance?
    Regression analysis was first carried out however the result is insignificant. Therefore, a random forest classification was carried out and luckily, the result was satisfactory. With 86% and 61.34% accuracy in predicting students' Portuguese and Math scores respectively.
    The following bar charts are the top ten contributing factors to our prediction:
Screenshot 2024-09-17 at 19 14 25 Screenshot 2024-09-17 at 19 17 18
Students' performance in Portuguese is mainly influenced by family factors, such as family support, family size, parental relationship, etc. Students' math performance is also affected by school factors, such as the school they attend and school support. We can draw the rough conclusion that children from small, supportive, close-knit families have a better perception of language and words. Good math results are more dependent on the school, such as the quality of teaching, the degree of emphasis and other factors.

Relationship mining findings(Q3):

  1. Can we find any interesting associations between other factors other than student’s academic performance?
    The first and most obvious one is the intention to pursue higher education, since it has the most arrows pointing to it in the figure. Therefore, it can be concluded that students who have access to the Internet at home, who have attended kindergarten, and whose families provide educational support are most likely to want to pursue higher education. The second obvious conclusion is that Internet access is also favored by students who want to pursue higher education, who attended kindergarten and who had family educational support.
Screenshot 2024-09-17 at 19 22 40

PCA clustering findings

  1. What are the group characteristics of students with high and low grades respectively?
    Students can be divided into 4 groups, below is a visualization of the clustering:
Screenshot 2024-09-18 at 15 27 57
The top four features that contribute to clustering are **fathers being teachers, fathers working in health-related, Number of students in School Mousinho da Silveira and being guarded by moms.**
Screenshot 2024-09-18 at 15 32 22 Screenshot 2024-09-18 at 15 33 56 Screenshot 2024-09-18 at 15 34 36 Screenshot 2024-09-18 at 15 35 37

About

Use data mining to explore secondary school student performance in Portugal.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages