Hasan Mohammed Tanvir (master’s student at the Institute of Computer Science, University of Tartu) writes about his research in colaps:
Estonia ensured its footprint in the prestigious list of academic institutions through its dedication and continuous thrive for excellence despite its comparably smaller population and economy. In continuation of this process, it undertook various projects one of which is understanding the student retention and dropout based on the collected logged data that could be segregated in the domains of student’s earlier academic background, current performance, and effort. In this context, we aim to understand the contribution of data-driven engineered features to students’ dropouts and the changing impact of those given the status of Estonia in the European Union and consequent local and global socio-economic changes.
Our work in colaps….
For the purpose of this project, we analyzed students’ data from 2010 to 2020 from an Estonian Higher Education Institute (HEI). The idea was to design predictive models that can identify students who are likely to drop out of their higher education curriculum. To address the problem, we decided to approach it from a combination of rule-based methods and machine learning. The rule-based part determines if a student is likely to be dropped out in three dimensions- academic, performance, and effort. When that student comes out as a dropout in three of those dimensions, we label that student to be in high-risk group, similarly dropping out in one cohort will be labeled as low risk, two cohorts to be mid-risk. But behind these cohorts, we deployed trained machine learning (ML) models that will perform the prediction. While training the predictive models, we applied, linear regression, random forest, gradient boost algorithm to analyze the performance of the models on the data. And the results among all the models were not very relevant. It was not supposed to be identical, however, given the same data used to train different models should have produced results on a tolerable level of differences based on recall. The reason we considered recall as an evaluation metrics is that we want to reduce type II error, false-negative- predicting students as “not dropped out” who are actually likely to drop out. While investigating the anomaly in the results of those predictive ML models, we found out the reason may lie in the data itself.
Could it be data-age affecting the outcome?
While training and evaluating the models, we noticed that the performance of the predictive models is not consistent with different years of data. As we wanted our model to be consistent in order to be able to predict future students, we took a deeper look in the dataset which introduced a new research question: is data time-independent? or, in other words: does time play a role when it comes to the performance of the models?
So, on a separate track, we performed the statistical analysis to check the effect of time on the data as a means of explaining the inconsistency of the predictive models.
In order to understand the effect of time on the data, we took a three-step approach. At first, we carried out a correlation analysis to investigate the potential relationship between these features and dropouts, after that we performed a multivariate analysis of variance (MANOVA) to investigate whether the engineered features change significantly among student cohorts with different admission year and, and finally, we carried out a regression analysis with admission year as an interaction term to confirm that the engineered features’ impact on predicting dropouts changes over the years.
The results suggest that the importance of features regarding the academic background of students (such as the students’ prior experience with the academic institution), and the effort students make (for example, the number of days students spend on academic leave) may change over time. On the contrary, performance-based features (such as credit points and grades) do not appear to interact with students’ admission year. On the basis of the findings, we argue that the performance of prediction models for assessing students at risk of dropping out of their studies can be affected by the age of data and we outline the possibility of including a forgetting factor for non-recent data in order to leverage the impact on prediction performance.
While training and evaluating predictive models, we discovered that the models’ performance is not consistent if trained on different years’ data that led to the exploration of the time dependency of the dataset. Through several approaches, we concluded that the performance of the models is affected by the age of the data, which proves that the importance of features that contributes to dropout varies from year to year. The reasons for such variation may relate to the socio-economic changes that a society is going through (in the case of Estonia that could be for example due to being a relatively new member of European Union). For example, the importance of economic support a institute provides to the students have been becoming less important over the years in student retention; the reason could be more economic opportunities in recent days compared to the situation in 2010. Although many features’ importance has changed over time, student performance has remained the same in contributing to student retention which suggests the fact that socio-economic background does not significantly impact one’s performance in education.
We argue that this finding may suggest that the age of the data is pivotal to designing and training predictive models.
Hasan will present this work at the 14th International Conference on Educational Data Mining (EDM 2021) and our article under the title “Exploring the Importance of Factors Contributing toDropouts in Higher Education Over Time” will appear in the conference’s proceedings.