Abstract:
Even with extensive retention research dating from the 1960s, community colleges still struggle to identify the reasons why students do not return to college. Data mining has allowed these retention models to evolve to identify new patterns among student populations and variables. The purpose of this study was to create a predictive model for student retention using background, academic, and financial factors serving as a guide for other community colleges to use when investigating institutional retention. Four different data mining models (neural networks, random forest trees, support vector machines, and logistic regression) identified significant factors for retention. The models were compared to identify if one outperformed the others on five different evaluation metrics.
The number of credit hours was consistently the most important variable in retention. In addition, the interactions between the number of credit hours, GPA, and financial aid variables were significant in student retention in their first year. The interaction between GPA, financial aid variables, and the number of remedial hours was also crucial for the first-year retention. There were no consistent variables among the retention models that can predict students' nonretention in the first year of their college career. Many background predictors (age, gender, race, or ethnicity) were not significant in predicting retained or nonretained students. The comparison of the retention models found the random forest model had the best performance for accurately classifying the nonretained and retained students overall and the retained students individually.
Keywords: Retention, Community College, Data Mining, Academic Factors, Background Factors, Financial Factors