Abstract
This study presents a comprehensive analysis of the prediction of carbon dioxide emissions from vehicles using machine learningbased regression models. Linear regression, lasso regression, k-nearest neighbor regression, random forest, and CatBoostRegressor algorithms are systematically evaluated using a dataset of vehicle specifications and emissions data. Hyper-parameter optimization was performed using a grid search method and the performance of the models was measured using mean squared error, root mean squared error, mean absolute error, and R-squared metrics. CatBoostRegressor stood out for its high predictive accuracy, while random forest and k-nearest neighbor models also produced notable results, while linear models failed to model complex data relationships. Correlation analysis showed that engine displacement, number of cylinders, and fuel consumption were strongly correlated (0.92–0.99) with carbon dioxide emissions. The comparison with the literature showed that the study was characterized by its multi-model approach, rigorous data pre-processing, and systematic optimization. However, the geographical limitation of the dataset and the lack of dynamic variables such as driving conditions restrict its generalizability. In the future, explainable artificial intelligence methods and larger datasets may overcome these limitations. By highlighting the applicability of CatBoostRegressor, this study strengthens the contribution of machine learning to environmental sustainability policy and provides methodological innovation in the literature.
Go to article