Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give
Carbon halogen bond dissociation energy predictions through automated machine learning pipeline
Bond dissociation energy prediction for the carbon halogen (C–X) bond is quite important in chemistry, due to range applications of C–X bond in the drug design, reaction mechanism and material sciences fields. In the present research, a robust machine learning workflow was explored, to accurately predict the bond dissociation energy values of C–X bond. For the systematic identification of the optimized LightGBM Regressor as the top performing model, the automated machine leaning (Automl) framework, the Tree Based Pipeline Optimization Tool (TPOT) was employed. Additionally, tenfold cross-validation was used to rigorously confirm the model’s robustness. The final model exhibited outstanding predictive capability, with a coefficient of determination (R2) of 0.93 on the internal test set, and 0.95 on a more stringent external validation set. Moreover, interpretation of the model via SHapley Additive exPlanations (SHAP) suggests that the model predictions are based on chemically intuitive concepts, including electronegativity difference, halogen atomic number, and local atomic charges. This work thus provides a tool for bond dissociation energy prediction that is both highly accurate and interpretable, while simultaneously demonstrating a powerful contemporary workflow for producing machine learning models that are interpretable for basic problems in chemistry.