Structure–activity modeling and hybrid machine learning-based prediction of bioactivity in pyrazole derivatives for drug discovery applications

Abstract

Background/aim: Pyrazole derivatives are of growing interest due to their diverse pharmacological activities. However, their biological activity is often highly sensitive to subtle structural modifications. Existing quantitative structure–activity relationships (QSAR) approaches frequently fail to capture the conformational flexibility and nonlinear structure–activity relationships (SAR) of such heterocyclic scaffolds, creating a gap in the accurate prediction of their biological profiles. Therefore, there is a strong need for more robust and predictive computational frameworks. This study addresses this gap by integrating four-dimensional (4D)-QSAR descriptors with hybrid machine learning (ML) techniques to improve predictive accuracy and provide a more reliable tool for structure-based drug design. In this work, it was aimed to investigate the SAR of a series of pyrazole-based compounds using this advanced integrative computational strategy.

Materials and methods: The dataset consisted of 54 pyrazole derivatives, of which 50 compounds were used for model construction and 4 compounds were reserved as a test set for validation. Although the test set was limited in size, the selected compounds were structurally representative of the training set, sharing the same core scaffold while covering different substitution patterns and biological activity values. The 4D-QSAR approach included multiple conformations of each compound and utilized matrix-based representations of geometric and electronic properties to capture dynamic molecular behavior. A pharmacophore model was generated using EMRE software based on the spatial and electronic features of used compounds. EMRE is an in-house software developed by our research group. It has been employed in several previously published 4D-QSAR studies for electron-conformational matrix of contiguity construction, pharmacophore modeling, descriptor matrix generation, and activity prediction. EMRE operates on standard geometric and electronic descriptors derived from quantum-chemical calculations, ensuring methodological transparency and reproducibility despite its proprietary implementation. Comparable performance trends obtained with EMRE-based 4D-QSAR models have been reported in previous studies, supporting the validity of the software for pharmacophore-driven QSAR analysis (Şahin et al., 2011; Sahin and Saripinar, 2020; Sahin et al., 2021). Using this framework, a total of 204 molecular descriptors were computed using Spartan 07. To reduce redundancy and prevent overfitting, descriptor selection was optimized through a genetic algorithm (GA)-based procedure (Fernandez et al., 2011), and only statistically significant descriptors with low intercorrelation were retained for model construction. Subsequently, multiple ML algorithms, including artificial neural network, decision tree, and hybrid models, were evaluated to enhance prediction accuracy.

Results: Among all the tested models, the gradient boosting machine and random forest (GBM+RF) hybrid algorithm yielded the highest predictive performance, with an R² value of 0.99978. To assess the robustness of the ML models, the training and validation procedures were repeated using different random seed initializations. The resulting performance metrics showed only minor variations across runs, indicating that the predictive performance of the GBM, RF, and GBM+RF hybrid models was not sensitive to random seed selection. The overall dataset comprised 54 pyrazole derivatives, with 50 molecules used for model construction and 4 reserved for validation. Although the high R² value indicates strong internal consistency, it should be interpreted with caution due to the relatively small sample sizes for the construction of the model and the test subset. Overall, the integration of 4D-QSAR and ML approaches demonstrated strong predictive capability and effectively captured the key geometric and electronic features associated with biological activity.

Conclusion: The electron conformational-GA computational strategy provides a robust framework for the rational design and virtual screening of pyrazole derivatives, leveraging multi-conformer modeling and a diverse set of molecular descriptors to identify potentially active compounds. The study was limited by the small size of the training and test set and the absence of experimental validation, which may constrain the generalizability of the findings. Future work could address these limitations by applying the model to larger and more diverse ligand dataset, performing virtual screening of compound libraries to discover novel hits, and validating promising candidates through in vitro assays. Overall, these findings support the potential of this scaffold for drug discovery, while further experimental studies are warranted to confirm the predicted activities and refine the predictive power of the model.