Dataset
In this study we used the highly diverse experimental dataset from PubChem BioAssay System (AID 1376). This dataset have information ~2 lakhs molecules used in screeeing against M.tuberculosis Glmu Protein. From this data set we find total 125 molecules whose IC50 value are known. Using Docking protocol, we screen these molecule and all molecules that are active in docking finally used for QSAR model development. Thus finally QSAR model were developed on 88 molecules that satisfying both docking and experimantal protocol. For this study the whole dataset is divided into two parts: training set and independent test set. The training set is used for building model and independent test set is used for evaluating the performance.
Descriptor Calculation
For QSAR model we have calculated descriptors from different softwares like V-life, Web-Cdk, Dragon, Docking based energy descriptors. These descriptors falls in different category like Topological descriptors, molecular descriptors, constitutional descriptors etc.
Feature Selection
Feature selection is an important criteria in QSAR modeling. It is generally seen that some descriptors shows negative contribution in model thus is necessery to identify those descriptors and remove them from model. For this purpose we used Weka software cfsubseteval feature selection method that give highly important descriptors. After that we used F-steping approach to further reduce descriptors without any significant change in model performance.
Techniques
For model building we used both linear (MLR) and non-linear (SVM) statistical approach. Our finding suggest that MLR a linear method perform better over non-linear based SVM techniques. Thus finally we developed a QSAR model on MLR based techniques.
Performance Evaluation
The performance of constructed model were evaluated using a LOOCV cross-validation technique. In the LOOCV cross-validation, every time a molecule comes under testing and remaining(n-1) comes under training.The performance of the methods was computed using the following formulas:
1. Correlation coefficient (R)
2. Cofficient of determinent(R2)
3. Mean absolute error (MAE)