GPSR stand for A Resource for Genomics, Proteomics and System Biology. The manual of this package has described various commonly used Bioinformatics and Chemoinformatics programs and software. The field of computational biology has witnessed tremendous change in the years gone by. In its initial infancy stage, computation biology was used to solve only smaller biological problems. However, with the advancement in the field, scientists started using computational biological techniques heavily for solving even complex problems like protein modeling. In present era, computational biology is dominated by bioinformatics where managing, analyzing and mining biological data is a major challenge. And one of the major challenges for any computer or Bioinformatics professional is to understand the need of biologist and develop user-friendly software.
In whole manual we have described two GPSR packages. This manual has three major sections; first section is written for students working in the field of bioinformatics particularly for software developers. This section describes i) commonly used major computational tools, frequently used for developing bioinformatics tools; ii) type of prediction methods and iii) procedure for evaluation of a newly developed method. Second section is written for users who wish to analyze the proteins. In this section, all small programs are described which are commonly used for building major software packages. Third section describes standalone programs based on our servers/methods, important for users who want to run our methods on whole proteome. We wish all the best for our users. These programs and the package are free software for academic users. Permission to use, copy, and modify any part of this software for educational, research and non-profit purposes is hereby granted but distribution to third-party is prohibited. They are distributed in the hope that they will be useful but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. If you want to include this software in a commercial product, please contact to Dr. GPS Raghava at raghava@imtech.res.in . 1. Prediction at Protein Level: These methods are developed to predict overall function of characteristics of proteins. In these methods we used complete protein as input. Cross-validation is a statistical method for validating a predictive model. Subsets of the data are held out, to be used as validating sets, a model is fit to the remaining data (a training set) and used to predict for the validation set. Averaging the quality of the predictions across the validation sets yields an overall measure of prediction accuracy. GPSR.1.0 packages mainly contains small PERL programs related to Bioinformatics problems. These programs are useful in analyzing sequences and as well as developing webservers. This package is having nearly 50 PERL scripts. This package is already on the internet and anyone can download it from URL :http://www.imtech.res.in/raghava/gpsr/. Most of these methods are focussed on generating input features of given proteins sequences e.g. Amino acid composition, dipeptide composition, Split Amino Acid Composition, N- or c-terminal composition etc. Some of these programs includes fasta2sfasta, pro2aac, pro2aac_split, pro2dpc etc. There are other programs which acts on data given in two columns e.g. multiplying each column with a number, finding coorelation between two columns, finding average of two columns etc. There are perl scripts which calculates the position specific scoring matrix in column format without normalization as well as normalized values of PSSM etc. Hence, there are a number of programs through which a givem protein sequnece dataset can be converted into different types of input features of any machine learning techniques e.g. Support Vector Machine. There are few standalone packages also provided in this packages, these mainly includes EslPred, EslPred2, HslPred, PSLPred, SRTPred, OxyPred, DPROT, NRpred, PLpred, AntiBP, PolyApred, ABCpred, NADbinder, MITPRED, NpPred, Pprint, SPpred, ISSPred, GSTPpred, TBPred and PSEAPred2.
GPSR version 2.0 is a collection of bioinformatic and chemoinformatics programs which are written in Perl and R languages. It is advisable to ensure that you have installed Perl and R in your operating systems before using this package (Perl is by default installed in Unix based OS). This part deals with the case when you are developing method for classification of molecules like inhibitors and non-inhibitors. In this case we are using binary descriptors where descriptors have value 0 or 1 for example fingerprints/descriptors from PADEL. In this situation, we advise use programs from this category e.g. desc_imp_a.pl, desc_sel_a.pl, desc_graph_a.pl etc. This part deals with the case when you are developing method for regression (e.g. predicting IC50 value of chemicals) using binary descriptors where descriptors have value 0 or 1. In this situation, we advise explore programs of this category e.g. desc_imp_b.pl, desc_sel_b.pl, desc_graph_b.pl etc. This part deals with the case when you are developing method for classification (e.g. inhibitors and non-inhibitors) using descriptors having real values. In this situation, user should explore the programs of this category e.g. desc_imp_c.pl, desc_sel_c.pl etc. This part deals with the case when you are developing method for regression analysis based upon the real values of response (e.g. IC50 values). In this case we are using descriptors with real values. In this situation, we advise user to use programs of this class e.g. desc_imp_d.pl, desc_sel_d.pl etc. These programs are used in file preparations and manipulations, which can be helpful in any Bioinformatics and Chemoinformatics work e.g. make_selectdfile.pl, shiftcol.pl, mean.pl, median.pl etc.
Introduction
Disclaimer and copyright
Types of prediction methods
1.1 Subcellular level prediction: The cellular localization of a protein is one of the most fundamental properties of any protein due to cellular division of labour. The correct prediction of sub cellular location can be a major breakthrough for functional prediction, since to perform a function; protein must be located in their native location, such as nucleus or mitochondria or outside the cell in case of secretory proteins. The native sub-cellular localization of a protein is one of the indicators of protein function. Our existing subcellular localization methods can be divided into following categories.
(i) Similarity search based techniques: Query sequence is searched against experimentally annotated proteins.
(ii) Signal sequence based techniques: number of methods fall under this category in which leader sequence or sorting sequence present on protein itself is used for prediction e.g. TargetP, PSORTb, SignalP.
(iii) Sequence composition based techniques: number of methods has been developed so far on the sequence composition e.g. SubLoc, NNPSL.
(iv) Organism specific and location specific sub cellular localization predictions: Organism specific approach is more useful than generalized approach.
1.2 Class level prediction: in which user can predict belonging class of proteins, e.g., DNA binding protein or Non-binding protein e.g. GPCRsclass.
(i) Classification of proteins e.g. GPCRclass.
(ii) Nucleotide binding protein predictions: Most of the DNA/RNA are performed through interaction with proteins. Prediction of DNA/RNA binding proteins can be categorised into 2 categories:
(a) Structure based methods
(b) Sequence-based methods. Examples are DISIS, DBS-Pred.
1.3 Family level prediction: In this class it is predicting the protein family. Example includes GPCRpred, GSTPred, GPCRsIdentifier.
1.4 Structure class of proteins e.g. Proclass, TBBpred.
2. Prediction at Residue level: in this class it predicts particular interacting/binding amino acid residues instead of full-length protein sequence.
2.1 Prediction of Nucleotide binding residues e.g. Pprint (Prediction of Protein RNA-Interaction).
2.2 Post-translational modification of proteins e.g. ISSPred, DictyOGlyc, NetAcet, NetCGlyc etc.
2.3 Secondary structure predictions e.g. APSSP2.
2.4 Turn predictions e.g. BhairPred, BTEVAL, BetaTPred, Betaturns, AlphaPred etc.
3. Prediction at peptide/epitope level: The potential importance of epitope identification in developing vaccine against infectious, immune and other antigen-related disease, epitopes are studied widely by reserachers in various fields, and a large expansion of databases, predictive methods and software focussing on different types of epitopes has been witnessed.
3.1 Sequence-based epitope predictions
3.2 Structure-based epitope predictions
3.3 Hybrid prediction methods: combining sequential with structural analysis
Some of the exaples of general epitope prediction methods includes ProPred, Propred1, nHLAPred, CTLPred, TAPPred, BcePred, ABCPred etc.
4. Prediction based on signal sequences
Some of the examples of signal based prediction methods involves pTARGET, SecretomeP, SignalP, TargetP and ChloroP etc.
5. Prediction based on Motifs
Major techniques used for developing such methods involves MEME, Prosite and PRINTS. Some of the examples includes Pseapred, Tbpred.
6. Prediction based on Domains
Protein domains are structural, functional and evolutionary units of proteins. The prediction of domains from sequence information can improve tertiary structure prediction, enhance function anotation and aid in structure determination. MITPred is one such method for predicting proteins which are destined to localize in mitocondria.
7. Prediction based on profiles
Classic profile-based prediction worked well for early single-issue, in-order execution processors, but fails to accurately predict the performance of modern processors. Prosite is a method of determining what is the function of uncharacterized proteins translated from genomic or cDNA sequences.
Evaluation of Bioinformatics Methods
In cross-validation, the original data is partitioned into smaller data sets. The analysis is performed on a single subset, with the results validated against the remaining subsets. The subset used for the analysis is called Training set and the other subsets are called validation sets or testing sets.
Some of the major cross-validation techniques are : Jack Knife, Bootstrapping, Monte Carlo, Three ways and Disjoint. Jack Knife can be classified into two categories: LOOCV - Leave One Out Cross Validation and K-fold Cross Validation. In out most of the methods, we have used five-fold cross validation techniques where four sets were formed training set where the reamining last set as test set. This process is reapeated five times so that each test once used for training.
GPSR 1.0 package
GPSR 2.0 package
To know which version of perl / R is installed on your system, use following commands:
perl -v
R -version
If these packages installed already, it will give you the details about the version number of perl / R.
GPSR.2.0 has five parts:
Tools for Chemoinformatics: Part A
Tools for Chemoinformatics: Part B
Tools for Chemoinformatics: Part C
Tools for Chemoinformatics: Part D
Miscellaneous