Genomics, proteomics and System Biology Resources

This page contains brief description about GPSR.1.0 and GPSR.2.0 packages. Both packages describes some programs which can be used as building block to develop complex prediction modules. GPSR.1.0 packages mainly contains small PERL programs related to Bioinformatics problems whereas GPSR.2.0 basically contains small PERL and R based programs related to Biostatistics and Chemoinformatics.
Following are important programs included in GPSR.1.0 package:

ProgramPurposeUsage
fasta2sfastaConvert fasta format to single fasta formatfasta2sfasta -i seq.fa -o seq.sfa
pro2aacTo calculate amino acid composition of proteinpro2aac -i seq.sfa -o seq.out
pro2aac_ntTo calculate amino acid composition of N-terminal (nt) residues of a proteinpro2aac_nt -i seq.sfa -o seq.out -n 5
pro2aac_ctTo calculate amino acid composition of C-terminal (ct) residues of a protein pro2aac_ct -i seq.sfa -o seq.out -n 5
pro2aac_rest.plTo calculate amino acid composition of a protein after removing N-, and C-terminal residuespro2aac_rest -i seq.sfa -o seq.out -n 5 -c 5
pro2aac_splitTo calculate split amino acid composition (SSAC) of a proteinpro2aac_split -i seq.sfa -o seq.out -n 3
pro2dpcTo calculate dipeptide composition of proteinpro2dpc -i seq.sfa -o seq.out
pro2dpc_ntTo calculate dipeptide composition of N-terminal (nt) residues of a proteinpro2dpc_nt -i seq.sfa -o seq.out -n 5
pro2dpc_ctTo calculate dipeptide composition of C-terminal (ct) residues of a proteinpro2dpc_ct -i seq.sfa -o seq.out -n 5
pro2tpcTo calculate tripeptide composition of proteinpro2tpc -i seq.sfa -o seq.out
add_colsTo add columns of two filesadd_cols -i se1.out -c se2.out -o seq.out
col2svmTo generating SVM_light input formatcol2svm -i se1.out -o svm.out -s +1
col_multTo multiplying each column of input file with a numbercol_mult -i se1.out -o se1_mult -n 0.1
col_mult_selTo multiplying selective columns with a numbercol_mult_sel -i se1.out -o se1_mult -n 10 -a 1 -b 3
col_remTo remove selective columns from a fileperl col_rem -i seq.out -o seq.rm -a 1 -b 2
col_extTo extract selective columns from a filecol_ext -i seq.out -o seq.ext -a 5 -b 10
col_corrTo compute correlation co-efficient between two columncol_corr -i pos -a 1 -b 6
col_avgTo calculate average column of two filescol_avg -a pos1 -b pos2 -o out
seq2pssm_impTo calculate PSSM matrix in column format without any normalizationseq2pssm_imp -i seq1.fa -o pssm.out -d nr
pssm_n1To normalize pssm profile based on 1/(1+e-x) formulapssm_n1 -i pssm.out -o pssm_n1
pssm_n2To normalize pssm profile based on (numb -min)/(max -min) formulapssm_n2 -i pssm.out -o pssm_n2
pssm_n3To normalize pssm profile based on (numb -min)*100/(max -min) formulapssm_n3 -i pssm.out -o pssm_n3
pssm_n4To normalize pssm profile based on 1/(1+e-(x/100) formulapssm_n4 -i pssm.out -o pssm_n4
pssm_compTo compute PSSM composition (400 points)pssm_comp -i pssm_n4 -o pssm_n4.out
col_sigSignificance of columns in two column filescol_sig -i file1 -j file2 >out
pssm2patTo generate patterns of given size from PSSM matrixpssm2pat -i pssm.out -o pssm_pat -w 5
pssm_smoothTo designed smooth pssm profile for plotpssm_smooth -i pssm.out -o pssm_pat -w 5
seq2motifTo create motifs by sliding window of user defined length with option of adding terminal Xseq2motif -i seq1.fa -o motif.out -w 5 -x y
motif2binTo make binary input from the multifasta motif filemotif2bin -i motif_1.out -o bin.out -x y
blast_similarityTo perform blastblast_similarity -i fasta -d nr -j 3 -e 1 -o blast.out

GPSR.2.0 package contains following PERL and R based progrrams::

Installation of gpsR version 2.0:
gpsR version 2.0 is a collection of programs which are written in Perl and R. Before using it ensure that you have installed Perl and R in your operating systems. (Perl is by default installed in Unix based OS).

To check whether Perl is installed in your system type following command perl -v

If it is installed, it will give you the details about the version number of Perl.

To check whether R is installed in your system type following command R --version

If it is installed, it will give you the details about the version number of R.

GPSR.2.0:divided into five parts:


Tools for Chemo-informatics: Part A

This part deals with the case when you are developing method for classification of molecules like inhibitors and non-inhibitors. In this case we are using binary descriptors where descriptors have value 0 or 1 for example fingerprints/descriptors from PADEL. In this situation, we advise following programs.

ProgramDependencyPurpose
desc_imp_a.pli. R
ii. desc_imp_a.R
Gives n most important descriptors for predicting positive and negative examples (n given by user)
desc_sel_a.pli. R
ii. desc_sel_a.R
iii. make_selectedfile.R
Selects the final set of descriptors for prediction by removing very similar descriptors.
desc_graph_a.pli. R
ii. desc_graph_a.R
Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_mod_a.pli. R
ii. desc_mod_a.R
Modifies the binary descriptors based on relative frequency in positive and negative datasets
desc_clust_a.pli. R
ii.chem_desc_clust_a.R
Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_a.pli. R
ii.chem_desc_clust_a.R
Performs clustering of chemicals (i.e. row wise) with graphical representation.
sim_chem_a.pli. R
ii. fingerprint package in R
iii. sim_chem_a.R
Finds the most similar chemical from the database of chemicals based on distance between descriptors of chemicals.

example:
ExampleDescription
desc_imp_a.plIt calculates importance of descriptor (in terms of IDD/IDR/IDL) for classifying positive and negative examples and gives a file listing top n important descriptors (user can select the value of n)
usagedesc_imp_a.pl -i file.pos -j file.neg -n 10 -s 4
file.poscomma separated file in which rows represents samples to classify and column represent descriptors
e.g.
0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0
0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0
0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
file.negcomma separated file in which rows represents samples to classify and column represent descriptors
e.g.
0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0
0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0
0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0
0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0
-n = 10number for selecting top descriptors (here n is 10)
-s = 4Number for calculating importance of descriptors based on
4 - IDD
5 - IDR
6 - IDL
Output will bei) out.desc_imp_a
giving top n descriptors with their names and their IDD/IDR/IDL values in tab separated format

Tools for Chemo-informatics: Part B

This part deals with the case when you are developing method for regression (like predicting IC50 value of chemicals) using binary descriptors where descriptors have value 0 or 1. In this situation, we advise following programs.

ProgramDependencyPurpose
desc_imp_b.pli. R
ii. desc_imp_b.R
Gives n most important descriptors for predicting positive and negative examples (n given by user)
desc_sel_b.pli. R
ii. desc_sel_b.R
iii. make_selectedfile.R
Selects the final set of descriptors for prediction by removing very similar descriptors
desc_graph_b.pli. R
ii. desc_graph_b.R
Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_clust_b.pli. R
ii.chem_desc_clust_b.R
Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_b.pli. R
ii.chem_desc_clust_b.R
Performs clustering of chemicals (i.e. row wise) with graphical representation.
sim_chem_b.pli. R
ii. fingerprint package in R
iii. sim_chem_b.R
Finds the most similar chemical from the database of chemicals based on distance between descriptors of chemicals.

Tools for Chemo-informatics: Part C

This part deals with the case when you are developing method for classification (like inhibitors and non-inhibitors) using descriptors having real values. In this situation, we advise following programs.

ProgramDependencyPurpose
desc_imp_c.pli. R
ii. desc_imp_c.R
Gives n most important descriptors for predicting positive and negative examples (n given by user)
desc_sel_c.pli. R
ii. desc_sel_c.R
iii. make_selectedfile.R
Selects the final set of descriptors for prediction by removing very similar descriptors
desc_graph_c.pli. R
ii. desc_graph_c.R
Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_clust_c.pli. R
ii.chem_desc_clust_c.R
Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_c.pli. R
ii.chem_desc_clust_c.R
Performs clustering of chemicals (i.e. row wise) with graphical representation.

Tools for Chemo-informatics: Part D

This part deals with the case when you are developing method for regression analysis based upon the real values of response (say IC50 values). In this case we are using descriptors with real values. In this situation, we advise following programs.

ProgramDependencyPurpose
desc_imp_d.pli. R
ii. desc_imp_d.R
Gives n most important descriptors based upon correlation with response. (n given by user). An additional file with all descriptors with correlation values is also given as output
desc_sel_d.pli. R
ii. desc_sel_d.R
iii. make_selectedfile.R
Selects the final set of descriptors for prediction by removing very similar descriptors
desc_graph_d.pli. R
ii. desc_graph_d.R
Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_clust_d.pli. R
ii.chem_desc_clust_d.R
Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_d.pli. R
ii.chem_desc_clust_d.R
Performs clustering of chemicals (i.e. row wise) with graphical representation.

Miscellaneous

These programs are used in file preparations and manipulations, which can be helpful in any Bioinformatics and Chemoinformatics work

ProgramDependencyPurpose
make_selectedfile.RRExtracts specific columns from input file and writes in output file
shiftcol.plperlShifts the 2 columns in a file and writes in an output file
rem_identicalcol.RRRemoves identical columns in a file and writes unique columns in output file
matrix_optimization.plRFor a given positive and negative dataset of protein sequences this program optimizes the substitution matrix which can be used in classification of positive and negative examples
randomizefile.plperlshuffles the rows of a file randomly and writes in an output file. (can also extract user defined number of lines randomly from input file and write in output file).
mean.plRCalculates row wise or column wise mean of file in csv format.
median.plRCalculates row wise or column wise median of file in csv format.
stdev.plRCalculates row wise or column wise standard deviation of file in csv format.
stderr.plRCalculates row wise or column wise standard error of file in csv format.
correlation.plRCalculates correlation of all columns of a file or between 2 columns.
barplot.plRDraws a barplot between 2 properties.
roc.plR and R-librariesDraws a roc plot.
PSSM-pattern.plgpsr_1.0, blastpgp (for psiblast).Makes PSSM profile of positive and negative patterns for prediction at residue level (see gpsr_1.0 manual for residue level prediction).