Genomics, proteomics and System Biology Resources (GPSR) at OSDDlinux

Genomics, proteomics and System Biology Resources

This page contains brief description about GPSR.1.0 and GPSR.2.0 packages. Both packages describes some programs which can be used as building block to develop complex prediction modules. GPSR.1.0 packages mainly contains small PERL programs related to Bioinformatics problems whereas GPSR.2.0 basically contains small PERL and R based programs related to Biostatistics and Chemoinformatics.
Following are important programs included in GPSR.1.0 package:

Program	Purpose	Usage
fasta2sfasta	Convert fasta format to single fasta format	fasta2sfasta -i seq.fa -o seq.sfa
pro2aac	To calculate amino acid composition of protein	pro2aac -i seq.sfa -o seq.out
pro2aac_nt	To calculate amino acid composition of N-terminal (nt) residues of a protein	pro2aac_nt -i seq.sfa -o seq.out -n 5
pro2aac_ct	To calculate amino acid composition of C-terminal (ct) residues of a protein	pro2aac_ct -i seq.sfa -o seq.out -n 5
pro2aac_rest.pl	To calculate amino acid composition of a protein after removing N-, and C-terminal residues	pro2aac_rest -i seq.sfa -o seq.out -n 5 -c 5
pro2aac_split	To calculate split amino acid composition (SSAC) of a protein	pro2aac_split -i seq.sfa -o seq.out -n 3
pro2dpc	To calculate dipeptide composition of protein	pro2dpc -i seq.sfa -o seq.out
pro2dpc_nt	To calculate dipeptide composition of N-terminal (nt) residues of a protein	pro2dpc_nt -i seq.sfa -o seq.out -n 5
pro2dpc_ct	To calculate dipeptide composition of C-terminal (ct) residues of a protein	pro2dpc_ct -i seq.sfa -o seq.out -n 5
pro2tpc	To calculate tripeptide composition of protein	pro2tpc -i seq.sfa -o seq.out
add_cols	To add columns of two files	add_cols -i se1.out -c se2.out -o seq.out
col2svm	To generating SVM_light input format	col2svm -i se1.out -o svm.out -s +1
col_mult	To multiplying each column of input file with a number	col_mult -i se1.out -o se1_mult -n 0.1
col_mult_sel	To multiplying selective columns with a number	col_mult_sel -i se1.out -o se1_mult -n 10 -a 1 -b 3
col_rem	To remove selective columns from a file	perl col_rem -i seq.out -o seq.rm -a 1 -b 2
col_ext	To extract selective columns from a file	col_ext -i seq.out -o seq.ext -a 5 -b 10
col_corr	To compute correlation co-efficient between two column	col_corr -i pos -a 1 -b 6
col_avg	To calculate average column of two files	col_avg -a pos1 -b pos2 -o out
seq2pssm_imp	To calculate PSSM matrix in column format without any normalization	seq2pssm_imp -i seq1.fa -o pssm.out -d nr
pssm_n1	To normalize pssm profile based on 1/(1+e-x) formula	pssm_n1 -i pssm.out -o pssm_n1
pssm_n2	To normalize pssm profile based on (numb -min)/(max -min) formula	pssm_n2 -i pssm.out -o pssm_n2
pssm_n3	To normalize pssm profile based on (numb -min)*100/(max -min) formula	pssm_n3 -i pssm.out -o pssm_n3
pssm_n4	To normalize pssm profile based on 1/(1+e-(x/100) formula	pssm_n4 -i pssm.out -o pssm_n4
pssm_comp	To compute PSSM composition (400 points)	pssm_comp -i pssm_n4 -o pssm_n4.out
col_sig	Significance of columns in two column files	col_sig -i file1 -j file2 >out
pssm2pat	To generate patterns of given size from PSSM matrix	pssm2pat -i pssm.out -o pssm_pat -w 5
pssm_smooth	To designed smooth pssm profile for plot	pssm_smooth -i pssm.out -o pssm_pat -w 5
seq2motif	To create motifs by sliding window of user defined length with option of adding terminal X	seq2motif -i seq1.fa -o motif.out -w 5 -x y
motif2bin	To make binary input from the multifasta motif file	motif2bin -i motif_1.out -o bin.out -x y
blast_similarity	To perform blast	blast_similarity -i fasta -d nr -j 3 -e 1 -o blast.out

GPSR.2.0 package contains following PERL and R based progrrams::

Installation of gpsR version 2.0:
gpsR version 2.0 is a collection of programs which are written in Perl and R. Before using it ensure that you have installed Perl and R in your operating systems. (Perl is by default installed in Unix based OS).

To check whether Perl is installed in your system type following command perl -v

If it is installed, it will give you the details about the version number of Perl.

To check whether R is installed in your system type following command R --version

If it is installed, it will give you the details about the version number of R.

GPSR.2.0:divided into five parts:

Tools for Chemo-informatics: Part A

This part deals with the case when you are developing method for classification of molecules like inhibitors and non-inhibitors. In this case we are using binary descriptors where descriptors have value 0 or 1 for example fingerprints/descriptors from PADEL. In this situation, we advise following programs.

Program	Dependency	Purpose
desc_imp_a.pl	i. R ii. desc_imp_a.R	Gives n most important descriptors for predicting positive and negative examples (n given by user)
desc_sel_a.pl	i. R ii. desc_sel_a.R iii. make_selectedfile.R	Selects the final set of descriptors for prediction by removing very similar descriptors.
desc_graph_a.pl	i. R ii. desc_graph_a.R	Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_mod_a.pl	i. R ii. desc_mod_a.R	Modifies the binary descriptors based on relative frequency in positive and negative datasets
desc_clust_a.pl	i. R ii.chem_desc_clust_a.R	Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_a.pl	i. R ii.chem_desc_clust_a.R	Performs clustering of chemicals (i.e. row wise) with graphical representation.
sim_chem_a.pl	i. R ii. fingerprint package in R iii. sim_chem_a.R	Finds the most similar chemical from the database of chemicals based on distance between descriptors of chemicals.

example:

Example	Description
desc_imp_a.pl	It calculates importance of descriptor (in terms of IDD/IDR/IDL) for classifying positive and negative examples and gives a file listing top n important descriptors (user can select the value of n)
usage	desc_imp_a.pl -i file.pos -j file.neg -n 10 -s 4
file.pos	comma separated file in which rows represents samples to classify and column represent descriptors e.g. 0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0 0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0 0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0 0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0
file.neg	comma separated file in which rows represents samples to classify and column represent descriptors e.g. 0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0 0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0 0,1,0,0,0,1,0,0,0,0,0,0,0,0,1,0 0,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0
-n = 10	number for selecting top descriptors (here n is 10)
-s = 4	Number for calculating importance of descriptors based on 4 - IDD 5 - IDR 6 - IDL
Output will be	i) out.desc_imp_a giving top n descriptors with their names and their IDD/IDR/IDL values in tab separated format

Tools for Chemo-informatics: Part B

This part deals with the case when you are developing method for regression (like predicting IC50 value of chemicals) using binary descriptors where descriptors have value 0 or 1. In this situation, we advise following programs.

Program	Dependency	Purpose
desc_imp_b.pl	i. R ii. desc_imp_b.R	Gives n most important descriptors for predicting positive and negative examples (n given by user)
desc_sel_b.pl	i. R ii. desc_sel_b.R iii. make_selectedfile.R	Selects the final set of descriptors for prediction by removing very similar descriptors
desc_graph_b.pl	i. R ii. desc_graph_b.R	Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_clust_b.pl	i. R ii.chem_desc_clust_b.R	Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_b.pl	i. R ii.chem_desc_clust_b.R	Performs clustering of chemicals (i.e. row wise) with graphical representation.
sim_chem_b.pl	i. R ii. fingerprint package in R iii. sim_chem_b.R	Finds the most similar chemical from the database of chemicals based on distance between descriptors of chemicals.

Tools for Chemo-informatics: Part C

This part deals with the case when you are developing method for classification (like inhibitors and non-inhibitors) using descriptors having real values. In this situation, we advise following programs.

Program	Dependency	Purpose
desc_imp_c.pl	i. R ii. desc_imp_c.R	Gives n most important descriptors for predicting positive and negative examples (n given by user)
desc_sel_c.pl	i. R ii. desc_sel_c.R iii. make_selectedfile.R	Selects the final set of descriptors for prediction by removing very similar descriptors
desc_graph_c.pl	i. R ii. desc_graph_c.R	Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_clust_c.pl	i. R ii.chem_desc_clust_c.R	Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_c.pl	i. R ii.chem_desc_clust_c.R	Performs clustering of chemicals (i.e. row wise) with graphical representation.

Tools for Chemo-informatics: Part D

This part deals with the case when you are developing method for regression analysis based upon the real values of response (say IC50 values). In this case we are using descriptors with real values. In this situation, we advise following programs.

Program	Dependency	Purpose
desc_imp_d.pl	i. R ii. desc_imp_d.R	Gives n most important descriptors based upon correlation with response. (n given by user). An additional file with all descriptors with correlation values is also given as output
desc_sel_d.pl	i. R ii. desc_sel_d.R iii. make_selectedfile.R	Selects the final set of descriptors for prediction by removing very similar descriptors
desc_graph_d.pl	i. R ii. desc_graph_d.R	Creates barplot of importance of descriptors (in terms of IDD) vs important Descriptors
desc_clust_d.pl	i. R ii.chem_desc_clust_d.R	Performs clustering of descriptors (i.e. column wise) with graphical representation.
chem_clust_d.pl	i. R ii.chem_desc_clust_d.R	Performs clustering of chemicals (i.e. row wise) with graphical representation.

Miscellaneous

These programs are used in file preparations and manipulations, which can be helpful in any Bioinformatics and Chemoinformatics work

Program	Dependency	Purpose
make_selectedfile.R	R	Extracts specific columns from input file and writes in output file
shiftcol.pl	perl	Shifts the 2 columns in a file and writes in an output file
rem_identicalcol.R	R	Removes identical columns in a file and writes unique columns in output file
matrix_optimization.pl	R	For a given positive and negative dataset of protein sequences this program optimizes the substitution matrix which can be used in classification of positive and negative examples
randomizefile.pl	perl	shuffles the rows of a file randomly and writes in an output file. (can also extract user defined number of lines randomly from input file and write in output file).
mean.pl	R	Calculates row wise or column wise mean of file in csv format.
median.pl	R	Calculates row wise or column wise median of file in csv format.
stdev.pl	R	Calculates row wise or column wise standard deviation of file in csv format.
stderr.pl	R	Calculates row wise or column wise standard error of file in csv format.
correlation.pl	R	Calculates correlation of all columns of a file or between 2 columns.
barplot.pl	R	Draws a barplot between 2 properties.
roc.pl	R and R-libraries	Draws a roc plot.
PSSM-pattern.pl	gpsr_1.0, blastpgp (for psiblast).	Makes PSSM profile of positive and negative patterns for prediction at residue level (see gpsr_1.0 manual for residue level prediction).