Integration of evolutionary features to identify the functionally important residues in Major Facilitator Superfamily (MFS) transporters.


Released on March, 2009


Introduction

This project contains the application of Integration Score (IS) that allows users to narrow down potential candidates of functional residues. Based on the hypothesis that functional residues are conserved and have more co-evolutionary coupled partners than non-functional residues, we developed IS by combining sequence conservation and co-evolutionary information. The initial application for the code provided was to identify functional residues in Major Facilitator Superfamily (MFS) transporters, LacY, GlpT, and EmrD which have known 3D-structures and characterized some of the functional residues enough to validate the performance of our method. Using our method, we could successfully find that the conserved cores of evolutionary coupled residues are responsible for specific substrate recognition and translocation of MFS transporters. Here, we provide a downloadable source code for the wide applications of this method to find functionally important residues in other classes of proteins.


Download

Source codes (Python) for LINUX and Windows is available for download here: [IS.tar.gz]
Source codes include programs for calculating IS (IS.py) and for installing other required programs (IS_install.py).


Running the program

To run this program, Python (higher than 2.4) and Java application (Java runtime environment 1.5 or higher) should be installed.

To calculate IS for each residue, you should follow these guidelines.
1. First, install the tools for calculating co-evolution score1, 2 (McBasc) and sequence conservation score3 (rate4site) by typing:

python IS_install.py

By using "IS_install.py", you can install McBasc and rate4site easily. After installation, you can see the output file "config" and output folder, such as "covariance" and "rate4site".

2. Second, calculate the IS of each residue by typing:

python IS.py [argument1] [argument 2] [argument 3]

The argument1 is the name of your own MSA file and argument2 is protein query sequence name in MSA file. The argument3 is the name of output file.
For example, if name of MSA file is Test.aln, query sequence name is Test_1, and desired output name is Test.out, you type:

python IS.py Test.aln "Test_1" Test.out


Input file

CLUSTAL W

Name_1(space)LFSIWLHVIG---REYWLISGLLF
Name_2(space)FFQRWLN-MGWRNPSYLQSSTGIF
Name_3(space)FFPIWLHINHLKNTNFWMFGLFFF

That is, each homologue sequence should be one line. The first word of the line is the name of the sequence. The second word of the line is the amino acid sequences. '-' indicates gap.


Output file

The result is in the format given by:

RES (tab) POS (tab) IS (tab) CS (tab) CS_P (tab) CN (tab) CN_P

RES: the amino acid in the query sequence in one letter code.
POS: the residue number in the query sequence.
IS: integration score of given residue in query sequence.
CS: sequence conservation score in given residue of query sequence.
CS_P: percentile rank of sequence conservation score of given residue in query sequence.
CN: co-evolutionary coupling number in given residue of query sequence
CN_P: percentile rank of co-evolutionary coupling number of given residue in query sequence.


Citation

1. Fodor A. and Aldrich R., Influence of Conservation on Calculations of Amino Acid Covariance in Multiple Sequence Alignments, Proteins: Structure, Function and Genetics, Proteins. 2004 Aug 1;56(2):211-21.

2. Dekker J., Fodor A., Aldrich R. and Yellen G. A perturbation-based method for calculating explicit likelihood of evolutionary covariance in multiple sequence alignments. 2004 Jul 10;20(10):1565-72.

3. Mayrose, I., Graur, D., Ben-Tal, N., and Pupko, T. 2004. Comparison of site-specific rate-inference methods: Bayesian methods are superior. Mol Biol Evol 21: 1781-1791.