Saturday 9 December 2017

bioinformatics - How to do multiple sequence alignment?


I have a DNA sequence that makes protein 1. but now I have asked to:




  1. compare the amino acid sequence of protein 1 with nine homologous proteins and make a multi sequence alignment (MSA) of the sequences.




  2. Determine a consensus sequence for the proteins based on the MSA.





  3. Find any specific parts of the proteins that are conserved, then explain why these parts are conserved.





Answer





compare the amino acid sequence of protein 1 with nine homologous proteins and make a multi-alignment of the sequences.



EBI have a portal for many MSA tools and there are also other MSA tools available elsewhere.



In research, it's good practice to use several alignment techniques and look at which generates sensible indels. Usually, this is the lowest number of indel events.


Clustal Omega is probably the most sophisticated MSA tool hosted on the EBI site, however, it is relatively new and isn't as established as T-coffee or MUSCLE.


Note that these tools are updated fairly regularly. This question and top answer about "cutting edge" MSA tools from 2014 refer to a paper from 2011 that attempts to benchmark MSA tools. As you can imagine the state-of-the-art tools change rapidly (for example, clustal-w2 was released, and now clustal omega since that benchmarking paper). For most researchers though, it's a personal preference, and different MSA tools are "better" for different situations (speed of computation, number of alignments, similarity of sequences, complexity of secondary structure, local vs global alignments etc.).




Determine a consensus sequence for the proteins based on the multialignment.



This depends entirely on the information from your alignment.


A common way to make a consensus sequence is to just take the most abundant residue at each position in the MSA. I don't like this approach because it heightens the significance of abundant residues, and diminishes the appearance of less abundant residues. This distorts the biochemistry and can easily lead to sequences that would be impossible or useless in biology. I would always rather see a raw alignment.





Find any specific parts of the proteins that are conserved, then explain why these parts are conserved.



Conservation usually implies function, and hopefully you'll have a homologue that has a categorised function. Perhaps in your case one could attribute this function to a conserved sequence in the other homologues.


One brilliant tool is called consurf.


image of structural and sequence alignments coloured according to conservation


You can upload your MSA file to it and it will colour code regions of conservation from purple to blue. A purple region implies an evolutionary selection to not change that region, implying that it is a "functional" region.


Consurf works better with more sequences, so perhaps it isn't appropriate for this project. Instead try loading the alignment in Jalview and show "sequence conservation".


Image of bar chart showing different conservation scores at different sequence positions


Caution.



When performing an MSA remember that the algorithm assumes that the sequence is homologous and that assumption can produce errors. If it looks wrong, it probably is!


No comments:

Post a Comment

evolution - Are there any multicellular forms of life which exist without consuming other forms of life in some manner?

The title is the question. If additional specificity is needed I will add clarification here. Are there any multicellular forms of life whic...