bioinformatics - Scoring matrices (BLOSUM & PAM) in BLAST and other sequence-comparison programs

Tuesday, 9 October 2018

bioinformatics - Scoring matrices (BLOSUM & PAM) in BLAST and other sequence-comparison programs

The Wikipedia page on BLAST states that:

The scores are created by comparing the word in the list in step 2* with all the 3-letter words. By using the scoring matrix (substitution matrix) to score the comparison of each residue pair, there are 20³ possible match scores for a 3-letter word. For example, the score obtained by comparing PQG with PEG and PQA is 15 and 12, respectively with the BLOSUM62 weighting scheme.

I do not understand how they got a score of 15 for PQG v. PEG. What is a scoring matrix, how is it computed, and how is it used?

Step 2 on that page is Make a k-letter word list of the query sequence: “Take k=3 for example, we list the words of length 3 in the query protein sequence (k is usually 11 for a DNA sequence) sequentially, until the last letter of the query sequence is included”.

Answer

The two most common families of scoring matrices are BLOSUM and PAM. Each of them has a score for every possible alignment combination between the 20 standard amino acids¹. They both do more or less the same job but have been derived using different approaches.

BLOSUM matrices

_{(image taken from Wikipedia)}

The BLOSUM matrices are built from actual alignments between very conserved regions of protein families. The score for a given pair of amino acids aligning with each other is the log-odds score measuring how often this particular alignment is found and how often it would be expected to be found by chance:

$S_{ij} = \left ( \frac{1}{\lambda} \right ) \left ( \frac{p_{ij}}{q_i * q_j} \right )$

Where $p_{ij}$ is the probability of finding the two amino acids $i$ and $j$ aligned to each other in a homologous sequence and $q_i$ and $q_j$ the background probability of finding the amino acids $i$ and $j$ in any protein sequence. The $\lambda$ is a scaling factor set to ensure that the matrix contains easily computable integer values.

In simple words, the BLOSUM matrices give you a score based on how often the alignment you observe in your sequences is found in alignments of similar sequences.

There are several BLOSUM matrices which have been computed using proteins of more or less sequence similarity. The most commonly used one is BLOSUM62 which was built using alignments between proteins of $\ge 62\%$ sequence identity. This is a nice middle ground and will do for most cases. If you are comparing proteins that are very closely related, you might want to use a matrix based on more similar sequences such as BLOSUM90 (built from sequences with $\ge 90\%$ sequence sequence identity) and for less conserved proteins, you might use something like BLOSUM45 ( $\ge45\%$ sequence identity).

PAM matrices

The PAM matrices are built using a different approach. First, a global alignment (as opposed to the local ones used in BLOSUM) of of a set of sequences sharing 85% sequence identity is computed. Then, a score for the alignment of all possible pairs of amino acids is calculated based on its observed frequency in the aligned proteins. Then, the scores are extrapolated to alignments of different sequence similarity using mathematical tools. The PAM matrices assume a model of protein evolution and score the alignments based on that model.

The PAM-I matrix is the only one that was actually built from real alignments. The rest were obtained by multiplying PAM-I by itself N times. In PAM, unlike in BLOSUM, the higher numbers correspond to greater evolutionary distances between proteins.

With all that in mind, the scoring matrices are used by BLAST to calculate the score of the alignment. You can think of this score as the sum of the scores of each individual pair of aligned amino acids in your BLAST result.

Useful references:

¹_{The non-standard amino acids Selenocysteine and Pyrrolysine are not included in any matrix I know of, probably because they are too rare for any reasonable conclusion to be drawn from statistic analyses.}

Blog

Tuesday, 9 October 2018