Substitution Matrix

In bioinformatics, a substitution matrix estimates the rate at which each possible residue in a sequence changes to each other residue over time. Substitution matrices are usually seen in the context of amino acid or DNA sequence alignment, where the similarity between sequences depends on the mutation rates as represented in the matrix.

Background

In the process of evolution, from one generation to the next the amino acid sequences of an organism's proteins are gradually altered through the action of DNA mutations. For example, the sequence
  ALEIRYLRD 
could mutate into the sequence
  ALEINYLRD 
in one generation and possibly
  AQEINYQRD 
over a longer period of evolutionary time. Each amino acid is more or less likely to mutate into various other amino acids. For instance, a hydrophobic residue such as valine is more likely to stay hydrophobic than not, since replacing it with a hydrophilic residue could affect the folding and/or activity of the protein. If we have two amino acid sequences before us, we should be able to say something about how likely they are to be derived from a common ancestor, or homologous. If we can line up the two sequences using a sequence alignment algorithm such that the mutations required to transform a hypothetical ancestor sequence into both of the current sequences would be evolutionarily plausible, then we'd like to assign a high score to the comparison of the sequences. To this end, we will construct a 20x20 matrix where the (i,j)th entry is equal to the probabiliy of the ith amino acid being transformed into the jth amino acid in a certain amount of evolutionary time. There are many different ways to construct such a matrix, called a substitution matrix. Here are the most commonly used ones:

Identity Matrix

The simplest possible substitution matrix would be one in which each amino acid is considered maximally similar to itself, but not able to transform into any other amino acid. This matrix would look like: \begin{bmatrix} 1 & 0 & \cdots & 0 & 0 \\ 0 & 1 & & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & & 1 & 0 \\ 0 & 0 & \cdots & 0 & 1 \end{bmatrix} This matrix will succeed in the alignment of very similar amino acid sequences but will be miserable at aligning two distantly related sequences. We need to figure out all the probabilities in a more rigorous fashion. It turns out that an empirical examination of previously aligned sequences works best.

Log-odds matrices

We express the probabilities of transformation in what are called log-odds scores. For our matrix M we define M_{i,j}=\log_{2} \frac {q_{i,j}} {p_i \cdot p_j}=\log_{2} \frac {observed\;frequency} {expected\;frequency} where q_{i,j} is the probability that amino acid i transforms into amino acid j and p_i is the frequency of amino acid i. The base of the logarithm is not too important, and you will often see the same substitution matrix expressed in different bases. What will our simple substitution matrix look like with log-odds scoring? p_i=p_j=\frac {1} {20} since there are 20 amino acids. q_{i,i}=\frac {1} {20} and q_{i,j}=0 \ \forall i \ne j Hence, the matrix is: \begin{bmatrix} \log_{2} 20 & 0 & \cdots & 0 & 0 \\ 0 & \log_{2} 20 & & 0 & 0 \\ \vdots & & \ddots & & \vdots \\ 0 & 0 & & \log_{2} 20 & 0 \\ 0 & 0 & \cdots & 0 & \log_{2} 20 \end{bmatrix}

PAM

One of the first amino acid substitution matrices, the PAM matrix was developed by Dayhoff in the 1970s. This matrix is calculated by observing the differences in closely related proteins. The PAM1 matrix measures the substitution rates when 1% of the amino acids have changed. The version that Dayhoff published was the PAM250 matrix. It's possible to go above PAM100 because if one residue changes five times, each change is counted. A matrix for divergent sequences can be calculated from a matrix for closely related sequences by taking the second matrix to a power. For instance, we can roughly approximate the WIKI2 matrix from the WIKI1 matrix by saying W_2 = W_1^2 where W_1 is WIKI1 and W_2 is WIKI2. This is how the PAM250 matrix is calculated.

BLOSUM

Dayhoff's methodology of comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Sequence changes over long evolutionary time scales are not well approximated by compounding small changes that occur over short time scales. The BLOSUM series of matrices rectifies this problem. Henikoff and Henikoff constructed these matrices using multiple alignments of evolutionarily divergent proteins. The BLOSUM62 matrix includes only blocks of sections of proteins that share at least 62% sequence identity, and so on. One would use a higher numbered BLOSUM matrix for aligning two closely related sequences and a lower number for more divergent sequences. It turns out that the BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications like BLAST.

Current research

Current innovative approaches include incorporating secondary structure information into the sequences and substitution matrices. See this paper for an example of this direction of research.

 

<< PreviousWord BrowserNext >>
pharaonism
aaron burr, sr.
first battle of winchester
marian engel award
trivia trap
robert edwin lee
list of cities in mauritius
khosrow vaziri
grauman's chinese theater
venustiano carranza
new conservative party
whiz kids
mary jane
novell netware
nael minas gerais
original free will baptist convention
sleeping barber problem
the east is red
dihedral
38 (number)
robert downey sr.
bae 146
chen yi
dead weight tonnage
acer
long ton
valley of ten thousand smokes
samuel mockbee
students islamic movement of india
chen yi (kuomintang)
people's republic
folding
raceme
rope rescue
toni collette
confined space rescue
hellenic air force
muriel's wedding
grimace
rescue
cinema of nepal
cave rescue
jato rocket car
whiskered tern