"Comparing sequences without using alignment scores :
Application to HIV/SIV subtypings."
G. Didier, L. Debomy, M. Pupin, M. Zhang, A. Grossmann, C. Devauchelle and I. Laprevotte
Under submission
The results of previous
analyses of retroviral nucleotide sequences (I.
Laprevotte et al. 2001) suggest a major duplication/deletion
evolutionary role of the reverse transcriptase. This explained that the
relevance of a multiple alignment of LTR nucleotide sequences was
difficult to estimate with the four-letter alphabet of DNA. This led to
use "local decoding of order N" of sequences that consists in rewriting
a set of observed biological sequences using a larger alphabet
accounting for the neighborhood of each letter in the sequences (the
integer N is the only required parameter).
The renamed letters
conserved between aligned sequences tend to cluster into homology
blocks, so that we describe a rapid and interactive method to produce
trees from sets of HIV/SIV nucleotide sequences: the pairwise
comparisons of sequence compositions in encoded letters produce
distance matrices that make it possible to compute trees without the
preliminary step of aligning the nucleotide sequences.
We present here the
prototype program that has been used in the present work. A revised and
documented version of this program is to be published. Here are also
included the sequence files (in FASTA format) that have been used to
compute the trees displayed in Figures 2 and 3 and other trees
mentionned in the text.
SEQUENCES DATA and TREES (here are the sequences data files used in the article. You can download them to rerun the program below and to see the results mentioned in the paper.)
Comp_2000 : 46 HIV/SIV complete nucleotide
sequences compiled in the Los
Alamos Compendium 2000. The NJ tree computed with N=13 is shown in Comp2000_13_tree.ps or Comp2000_13_tree.pdf.
hiv.fsa : 70 HIV/SIV nucleotide sequences (66
complete). The NJ tree computed with N=15 is shown in paper Figure 2.
gag.fsa : 70 HIV/SIV gag nucleotide sequences (66
complete). The NJ tree computed with N=18 is shown in gag18_tree.ps or
gag18_tree.pdf.
pol.fsa : 66 HIV/SIV pol nucleotide sequences. The
NJ tree computed with N=22 is shown in
pol22_tree.ps or pol22_tree.pdf.
env.fsa : 66 HIV/SIV env nucleotide sequences. The
NJ tree computed with N=15 is shown in
env15_tree.ps or env15_tree.pdf.
nef.fsa : 66 HIV/SIV nef nucleotide sequences. The
NJ tree computed with N=13 is shown in
nef13_tree.ps or nef13_tree.pdf.
hivltr.fsa : 43 HIV/SIV non-coding parts of LTR
nucleotide sequences. The NJ tree computed with N=11 is shown in paper
Figure 3.
PROTOTYPE PROGRAM
install.exe : self extracting application
for Windows (SCeNE has to be installed on the disk C:\)
Installation steps :
1 - Save install.exe on C:\ (this program is a prototype and the
location of the self-extracting archive in C:\ is imperative)
2 - Click on install.exe (caution : do not change the default
directory). The self extracting file create a directory SCeNE on the
root C:\ .
Execution steps :
1 - Click on SCeNE.exe in the directory C:\SCeNE
2 - In the window "Load Sequences" select the sequence file with the
Browser and press the button "Load Sequences".
3 - In the window "Compute N-ecriture" select the N-block presentation
rank in the text field "Level" and press the button "Compute". Wait
for the next window.
4 - Select the phylogeny window. Select
"score based on sequences" (which corresponds to the
calculation method used in the
associated paper). Press the button "GO".
5 - Wait few seconds and press NJPlot to display the tree.
A formal presentation of the local decoding and a comparison of the
distance used to compute trees vs
an alignment distance can be found in
G. Didier, I. Laprevotte, M. Pupin et Alain Hénaut, Local
Decoding of Sequences and Alignment-Free Comparison, Journal of Computational Biology 13(8) (2006)