"Comparing sequences without using alignment scores :
Application to HIV/SIV subtypings."

G. Didier, L. Debomy, M. Pupin, M. Zhang, A. Grossmann, C. Devauchelle and I. Laprevotte

Under submission

        The results of previous analyses of retroviral nucleotide sequences (I. Laprevotte et al. 2001) suggest a major duplication/deletion evolutionary role of the reverse transcriptase. This explained that the relevance of a multiple alignment of LTR nucleotide sequences was difficult to estimate with the four-letter alphabet of DNA. This led to use "local decoding of order N" of sequences that consists in rewriting a set of observed biological sequences using a larger alphabet accounting for the neighborhood of each letter in the sequences (the integer N is the only required parameter).
        The renamed letters conserved between aligned sequences tend to cluster into homology blocks, so that we describe a rapid and interactive method to produce trees from sets of HIV/SIV nucleotide sequences: the pairwise comparisons of sequence compositions in encoded letters produce distance matrices that make it possible to compute trees without the preliminary step of aligning the nucleotide sequences.

        We present here the prototype program that has been used in the present work. A revised and documented version of this program is to be published. Here are also included the sequence files (in FASTA format) that have been used to compute the trees displayed in Figures 2 and 3 and other trees mentionned in the text.

SEQUENCES DATA and TREES (here are the sequences data files used in the article. You can download them to rerun the program below and to see the results mentioned in the paper.)

Comp_2000 : 46 HIV/SIV complete nucleotide sequences compiled in the Los Alamos Compendium 2000. The NJ tree computed with N=13 is shown in or Comp2000_13_tree.pdf.
hiv.fsa : 70 HIV/SIV nucleotide sequences (66 complete). The NJ tree computed with N=15 is shown in paper Figure 2.
gag.fsa : 70 HIV/SIV gag nucleotide sequences (66 complete). The NJ tree computed with N=18 is shown in or gag18_tree.pdf.
pol.fsa : 66 HIV/SIV pol nucleotide sequences. The NJ tree computed with N=22 is shown in or pol22_tree.pdf.
env.fsa : 66 HIV/SIV env nucleotide sequences. The NJ tree computed with N=15 is shown in or env15_tree.pdf.
nef.fsa : 66 HIV/SIV nef nucleotide sequences. The NJ tree computed with N=13 is shown in or nef13_tree.pdf.
hivltr.fsa : 43 HIV/SIV non-coding parts of LTR nucleotide sequences. The NJ tree computed with N=11 is shown in paper Figure 3.


install.exe : self extracting application for Windows (SCeNE has to be installed on the disk C:\)

Installation steps :
1 - Save install.exe on C:\ (this program is a prototype and the location of the self-extracting archive in C:\ is imperative)
2 - Click on install.exe (caution : do not change the default directory). The self extracting file create a directory SCeNE on the root C:\ .
Execution steps :
1 - Click on SCeNE.exe in the directory C:\SCeNE
2 - In the window "Load Sequences" select the sequence file with the Browser and press the button "Load Sequences".
3 - In the window "Compute N-ecriture" select the N-block presentation rank in the text field "Level" and press the button "Compute". Wait for the next window.
4 - Select the phylogeny window. Select "score based on sequences" (which corresponds to the
calculation method used in the associated paper).  Press the button "GO".
5 - Wait few seconds and press NJPlot to display the tree.

A formal presentation of the local decoding and a comparison of the distance used to compute trees vs an alignment distance can be found in
G. Didier, I. Laprevotte, M. Pupin et Alain Hénaut, Local Decoding of Sequences and Alignment-Free Comparison, Journal of Computational Biology 13(8) (2006)

Ivan Laprevotte
Laboratoire Statistique et Génome
Tour Evry2 - 523 Place des Terrasses - 91034 Evry Cedex - France