|Algorithms on non-rooted trees|
Circularity is a characteristic for a part of RNA molecules (in particular RNA molecules composing the viroïdes). The underlying secondary structures of such molecules can be modeled by trees. However it is not always possible to determine a root for these trees. In order to make comparison of this kind of secondary structures in the most general possible way, it is necessary to compare non-rooted trees. Philip N. Klein considered this kind of problem (cf. Klein 98).
|Alternative comparisons of trees|
It would be interesting to define heuristic comparison methods of trees based on a preprocessing of the trees. This preprocessing could make it possible to know various global properties of the trees (like their Strahler number, height, width, number of leaves, etc.). We can then do a comparison based partly on such global properties. This method, allied to a topological comparison, would make it possible to obtain algorithms of better complexity since it would be possible to avoid the structural comparison of two overall different sub-trees.
|Tree pattern matching|
During this thesis, we were interested in the problem of tree pattern matching. In this field, an important number of projections were carried out within the framework of exact pattern matching as well as in the case of approximate pattern matching. In this last case, two great families of specialized deterministic algorithms exist, according to whether one seeks a single pattern in several different trees or several patterns in only one tree. There is also a family of probabilistic algorithms having a better complexity than the deterministic algorithms. Within the biological framework, pattern matching would make possible to identify a subset of RNA molecules having a common substructure. Moreover, successive refinements of the pattern would make it possible to find a "consensus" pattern within a set of secondary structures belonging to the same family.
|Tree pattern recognition|
Another topic of my research is the problem of pattern recognition in trees. This problem consists, starting from a collection of trees, to produce a "consensus" tree pattern, i.e. a pattern very close to this collection, relative with a certain distance measurement. In the problems of pattern recognition, two objects are given to us, as well as a distance targets d. Then, the aim is to find the greatest parts of the objects which differ by, at most, this distance. By specializing the problem of the pattern recognition to a pair of labeled trees, the goal would be then to find the largest component related of each tree so that the distance between them is lower than the target distance d (cf. Zhang-Shasha 97). From a biological point of view, the tree patterns preserved within a collection of trees can characterize a biological structure having a particular function. It was shown that certain divergent sequences can have a similar structure and consequently an identical function, as it is the case for example for the IRES (Internal Ribosome Entry Site).
In the direct prolongation of our work on the various scales of representation of the same object by trees, we can extend our model to multi-scales comparison. A multi-scales comparison is a comparison that takes into account several scales of representation, each model being a refinement of a higher scale model. Our first results give rise to think that this method of comparison multi-scales would make it possible to obtain more effective algorithms in terms of complexity.
|Application to segmented images|
The work previously carried out on rooted, non-rooted trees and directed acyclic graphs can be extended to do pattern matching in images, to compare segmented images, and be applied to image recognition.
In the prolongation of my algorithmic work, I am currently interested in the parallelization of tree comparison algorithms (cf. Shasha-Zhang 89, Zhang-Shasha-Wang 92, Zhang 98), and, in the strict continuity of my research, I intend to study complexity of the parallel versions of the algorithms which I developed during my thesis.
From an algorithmic point of view, we can extend the results of this thesis to other algorithms of the same family. Philip Klein proposed an algorithm similar to Zhang and Shasha's, using another decomposition strategy of the trees. They use an asymmetrical pruning of the first tree according to "the heaviest branch". The asymmetry of this algorithm induces a better complexity if the two trees are very different. Dulucq and Touzet proposed another decomposition strategy `` optimal left-right-hand side '' on the two trees . The complexity of their algorithm can be worth analyzing.
|DNA secondary structure|
Some mycobacterial genomes show the characteristic to contain a great number of approximate repetitions, like the transposon IS6110 in Direct Repeat locus of Mycobacterium tuberculosis. (cf. Sebban and al. - Bioinformatics 2002). Though the repeated sequences diverge (due to transpositions, replication slippages, homologous recombinations through Holliday junction), they seem to share similar secondary structures. It would be interesting to focus on the characterization of such secondary structures of repeated DNA fragments.
|Graphs and classification|
At present, few studies take into account pseudo-nodes and informations relating to the tertiary structure. It would be interesting to characterize a class of graph making it possible to capture this kind of information, even only partially. A classification of the various types of pseudo-nodes would be necessary to know their influence on the complexity of the induced graph (planar graph, outerplanar graph, 2-pages graph, etc).
|Contribution to the prediction of spatial structures|
Concerning the prediction of spatial conformation (or tertiary structure) of RNA molecules, it would be interesting to carry out prediction of tertiary structures not only based on thermodynamic criteria (as it is currently the case), but also using comparison of secondary structures. Indeed, as two similar sequences have a very strong probability to show the same secondary structure, two similar secondary structures should be folded up in the same way in space.
|Geometric substructure of proteins and nucleic acids|
Following the approach of L. Chew et al., I consider the problem of identifying common three-dimensional substructures between proteins. My method is based on comparing the shape of the α-carbon backbone structures of the proteins in order to find 3D rigid motions that bring portions of the geometric structures into correspondence. I am interested in geometric representations of protein backbone chains that are compact yet allow for similarity measures that are robust against noise and outliers. The structure of the backbone are commonly represented as a sequence of unit vectors, defined by each adjacent pair of α-carbons. L. Chew et al. defined a measure of the similarity of two protein structures based on the RMS (root mean squared) distance between corresponding orientation vectors of the two proteins. As an extension of the work I carried out during my PhD thesis, I focus my work on improving this measure of the similarity between proteins and find new ways of comparing other organic macromolecules like ribonucleic acids.
|Consensus shape of a protein or a ribonucleic acid family|
As an extension of the previous problem, I am also interested in finding the consensus shape for a family of proteins, a protein-like structure that provides a compact summary of the significant structural information for a protein family. If all members of a protein family exhibit a geometric relationship between corresponding α-carbons then that relationship is preserved in the consensus shape. In particular, distances and angles that are consistent across family members are preserved. For the consensus shape, the spacing between successive α-carbons is variable, with small distances in regions where the members of the protein family exhibit significant variation and large distances (up to the standard spacing of about 4 angstrom) in regions where the family members agree. Despite this non-protein-like characteristic, the consensus shape preserves and highlights important structural information. I am interested in finding an iterative algorithm for computing the consensus shape of a protein. As an extension to my PhD thesis research, I am also interested in applying these results to RNA structures.