Estimation of causal direction from time series in the presence of mixed and colored noise
Many methods erroneously interpret asymmetries e.g. in signal-to-noise ratio or spectral content as indication of a specific direction of information flow. To estimate the direction, we have constructed a "Phase-Slope-Index" (PSI) from real and imaginary part of coherency which is strictly insensitive to pure mixtures of independent sources, regardless of the number of noise sources and their spectral content, but is highly sensitive to true information flow. Application to EEG data of the human alpha rhythm (10 Hz rhythm most prominent with eyes closed) indicates a clear front-to-back causal interaction in most of the subjects. In this talk I will illustrate the method including a comparison with Granger Causality. I will put special emphasis on limitations to address the questions:
a) when can we expect a result,
b) when is the result wrong,
c) what does the result really mean, and
d) what questions cannot be answered?
Towards robust feature selection for high-dimensional, small sample settings
The selection of relevant features is an important issue in many biological and biomedical settings. A typical application is the identification of biomarkers, such as signature genes from microarray data or SNP selection form SNP-Chip or high-throughput sequencing data.
In most of these settings, the number of features (dimensionality) largely outnumbers the number of samples, often by one or more orders of magnitude.
Given the limited amount of training samples, it can be questioned how sensitive feature selection methods are to small variations in training data. Because the selected features are inspected by domain experts in subsequent analyses, robust feature selection methods are preferred, as these increase the confidence of the domain expert.
Quantifying and increasing the robustness of feature selection methods is a topic of recent interest, and in this talk I will present some recent break-through current developments, and promising future directions in this area. In particular, I will also elaborate on the concept of ensemble feature selection methods, and present a new comparative study between these methods and traditional SVM-based feature selection techniques, such as RFE.
Meriem El Karoui
Some biological questions in comparative genomics
Since the publication of the first complete genome in 1995, the number of complete genome sequences for eukaryotic and prokaryotic organisms has grown exponentially. This has opened up the possibility to make comparisons of the full genetic complement of organisms at different evolutionary scales. Comparative genomics allows the identification of evolutionary constrained regions and has proved to be an important tool to better annotate genomes, identify new functional motifs, build phylogenies and more generally understand the molecular mechanisms that shape genome evolution. Yet, comparing genomes is a difficult task with many methodological challenges. In this presentation, I will describe examples of some of the most common biological questions in comparative genomics and of the methods used to address them.
Local network motifs
Biological networks exhibit small over-represented subgraphs, called motifs, some of which are known to have a biological function. Several algorithms exist to detect motifs, most of them being based on time-consuming simulations and on a global count of the subgraphs. We propose a new definition of motif, based on a local over-representation of a small pattern, which allows us to consider the over-representation of a pattern with respect to its subpatterns. We developp an efficient procedure to detect them and apply it on the Yest interaction network.
Learning gene regulation networks with Gaussian graphical model
A current challenge in System Biology is to learn the regulation network of a family of $p$ genes from a $n$-sample of microarrays, with $n$ (much) smaller than $p$. Gaussian graphical models (GGM) are simple models to describe these regulation networks. The goal of the statistician is then to learn the minimal graph of a GGM from a $n$-sample.
We propose a two-step procedure, GGMselect, which first build a collection of candidate graphs and then selects one of them according to a dedicated selection criterion. The good theoretical properties of the procedure are confirmed on numerical examples.
Global alignment of protein–protein interaction networks by graph matching methods
Aligning protein-protein interaction (PPI) networks of different species has drawn a considerable interest recently. This problem is important to investigate evolutionary conserved pathways or protein complexes across species, and to help in the identification of functional orthologs through the detection of conserved interactions. It is however a difficult combinatorial problem, for which only heuristic methods have been proposed so far. We reformulate the PPI alignment as a graph matching problem, and investigate how state-of-the-art graph matching algorithms can be used for that purpose. We differentiate between two alignment problems, depending on whether strict constraints on protein matches are given, based on sequence similarity, or whether the goal is instead to find an optimal compromise between sequence similarity and interaction conservation in the alignment. We propose new methods for both cases, and assess their performance on the alignment of the yeast and fly PPI networks. The new methods consistently outperform state-of-the-art algorithms, retrieving in particular 78% more conserved interactions than IsoRank for a given level of sequence similarity. All data and codes are freely and publicly available upon request.
Increasing stability and interpretability of gene expression signatures (prediction of breast cancer outcome)
Predicting the outcome of breast cancer in terms of metastatic status is meant to have direct bearing on therapy choice. Molecular models, as opposed to clinical ones, attempt to make this prognosis from gene expression levels measured on the tumor, and several microarray-based studies, among which van de Vijver et al. , proposed sets of genes whose expression profiles predict the output with some success. However, the signatures proposed in different studies have generally very few genes in common, which suggests non-uniqueness of the solution. Moreover,maximizing the predictive accuracy seems not to be a sufficient criterion to extract a biologically relevant set of genes, as many non-overlapping signatures can have similar predictive accuracy [Ein-Dor et al., 2005]. Hence, stabilization of the signature in order to improve its biological interpretability arises as a main challenge.
One method to estimate a signature from a training set of expression data is to use shrinkage estimators such as the LASSO [Tibshirani, 1996], which has proved efficient in sparse model selection. Such methods consider all genes as features without any prior knowledge about the biological relationship between them. However, in the case of genetic signature, it may seem more relevant to exploit prior knowledge we have about the genes, such as the fact that many groups of genes are involved together in diverse biological functions. For example, pathway databases are available (e.g on MsigDB) that contain genes known to be co-activated in several biological functions. If we look for the predefined sets of gene which best discriminate between metastasic and non-metastatic tumors, instead of simply looking for the genes, we can expect (i) more robustness in the signature, (ii) better interpretability in terms of biological functions and processes, and (iii) potentially better accuracy. Indeed, it can be easy to spuriously select a gene with the Lasso because of the noise, but it should be more difficult to spuriously select a large number of genes simultaneously. Moreover, correlated genes will often belong to the same gene set, which reduces the issue of selecting only one of several correlated variables. In addition, defining the signature in terms of gene sets makes it much more interpretable from a biological point of view.
Here we investigate the use of shrinkage methods to select predefined groups of variables. A variant of the LASSO known as the group LASSO performs a selection directly on groups [Yuan and Lin, 2006] using a block-L1 penalty. Since the groups of genes (pathways) we consider may overlap, we focus on another variant called the overlapping group LASSO proposed in [Jacob et al., 2009]. This algorithm was shown to be consistent for model selection when some of the groups share variables, and we investigate the possibility to increase the stability of selected groups using randomization techniques.
L. Ein-Dor, I. Kela, G. Getz, D. Givol, and E. Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171–178, 2005.
L. Jacob, G. Obozinski, and JP Vert. Group Lasso with Overlap and Graph Lasso. In Proc. ICML, 2009.
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
M.J. van de Vijver, Y.D. He, L.J. van’t Veer, H. Dai, A.A.M. Hart, D.W. Voskuil, G.J. Schreiber, M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B, 68(1):49–67, 2006.
Douglas N. Rutledge
Feature selection using multi-block analysis
The quality of multivariate predictive models is increased by eliminating uninformative variables. Several methods have been developed to detect such variables. In the case of discriminant models, ANOVA is often used to test each variable separately to determine whether it varies more between groups than within groups. Many methods, such as Uninformative Variable Elimination-PLS , Genetic Algorithm-PLS  or iPLS  , have been proposed for variable selection prior to regression analysis. Here a novel method based on multi-block analysis will be proposed to detect informative variables.
 V. Centner, D. L. Massart, O. E. deNoord, S. deJong, B. M. Vandeginste, C. Sterna, Elimination of uninformative variables for multivariate calibration. Analytical Chemistry 1996, 68, 3851-3858.
 A.S. Bangalore, R.E. Shaffer, G.W. Small, M.A. Arnold, Genetic algorithm-based method for selecting wavelengths and model size for use with partial least-squares regression: Application to near-infrared spectroscopy. Analytical Chemistry 1996, 68, 4200-4212.
 L. Norgaard, A. Saudland, J. Wagner, J.P. Nielsen, L. Munck, S.B. Engelsen, Interval partial least-squares regression (iPLS): A comparative chemometric study with an example from near-infrared spectroscopy. Applied Spectroscopy 2000, 54, 413-419.
Bayesian variable selection for probit mixed models
An important issue in building a regression model is to select the most pertinent covariables. Several approaches have been proposed to handle this problem. However, it is often useful in a regression framework to take into account some random effects. In genetics, it is appealing to merge datasets because it results in more observations and diversifies the data, allowing a more reliable selection of gene fragments (otherwise known as "probesets"). But it is then necessary to introduce the dataset as a random effect. In this article we propose a method to select relevant variables among ten of thousands in a probit mixed regression model, which extends a method developed by (4). This model is considered as part of a larger hierarchical Bayesian model, and latent variables are used to identify subsets of selected variables. We combine the collapsing technique of (6) with a Metropolis-within-Gibbs algorithm (7). The algorithm proposed is quite efficient and feasible, even for very large datasets with around 20000 variables. We illustrate our method with an application to breast cancer, to select probesets characterizing the Estrogen Receptor (ER) hormonal status of patients which come from three different merged datasets.
 J.H. Albert and S. Chib. Bayesian analysis of binary and polychotomous response data. JASA, 88 (422), 669-679, 1993.
 H. Chipman and E.I. George and R.E. McCulloch. The practical implementation of bayesian model selection. IMS Lecture Notes. P. LAHIRI. Institute of Mathematical Statistics, 2001.
 E.I. George and R.E. McCulloch. Variable selection via Gibbs sampling. JASA, 88 (423), 881-889, 1993.
 K.E. Lee and N. Sha and E.R. Dougherty and M. Vannucci and B.K. Mallick Gene selection: a bayesian variable selection approach. Bioinformatics, 19 (1), 90-97, 2003.
 J.S. Liu and W.H. Wong and A. Kong. Covariance structure and convergence rate of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81, 27-40, 1994.
 J.S. Liu. The collapsed Gibbs sampler in bayesian computations with application to a gene regulation problem. JASA, 89 (427), 958-966, 1994.  C. Robert and G. Casella Monte Carlo statistical methods, second edition. JASA, 89 (427), 958-966, 1994.
A model of signal shift and drift for tiling array expression data
The tiling design for oligonucleotide microarrays consists of overlapping probes that provide uniform covering of the genomic sequence. Their hybridization with RNA samples, allow to assess the transcriptional activity of the whole genome of organisms such as bacteria and yeasts with high resolution.
The problem of the analysis of these data is naturally stated in terms of finding segments where the hybridization signal is relatively constant, delimited by breakpoints that are expected to correspond to biological features such as promoters, terminators or splicing sites. A variety of tools including local non-parametric smoothing and simple iterative hypothesis testing have been proposed to answer this question. Today the most popular and best statistically grounded model is the piecewise constant model with Gaussian noise. The simplicity of this approach is appealing but its use presents a number of specific difficulties, the two most obvious being the choice of the number of segments and the high time complexity of the algorithm.
In principle, embedding the segmentation model in a probabilistic setting that includes not only the noise but also the evolution of the signal can alleviate the need for the choice of a fixed number of breakpoints. In this context the problem states as the estimation of a parameter and the reconstruction of the underlying signal trajectory can integrate the uncertainty on the exact number of breakpoints. This idea stimulated the development of Hidden Markov models (HMMs). However, transcript level is a continuous quantity and none of the proposed models is satisfactory when the underlying signal is continuous. A HMM achieving this aim at a computationally affordable cost will be presented here.
The proposed model is also markedly richer than the piecewise constant model. First, it automatically accounts for differential affinity between probes via the introduction of covariates. This allows to achieve segmentation and within-array normalization in one step. Second, our model also relaxes the assumption of strictly constant underlying signal between abrupt "shifts" by also allowing progressive "drift".
P. Nicolas, A. Leduc, S. Robin, S. Rasmussen, H. Jarmer and P. Bessières. (2009) Transcriptional landscape estimation from tiling array data using a model of signal shift and drift. Bioinformatics. 25. 2341-2347
Inference of gene regulatory networks using genetical genomics data
Jansen and Nap Trends in Genet 2001 proposed a strategy to decipher biological mechanisms underlying gene regulation. Both gene expression data and genetic marker data are collected in tissue samples from genetically related individuals. Each of the (tens of) thousands of gene expression level is treated as a quantitative phenotype, just like morphological, physiological or molecular traits. Chromosomal regions that affect steady-state levels of transcripts are determined by Quantitative Trait Loci (QTLs) mapping.
The simultaneous use of genetic and genomics data is a breakthrough in order to infer causal relationships between biological entities (e.g. genes) working together to achieve many diverse functions and explaining the observed phenotypic variability. Recently Bayesian Networks (BNs, Zhu et al. Plos Comput Biol 2007) and Structural Equation Modelling (SEM, Liu et al. Genetics 2008) have been proposed to infer the topology of a network encoding such relationships. We will here briefly review such models and present development in progress to accurately assess and refine these approaches.
A method for high throughput sequencing data analysis:
application for mapping genome-wide protein-DNA binding sites (ChIPseq)
Our main topic of interest is the study of basal and activated transcription in mammalian T-cells. To this aim, we extensively analyze profiling of RNA polymerase II (the enzyme responsible for transcription) in the genome of developing T-cells. Chromatin immunoprecipitation (ChIP) allows for mapping of genome-wide protein-DNA binding sites. This method, like others, has been thoroughly improved by the recent development of high throughput sequencing technologies. Specific methods adapted to sequencing, that are developed in our team to exploit and analyze this type of genome-wide data, will be presented.
A factor model to analyze heterogeneity in gene expression
Microarray technology allows the simultaneous analysis of thousands of genes within a single experiment. Classical approaches to analyze transcriptomic data ignore the gene dependence structure. This leads to correlation among test statistics which affects a strong control of the false discovery proportion.
We focus our study on a method called FAMT (Friguet et al, 2009) which captures the components of expression heterogeneity into factors. The relevance of factor modeling is first shown on illustrative gene expression data sets in simple situations of heterogeneity. We also use a real expression data set, primarily generated to map QTL for abdominal fatness in chickens (Le Mignon et al, 2009). FAMT provides functional information about a QTL region through a gene related to the fatness trait and controlled by this region (DHCR7) not observed by a classical approach. Then we interpret the independent factors extracted from this biological data set using known information about both experimental design and genes. We show that some factors may have different and complex origins, which can be related to particular metabolisms.
As we extract biological information from what was before simply considered as statistical noise, analyzing heterogeneity in gene expression yields a new point of view on transcriptomic data (Blum et al, 2009).
Joint work with Sandrine Lagarrigue and David Causeur.
1. Friguet C, Kloareg M, Causeur D: A factor model approach to multiple testing under dependence, JASA, in press.
2. Le Mignon G, Desert C, Pitel F, Leroux S, Demeure O, Pitel F, Guernec G, Pitel F, Douaire M, P LR, Pitel F, Lagarrigue S: Using trancriptome profiling to refine QTL regions on chicken chromosome 5. BMC Genomics, in press.
3. Blum Y, Le Mignon G, Lagarrigue S, Causeur D: A factor model to analyze heterogeneity in gene expression. BMC Bioinformatics, submitted.
Recent challenges in Flow Cytometry
Flow Cytometry (FC) characterises millions of biological cells in a few minutes. FC measures binding of designed fluorescent molecules to the cell. Recent instruments allow up to twenty parameters to be measured simultaneously for each cell. This huge amount of data leads to new methodological challenges. At the same time, FC will allow to work with fewer biological material or with cohort of hundred patients.
Exact and fast segmentation of large SNP/CGH profiles
Genomic DNA regions are frequently lost or gained during tumour progression. SNP- and CGH-array experiments are used to detect and map these chromosomal imbalances. A CGH profile can be viewed as a succession of segments representing regions in the genome that share the same copy number. Segmentation methods constitute a natural framework for their analysis. In an algorithmic point of view, dynamic programming allow the recovery of the best segmentation, i.e. the global minimizer of the residual sum of squares. However their time complexity is in O(n2), where n is the number of observations. There- fore these algorithms cannot be applied to new technologies which can contain up to 1 000 000 observations. Here we address the problem of finding the best segmentation for large values of n and propose a new efficient algorithm to recover this segmentation. We empirically show that its mean time complexity is in O(n). We applied the algorithm to large SNP profiles. Moreover, we demonstrate that at worst the time complexity is in O(n2) and the space complexity is in O(n).
Bai, J. and Perron, P. (2003). Computation and analysis of multiple structural change models. J. Appl. Econ., 18:1–22.
Guthery, S. (1974). Partition régression. Journal ofthe American Statistical.
Lavielle, M. (2005). Using penalized contrasts for the change-point problem. Signal Processing, 85(8):1501–1510.
Picard, F., Robin, S., Lavielle, M., Vaisse, C., and Daudin, J.-J. (2005). A statistical approach for array CGH data analysis. BMC Bioinformatics, 6(27):1.