TITLE: Model Selection and Estimation in High Dimensional Settings
STUDENT: Rodrigue Ngueyep Tzoumpe
Several statistical problems can be described as estimation problem, where the goal is to learn a set of parameters, from some data, by maximizing a criterion. These type of problems are typically encountered in a supervised learning setting, where we want to relate an output (or many outputs) to multiple inputs. The relationship between these outputs and these inputs can be complex, and this complexity can be attributed to the high dimensionality of the space containing the inputs and the outputs; the existence of a structural prior knowledge within the inputs or the outputs that if ignored may lead to inefficient estimates of the parameters; and the presence of a non-trivial noise structure in the data.
In Chapter 2, we study one of the most commonly used multivariate time series model, the Vector Autoregressive Model (VAR). VAR is generally used to identify lead, lag and contemporaneous relationships describing Granger causality within and between time series. In this chapter, we investigate VAR methodology for analyzing data consisting of multi-layer time series which are spatially interdependent. When modeling VAR relationships for such data, the dependence between time series is both a curse and a blessing. The former because it requires modeling the between time series correlation and the contemporaneous relationships which may be challenging when using likelihood-based methods. The latter because the spatial correlation structure can be used to specify the lead-lag relationships within and between time series, within and between layers. To address these challenges, we propose a L1\L2 regularized likelihood estimation method. The lead, lag and contemporaneous relationsh ips are estimated using a new coordinate descent algorithm that exploits sparsity in the VAR structure, accounts for the spatial dependence and models the error dependence. We assess the performance of the proposed VAR model and compare it with existing methods within a simulation study. We also apply the proposed methodology to a large number of state-level US economic time series.
In the third chapter, we propose a new methodology to tackle the problem of high-dimensional nonparametric learning in the multi-responses or multitask learning setting. We impose sparsity constraints that allow the recovery of the additive functions that are the most influential accross tasks and responses. The methodology applies a functional L1\L2 norm to each group of additive functions. Each group contains all the additive functions associated with a specific predictor. We derive a novel thresholding condition for the union support recovery in the nonparametric setting. A new functional block coordinate descent algorithm is developed to solve for all the additive functions. By applying the methodology to a benchmark cancer data set, we are able to perfectly classify 83 cancer patients to 4 distinct cancer categories by using only 12 out of 2308 genes. The method is also used to uncover the determinants of health that drive the county level cost of care in the state of No rth Carolina from 2005 to 2009.
Motivated by the analysis of a Positron Emission Tomography (PET) imaging data considered in Bowen et al. (2012), we introduce in chapter 4, a semiparametric topographical mixture model able to capture the characteristics of dichotomous shifted response-type experiments. We propose a pointwise estimation procedure of the proportion and location functions involved in our model. Our estimation procedure is only based on the symmetry of the local noise and does not require any finite moments on the errors (e.g. Cauchy-type errors). We establish under mild conditions minimax properties and asymptotic normality of our estimators. Moreover, Monte Carlo simulations are conducted to examine their finite sample performance. Finally a statistical analysis of the PET imaging data in Bowen et al. (2012) is illustrated for the proposed method.