TITLE: Model-Based Data Mining Methods for Identifying Patterns in Medical and Health Data

ABSTRACT:

Every day humans and machines are responsible for the creation of massive amounts of data. Alongside the growth of these data banks, a new field of study, data science, has emerged. The central role of data science is to infer knowledge on the data in the form of models and estimates employing methods at the intersection of computer science, data mining, mathematics, and statistics. In this thesis we provide statistical and model-based data mining methods for pattern detection with applications to biomedical and healthcare data sets. In particular, we examine applications in costly acute or chronic disease management. Health data are extremely varied: at the macro-level, one can examine the healthcare utilization of millions of patients in the insurance systems like Medicare and Medicaid, while at the micro-level, a single snapshot from a medical imaging device may be used to diagnose cancerous cells in the body. In all, statisticians can contribute methods that extract structure from large, noisy data.

In Chapter II, we consider NMR experiments in which we seek to locate and de-mix smooth, yet highly localized components in a noisy two-dimensional signal. By using wavelet-based methods we are able to separate components from the noisy background, as well as from other neighboring components. In Chapter III, we pilot methods for identifying profiles of patient utilization of the healthcare system from large, highly-sensitive, patient-level data. We combine model-based data mining methods with clustering analysis in order to extract longitudinal utilization profiles. We transform these profiles into simple visual displays that can inform policy decisions and quantify the potential cost savings of interventions that improve adherence to recommended care guidelines. In Chapter IV, we propose new methods integrating survival analysis models and clustering analysis to profile patient-level utilization behaviors while controlling for variations in the population’s demographic and healthcare characteristics and explaining variations in utilization due to different state-based Medicaid programs, as well as access and urbanicity measures.