TITLE: Extracting Meaningful Statistics for the Characterization and Classification of Biological, Medical, and Financial Data
ABSTRACT:
This thesis is focused on extracting meaningful statistics for the characterization and classification of biological, medical, and financial data and contains four chapters. The first chapter contains theoretical background on scaling and wavelets, which supports the work in chapters two and three.
In the second chapter, we outline a methodology for representing sequences of DNA nucleotides as numeric matrices in order to analytically investigate important structural characteristics of DNA. This methodology involves assigning unit vectors to nucleotides, placing the vectors into columns of a matrix, and accumulating across the rows of this matrix. Transcribing the DNA in this way allows us to compute the 2-D wavelet transformation and assess regularity characteristics of the sequence via the slope of the wavelet spectra. In addition to computing a global slope measure for a sequence, we can apply our methodology for overlapping sections of nucleotides to obtain an evolutionary slope.
In the third chapter, we describe various ways wavelet-based scaling may be used for cancer diagnostics. There were nearly half of a million new cases of ovarian, breast, and lung cancer in the United States last year. Breast and lung cancer have highest prevalence, while ovarian cancer has the lowest survival rate of the three. Early detection is critical for all of these diseases, but substantial obstacles to early detection exist in each case. In this work, we use wavelet-based scaling on metabolic data and radiography images in order to produce meaningful features to be used in classifying cases and controls. Computer-aided detection (CAD) algorithms for detecting lung and breast cancer often focus on select features in an image and make a priori assumptions about the nature of a nodule or a mass. In contrast, our approach to analyzing breast and lung images captures information contained in the background tissue of images as well as information about specific featu res and makes no such a priori assumptions.
In the fourth chapter, we investigate the value of social media data in building commercial default and activity credit models. We use random forest modeling, which has been shown in many instances to achieve better predictive accuracy than logistic regression in modeling credit data. This result is of interest, as some entities are beginning to build credit scores based on this type of publicly available online data alone. Our work has shown that the addition of social media data does not provide any improvement in model accuracy over the bureau only models. However, the social media data on its own does have some limited predictive power.