Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.



Committee Chair

Cerrito, Patricia

Author's Keywords

Data mining; Patient data; Statistics; Health care; AMI; Utilization


Medical records--Data processing; Data mining; Medical care--Data processing; Coronary heart disease--Treatment


The goal of this study is to use a data mining framework to assess the three main treatments for acute myocardial infarction: thrombolytic therapy, percutaneous coronary intervention (percutaneous angioplasty), and coronary artery bypass surgery. The need for a data mining framework in this study arises because of the use of real world data rather than highly clean and homogenous data found in most clinical trials and epidemiological studies. The assessment is based on determining a profile of patients undergoing an episode of acute myocardial infarction, determine resource utilization by treatment, and creating a model that predicts each treatment resource utilization and cost. Text Mining is used to find a subset of input attributes that characterize subjects who undergo the different treatments for acute myocardial infarction as well as distinct resource utilization profiles. Classical statistical methods are used to evaluate the results of text clustering. The features selected by supervised learning are used to build predictive models for resource utilization and are compared with those features selected by traditional statistical methods for a predictive model with the same outcome. Sequence analysis is used to determine the sequence of treatment of acute myocardial infarction. The resulting sequence is used to construct a probability tree that defines the basis for cost effectiveness analysis that compares acute myocardial infarction treatments. To determine effectiveness, survival analysis methodology is implemented to assess the occurrence of death during the hospitalization, the likelihood of a repeated episode of acute myocardial infarction, and the length of time between reoccurrence of an episode of acute myocardial infarction or the occurrence of death. The complexity of this study was mainly based on the data source used: administrative data from insurance claims. Such data source was not originally designed for the study of health outcomes or health resource utilization. However, by transforming record tables from many-to-many relations to one-to-one relations, they became useful in tracking the evolution of disease and disease outcomes. Also, by transforming tables from a wide-format to a long-format, the records became analyzable by many data mining algorithms. Moreover, this study contributed to field of applied mathematics and public health by implementing a sequence analysis on consecutive procedures to determine the sequence of events that describe the evolution of a hospitalization for acute myocardial infarction. This same data transformation and algorithm can be used in the study of rare diseases whose evolution is not well understood.