Date on Master's Thesis/Doctoral Dissertation


Document Type

Doctoral Dissertation

Degree Name

Ph. D.



Committee Chair

Cerrito, Patricia


Bioinformatics; Breast--Cancer--Treatment; Medical care--Data processing


Statistical models have been the first choice for comparative effectiveness in clinical research. Though effective, these models are limited when the data to be analyzed do not fit the assumed distributions; which is mostly the case when the study is not a clinical trial. In this project, data mining, decision analysis and cost effectiveness analysis methods were used to supplement statistical models in comparing lumpectomy to mastectomy for surgical treatment of breast cancer. Mastectomy has been the gold standard for breast cancer treatment for since the 1800s. In the 20th century, an equivalence of mastectomy and lumpectomy was established in terms of long-term survival and disease free survival. However, short term comparative effectiveness in post-operative outcomes has not been fully explored. Studies using administrative data are lacking and no study has used new technologies of self-expression, particularly the internet discussion board. In this study, data used were from the Nationwide Inpatient Sample (NIS) 2005, the Thomson Reuter's MarketScan 2000 - 2001, the medical literature on clinical trials and online individuals' posts in discussion boards on The NIS was used to compare lumpectomy to mastectomy in terms of hospital length of stay, total charges and in-hospital death at the time of surgery. MarketScan data was used to evaluate the comparative follow-up outcomes in terms of risk of repeat hospitalization, risk of repeat operation, number of outpatient services, number of prescribed medications, length of stay, and total charges per post-operative hospital admission on a period of eight months average. The MarketScan was also used to construct a simple post-operative hospital admission predictive model and to perform short-term cost-effectiveness analysis. The medical literature was used to analyze long term -10 years- mortality and recurrence for both treatments. The web postings were used to evaluate the comparative cost to improve quality of life in terms of patient satisfaction. In NIS and MarketScan data, International Classification of Disease, 9th revision, Clinical Modification (lCD-9-CM) diagnosis codes were used to extract cases of breast cancer; and ICD-9-CM procedure codes and Current Procedural Terminology, 4th edition procedure codes were used to form groups of treatment. Data were pre-processed and prepared for analysis using data mining techniques such as clustering, sampling and text mining. To clean the data for statistical models, some continuous variables were normalized using methods such as logarithmic transformation. Statistical models such as linear regression, generalized linear models, logistic and proportional hazard (Cox) regressions were used to compare post-operative outcomes of lumpectomy versus mastectomy. Neural networks, decision tree and logistic regression predictive modeling techniques were compared to create a simple predictive model predicting 90-day post-operative hospital re-admission. Cost and effectiveness were compared with the Incremental Cost Effectiveness Ratio (ICER). A simple method to process and analyze online po stings was created and used for patients' input in the comparison of lumpectomy to mastectomy. All statistical analyses were performed in SAS 9.2. Data Mining was performed in SAS Enterprise Miner (EM) 6.1 and SAS Text Miner. Decision analysis and Cost Effectiveness Analysis were performed in TreeAge Pro 2011. A simple comparison of the two procedures using the NIS 2005, a discharge-level data, showed that in general, a lumpectomy surgery is associated with a significantly longer stay and more charges on average. From the MarketScan data, a person-level data where a patient can be followed longitudinally, it was found that for the initial hospitalization, patients who underwent mastectomy had a non-significant longer hospital stay and significantly lower charges. The post-operative number of outpatient services, prescribed medications as well as length of stay and charges for post-operative hospital admissions were not statistically significant. Using the MarketScan data, it was also found that the best model to predict 90-day post-operative hospital admission was logistic regression. A logistic regression revealed that the risk of a hospital re-admission within 90 days after surgery was 65% for a patient who underwent lumpectomy and 48% for a patient who underwent mastectomy. A cost effectiveness analysis using Markov models for up to 100 days after surgery showed that having lumpectomy saved hospital related costs every day with a minimum saving of $33 on day 10. In terms of long-term outcomes, the use of decision analysis methods on the literature review data revealed that, 10-years after surgery, 739 recurrences and 84 deaths were prevented among 10,000 women who had mastectomy instead of lumpectomy. Factoring patients' preferences in the comparison of the two procedures, it was found that patients who undergo lumpectomy are non-significantly more satisfied than their peers who undergo mastectomy. In terms of cost, it was found that lumpectomy saves $517 for each satisfied individual in comparison to mastectomy. In conclusion, the current project showed how to use data mining, decision analysis and cost effectiveness methods to supplement statistical analysis when using real world nonclinical trial data for a more complete analysis. The application of this combination of methods on the comparative effectiveness of lumpectomy and mastectomy showed that in terms of cost and patients' quality of life measured as satisfaction, lumpectomy was found to be the better choice.