Artificial intelligence in dentistry: Harnessing big data to predict oral cancer survival
2020-04-12ManHungJungweonParkEricHonJerryBounsangaSaraMoazzamiBiancaRuizNegrDaweiWang
Man Hung, Jungweon Park, Eric S Hon, Jerry Bounsanga, Sara Moazzami, Bianca Ruiz-Negrón, Dawei Wang
Man Hung, Jungweon Park, Sara Moazzami, College of Dental Medicine, Roseman University of Health Sciences, South Jordan, UT 84095, United States
Man Hung, Department of Orthopaedic Surgery Operations, University of Utah, Salt Lake City,UT 84108, United States
Man Hung, College of Social Work, University of Utah, Salt Lake City, UT 84112, United States
Man Hung, Division of Public Health, University of Utah, Salt Lake City, UT 84108, United States
Man Hung, Department of Educational Psychology, University of Utah, Salt Lake City, UT 84109, United States
Eric S Hon, Department of Economics, University of Chicago, Chicago, IL 60637, United States
Jerry Bounsanga, Research Section, Utah Medical Education Council, Salt Lake City, UT 84102, United States
Bianca Ruiz-Negrón, College of Social and Behavioral Sciences, University of Utah, Salt Lake City, UT 84112, United States
Dawei Wang, Data Analytics Unit, Walmart Inc., Bentonville, AR 72716, United States
Abstract BACKGROUND Oral cancer is the sixth most prevalent cancer worldwide. Public knowledge in oral cancer risk factors and survival is limited.AIM To come up with machine learning (ML) algorithms to predict the length of survival for individuals diagnosed with oral cancer, and to explore the most important factors that were responsible for shortening or lengthening oral cancer survival.METHODS We used the Surveillance, Epidemiology, and End Results database from the years 1975 to 2016 that consisted of a total of 257880 cases and 94 variables. Four ML techniques in the area of artificial intelligence were applied for model training and validation. Model accuracy was evaluated using mean absolute error (MAE),mean squared error (MSE), root mean squared error (RMSE), R2 and adjusted R2.RESULTS The most important factors predictive of oral cancer survival time were age at diagnosis, primary cancer site, tumor size and year of diagnosis. Year of diagnosis referred to the year when the tumor was first diagnosed, implying that individuals with tumors that were diagnosed in the modern era tend to have longer survival than those diagnosed in the past. The extreme gradient boosting ML algorithms showed the best performance, with the MAE equaled to 13.55,MSE 486.55 and RMSE 22.06.CONCLUSION Using artificial intelligence, we developed a tool that can be used for oral cancer survival prediction and for medical-decision making. The finding relating to the year of diagnosis represented an important new discovery in the literature. The results of this study have implications for cancer prevention and education for the public.
Key Words: Oral cancer survival; Machine learning; Artificial intelligence; Dental medicine; Public health; Surveillance, Epidemiology, and End Results; Quality of life
INTRODUCTION
To minimize the occurrence of oral cancer and improve one's quality of life, it is imperative to conduct screenings for early detection of head and neck carcinomas(HNC) on all high-risk dental patients. HNC, which is the umbrella term that includes oral cancer, are often located within the oral and nasal cavities, upper/lower pharynx,larynx, and the maxillary sinus[1-3]. Early screenings for identification of dysplastic tissue in the head and neck region are within the scope of care of the dental health providers. Oral cancer may be curable if detected early[4]. However, more than one-half of all oral and pharyngeal cancers in the United States were detected at late stages[4,5],thus the overall United States five-year survival rate for oral cancer was only 52 percent[6]. In 2012, there were 145000 deaths in the United States attributed to oral cancers[2]. Throughout the world, approximately 563826 diagnoses of oral cancer were reported, rendering it the sixth most common type of cancer in the world[7-10]. Although there is a downward trend in oral cancer incidence due to the rising awareness in the risks associated with tobacco use and alcohol consumption in the United States, a general lack of public awareness of the symptoms and other risk factors of oral cancer remains[4]. In 2016, a total of 48330 oral cavity and oropharyngeal cancer incidents were reported[11], and an increase of 225% of human papillomavirus-related oropharynx cancer was recorded[11]. Altogether, the need to address additional risk factors and increases in early screenings of oral cancers are key factors to improving cancer survival[12].
Among those with an oral cancer diagnosis, stage of tumor at time of diagnosis and treatment have been associated with survival[10,13]. Specifically, in a study by Sargeranet al[13]the survival rates were higher in patients with stages I or II cancer than those with stage III cancer at the time of the diagnosis. They further concluded that patients who had undergone radiotherapy alone had a lower survival rate than patients with a combination of surgery and radiotherapy, and that age and sex were not associated with survival. However, Warnakulasuriyaet al[10]found that younger age was associated with higher 5-year relative survival rate.
Additionally, race has been associated with varying level of survival rates. A study using 1973-2002 data from Surveillance, Epidemiology, and End Results (SEER-18)[14]by Shiboskiet al[15]revealed that the stage at diagnosis was related to 5-year relative survival rate among Whites and Blacks. The results indicated that Blacks had a significantly higher rate of cancer, mainly located on the tongue, with tumors larger than 4 cm in diameter at the time of diagnosis. Black men experienced lower 5-year relative survival rates compared to White men, especially for tongue cancer. Shiboskiet al[15]explained that the differences in survivals across different races may be due to differences in access to, and utilization of healthcare services.
Due to the limited understanding of the disparities seen across cancer survivors and public knowledge on risk factors and symptoms, investigators in the past have suggested for primary care providers to put greater weight on initial screening and comprehensive soft-tissue exams[15]. Having a tool to accurately predict the survival time of oral cancer patients could help regulate the effects of psychological distress on physical and mental health outcomes after diagnosis. Medical decision-making tools based on fuzzy and soft set theories and artificial intelligence are effective for determination of cancer survival and enhancing disease awareness[16]. Awareness of the disease can lessen the burden of the disease on the survivors and their caretakers,and assist with medical and dental decision-making moving forward. The main purpose of this study was to apply artificial intelligence to build a model to predict the length of survival for those diagnosed with oral cancer as accurately and precisely as possible based on 40 plus years-worth of data representative of the United States'population. The secondary purpose was to explore the most important factors that were influencing the longevity of oral cancer survival.
MATERIALS AND METHODS
Data
Data from the SEER-18 database[14]were used to conduct this study. The SEER-18 database is a population-based registry that contains cancer-related data on individuals diagnosed with cancer from hospitals and laboratories in the United States[14]. The SEER-18 database does not contain data from Louisiana during hurricanes Katrina and Rita from July to December in 2005[14]. Institutional review board approval was not required for this study since the SEER-18 data were deidentified and publicly available online. The data that support the findings of this study are openly available at https://seer.cancer.gov/.
Oral cancer cases from the years 1975 to 2016 in the SEER-18 database[17]were identified by the International Classification of Diseases for Oncology, 3rd Edition(ICD-O-3) site codes (https://training.seer.cancer.gov/head-neck/abstract-codestage/codes.html)[18]. Table 1 contains a list of all ICD-O-3 site codes that were identified for the oral cancer cases utilized in this study[18-20].
Analytical approach
The outcome of interest for this study was oral cancer survival time. Survival time represented the time of survival in months from the date of cancer diagnosis to the date of last contact[21,22].
Descriptive statistics of demographics and cancer characteristics (such as primary site, tumor size, laterality,etc.) were analyzed. Prediction of oral cancer survival time was modeled by using four machine learning (ML) algorithms: linear regression,decision tree, random forest, and extreme gradient boosting (XGBoost). ML is acomputer algorithm-based method that can efficiently detect relationships between variables with unrecognizable trends in large and complex data. The process takes into account historical trends to come up with models in predicting outcome of interest (e.g., oral cancer survival time), and then validates the models with actual or current data. The performance of the various models from the validation process will be compared, and the more parsimonious model with better performance is generally the preferred model. The ML techniques included in this study were chosen due to their ability to prevent over-fitting, being commonly used in similar studies, and their ease of interpretation in medical settings. To compare the different techniques, model accuracy was evaluated using mean absolute error (MAE)[1,13], mean squared error(MSE), root mean squared error (RMSE),R2and adjustedR2. All analyses were conducted using Python 3.7.4 (Python Software Foundation).
Table 1 Number of oral cancer cases from various anatomical sites
ICD-O-3 codes: International Classification of Diseases for Oncology, 3rd ed; NOS: Not otherwise specified.
There was a total of 257880 oral cancer cases and 94 variables (i.e., features) in the dataset. Cases with missing data on the outcome variable (i.e., oral cancer survival time) were dropped, and responses that were marked as not applicable were excluded.All variables with more than 40% of missing values were also excluded. Further data processing was conducted to remove null features, constant features (i.e., features with same values for the outcome), quasi-constant features (i.e., features with variance less than 0.01), and highly correlated features (i.e., features with correlation higher than 0.9). These features were removed prior to data analysis as they would not contribute to the prediction of outcome and can often cause errors in the prediction. Outliers were detected by plotting distributions of each variable and they were replaced by mean,mode, and quantile as appropriate. Features with more than 90% the values that were the same were dropped.
To avoid the impracticality of including too many variables, further feature selection was performed using random forest. We aimed to narrow down the variables as much as possible without losing prediction accuracy. The random forest model showed that many features are of little importance (Figure 1). We dropped 7 features that were of less importance in terms of their importance scores, and a step backward feature selection method with random forest was then applied to select the best number of features. The cross-validation scores were then plotted (Figure 2) and the most important 10 features were kept to create a parsimony model. The cross-validation scores did not change much even after deleting the less significant features. The selected 10 features were: Year of diagnosis; primary site; age at diagnosis; CS tumor size; CS extension; CS lymph nodes eval; RX Summ-surg prim site; derived AJCC stage group; site recode ICD-O-3/WHO 2008; and month of diagnosis.
Figure 1 Feature selection using random forest. CS: Coding system; ICD-O-3: International Classification of Diseases for Oncology, 3rd ed; WHO: World Health Organization; AJCC: American Joint Committee on Cancer; SEER: Surveillance, Epidemiology, and End Results; LN: Lymph node.
The final dataset used for model prediction from linear regression, decision tree,random forest, and XGBoost had an effective sample size of 177714 cases with a total of 10 variables. Most of the values were categorized and given numerical code values.Table 2 lists all of the variables. Data were randomly split into training set and testing set. The training set contained 75% of the data and were used to build models. The testing set contained 25% of the data and were used to validate the models built from the training data. Detailed model parameter tuning set up is available upon request from the authors.
RESULTS
There was a total of 177714 oral cancer cases included in the study, of which 63111 were oropharyngeal cancer cases and 114603 were laryngeal cancer cases. The nasopharyngeal cancer cases did not make it to the final sample since there was very few of these cases and all of them had a large number of missing values. Oropharynx cancer included anatomical positions at the base of tongue, lingual tonsil, soft palate,uvula, tonsil, orpharynx, Waldeyer ring, and histology sites[23]. Laryngeal cancer included areas at the larynx, which comprises of the epiglottis, supraglottis, vocal cord, glottis, and subglottis[24]. The sample consisted of 40.62% (n= 72179) males. The average age at diagnosis was 54.6 years old (range: 0-109) (Figure 3). Nearly 40% of the sample were 60 years or older at the time of oral cancer diagnosis (Table 3).
Among the 10 features, several of them showed strong linear relation with survival time (Figure 4). Hence a linear regression model was used to predict outcome. The feature importance can be visualized in Figure 4 showing year of diagnosis as the most important variable. The performance of linear regression was MSE = 647.49, RMSE =25.45, MAE = 18.21,R2= 0.620 and adjustedR2= 0.620 (Table 4).
Table 2 List of all 10 variables included in the final machine learning model building and validation
Decision tree regression, a ML method, was used to determine the top features (i.e.,variables) that were predictive of oral cancer survival time. Relative variable importance scores were computed to identify the top predictors. The usage of the decision tree regression was ideal as it doesn't require linear relationship between features and target variable. Year of diagnosis was found as the most important variable (Figure 5). The performance of the decision tree was MSE = 538.30, RMSE =23.20, MAE[1]= 14.45,R2= 0.681 and adjustedR2= 0.681 (Table 4).
Among the 10 features, several of them showed strong linear relation with survival time (Figure 4). Hence a linear regression model was used to predict outcome. The feature importance can be visualized in Figure 4 showing year of diagnosis as the most important variable. The performance of linear regression was MSE = 647.49, RMSE =25.45, MAE = 18.21,R2= 0.620 and adjustedR2= 0.620 (Table 4).
Random forest method was also conducted to develop predictive model. It was appropriate for data with one strong predictor and some moderate predictors. The feature importance for random forest is shown in Figure 4 with year of diagnosis as the most important variable. The performance of the random forest was MSE = 489.58,RMSE = 22.13, MAE = 13.63,R2= 0.709 and adjustedR2= 0.709 (Table 4).
Finally, the XGBoost model was used. The performance of the XGBoost was MSE =486.55, RMSE = 22.06, MAE = 13.55,R2= 0.711 and adjustedR2= 0.711 (Table 4). The feature importance for the XBoost model is presented in Figure 4 showing primary cancer site and year of diagnosis as the top two most important variables for prediction of oral cancer survival. Figure 6 presents a comparison of the prediction of oral cancer survival time from all models against the actual survival time. All model predictions were very similar and close to the actual outcomes. When the survival time was between 40 mo and 60 mo, the predictions were on target with the actual survival time. When it was under 40 mo, the predicted survival time for all models were slightly higher than the actual survival time. However, when it was over 60 mo, the predicted survival time for all models were slightly lower than the actual survival.
Table 3 Demographic characteristics of the sample (n = 177714)
Table 4 Machine learning model performance
DISCUSSION
The goal of this study was two-fold: (1) To build a ML model predictive of the length of survival for those diagnosed with oral cancer, and (2) To establish the most important factors that predict oral cancer survival. Our results showed that XGBoost was the best model in terms of accuracy. XGBoost's performance exceeded all other ML methods, with linear regression's performance slightly trailing behind all models.The average length of survival for all patients was 60.35 mo. Furthermore, age at diagnosis, primary cancer site, tumor size and year of diagnosis were the most important factors related to oral cancer survival. Year of diagnosis was consistently ranking as the top feature across all models. Year of diagnosis was not the number of years nor the amount of time since the tumor was initially diagnosed. Rather, year of diagnosis referred to the year when the tumor was first diagnosed, implying that individuals with tumors that were diagnosed in the modern era tend to have longer survival than those diagnosed in the past.
Figure 2 Cross-validation score change for selecting optimal number of features. LN: Lymph node; SEER: Surveillance, Epidemiology, and End Results; AJCC: American Joint Committee on Cancer; CS: Coding system; ICD-O-3: International Classification of Diseases for Oncology, 3rd ed; WHO: World Health Organization.
To our knowledge, this study is the first of its kind to use ML techniques to predict length of survival for those diagnosed with oral cancer. Previous research is consistent with some of our findings. Tumor size, specifically thickness among other tumor size parameters, has been found to be a significant predictor of oral tongue carcinoma survival[25,26]. Younger patients with oral cavity squamous cell cancer[27]and squamous cell carcinoma of the oral tongue[28]have been found to have a higher survival rate in the past which is also consistent with our findings. For cases of squamous cell carcinoma of the oral tongue, a ten-year increase in age was associated with an 18%increase in risk of death[27,28]. However, year of diagnosis was a unique and novel predictive factor that has not been reported in the literature. Considering that our study included 40 plus years-worth of data and incorporated ML for precise prediction, this perhaps makes it possible for discovering new knowledge. It is possible that more recent year of diagnosis leads to the better survival outcomes due to improved oral cancer treatments and public awareness.
This study also revealed some conflicting findings that need further exploration.Although race and ethnicity have been identified as predictors to oral cancer survival in past literature[15], our study using recent data showed low importance of these features, so race and ethnicity were eventually dropped from the model. Given that our study included 40 plus year-worth of data and consisted of recent data, we may see that race and ethnicity are not associated with oral cancer survival over time.Improvements in access to, and utilization of healthcare services among race could also be reasons leading to no or low racial disparities in oral care in the 21stcentury.Additional large-scale studies using recent data are needed to evaluate these findings.
A primary limitation of this study was that the data did not include psychological factors that could explain survivors' quality of life. In a future study, we can explore other databases and incorporate surveys to explain the psychological state of oral cancer survivors and overall perspective on the disease. Over 50% of diagnosed oral cancer cases still remain a lethal disease annually[10], early detection and accessibility to regular head and neck examination is key.
Figure 3 Boxplots of sample characteristics. CS: Coding system.
CONCLUSION
This study is particularly important and appropriate for the field of dentistry as the prediction of oral cancer survival can assist dentists, patients and caregivers in disease management and treatment plan development. Identifying oral cancer and gaining a more in depth understanding of the length of survival for those diagnosed with oral cancer and establishing important factors that predict oral cancer survival will better equip health care providers on how to best manage such diagnoses. This study serves as a steppingstone for future exploration using ML and artificial intelligence to uncover the full potential for the management of oral cancers and to reduce healthcare disparities around the globe.
Figure 4 Survival months shows strong linear relation with several variables: Age of diagnosis, year of diagnosis, month of diagnosis,and site recode ICD-O-3/WHO 2008. ICD-O-3: International Classification of Diseases for Oncology, 3rd ed; WHO: World Health Organization.
Figure 5 Machine learning model feature importance. ICD-O-3: International Classification of Diseases for Oncology, 3rd ed; WHO: World Health Organization; CS: Coding system; AJCC: American Joint Committee on Cancer.
Figure 6 Prediction comparison among different models. Patient index refers to the rank after sorting by survival months. Actual: The actual survival outcome; XGB: Extreme gradient boosting.
ARTICLE HIGHLIGHTS
Research background
Oral cancer is highly prevalent in the world, yet there is a limited understanding of oral cancer risk factors and survival.
Research motivation
To increase one's quality of life, it is important to be able to predict oral cancer survival.
Research objectives
The objectives of this study were to build an accurate model to precisely predict the length of oral cancer survival and to explore the most important factors that determine the longevity of oral cancer survivors.
Research methods
Oral cancer data were obtained from the years 1975 to 2016 in the Surveillance,Epidemiology, and End Results database. Methods from the field of artificial intelligence were applied to build and validate prediction models from 40+ years of oral cancer data representative of the United States' population.
Research results
Age at diagnosis, primary cancer site, tumor size and year of diagnosis were the most important factors related to oral cancer survival. Individuals with tumors that were diagnosed in the modern era tend to have longer survival than those diagnosed in the past, which was a novel finding that had not been reported in the literature.
Research conclusions
Machine learning algorithms were developed this study to predict the length of oral cancer survival that can be readily deployed to clinical settings.
Research perspectives
This study was the first of its kind to use methods from artificial intelligence to examine the length of survival for individuals diagnosed with oral cancer. The outcome of this study has the potential to reduce healthcare disparities and improve the quality of life for oral cancer survivors and their friends and families around the world.
ACKNOWLEDGEMENTS
The authors sincerely thank the Clinical Outcomes Research and Education at College of Dental Medicine, Roseman University of Health Sciences for supporting this study.
杂志排行
World Journal of Clinical Oncology的其它文章
- Tumor-specific lytic path “hyperploid progression mediated death”:Resolving side effects through targeting retinoblastoma or p53 mutant
- Liquid biopsy in ovarian cancer: Catching the silent killer before it strikes
- Updates on “Cancer Genomics and Epigenomics”
- Deep diving in the PACIFIC: Practical issues in stage III non-small cell lung cancer to avoid shipwreck
- Assessment of breast cancer immunohistochemistry and tumor characteristics in Nigeria
- Functional Gait Assessment scale in the rehabilitation of patients after vestibular tumor surgery in an acute hospital