摘要: | 背景:許多機構與病人因COVID-19疫情隔離或是其他因素而中斷放射治療,而放射治療是需要考量到輻射生物效應,一旦開始不能輕易中斷,否則可能讓癌細胞得以喘息。 目的:建立一個預測模型能在乳癌病人不同的腫瘤狀態與治療下,其放射治療中斷使療程延長天數造成五年內是否會復發。 方法:此為一個回溯性研究,使用資料探勘方式建構預測模型。使用兩大醫療機構2004-2021年乳癌癌症登記長表資料,將其分別訂為訓練組與測試組,以訓練組訓練以及優化模型,再以測試組測試,其AUC值須達0.8以上作為優良的模型,再進一步比較靈敏度與準確性。 結果:臺北市立聯合醫院仁愛院區癌症中心提供386筆資料(復發44/未復發342)做訓練組,台北醫學大學數據中心891筆(復發87/未復發804)做測試組,將112個癌症登記欄位執行合併與排除遺漏值過多欄位後整理使用20個特徵。模型種類篩選結果為隨機森林與CHAID集成模型,進一步篩選特徵發現無法刪除任何特徵,此時AUC值已大於0.8。執行資料平衡訓練模型以增加靈敏度,將訓練組的未復發樣本做隨機抽樣取20%與復發樣本100%,以達到復發:未復發為44:68之比例做模型訓練建構,並且做5次再將模型集成。此時平衡且集成的RF模型測試結果AUC 0.783最佳,CHIAD執行3次集成結果最佳(AUC0.809),再5個RF與3CHAID模型集成後為最後最佳結果AUC值為0.801、靈敏度0.632、準確度0.801。此時,我們將訓練組於測試組的角色對調,使用相同的方式使用測試組來訓練模型,使用訓練組來測試,結果為AUC值0.83、靈敏度0.614、準確度0.782。 結論:僅使用癌症登記長表欄位做特徵,即可建立出AUC達0.8。最佳模型為RF與CHAID集成模型,平衡資料後結果靈敏度可從0.414提升為0.632。從得到的模型隨機森林中的預測重要性,診斷年齡為第一,RT療程延長天數第二,兩者分數相近,放射治療療程延長天數也為五年後乳癌復發因子。 Background: Many institutions and patients have experienced interruptions in radiation therapy due to the COVID-19 pandemic or other factors. Radiation therapy, which takes into account radiation biologic effects, should not be easily interrupted once initiated, as it could allow cancer cells to thrive. Objective: The objective is to establish a predictive model that can assess the impact of interruptions in radiation therapy on the duration of treatment and the likelihood of recurrence within five years for breast cancer patients in different tumor states and treatments. Methods: This is a retrospective study that utilizes data mining techniques to construct the predictive model. Longitudinal registry data from two major medical institutions spanning from 2004 to 2021 for breast cancer cases are used. The data is divided into training and testing sets. The training set is used to train and optimize the model, and the testing set is used to evaluate the model's performance. An area under the curve (AUC) value of 0.8 or higher is considered indicative of a good model. Sensitivity and accuracy are further compared. Results: A total of 386 records (44 recurrences, 342 non-recurrences) from Taipei City Hospital, Renai Branch Cancer Center are used for the training set, while 891 records (87 recurrences, 804 non-recurrences) from Taipei Medical University Data Center are used for the testing set. After merging and excluding fields with excessive missing values, 20 features are selected for analysis. The selected model types are Random Forest (RF) and CHAID ensemble models. Feature selection does not result in the removal of any features, and the AUC value exceeds 0.8. To increase sensitivity, the training set is balanced by randomly sampling 20% of the non-recurrence samples and 100% of the recurrence samples, achieving a recurrence-to-non-recurrence ratio of 44:68 for model training and constructing an ensemble model based on 5 iterations. The balanced and integrated RF model achieves the best AUC value of 0.783, while the CHAID model, integrated 3 times, achieves the best result with an AUC of 0.809. The final best result is obtained by integrating 5 RF models and 3 CHAID models, with an AUC value of 0.801, sensitivity of 0.632, and accuracy of 0.801. Furthermore, the roles of the training and testing sets are reversed, and the testing set is used for training while the training set is used for testing. The results yield an AUC value of 0.83, sensitivity of 0.614, and accuracy of 0.782. Conclusion: By using only the fields from the cancer registry, a model with an AUC value exceeding 0.8 can be established. The best model is an ensemble of Random Forest (RF) and CHAID models, and balancing the data increases sensitivity from 0.414 to 0.632. Based on the predictive importance obtained from the Random Forest model, the most important feature is the age at diagnosis, followed by the duration of radiation therapy (RT). These two factors have similar scores and are considered significant predictors of breast cancer recurrence within five years. |