摘要: | 隨著電子健康紀錄(EHR)的快速發展和機器學習技術的成熟,使得海量數據的處理成為可能。然而大約80%的醫療數據在建立後仍然是非結構化格式,是高度未開發的資源,這些信息可能會被 EHR 的結構化資訊所遺漏。近幾年來,自然語言處理分類技術於醫學臨床的貢獻非常大,他可以快速的幫助醫生自動分類、更好地管理和理解醫療數據,進而協助醫生做最佳的疾病診斷和治療。 本研究使用MIMIC-III資料集之出院病摘臨床文本紀錄,採用基於變換器的雙向編碼器 (Bidirectional Encoder Representations from Transformers, BERT)預訓練語言模型,進行病患出院後之死亡風險預測。考量BERT基本模型長文本探索的效能不佳問題,本研究提出“關鍵臨床描述提取器 (Crucial Clinical Description Extractor, CCDE)”,將龐大的臨床文本提取摘要(平均1,800字濃縮摘要至510字以內),讓模型能完整學習到臨床文本的重要資訊。實驗證明,我們的模型不僅能大幅提升死亡案例的預測效能,同時也能保持原有存活案例的預測能力。 另外,本研究亦嘗試跨院驗證實驗,將MIMIC-III實驗模型參數,應用於TMUCRD資料集的預測。實驗結果,我們的模型確實能有效應用於不同醫院臨床資料。其中,我們也細究原因兩個資料集科別差異性,並解釋效能差異原因。 With the rapid development of electronic health records (EHR) and the maturity of machine learning technology, it is possible to process massive data. However, approximately 80% of medical data remains in an unstructured format after creation, a highly untapped resource that can be missed by EHR's structured information. In recent years, natural language processing classification technology has made great contributions to clinical medicine. It can quickly help doctors to automatically classify, better manage and understand medical data, and then assist doctors to make the best disease diagnosis and treatment. This study uses the clinical text records of discharged patients from the MIMIC-III data set, and uses the Transformer-based bidirectional encoder (Bidirectional Encoder Representations from Transformers, BERT) pre-trained language model to predict the death risk of patients after discharge. Considering the poor efficiency of long-text exploration of the BERT basic model, this study proposes the "Crucial Clinical Description Extractor (CCDE)", which extracts and summarizes huge clinical texts (an average of 1,800 words and condenses the abstract to less than 510 words) , so that the model can fully learn the important information of the clinical text. Experiments have proved that our model can not only greatly improve the prediction performance of death cases, but also maintain the prediction ability of the original survival cases. In addition, this study also attempted a cross-institution verification experiment, applying the parameters of the MIMIC-III experimental model to the prediction of the TMUCRD data set. Experimental results show that our model can indeed be effectively applied to clinical data from different hospitals. Among them, we also examine the reasons for the differences between the two datasets, and explain the reasons for the performance differences. |