利用自然語言處理及機器學習早期識別兒童生長障礙

Taipei Medical University Institutional Repository > 醫學院 > 人工智慧醫療碩士在職專班 > 博碩士論文 > Item 987654321/62457

請使用永久網址來引用或連結此文件: http://libir.tmu.edu.tw/handle/987654321/62457

題名:	利用自然語言處理及機器學習早期識別兒童生長障礙 Early identification and diagnosis of growth disorders using NLP and machine learning
作者:	程春燕 CHENG, CHUN-YEN
貢獻者:	醫學院人工智慧醫療碩士在職專班許明暉黎阮國慶
關鍵詞:	生長障礙;生長矮小;青春期;生長曲線;兒科;人工智慧;電子醫療紀錄;機器學習;隨機森林;文字探勘;特徵選取;不平衡資料 Growth Disorder;Short Stature;Puperty;Growth Curve;Pediatrics;Artificial Intelligence;Electronic Medical Record;Machine Learning;Random Forest;Feature selection;Imbalanced Data;Text Mining
日期:	2022-06-23
上傳時間:	2023-01-17 14:52:43 (UTC+8)
摘要:	目的：生長異常是兒科醫生重視且關鍵的臨床狀況，研究兒童生長障礙的主要原因是確定可能威脅兒童未來健康的狀況。而兒童病理性的身材矮小發生率約5%，對於身材矮小應及時識別、診斷和適當治療，因此監測生長障礙在兒科醫療保健中至關重要。由於人工智慧在醫學影像及診斷上應用廣泛提供精準醫療輔助，而本研究目的利用機器學習協助初級保健醫師及早準確地診斷兒童生長障礙。方法：在本回顧性試驗研究中，通過臺北醫學大學臨床研究資料庫申請臨床試驗，使用其臨床研究數據庫的門診病童的臨床生長數據資料分析共112267筆資料(臺北醫學大學附設醫院的訓練測試集85743筆，及萬芳醫學中心的外部驗證集26514筆) 。應用Python及自然語言處理在電子病歷紀錄，進行文字探勘及資料前處理，並運用機器學習演算法評估生長障礙，比較多種機器學習模型分類器，包括決策數、K-近鄰演算法、隨機森林、邏輯斯迴歸、支持向量機、多層感知器機、自適應增強機、梯度提昇機和極端梯度提昇機，來預測初診追蹤一年病童的生長障礙。為了最佳預測模型，同時採用特徵選取和不平衡方法，來找到最佳特徵集以及平衡結果。此外，加入電子生長曲線表追蹤身高及體重的百分位、父母身高中值≧1SDS及≧2SDS標準差距、骨齡值與實際年齡≧1SDS及≧2SDS標準差距、生長速率≦5cm/年生長指標，來提高生長障礙診斷的準確性。結果：在前12次門診紀錄模組或混合特徵選取模組分析，訓練測試集或外部驗證集在機器模型隨機森林、梯度提昇機和極端梯度提昇機表現皆旗鼓相當且穩定。其中隨機森林在混合特徵選取模組，相對其他演算法運算快速，在身材矮小或性早熟分類診斷的驗證表現上：準確性0.88、靈敏度 0.91、特異性0.86、F值0.88、準確度0.89。另外在生長指標以骨齡≧2SDS標準差距、或目標身高≧2SDS標準差距或生長速率≦ 5公分/年的分類驗證表現更顯著優異：準確性0.90、靈敏度 0.92、特異性0.87、F值0.91、準確度0.89。討論：本研究使用不同的機器學習演算法，在兒童身長障礙分類診斷上具有穩定及極好效能，在上述所有演算法中，隨機森林是一項快速方便的精準醫療診斷的演算法。此外，在文字探勘藥物治療紀錄及疾病診斷資訊，與醫院結構化的ICD10診斷碼相符合度47.15%，與藥物相符合度86.03%，並且額外提取11.23%藥物資訊補足原醫院結構化的藥物欄位完整性，提供未來研究者參考。 Objectives: The purpose of this study was to use machine learning to assist primary care physicians in the early and accurate diagnosis of childhood growth disorders. Methods: In this retrospective study, we recruited the clinical growth data of outpatients from the Taipei Medical University Clinical Research Database (TMUCRD). A total of 112267 subjects have been chosen and used in the study for further analysis. Text mining and data preprocessing have been applied to extract and clean the data from raw data. Subsequently, we implemented different machine learning algorithms to predict the growth disorders in outpatients after one year of follow-up. To find the optimal model, we assessed the performance of different models (i.e. Support Vector Machine, Multilayer perceptron, k-nearest neighbors, Decision Tree, Logistic regression, Random Forest, Adaptive Boosting, Gradient Boosting Machine and Extreme Gradient Boosting) using different measurement metrics. Feature selection and imbalance approaches are employed to find the optimal feature set as well as balance results. In addition, it is expected that the model will be drawn into an electronic growth chart to track the standard gap of target height ?1 SDS and ?2 SDS, skeletal age value and chronological age ?1 SDS and ?2 SDS, height percentile and weights percentile, and growth rate ?5 cm/year to improve the diagnosis of growth disorders. Results: In the first 12 records module or hybrid feature selection module analysis, the training test set or the external validation set performed equally and stable on the machine model Random Forest, Gradient Boosting Machine and Extreme Gradient Boosting. Among them, Random Forest algorithm is faster than the others in the hybrid feature selection method module, and the verification performance of short stature or precocious puberty diagnosis reached an accuracy of 0.88, sensitivity of 0.91, specificity of 0.86, F1-score of 0.88, and AUC of 0.89. In addition, the performance of the classification and verification of the growth index with the standard gap of bone age≧2 SDS, or the gap of target height≧2 SDS standard or growth rate≦5 cm/year is more significant and excellent with accuracy of 0.90, sensitivity of 0.92, specificity of 0.87, F1-score of 0.91, and AUC of 0.89. Conclusion: In this study, different machine learning algorithms have been implemented to reach a stable and excellent performance in the classification and diagnosis of children's growth disorders. Among all aforementioned algorithms, the Random Forest was a fast, convenient and accurate algorithm on precision medical diagnosis. In addition, in the text mining of medicine records and disease diagnosis information, the consistency with EMR structured ICD10 diagnosis columns were 47.15%, and the consistency with the medicine column was 86.03%, and an additional average of 11.23% of medicine information was extracted to supplement the original data. The completeness of the medicine column is provided for future researchers' reference.
描述:	碩士指導教授：許明暉共同指導教授：黎阮國慶委員：張詠淳委員：陳中明委員：侯家瑋委員：許明暉委員：黎阮國慶
資料類型:	thesis
顯示於類別:	[人工智慧醫療碩士在職專班] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	265	檢視/開啟

在TMUIR中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....