摘要: | 目的:生長異常是兒科醫生重視且關鍵的臨床狀況,研究兒童生長障礙的主要原因是確定可能威脅兒童未來健康的狀況。而兒童病理性的身材矮小發生率約5%,對於身材矮小應及時識別、診斷和適當治療,因此監測生長障礙在兒科醫療保健中至關重要。由於人工智慧在醫學影像及診斷上應用廣泛提供精準醫療輔助,而本研究目的利用機器學習協助初級保健醫師及早準確地診斷兒童生長障礙。 方法:在本回顧性試驗研究中,通過臺北醫學大學臨床研究資料庫申請臨床試驗,使用其臨床研究數據庫的門診病童的臨床生長數據資料分析共112267筆資料(臺北醫學大學附設醫院的訓練測試集85743筆,及萬芳醫學中心的外部驗證集26514筆) 。應用Python及自然語言處理在電子病歷紀錄,進行文字探勘及資料前處理,並運用機器學習演算法評估生長障礙,比較多種機器學習模型分類器,包括決策數、K-近鄰演算法、隨機森林、邏輯斯迴歸、支持向量機、多層感知器機、自適應增強機、梯度提昇機和極端梯度提昇機,來預測初診追蹤一年病童的生長障礙。為了最佳預測模型,同時採用特徵選取和不平衡方法,來找到最佳特徵集以及平衡結果。此外,加入電子生長曲線表追蹤身高及體重的百分位、父母身高中值≧1SDS及≧2SDS標準差距、骨齡值與實際年齡≧1SDS及≧2SDS標準差距、生長速率≦5cm/年生長指標,來提高生長障礙診斷的準確性。 結果:在前12次門診紀錄模組或混合特徵選取模組分析,訓練測試集或外部驗證集在機器模型隨機森林、梯度提昇機和極端梯度提昇機表現皆旗鼓相當且穩定。其中隨機森林在混合特徵選取模組,相對其他演算法運算快速,在身材矮小或性早熟分類診斷的驗證表現上:準確性0.88、靈敏度 0.91、特異性0.86、F值0.88、準確度0.89。另外在生長指標以骨齡≧2SDS標準差距、或目標身高≧2SDS標準差距或生長速率≦ 5公分/年的分類驗證表現更顯著優異:準確性0.90、靈敏度 0.92、特異性0.87、F值0.91、準確度0.89。 討論:本研究使用不同的機器學習演算法,在兒童身長障礙分類診斷上具有穩定及極好效能,在上述所有演算法中,隨機森林是一項快速方便的精準醫療診斷的演算法。此外,在文字探勘藥物治療紀錄及疾病診斷資訊,與醫院結構化的ICD10診斷碼相符合度47.15%,與藥物相符合度86.03%,並且額外提取11.23%藥物資訊補足原醫院結構化的藥物欄位完整性,提供未來研究者參考。 Objectives: The purpose of this study was to use machine learning to assist primary care physicians in the early and accurate diagnosis of childhood growth disorders. Methods: In this retrospective study, we recruited the clinical growth data of outpatients from the Taipei Medical University Clinical Research Database (TMUCRD). A total of 112267 subjects have been chosen and used in the study for further analysis. Text mining and data preprocessing have been applied to extract and clean the data from raw data. Subsequently, we implemented different machine learning algorithms to predict the growth disorders in outpatients after one year of follow-up. To find the optimal model, we assessed the performance of different models (i.e. Support Vector Machine, Multilayer perceptron, k-nearest neighbors, Decision Tree, Logistic regression, Random Forest, Adaptive Boosting, Gradient Boosting Machine and Extreme Gradient Boosting) using different measurement metrics. Feature selection and imbalance approaches are employed to find the optimal feature set as well as balance results. In addition, it is expected that the model will be drawn into an electronic growth chart to track the standard gap of target height ?1 SDS and ?2 SDS, skeletal age value and chronological age ?1 SDS and ?2 SDS, height percentile and weights percentile, and growth rate ?5 cm/year to improve the diagnosis of growth disorders. Results: In the first 12 records module or hybrid feature selection module analysis, the training test set or the external validation set performed equally and stable on the machine model Random Forest, Gradient Boosting Machine and Extreme Gradient Boosting. Among them, Random Forest algorithm is faster than the others in the hybrid feature selection method module, and the verification performance of short stature or precocious puberty diagnosis reached an accuracy of 0.88, sensitivity of 0.91, specificity of 0.86, F1-score of 0.88, and AUC of 0.89. In addition, the performance of the classification and verification of the growth index with the standard gap of bone age≧2 SDS, or the gap of target height≧2 SDS standard or growth rate≦5 cm/year is more significant and excellent with accuracy of 0.90, sensitivity of 0.92, specificity of 0.87, F1-score of 0.91, and AUC of 0.89. Conclusion: In this study, different machine learning algorithms have been implemented to reach a stable and excellent performance in the classification and diagnosis of children's growth disorders. Among all aforementioned algorithms, the Random Forest was a fast, convenient and accurate algorithm on precision medical diagnosis. In addition, in the text mining of medicine records and disease diagnosis information, the consistency with EMR structured ICD10 diagnosis columns were 47.15%, and the consistency with the medicine column was 86.03%, and an additional average of 11.23% of medicine information was extracted to supplement the original data. The completeness of the medicine column is provided for future researchers' reference. |