利用臨床資料庫建立人工智慧子癲前症預測模型

Taipei Medical University Institutional Repository > 醫學科技學院 > 醫學資訊研究所 > 博碩士論文 > Item 987654321/65000

請使用永久網址來引用或連結此文件: http://libir.tmu.edu.tw/handle/987654321/65000

題名:	利用臨床資料庫建立人工智慧子癲前症預測模型 Developing artificial intelligence-assisted models to predict preeclampsia using clinical datasets
作者:	洪士鈞 HUNG, SHIH-CHUN
貢獻者:	醫學資訊研究所碩士班蘇家玉
關鍵詞:	特徵選取;子癲前症;妊娠高血壓;預測模型;臨床資料 feature selection;preeclampsia;pregnancy-induced hypertension;prediction model;clinical data
日期:	2024-01-15
上傳時間:	2025-01-06 09:13:49 (UTC+8)
摘要:	背景—子癲前症又可稱為「妊娠毒血症」或「前兆子癲」，屬於妊娠高血壓的一種子類型，發生在懷孕20週後，出現高血壓情形伴隨蛋白尿或水腫的現象，該疾病在全球發生率為2 - 8%，不僅可能對於胎兒及母體帶來嚴重後遺症抑或併發症，嚴重者甚至可能危及性命。近年來，雖然死亡率逐漸有下降的趨勢，但仍佔懷孕母親死亡的18%，其中又以發展中國家更甚，此外此疾病在初期時不容易被發覺，且已位列國內產婦死亡原因前三名，目前臨床上造成子癲前症的原因尚未定論，從過去的臨床抑或病理上的研究僅可推斷出與胎盤有著密切的相關性，且該疾病仍未有有效的治療方案，只能在孕期的16周前開始服用阿斯匹靈以降低發生率，因此對於該疾病的預測至關重要。本研究透過整合臨床資料，致力於找出最適合作為評估懷孕女子罹患子癲前症(針對: 不分類別、早發型、中晚發型)發生率的特徵，並開發出最合適的預測模型，來有效遏止疾病的發生，此外也可由能藉由此研究發現過去文獻中未曾提到的變數，提供臨床上的新思路。材料及方法—本研究收集了2013-2019年間於北醫、萬芳、雙和參院接受過懷孕照護之18-55歲女性做為研究對象，並以住院期間有註記完成生育且確定罹患子癲前症者為實驗組(共計243人)，而健康的女性則做為控制組(共計843人)，並收集研究對象之8-20周之檢驗紀錄及孕前6周至孕後20周之用藥紀錄，超過50%以上為遺失值之變數則會被優先剔除，而資料會進一步分為80%的內部資料(訓練)集和20%的獨立測試(驗證)集。使用5種不同人工智慧演算法，包含邏輯式回歸、xgboost、決策樹、隨機森林及SVM，由於子癲前症發生率相對較低，資料呈現不平衡的現象，因此採用SMOTE ENN及Tomek兩種方式輪流進行樣本填充。特徵選取方面，則使用chi2以及t test、Scikit-learn中內建的importance score、permutation importance、SHAP Value四種方法交替使用，並得出表現最好的內部驗證模型，接著測驗模型在獨立測試集中的表現。結果—對於不分類別的子癲前症，最後使用由chi2以及t test的p value篩選出的26個變數，並採用SMOTE Tomek填值，使用的演算法為隨機森林，模型在獨立測試集的表現:準確率為0.90 (95% CI, 0.85-0.96)、AUC=0.91 (95% CI, 0.84-0.98)、Precision=0.86 (95% CI,0.76-0.95)、F1-weighted=0.90 (95% CI, 0.84-0.96)。早發型子癲前症則是使用permutation importance篩選出排名前3的變數，並以SMOTE ENN填值，採用XGBoost演算法，模型在獨立測試集的表現:準確率為0.87 (95% CI, 0.80-0.93)、AUC=0.98 (95% CI, 0.97-0.99)、Precision=0.65 (95% CI,0.36-0.94)、F1-weighted=0.85 (95% CI, 0.77-0.93)。中、晚發行子癲前症使用內建的importance score篩選出排名前19的變數，同樣使用Random Forest演算法配合SMOTE Tomek填值，模型最後的表現：0.88 (95% CI, 0.87-0.90)、AUC=0.87 (95% CI, 0.83-0.92)、Precision=0.76 (95% CI,0.67,0.85)、F1-weighted=0.87 (95% CI, 0.85-0.90)。結論—我們的研究發現，除了常見的變數指標外，引入了一些過去研究中較少被探討的檢驗值，如病人血液白血球總量、紅血球寬度正異常值、病人每升血液中平均所含血紅素濃度，以及病人B型肝炎抗原檢查。這些檢驗值提供了更全面且細緻的生理指標，可能有助於更深入地理解與子癲前症相關的生理機制。這些新增的變數在實驗結果中顯示出對模型性能的積極影響，尤其在面對陌生資料的情境下依然表現優異。這意味著這些變數可能具有獨特的預測能力，並有助於提高子癲前症的預測準確性。未來的研究可以進一步深入研究這些變數的生物學基礎，以了解它們在婦女健康和妊娠期間的生理過程中的角色。 Background—Pre-eclampsia, also known as "pregnancy-induced hypertension" or " pregnancy toxemia," is a subtype of pregnancy-induced hypertension that occurs after 20 weeks of pregnancy, characterized by high blood pressure accompanied by proteinuria or edema. The global incidence of this condition is 2-8%, and it poses serious risks to both the fetus and the mother, with severe cases potentially leading to life-threatening complications. Despite a gradual decline in mortality rates in recent years, pre-eclampsia still accounts for 18% of maternal deaths during pregnancy, with a higher impact in developing countries. Early detection of this condition is challenging, and it ranks among the top three causes of maternal mortality in domestic populations. Currently, the exact causes of pre-eclampsia remain inconclusive, with clinical and pathological studies suggesting a close association with the placenta. Effective treatment options are lacking, and aspirin is recommended as a preventive measure only when initiated before the 16th week of pregnancy. Therefore, accurate prediction of this condition is crucial. This study aims to integrate clinical data to identify features most suitable for evaluating the incidence of pre-eclampsia in pregnant women (classified as: unclassified, early-onset, and late-onset) and develop the most appropriate predictive model to effectively prevent the onset of the disease. Additionally, this research introduces variables not previously mentioned in the literature, offering new perspectives for clinical applications. Materials and Methods—The study collected data from 18-55-year-old women who received pregnancy care at Taipei Medical University Hospital, Wan Fang Hospital, and Taipei Municipal Wanfang Hospital between 2013 and 2019. The experimental group consisted of 243 individuals with documented completion of childbirth and confirmed pre-eclampsia during hospitalization, while the control group comprised 843 healthy women. Data included laboratory records from weeks 8-20 and medication records from 6 weeks before pregnancy to 20 weeks after pregnancy. Variables with over 50% missing values were prioritized for exclusion. The data were further divided into an 80% internal training set and a 20% independent testing set. Five different artificial intelligence algorithms, including logistic regression, xgboost, decision tree, random forest, and SVM, were employed. Due to the relatively low incidence of pre-eclampsia resulting in imbalanced data, SMOTE ENN and Tomek were alternately used for sample filling. Feature selection involved using chi2, t test, importance score from Scikit-learn, permutation importance, and SHAP Value interchangeably. The best-performing internal validation model was determined, and its performance was tested on the independent testing set. Results—For unclassified pre-eclampsia, 26 variables were selected using chi2 and t test p values, with SMOTE Tomek filling and a random forest algorithm. The model's performance on the independent testing set was as follows: accuracy 0.90 (95% CI, 0.85-0.96), AUC=0.91 (95% CI, 0.84-0.98), Precision=0.86 (95% CI, 0.76-0.95), F1-weighted=0.90 (95% CI, 0.84-0.96). Early-onset pre-eclampsia used permutation importance to select the top 3 variables, SMOTE ENN filling, and an XGBoost algorithm. The model's performance on the independent testing set was: accuracy 0.87 (95% CI, 0.80-0.93), AUC=0.98 (95% CI, 0.97-0.99), Precision=0.65 (95% CI, 0.36-0.94), F1-weighted=0.85 (95% CI, 0.77-0.93). Late-onset pre-eclampsia used importance score to select the top 19 variables, Random Forest algorithm, and SMOTE Tomek filling. The model's final performance was: accuracy 0.88 (95% CI, 0.87-0.90), AUC=0.87 (95% CI, 0.83-0.92), Precision=0.76 (95% CI, 0.67-0.85), F1-weighted=0.87 (95% CI, 0.85-0.90). Conclusion—Our study discovered that, in addition to common variable indicators, the introduction of less-explored laboratory values, such as total blood white cell count, abnormal red blood cell width, average hemoglobin concentration per liter of blood, and B-type hepatitis antigen detection, provided more comprehensive and detailed physiological indicators. These newly added variables demonstrated a positive impact on model performance, particularly in handling unfamiliar data, suggesting their unique predictive capabilities to enhance the accuracy of pre-eclampsia prediction. Future research can delve deeper into the biological basis of these variables to understand their roles during women's health and pregnancy.
描述:	碩士指導教授：蘇家玉口試委員：張資昊口試委員：陳俊璋口試委員：蘇家玉
附註:	論文公開日期：2024-01-30
資料類型:	thesis
顯示於類別:	[醫學資訊研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	57	檢視/開啟

在TMUIR中所有的資料項目都受到原著作權保護.

TAIR相關文章

資料載入中.....