摘要: | 研究背景: 令人滿意的醫療服務會提高病人對醫師的信任度及醫囑遵從性,現今醫療糾紛發生率亦居高不下。研究顯示較佳的臨床治療結果與良好的醫病關係有關,而醫師的情緒智能與醫病關係呈正相關。人與人情感傳遞方式包括臉部表情、說話聲調和文字訊息。然而,COVID-19大流行使得人們建立戴上口罩的習慣,以及多樣化的數位醫療形式增加,令清晰接收到臉部表情訊息變得不易,以說話聲調感受同理心成為醫病溝通的核心因素。根據醫病對話語調來改善醫師的表達方式,已成為了一項重要的議題。 研究目的: 有鑑於此,本研究旨在探討與建立醫師與病患互動專門之語音情緒自動辨識模型。透過自動化語音情緒辨識來教育增強醫師在面對病人時情緒反應與識別的能力,亦可分析雙方之互動情緒狀態,進而改善醫病關係和病人照護。 研究材料與方法: 本研究收集臺灣北部某兩間醫院之皮膚科門診看診互動錄製之影音為實驗資料,藉用人聲偵測進行語句片段分離,而後招募人工標註語音三維情緒中的三個類別(支配程度、激動程度與愉悅程度)。深度學習模型之輸入特徵使用語音轉換之梅爾時頻圖;輸出為情緒類別,並採用DenseNet121神經網路架構完成訓練,以測試集辨識結果進行模型評估。 研究結果: 本研究經由資料前處理後在三個情緒維度分別有8850、9139和9137個音訊片段,經由語音特徵擷取,以三折交叉驗證訓練的結果顯示,支配程度的模型準確率為82%,特別是高支配程度辨識表現優異(F1 score 0.91),然而在低支配程度識別能力較低。激動程度模型測試結果準確率為68%,在中等激動程度辨識表現最突出(F1 score 0.77)。愉悅程度準確率84%,分類中性情緒的能力最佳(F1 score 0.9)。實驗結果顯示,支配程度與激動程度的模型可確實分辨高主導、以及較高的激動語氣。並且經由人工標記分析,由於看診諮詢情境特性,情緒表達較隱晦,中性情緒資料多,但是高支配程度語氣比例高。 結論: 本研究建立了專屬醫病互動之三維情緒模型,並能準確辨識高支配程度語氣與非低激動程度的非語言語音,且具有跨語言之能力。日後將能使用在醫護溝通教育訓練亦或是真實診間來加強同理心溝通。然而未來還能夠擴增資料收案和結合不同模態辨識來增加非語言情緒辨識準確率及穩健性。 Research Background: Satisfactory healthcare services not only elevate patients' trust in their physicians but also significantly improve their compliance with medical advice. Amidst the high prevalence of medical disputes, research underscores the importance of a strong doctor-patient relationship in achieving favorable clinical outcomes. This relationship is further strengthened by a physician’s emotional intelligence, which positively correlates with effective patient communication. Emotional communication between individuals encompasses facial expressions, vocal tones, and textual messages. However, the widespread use of masks during the COVID-19 pandemic and the rise in various forms of digital medical services, have posed challenges in accurately interpreting facial expressions. As a result, vocal tone has become a central element in empathetic communication in physician-patient interactions. Improving physicians' communication styles based on the tone of medical dialogues has emerged as a crucial issue. Objective: In light of this, our study aims to explore and establish an automatic speech emotion recognition model specifically for physician-patient interactions. By utilizing automated non-verbal speech emotion recognition, we seek to enhance physicians' abilities in emotional response and recognition when interacting with patients. Additionally, this model can analyze the emotional states of both parties, potentially improving doctor-patient relationships and patient care. Materials and Methods: Our study collected audiovisual recordings of dermatology outpatient interactions from two hospitals in Northern Taiwan. Voice activity detection was employed for utterance segmentation, followed by recruiting annotators to label the three-dimensional emotions in the audio (valence, arousal, and dominance). The deep learning model inputs Mel-spectrograms converted from utterances, with the output being emotional categories. The model, based on the DenseNet121 neural network architecture, was trained, and evaluated using the testing datasets. Results: After data preprocessing, a total of 8850 for dominance dimension, 9139 for arousal dimension and 9137 audio segments for valence dimension were obtained. For the dominance dimension, the model achieved an 82% accuracy rate, particularly excelling in high dominance recognition(F1 score 0.91), though lower in identifying low dominance. The arousal model showed an accuracy of 68%, with its best performance in middle arousal (F1 score 0.77). The valence model had an accuracy of 84%, excelling in neutral emotion classification(F1 score 0.9). The results collectively suggest that the models for dominance and arousal can effectively distinguish high dominance and more intense vocal tones. Furthermore, due to the subtleties of emotions expressed in medical consultations, we found a predominance of neutral emotional data. However, the proportion of high dominance in speech was notably higher, emphasizing the unique nature of emotional expression within medical consultation settings. Conclusion: This study developed three-dimensional emotion models for doctor-patient interactions, accurately recognizing high dominance tones and non-low arousal nonverbal speech expressions, with cross-language capabilities. It can be used in medical communication training or actual clinical settings to enhance empathetic communication. Future work includes expanding data collection and integrating multimodal recognition to improve the accuracy and robustness of non-verbal emotion recognition. |