摘要: | 背景
人類免疫缺陷病毒1型 (human immunodeficiency virus type 1, HIV-1) 是由於天冬氨酸蛋白酶的變化而發生的疾病。在免疫缺陷綜合症 (acquired immune deficiency syndrome, AIDS) 中,此酶也是其中一個病原體,由此可見天冬氨酸蛋白酶為一個重要的酶。HIV-1蛋白酶抑制劑的開發可以幫助理解抑制HIV-1的特異性,進而開發對抗愛滋病之藥物。然後,在進行HIV-1蛋白酶切割位點的實驗鑑定方法大多為耗時且需要大量人力。因此,使用計算方法進行預測切割位點已成為初步快速篩選之方法。
方法
在本研究中,我們提出利用序列、結構和物理化學特性三類生物特徵,並結合不同機器學習算法來預測切割位點。接著使用逐步羅吉斯迴歸選擇具有識別判斷的特性。在特徵表示上,被選擇之生物特徵由不同編碼方式來計分,並輸入到決策樹、羅吉斯迴歸與類神經網路模型中。此外,本研究在資料分割過程,將資料分為三份以進行預測切割位點的評估,並採用前人研究中所提出之四個資料集作為預測結果評估。
結果和結論
實驗結果顯示結合序列、結構和物理化學特性之組合方式於進行HIV-1蛋白酶切割位點比只有使用單一特徵類型更佳準確。此外在逐步特性選擇的加入能有效識別生物特徵的特異性。另外,在類神經網路的結果顯示比決策樹與羅吉斯迴歸模型有較佳的顯著結果。最後,本研究以三份資料切割評估方法下,AUC可達到0.815~0.995及具有80.0%~97.4%的準確率。 Background
The human immunodeficiency virus type 1 (HIV-1) aspartic protease is an important enzyme owing to its imperative part in viral development and a causative agent of deadliest disease known as acquired immune deficiency syndrome (AIDS). Development of HIV-1 protease inhibitors can help understand the specificity of substrates, which can restrain the replication of HIV-1, thus antagonize AIDS. However, experimental methods in identification of HIV-1 protease cleavage sites are generally time-consuming and labor-intensive. Therefore, using computational methods to predict cleavage sites has become highly desirable.
Results
In this study, we propose a prediction method in which sequence, structural, and physicochemical features are incorporated in various machine-learning algorithms. Then, a bidirectional stepwise selection algorithm is incorporated in feature selection to identify discriminative features. Further, only the selected features are calculated by various encoding schemes and used as input for decision trees, logistic regression, and artificial neural networks. Moreover, a more rigorous three-way data split procedure is applied to evaluate the objective performance of cleavage site prediction. Four benchmark datasets used in previous studies are used to evaluate the predictive performance.
Conclusion
Experiment results showed that combinations of sequence, structure, and physicochemical performed better than single feature type for identification of HIV-1 protease cleavage sites. In addition, incorporation of stepwise feature selection is effective to identify interpretable biological features to depict specificity of the substrates. Moreover, artificial neural networks perform significantly better than the other two classifiers. Finally, the proposed method achieved 80.0%~97.4% in accuracy and 0.815~0.995 AUC evaluated by independent test sets in a three-way data split procedure.
Keywords: HIV-1 protease, cleavage sites, sequence features, structural features, physicochemical properties, pseudo amino acid composition, machine learning. |