Automated Large-scale Extraction of Relations between Proteins and Phenotypes from Biomedical Litera
- Thursday, November 7, 2019 from 12:00pm to 1:00pm
- Barnard Hall, 258 - view map
Identifying protein-phenotype relations is of paramount importance for applications such as uncovering rare and complex diseases. Human Phenotype Ontology (HPO) is a recently introduced standard vocabulary for describing disease-related phenotypic abnormalities in humans. Since the experimental determination of HPO categories for human proteins is a highly resource-consuming task, developing automated tools that can accurately predict HPO categories has gained interest recently. One of the best resources that captures protein-phenotype relationships is the biomedical literature. However, currently, no method exists for the automated extraction of relations involving human proteins and phenotypes from biomedical text. As a solution, we are proposing to develop HPcurator, an automated curation pipeline that works in two steps: (1) extracting protein-phenotype co-mentions from all biomedical literature, and (2) classifying extracted co-mentions for identifying valid relations.
In our first preliminary study, we have developed ProPheno 1.0, a comprehensive online dataset composed of human protein/phenotype mentions extracted from the entire set of biomedical articles. Subsequently, in our second preliminary study, we incorporated the co-mentions obtained from ProPheno as a complementary source of input features for the task of automated phenotype prediction and we observed that co-mention features improve the overall performance. Then, in our third preliminary study, we presented a supervised machine learning approach called PPPred (Protein-Phenotype Predictor) for classifying the validity of a given sentence-level co-mention. We demonstrated that PPPred significantly outperforms several baseline methods using a gold-standard dataset of manually annotated co-mentions.
While the manual annotation of co-mentions is prohibitive, we have access to millions of unlabeled protein-phenotype co-mentions through ProPheno. In our current work, we propose to develop HMILE (Hierarchical Multiple Instance Learning Ensemble model) for the classification of protein-phenotype co-mentions. HMILE, through distant supervision, starts by assuming that any mention of a pair in the text that is also present in the HPO database is a potential valid relationship. Then, by utilizing the multiple instance learning concept, it creates bags of co-mentions and assigns the bag labels using the annotations from the HPO database. Next, it generates three separate classifiers at the pair, phenotype, and protein levels and combines them to return an ensemble classifier. The ProPheno dataset and PPPred/ HMILE classifiers constitute the complete HPcurator pipeline. The findings and the insight that will be gained through this work have implications for biocurators, researchers, and bio text mining tool developers.
- Gianforte School of Computing