Dictionary and Gene Ontology Based Similarity for Named Entity Relationship Protein-protein Interaction Prediction from Biotext Corpus

Smt K. Prabavathy; P. Sumathi

doi:10.19026/rjaset.8.1230

Research Journal of Applied Sciences, Engineering and Technology

Research Article | OPEN ACCESS

Dictionary and Gene Ontology Based Similarity for Named Entity Relationship Protein-protein Interaction Prediction from Biotext Corpus

¹Smt K. Prabavathy and ²P. Sumathi

¹Department of Computer Science, Manonmanium Sundaranar University, Tirunelveli
²Department of Computer Science, Government Arts College, Coimbatore, Tamil Nadu 627012, India

Research Journal of Applied Sciences, Engineering and Technology 2014 22:2282-2289

http://dx.doi.org/10.19026/rjaset.8.1230 | © The Author(s) 2014

Received: September â€Ž13, â€Ž2014 | Accepted: September â€Ž20, â€Ž2014 | Published: December 15, 2014

Back to issue | PDF | HTML

Abstract

Protein-protein interactions functions as a significant key role in several biological systems. These involves in complex formation and many pathways which are used to perform biological processes. By accurate identification of the set of interacting proteins can get rid of new light on the functional role of various proteins in the complex surroundings of the cell. The ability to construct biologically consequential gene networks and identification of the exact relationship in the gene network is critical for present-day systems biology. In earlier research, the power of presented gene modules to shed light on the functioning of complex biological systems is studied. Most of modules in these networks have shown small link with meaningful biological function, because these methods doesnâ€™t exactly calculate the semantic relationship between the entities. In order to overcome these problems and improve the PPI results in the biotext corpus a new method is proposed in this research. The proposed method which directly incorporates Gene Ontology (GO) annotation in construction of gene modules and Dictionary-based text is proposed to extract biotext information. Dictionary-Based Text and Gene Ontology (DBTGO) approach that integrates with various gene-gene pairwise similarity values, protein-protein interaction relationship obtained from gene expression, in order to gain better biotext information retrieval result. A result analysis has been carried out on Biotext Project at UC Berkley. Testing the DBTGO algorithm indicates that it is able to improve PPI relationship identification result with all previously suggested methods in terms of the precision, recall, F measure and Normalized Discounted Cumulative Gain (NDCG). The proposed DBTGO algorithm can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level.

Keywords:

Biotext corpus, gene network, gene ontology, Information Extraction (IE), Named Entity Relationship (NER), preprocessing, Protein-Protein Interaction (PPI), word-sense disambiguator,

References

Abacha, A.B. and P. Zweigenbaum, 2011. Automatic extraction of semantic relations between medical entities: A rule based approach. J. Biomed. Semant., 2(Suppl. 5): S4.
CrossRef PMid:22166723 PMCid:PMC3239304
Aebersold, R. and M. Mann, 2003. Mass spectrometry-based proteomics. Nature, 422(6928): 198-207.
CrossRef PMid:12634793
Ananiadou, S., S. Pyysalo, J. Tsujii and D.B. Kell, 2010. Event extraction for systems biology by text mining the literature. Trends Biotechnol., 28: 381-390.
CrossRef PMid:20570001
Aronson, A.R. and F.M. Lang, 2010. An overview of MetaMap: Historical perspective and recent advances. J. Am. Med. Inform. Assn., 17: 229-236.
CrossRef PMid:20442139 PMCid:PMC2995713
Ashburner, M., C.A. Ball, J.A, Blake D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight and J.T. Eppig, 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25: 25-29.
CrossRef PMid:10802651 PMCid:PMC3037419
Aubin, S., A. Nazarenko and C. N�dellec, 2005. Adapting a general parser to a sublanguage. In: Angelova, G., K. Bontcheva, R. Mitkov, N. Nicolov and N. Nikolov (Eds.), Proceeding of the International Conference on Recent Advances in Natural Language Processing (RANLP, 05). Borovets, Incoma, Bulgaria, pp: 89-93.
Barabasi, A.L. and E. Bonabeau, 2003. Scale-free networks. Sci. Am., 288(5): 60-69.
CrossRef PMid:12701331
Bhattacharya, I., S. Godbole, A. Gupta and A. Verma, 2010. Building re-usable dictionary repositories for real-world text mining. Proceeding of the 9th ACM international conference on Information and knowledge management (CIKM'10). Toronto, Ontario, Canada, October 26-30.
Breiman, L., 2001. Random forests. Mach. Learn., 45: 5-32.
CrossRef Direct Link
Cho, Y.R., L. Shi, M. Ramanathan and A. Zhang, 2008. A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge. BMC Bioinformatics, 9: 382.
CrossRef PMid:18801191 PMCid:PMC2570367
Chun, H.W., Y. Tsuruoka, J.D. Kim, R. Shiba, N. Nagata, T. Hishiki and J. Tsujii, 2006. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Proceeding of the Pacific Symposium on Biocomputing, pp: 4-15.
PMid:17094223
Gu, J., Y. Chen, S. Li and Y. Li, 2010. Identification of responsive gene modules by network-based gene clustering and extending: Application to inflammation and angiogenesis. BMC Syst. Biol., 4: 47.
CrossRef PMid:20406493 PMCid:PMC2873318
Huang, M., X. Zhu, D.G. Payan, K. Qu and M. Li, 2004. Discovering patterns to extract protein-protein interactions from full biomedical texts. Bioinformatics, 20: 3604-3612.
CrossRef PMid:15284092
Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori and Y. Sakaki, 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. P. Natl. Acad. Sci. USA, 98(8): 4569-4574.
CrossRef PMid:11283351 PMCid:PMC31875
Kuchaiev, O., T. Milenkovic, V. Memisevic, W. Hayes and N. Przulj, 2010. Topological network alignment uncovers biological function and phylogeny. J. Roy. Soc. Interface, 7(50): 1341-1354.
CrossRef PMid:20236959 PMCid:PMC2894889
Manning, C.D., P. Raghavan and H. Sch�tze, 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA.
CrossRef
Ohta, T., Y. Tateisi, H. Mima and J. Tsujii, 2002. GENIA corpus: An annotated research abstract corpus in molecular biology domain. Proceeding of the Human Language Technology Conference (HLT, 2002). San Diego, California, pp: 73-77.
Palakal, M., M. Stephens, S. Mukhopadhyay, R. Raje and S. Rhodes, 2003. Identification of biological relationships from text documents using efficient computational methods. J. Bioinform. Comput. Biol., 1(2): 307-342.
CrossRef PMid:15290775
Pyysalo, S., F. Ginter, T. Pahikkala, J. Boberg, J. J�rvinen, T. Salakoski and J. Koivula, 2004. Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions. Proceeding of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA). Geneva, Switzerland, pp: 15-21.
CrossRef
Qi, Y., J. Klein-Seetharaman and Z. Bar-Joseph, 2005. Random forest similarity for protein: Protein interaction prediction from multiple sources. Proceeding of the Pacific Symposium on Biocomputing, 10: 531-542.
Rosario, B. and M. Hearst, 2004. Classifying semantic relations in bioscience texts. Proceeding of the 42nd Annual Meeting of Association of Computing Linguistics.
CrossRef
Schulze, A. and J. Downward, 2001. Navigating gene expression using microarrays: A technology review. Nat. Cell Biol., 3(8): E190-E195.
CrossRef PMid:11483980
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34: 1-47.
CrossRef
Seco, N., T. Veale and J. Hayes, 2004. An intrinsic information content metric for semantic similarity in WordNet. Proceeding of the European Conference on Artificial Intelligence (ECAI'04), pp: 1089-1090.
Sharan, R., A. Maron-Katz and R. Shamir, 2003. Click and expander: A system for clustering and visualizing gene expression data. Bioinformatics, 19: 1787-1799.
CrossRef PMid:14512350
Uetz, P., L. Giot and G. Cagney, 2000. A comprehensive analysis of protein' protein interactions in Saccharomyces cerevisiae. Nature, 403: 623-627.
CrossRef PMid:10688190
Wang, J.Z., Z. Du, R. Payattakool, P.S. Yu and C.F. Chen, 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics, 23: 1274-1281.
CrossRef PMid:17344234
Wang, Z. and J. Zhang, 2007. In search of the biological significance of modular structures in protein networks. PLoS Comput. Biol., 3: e107.
CrossRef PMid:17542644
Winnenburg, R., T. Wachter, C. Plake, A. Doms and M. Schroeder, 2008. Facts from text: Casn text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief. Bioinform., 9: 466-478.
CrossRef PMid:19060303
Zweigenbaum, P., D. Demner-Fushman, H. Yu and K.B. Cohen, 2007. Frontiers of biomedical text mining: Current progress. Brief. Bioinform., 8: 358-375.
CrossRef PMid:17977867 PMCid:PMC2516302

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online): 2040-7467
ISSN (Print): 2040-7459

Information

Sales & Services



Journal Home \| Aim & Scope \| Author(s) Information \| Editorial Board \| MSP Download Statistics