Research Article | OPEN ACCESS
Dictionary and Gene Ontology Based Similarity for Named Entity Relationship Protein-protein Interaction Prediction from Biotext Corpus
1Smt K. Prabavathy and 2P. Sumathi
1Department of Computer Science, Manonmanium Sundaranar University, Tirunelveli
2Department of Computer Science, Government Arts College, Coimbatore, Tamil Nadu 627012, India
Research Journal of Applied Sciences, Engineering and Technology 2014 22:2282-2289
Received: September ‎13, ‎2014 | Accepted: September ‎20, ‎2014 | Published: December 15, 2014
Abstract
Protein-protein interactions functions as a significant key role in several biological systems. These involves in complex formation and many pathways which are used to perform biological processes. By accurate identification of the set of interacting proteins can get rid of new light on the functional role of various proteins in the complex surroundings of the cell. The ability to construct biologically consequential gene networks and identification of the exact relationship in the gene network is critical for present-day systems biology. In earlier research, the power of presented gene modules to shed light on the functioning of complex biological systems is studied. Most of modules in these networks have shown small link with meaningful biological function, because these methods doesn’t exactly calculate the semantic relationship between the entities. In order to overcome these problems and improve the PPI results in the biotext corpus a new method is proposed in this research. The proposed method which directly incorporates Gene Ontology (GO) annotation in construction of gene modules and Dictionary-based text is proposed to extract biotext information. Dictionary-Based Text and Gene Ontology (DBTGO) approach that integrates with various gene-gene pairwise similarity values, protein-protein interaction relationship obtained from gene expression, in order to gain better biotext information retrieval result. A result analysis has been carried out on Biotext Project at UC Berkley. Testing the DBTGO algorithm indicates that it is able to improve PPI relationship identification result with all previously suggested methods in terms of the precision, recall, F measure and Normalized Discounted Cumulative Gain (NDCG). The proposed DBTGO algorithm can facilitate comprehensive and in-depth analysis of high throughput experimental data at the gene network level.
Keywords:
Biotext corpus, gene network, gene ontology, Information Extraction (IE), Named Entity Relationship (NER), preprocessing, Protein-Protein Interaction (PPI), word-sense disambiguator,
References
-
Abacha, A.B. and P. Zweigenbaum, 2011. Automatic extraction of semantic relations between medical entities: A rule based approach. J. Biomed. Semant., 2(Suppl. 5): S4.
CrossRef PMid:22166723 PMCid:PMC3239304 -
Aebersold, R. and M. Mann, 2003. Mass spectrometry-based proteomics. Nature, 422(6928): 198-207.
CrossRef PMid:12634793 -
Ananiadou, S., S. Pyysalo, J. Tsujii and D.B. Kell, 2010. Event extraction for systems biology by text mining the literature. Trends Biotechnol., 28: 381-390.
CrossRef PMid:20570001 -
Aronson, A.R. and F.M. Lang, 2010. An overview of MetaMap: Historical perspective and recent advances. J. Am. Med. Inform. Assn., 17: 229-236.
CrossRef PMid:20442139 PMCid:PMC2995713 -
Ashburner, M., C.A. Ball, J.A, Blake D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight and J.T. Eppig, 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25: 25-29.
CrossRef PMid:10802651 PMCid:PMC3037419 -
Aubin, S., A. Nazarenko and C. Nédellec, 2005. Adapting a general parser to a sublanguage. In: Angelova, G., K. Bontcheva, R. Mitkov, N. Nicolov and N. Nikolov (Eds.), Proceeding of the International Conference on Recent Advances in Natural Language Processing (RANLP, 05). Borovets, Incoma, Bulgaria, pp: 89-93.
-
Barabasi, A.L. and E. Bonabeau, 2003. Scale-free networks. Sci. Am., 288(5): 60-69.
CrossRef PMid:12701331 -
Bhattacharya, I., S. Godbole, A. Gupta and A. Verma, 2010. Building re-usable dictionary repositories for real-world text mining. Proceeding of the 9th ACM international conference on Information and knowledge management (CIKM'10). Toronto, Ontario, Canada, October 26-30.
-
Breiman, L., 2001. Random forests. Mach. Learn., 45: 5-32.
CrossRef Direct Link -
Cho, Y.R., L. Shi, M. Ramanathan and A. Zhang, 2008. A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge. BMC Bioinformatics, 9: 382.
CrossRef PMid:18801191 PMCid:PMC2570367 -
Chun, H.W., Y. Tsuruoka, J.D. Kim, R. Shiba, N. Nagata, T. Hishiki and J. Tsujii, 2006. Extraction of gene-disease relations from Medline using domain dictionaries and machine learning. Proceeding of the Pacific Symposium on Biocomputing, pp: 4-15.
PMid:17094223 -
Gu, J., Y. Chen, S. Li and Y. Li, 2010. Identification of responsive gene modules by network-based gene clustering and extending: Application to inflammation and angiogenesis. BMC Syst. Biol., 4: 47.
CrossRef PMid:20406493 PMCid:PMC2873318 -
Huang, M., X. Zhu, D.G. Payan, K. Qu and M. Li, 2004. Discovering patterns to extract protein-protein interactions from full biomedical texts. Bioinformatics, 20: 3604-3612.
CrossRef PMid:15284092 -
Ito, T., T. Chiba, R. Ozawa, M. Yoshida, M. Hattori and Y. Sakaki, 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. P. Natl. Acad. Sci. USA, 98(8): 4569-4574.
CrossRef PMid:11283351 PMCid:PMC31875 -
Kuchaiev, O., T. Milenkovic, V. Memisevic, W. Hayes and N. Przulj, 2010. Topological network alignment uncovers biological function and phylogeny. J. Roy. Soc. Interface, 7(50): 1341-1354.
CrossRef PMid:20236959 PMCid:PMC2894889 -
Manning, C.D., P. Raghavan and H. Schütze, 2008. Introduction to Information Retrieval. Cambridge University Press, Cambridge, MA.
CrossRef -
Ohta, T., Y. Tateisi, H. Mima and J. Tsujii, 2002. GENIA corpus: An annotated research abstract corpus in molecular biology domain. Proceeding of the Human Language Technology Conference (HLT, 2002). San Diego, California, pp: 73-77.
-
Palakal, M., M. Stephens, S. Mukhopadhyay, R. Raje and S. Rhodes, 2003. Identification of biological relationships from text documents using efficient computational methods. J. Bioinform. Comput. Biol., 1(2): 307-342.
CrossRef PMid:15290775 -
Pyysalo, S., F. Ginter, T. Pahikkala, J. Boberg, J. Järvinen, T. Salakoski and J. Koivula, 2004. Analysis of link grammar on biomedical dependency corpus targeted at protein-protein interactions. Proceeding of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA). Geneva, Switzerland, pp: 15-21.
CrossRef -
Qi, Y., J. Klein-Seetharaman and Z. Bar-Joseph, 2005. Random forest similarity for protein: Protein interaction prediction from multiple sources. Proceeding of the Pacific Symposium on Biocomputing, 10: 531-542.
-
Rosario, B. and M. Hearst, 2004. Classifying semantic relations in bioscience texts. Proceeding of the 42nd Annual Meeting of Association of Computing Linguistics.
CrossRef -
Schulze, A. and J. Downward, 2001. Navigating gene expression using microarrays: A technology review. Nat. Cell Biol., 3(8): E190-E195.
CrossRef PMid:11483980 -
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv., 34: 1-47.
CrossRef -
Seco, N., T. Veale and J. Hayes, 2004. An intrinsic information content metric for semantic similarity in WordNet. Proceeding of the European Conference on Artificial Intelligence (ECAI'04), pp: 1089-1090.
-
Sharan, R., A. Maron-Katz and R. Shamir, 2003. Click and expander: A system for clustering and visualizing gene expression data. Bioinformatics, 19: 1787-1799.
CrossRef PMid:14512350 -
Uetz, P., L. Giot and G. Cagney, 2000. A comprehensive analysis of protein' protein interactions in Saccharomyces cerevisiae. Nature, 403: 623-627.
CrossRef PMid:10688190 -
Wang, J.Z., Z. Du, R. Payattakool, P.S. Yu and C.F. Chen, 2007. A new method to measure the semantic similarity of GO terms. Bioinformatics, 23: 1274-1281.
CrossRef PMid:17344234 -
Wang, Z. and J. Zhang, 2007. In search of the biological significance of modular structures in protein networks. PLoS Comput. Biol., 3: e107.
CrossRef PMid:17542644 -
Winnenburg, R., T. Wachter, C. Plake, A. Doms and M. Schroeder, 2008. Facts from text: Casn text mining help to scale-up high-quality manual curation of gene products with ontologies? Brief. Bioinform., 9: 466-478.
CrossRef PMid:19060303 -
Zweigenbaum, P., D. Demner-Fushman, H. Yu and K.B. Cohen, 2007. Frontiers of biomedical text mining: Current progress. Brief. Bioinform., 8: 358-375.
CrossRef PMid:17977867 PMCid:PMC2516302
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|