Research Article | OPEN ACCESS
Utilizing WordNet and Regular Expressions for Instance-based Schema Matching
Ahmed Mounaf Mahdi and Sabrina Tiun
Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Selangor, Malaysia
Research Journal of Applied Sciences, Engineering and Technology 2014 4:460-470
Received: January 20, 2014 | Accepted: February 06, 2014 | Published: July 25, 2014
Abstract
Instance-based matching is the process of finding the correspondence of schema elements by comparing the data from different data sources. It is used as an alternative option when the match between schema elements fails. Instance-based matching is applied in many application areas such as website creation and management, schema evolution and migration, data warehousing, database design and data integration. Sometimes the schema information such as (element name, description, data type, etc.) is unavailable or is unable to get the correct match especially when the element name is abbreviation, therefore, if the schema matching failed, the next step is to focus on values stored in the schemas. For these reasons, many recent approaches focus on instance-based matching. In this study, we propose an approach that combines the strength of pattern recognition utilizing regular expressions for numerical domain as well with WordNet for string domain by getting the similarity coefficient in the range of [0,1]. In previous approach, the regular expression is achieved with a good accuracy for numerical instances only and is not implemented on string instances because we need to know the meaning of string to decide if there is a match or not. The using of WordNet-based measures for string instances should guarantee to improve the effectiveness in terms of Precision (P), Recall (R) and F-measure (F). This approach is evaluated with real dataset and the results are found better than using just equality measure for string especially if the schemas are disjoint. The approach achieved 95.3% F-measure (F).
Keywords:
Instance-based matching , regular expression , schema matching , WordNet,
References
-
Belazzougui, D. and M. Raffinot, 2012. Approximate regular expression matching with multi-strings. J. Discret. Algorithm., 18: 14-21.
CrossRef
-
Berlin, J. and A. Motro, 2001. Autoplex: Automated discovery of content for virtual databases. Lect. Notes Comput. Sc., 2172: 108-122.
CrossRef
-
Bilenko, M., R. Mooney, W. Cohen, P. Ravikumar and S. Fienberg, 2003. Adaptive name matching in information integration. IEEE Intell. Syst., 18(5): 16-23.
CrossRef
-
Blanchard, E., P. Kuntz, M. Harzallah and H. Briand, 2006. A tree-based similarity for evaluating concept proximities in an ontology. St. Class. Dat. Anal., pp: 3-11.
CrossRef
-
Budanitsky, A. and G. Hirst, 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist., 32(1): 13-47.
CrossRef
-
Bulskov, H., R. Knappe and T. Andreasen, 2002. On measuring similarity for conceptual querying. Lect. Notes Comput. Sc., 2522: 100-111.
CrossRef
-
Doan, A. and A.Y. Halevy, 2005. Semantic integration research in the database community: A brief survey. AI Mag., 26(1): 83.
-
Doan, A., P. Domingos and A.Y. Halevy, 2001. Reconciling schemas of disparate data sources: A machine-learning approach. ACM Sigmod Record, 30(2): 509-520.
CrossRef
-
Duchateau, F., Z. Bellahsene and M. Roche, 2006. A Context-based Measure for Discovering Approximate Semantic Matching between Schema Elements.
Direct Link
-
Elmagarmid, A.K., P.G. Ipeirotis and V.S. Verykios, 2007. Duplicate record detection: A survey. IEEE T. Knowl. Data En., 19(1): 1-16.
CrossRef
-
Fellbaum, C., 1998. A semantic network of english: The mother of all WordNets. Comput. Humanities, 32(2-3): 209-220.
CrossRef
-
Friedl, J., 2006. Mastering Regular Expressions. O'Reilly Media, Incorporated.
-
Gillani, S., M. Naeem, R. Habibullah and A. Qayyum, 2013. Semantic schema matching using DBpedia. Int. J. Intell. Syst. Appl., 5(4): 72.
CrossRef
-
Gomes de Carvalho, M., A.H. Laender, M. André Gonçalves and A.S. Da Silva, 2012. An evolutionary approach to complex schema matching. Inform. Syst., 38(3): 302-316.
CrossRef
-
Jaccard, P., 1912. The distribution of the flora in the alpine zone. 1. New Phytol., 11(2): 37-50.
CrossRef
-
Jaro, M.A., 1989. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc., 84(406): 414-420.
CrossRef
-
Kozima, H., 1994. Computing lexical cohesion as a tool for text analysis. Ph.D. Thesis, University of Electro-Communications.
-
Kumar, S., B. Chandrasekaran, J. Turner and G. Varghese, 2007. Curing regular expressions matching algorithms from insomnia, amnesia and acalculia. Proceedings of the 3rd ACM/IEEE Symposium on Architecture for Networking and Communications Systems, pp: 155-164.
CrossRef
-
Levenshtein, V.I., 1966. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Doklady, 10(8): 707.
-
Li, W.S. and C. Clifton, 1994. Semantic integration in heterogeneous databases using neural networks. Proceedings of the 20th VLDB Conference. Santiago, Chile, pp: 12-15.
-
Li, W.S. and C. Clifton, 2000. SEMINT: A tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng., 33(1): 49-84.
CrossRef
-
Liang, Y., 2008. An instance-based approach for domain-independent schema matching. Proceedings of the 46th Annual Southeast Regional Conference. Auburn, Alabama, pp: 268-271.
CrossRef
-
Lin, D., 1998. An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning, pp: 296-304.
-
Lin, F. and K. Sandkuhl, 2008. A survey of exploiting wordnet in ontology matching. Int. Fed. Info. Proc., 276: 341-350.
CrossRef
-
Madhavan, J., P.A. Bernstein and E. Rahm, 2001. Generic schema matching with cupid. Proceedings of the International Conference on Very Large Data Bases, pp: 49-58.
-
Mehdi, O.A., H. Ibrahim and L.S. Affendey, 2012. Instance based matching using regular expression. Proc. Comput. Sci., 10: 688-695.
CrossRef
-
Melnik, S., H. Garcia-Molina and E. Rahm, 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. Proceedings of the 18th International Conference on Data Engineering, pp: 117-128.
CrossRef
-
Meng, L., R. Huang and J. Gu, 2013. A Review of Semantic Similarity Measures in WordNet. Int. J. Hybrid Inform. Technol., 6(1).
-
Miller, G. and C. Fellbaum, 1998. Wordnet: An Electronic Lexical Database. MIT Press, Cambridge.
-
Milo, T. and S. Zohar, 1998. Using schema matching to simplify heterogeneous data translation. Proceeding of the 24th VLDB Conference. New York, USA, pp: 24-27.
-
Monge, A.E. and C. Elkan, 1996. The field matching problem: Algorithms and applications. Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp: 267-270.
-
Moreau, E., F. Yvon and O. Cappé, 2008. Robust similarity measures for named entities matching. Proceedings of the 22nd International Conference on Computational Linguistics, 1: 593-600.
CrossRef
-
Patwardhan, S., S. Banerjee and T. Pedersen, 2003. Using measures of semantic relatedness for word sense disambiguation. Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing'03), pp: 241-257.
CrossRef
-
Petrakis, E.G., G. Varelas, A. Hliaoutakis and P. Raftopoulou, 2006. Design and evaluation of semantic similarity measures for concepts stemming from the same or different ontologies. Proceedings of the 4th Workshop on Multimedia Semantics (WMS'06), pp: 44-52.
PMid:16449092
-
Rada, R., H. Mili, E. Bicknell and M. Blettner, 1989. Development and application of a metric on semantic nets. IEEE T. Syst. Man Cyb., 19(1): 17-30.
CrossRef
-
Rahm, E. and P.A. Bernstein, 2001. A survey of approaches to automatic schema matching. VLDB J., 10(4): 334-350.
CrossRef
-
Rong, S., X. Niu, E.W. Xiang, H. Wang, Q. Yang and Y. Yu, 2012. A machine learning approach for instance matching based on similarity metrics. Proceedings of the 11th International Conference on the Semantic Web-Volume Part I (ISWC'12).
CrossRef
-
Shvaiko, P. and J. Euzenat, 2005. A survey of schema-based matching approaches. Lect. Notes Comput. Sc., 3730: 146-171.
CrossRef
-
Spishak, E., W. Dietl and M.D. Ernst, 2012. A type system for regular expressions. Proceedings of the 14th Workshop on Formal Techniques for Java-Like Programs, pp: 20-26.
CrossRef
-
Tejada, S., C.A. Knoblock and S. Minton, 2001. Learning object identification rules for information integration. Inform. Syst., 26(8): 607-633.
CrossRef
-
Tejada, S., C.A. Knoblock and S. Minton, 2002. Learning domain-independent string transformation weights for high accuracy object identification. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp: 350-359.
CrossRef
-
Varelas, G., E. Voutsakis, P. Raftopoulou, E.G. Petrakis and E.E. Milios, 2005. Semantic similarity methods in wordNet and their application to information retrieval on the web. Proceedings of the 7th Annual ACM International Workshop on Web Information and Data Management, pp: 10-16.
CrossRef
-
Wu, Z. and M. Palmer, 1994. Verbs semantics and lexical selection. Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp: 133-138.
CrossRef
-
Xie, Y., F. Yu, K. Achan, R. Panigrahy, G. Hulten and I. Osipkov, 2008. Spamming botnets: Signatures and characteristics. Comput. Commun. Rev., 38(4): 171-182.
CrossRef
-
Yang, Y., M. Chen and B. Gao, 2008. An effective content-based schema matching algorithm. Proceedings of the International Seminar on Future Information Technology and Management Engineering (FITME '08), pp: 7-11.
CrossRef
-
Yatskevich, M. and F. Giunchiglia, 2004. Element level semantic matching using WordNet. Proceeding of the Meaning Coordination and Negotiation Workshop. ISWC.
-
Zaiß, K., T. Schlüter and S. Conrad, 2008. Instance-based ontology matching using regular expressions. Proceeding of the OTM 2008 Workshops on the Move to Meaningful Internet Systems, pp: 40-41.
CrossRef
-
Zapilko, B., M. Zloch and J. Schaible, 2012. Utilizing regular expressions for instance-based schema matching. Procedia Comput. Sci., 10: 688-695.
CrossRef
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|