Research Article | OPEN ACCESS
Building Words Dictionary List Using Symbol Enumeration and Hashing Methodology
Safa S. Abdul-Jabbar and Dr. Loay E. George
Computer Science Department, College of Science, Baghdad University, Baghdad, Iraq
Research Journal of Applied Sciences, Engineering and Technology 2016 12:885-894
Received: September 23, 2016 | Accepted: November 15, 2016 | Published: December 15, 2016
Abstract
This study aims to introduce a new method to reduce the time needed for text retrieval systems by building word dictionary takes the advantage of enumerating each string, multi hashing methodology stop-words extraction and word stemming; dictionary-based text mining has an important role in understanding and analyzing large text datasets that used in any searching, matching and information retrieval systems. All of these systems mainly imply dealing with strings (i.e., undefined number of alphabet characters of each word and an undefined number of words in a sentence) and text processing operation. This has a significant effect on the execution time for the systems due to the overhead hidden-operations (like, symbols matching calculations and character conversion operations). Some of the attained experimental results are provided for these operations with a comparison between the proposed method results and those belong to the traditional method; which directly deals with strings only. Results comparisons are provided for each step to understand the advantage of the proposed approach. The results demonstrate the effectiveness of the proposed approach that reduces the execution time for each step, which in turn leads to improve the overall execution time for the whole system while maintaining the accuracy of the operations.
Keywords:
And stop-words, data editors, hashing methodology, string enumeration, string hashing, stemming , string matching operation , word dictionary,
References
-
Ayral, H. and S. Yavuz, 2011. An automated domain specific stop word generation method for natural language text classification. Proceeding of the IEEE International Symposium on Innovations in Intelligent Systems and Applications (INISTA). June 15-18, pp: 500-503.
-
Bhadade, U.S. and A.I. Trivedi, 2011. Lossless text compression using dictionaries. Int. J. Comput. Appl., 13(8): 27-34.
CrossRef
-
Botelho, F.C., A. Lacerda, G.V. Menezes and N. Ziviani, 2011. Minimal perfect hashing: A competitive method for indexing internal memory. Inform. Sciences, 181(13): 2608-2625.
CrossRef Direct Link
-
Burnard, L., 1976. The University of Oxford Text Archive. University of Oxford.
Direct Link
-
Clapson, A., 2014. A Note on Type Conversions and Numeric Precision in SASŪ: Numeric to Character and Back Again. Statistics Canada. Paper No. 1752-2014.
Direct Link
-
Cox, N.J., 2002. Speaking stata: On numbers and strings. Stata J., 2(3): 314-329.
Direct Link
-
Ferragina, P. and G. Navarro, 2005. Pizza&Chili Corpus-Compressed Indexes and Their Testbeds.
Direct Link
-
Fox, C., 1992. Lexical Analysis and Stoplists. In: Frakes, W.B. and R. Baeza-Yates (Eds.), Information Retrieval: Data Structures & Algorithms. Prentice Hall Inc., Uppar Saddle River, NJ, USA, pp: 102-130.
-
Grill, B., 2014. A Survey on Efficient Hashing Techniques in Software Configuration Management. White Paper.
Direct Link
-
Jivani, A.G., 2011. A comparative study of stemming algorithms. Int. J. Comput. Technol. Appl., 2(6): 1930-1938.
-
Joshi, A., N. Thomas and M. Dabhade, 2016. Modified porter stemming algorithm. Int. J. Comput. Sci. Inform. Technol., 7(1): 266-269.
-
Popova, S., L. Kovriguina, D. Mouromtsev and I. Khodyrev, 2013. Stop-words in keyphrase extraction problem. Proceeding of the 14th Conference of Open Innovations Association (FRUCT), pp: 113-121.
-
Ramasubramanian, C. and R. Ramya, 2013. Effective pre-processing activities in text mining using improved porter's stemming algorithm. Int. J. Adv. Res. Comput. Commun. Eng., 2(12): 4536-4538.
-
Richter, S., V. Alvarez and J. Dittrich, 2015. A seven-dimensional analysis of hashing methods and its implications on query processing. Proc. VLDB Endowment, 9(3): 96-107.
-
Sakurai, S., Y. Ichimura, A. Suyama and R. Orihara, 2001. Acquisition of a knowledge dictionary for a text mining system using an inductive learning method. Proceeding of the Workshop on Text Learning: Beyond Supervision, pp: 45-52.
-
Singh, B., I. Yadav, S. Agarwal and R. Prasad, 2009. An efficient word searching algorithm through splitting and hashing the offline text. Proceeding of the IEEE International Conference on Advances in Recent Technologies in Communication and Computing. Kottayam, Kerala, India, pp: 387-389.
-
Stein, B. and M. Potthast, 2007. Applying hash-based indexing in text-based information retrieval. Proceeding of the 7th Dutch-Belgian Information Retrieval Workshop, pp: 29-35.
-
Van Rijsbergen, C.J., 1979. Information Retrieval. Department of Computer Science, University of Glasgow.
Direct Link
-
Vijayarani, S., J. Ilamathi and Nithya, 2015. Preprocessing techniques for text mining - An overview. Int. J. Comput. Sci. Commun. Netw., 5(1): 7-16.
-
Willett, P., 2006. The porter stemming algorithm: Then and now. Program-Electron. Lib., 40(3): 219-223.
-
Yao, Z. and C. Ze-Wen, 2011. Research on the construction and filter method of stop-word list in text preprocessing. Proceeding of the IEEE 4th International Conference on Intelligent Computation Technology and Automation (ICICTA), pp: 217-221.
-
Zhang, D. and W.J. Li, 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. Proceeding of the 28th AAAI Conference on Artificial Intelligence, 1(2): 2177-2183.
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|