Research Article | OPEN ACCESS
Document Similarity Measure Based on the Earth Mover's Distance Utilizing Latent Dirichlet Allocation
1Min-Hee Jang, 1Tae-Hwan Eom, 1Sang-Wook Kim and 2Young-Sup Hwang
1Department of Computer and Software, Hanyang University, 17 Haengdang-Dong, Seongdong-Gu, Seoul 133-791
2Department ofComputer Science and Engineering, Sun Moon University, Sunmoonro 221-70 Tangjoong-Myoon, Asan, Chungnam, 336-708, Korea
Research Journal of Applied Sciences, Engineering and Technology 2016 2:214-222
Received: July ‎2, ‎2015 | Accepted: August ‎2, ‎2015 | Published: January 20, 2016
Abstract
Document similarity is used to search for such documents similar to a query document given. Text-based document similarity is computed by comparing the words in documents. The cosine similarity is the most popular text-based document similarity measure and computes the similarity of two documents based on their common word frequencies. It counts the exactly same words only, so cannot reflect semantic similarity between similar words having the same meaning. We propose a new document similarity measure to solve this problem by using the Earth Mover’s Distance (EMD). The EMD enables to compute the semantic similarity of documents. To apply the EMD to the similarity measure, we need to solve the high computational complexity and to define the distance between attributes. The high computational complexity comes from the large number of words in documents. Thus, we extract the topics from documents by using Latent Dirichlet Allocation (LDA), a document generating model. Since the number of topics is much smaller than that of words, the LDA helps reduce the computational complexity. We define the distance between topics using the cosine similarity. The experimental results on real-world document databases show that the proposed measure finds similar documents more accurately than the cosine similarity owing to reflecting semantic similarity.
Keywords:
Cosine similairty, document similarity, earth mover, latent dirichlet allocation, semantic similarity,
References
-
Assent, I., A. Wenning and T. Seidl, 2006. Approximation techniques for indexing the earth mover’s distance in multimedia databases. Proceeding of the IEEE International Conference on Data Engineering(ICDE, 2006), pp: 1-12.
-
Baeza-Yates, R. and B. Ribeiro-Neto, 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA, USA.
-
Berry, M., 2003. Survey of Text Mining: Clustering, Classification, and Retrieval. Springer-Verlag, New York, USA.
-
Bisson, G. and F. Hussain, 2008. Chi-Sim: A new similarity measure for the co-clustering task. Proceeding of the 7th International Conference on Machine Learning and Applications (ICMLA, 2008), pp: 211-217.
CrossRef -
Blei, D., 2004. Probabilistic models for text and images. Ph.D. Thesis, U.C., Berkeley.
-
Blei, D. and J. Lafferty, 2006. Correlated topic models. Adv. Neur. In., 1: 147-154.
-
Blei, D., A. Ng and M. Jordan, 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3: 993-1022.
-
Cao, Y., J. Xu, T.Y. Liu, H. Li, Y. Huang and H.W. Hon, 2006. Adapting ranking SVM to document retrieval. Proceeding of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp: 186-193.
CrossRef -
Han, J. and M. Kamber, 2006. Data Mining: Concepts and Techniques. 2nd Edn., Morgan Kaufmann, San Francisco, USA.
-
Iosif, E. and A. Potamianos, 2010. Unsupervised semantic similarity computation between terms using web documents. IEEE T. Knowl. Data En., 22: 1637-1647.
CrossRef -
Jang, M.H., S.W. Kim, C. Faloutsos and S. Park, 2011. A linear-time approximation of the earth mover’s distance. Proceeding of the 20th ACM International Conference on Information and Knowledge Management (CIKM, 2011), pp: 505-514.
CrossRef -
NCBI (National Center for Biotechnology Information), 2009. PubMed, Retrieved from: http://www.ncbi.nlm.nih.gov/sites/entrez/.
Direct Link -
Rennie, J., 2008. The 20 Newsgroups Data Set. Retrieved from: http://people.csail.mit.edu/jrennie/20Newsgroups/.
Direct Link -
Robertson, S.E. and K.S. Jones, 1976. Relevance weighting of search terms. J. Am. Soc. Inform. Sci., 27: 129-146.
CrossRef -
Rubner, Y., C. Tomasi and L. Guibas, 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision, 40: 99-121.
CrossRef -
Salton, G., A. Wong and C. Yang, 1976. A vector space model for automatic indexing. Commun. ACM, 18: 613-620.
CrossRef -
Steinbach, M., G. Karypis and V. Kumar, 2000. A comparison of document clustering techniques. Proceeding of the ACM International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, 400: 525-526.
-
Wang, X. and E. Grimson, 2007. Spatial latent dirichlet allocation. Adv. Neur. In., 20: 1-8.
-
Wichterich, M., I. Assent, P. Kranen and T. Seidl, 2008. Efficient EMD-based similarity search inmultimedia databases via flexible dimensionality reduction. Proceeding of the ACM SIGMOD International Conference on Management of Data(SIGMOD, 2008), pp: 199-212.
-
Xu, J., Z. Zhang, A.K.H. Tung and G. Yu, 2010. Efficient and effective similarity search over probabilistic data based on earth mover’s distance. Proceeding of the VLDB Endowment, 3(1): 758-769.
CrossRef
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|