Home            Contact us            FAQs
    
      Journal Home      |      Aim & Scope     |     Author(s) Information      |      Editorial Board      |      MSP Download Statistics

     Research Journal of Applied Sciences, Engineering and Technology


Document Similarity Measure Based on the Earth Mover's Distance Utilizing Latent Dirichlet Allocation

1Min-Hee Jang, 1Tae-Hwan Eom, 1Sang-Wook Kim and 2Young-Sup Hwang
1Department of Computer and Software, Hanyang University, 17 Haengdang-Dong, Seongdong-Gu, Seoul 133-791
2Department ofComputer Science and Engineering, Sun Moon University, Sunmoonro 221-70 Tangjoong-Myoon, Asan, Chungnam, 336-708, Korea
Research Journal of Applied Sciences, Engineering and Technology  2016  2:214-222
http://dx.doi.org/10.19026/rjaset.12.2323  |  © The Author(s) 2016
Received: July ‎2, ‎2015  |  Accepted: August ‎2, ‎2015  |  Published: January 20, 2016

Abstract

Document similarity is used to search for such documents similar to a query document given. Text-based document similarity is computed by comparing the words in documents. The cosine similarity is the most popular text-based document similarity measure and computes the similarity of two documents based on their common word frequencies. It counts the exactly same words only, so cannot reflect semantic similarity between similar words having the same meaning. We propose a new document similarity measure to solve this problem by using the Earth Mover’s Distance (EMD). The EMD enables to compute the semantic similarity of documents. To apply the EMD to the similarity measure, we need to solve the high computational complexity and to define the distance between attributes. The high computational complexity comes from the large number of words in documents. Thus, we extract the topics from documents by using Latent Dirichlet Allocation (LDA), a document generating model. Since the number of topics is much smaller than that of words, the LDA helps reduce the computational complexity. We define the distance between topics using the cosine similarity. The experimental results on real-world document databases show that the proposed measure finds similar documents more accurately than the cosine similarity owing to reflecting semantic similarity.

Keywords:

Cosine similairty, document similarity, earth mover, latent dirichlet allocation, semantic similarity,


References

  1. Assent, I., A. Wenning and T. Seidl, 2006. Approximation techniques for indexing the earth mover’s distance in multimedia databases. Proceeding of the IEEE International Conference on Data Engineering(ICDE, 2006), pp: 1-12.
  2. Baeza-Yates, R. and B. Ribeiro-Neto, 1999. Modern Information Retrieval. Addison-Wesley, Boston, MA, USA.
  3. Berry, M., 2003. Survey of Text Mining: Clustering, Classification, and Retrieval. Springer-Verlag, New York, USA.
  4. Bisson, G. and F. Hussain, 2008. Chi-Sim: A new similarity measure for the co-clustering task. Proceeding of the 7th International Conference on Machine Learning and Applications (ICMLA, 2008), pp: 211-217.
    CrossRef    
  5. Blei, D., 2004. Probabilistic models for text and images. Ph.D. Thesis, U.C., Berkeley.
  6. Blei, D. and J. Lafferty, 2006. Correlated topic models. Adv. Neur. In., 1: 147-154.
  7. Blei, D., A. Ng and M. Jordan, 2003. Latent dirichlet allocation. J. Mach. Learn. Res., 3: 993-1022.
  8. Cao, Y., J. Xu, T.Y. Liu, H. Li, Y. Huang and H.W. Hon, 2006. Adapting ranking SVM to document retrieval. Proceeding of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp: 186-193.
    CrossRef    
  9. Han, J. and M. Kamber, 2006. Data Mining: Concepts and Techniques. 2nd Edn., Morgan Kaufmann, San Francisco, USA.
  10. Iosif, E. and A. Potamianos, 2010. Unsupervised semantic similarity computation between terms using web documents. IEEE T. Knowl. Data En., 22: 1637-1647.
    CrossRef    
  11. Jang, M.H., S.W. Kim, C. Faloutsos and S. Park, 2011. A linear-time approximation of the earth mover’s distance. Proceeding of the 20th ACM International Conference on Information and Knowledge Management (CIKM, 2011), pp: 505-514.
    CrossRef    
  12. NCBI (National Center for Biotechnology Information), 2009. PubMed, Retrieved from: http://www.ncbi.nlm.nih.gov/sites/entrez/.
    Direct Link
  13. Rennie, J., 2008. The 20 Newsgroups Data Set. Retrieved from: http://people.csail.mit.edu/jrennie/20Newsgroups/.
    Direct Link
  14. Robertson, S.E. and K.S. Jones, 1976. Relevance weighting of search terms. J. Am. Soc. Inform. Sci., 27: 129-146.
    CrossRef    
  15. Rubner, Y., C. Tomasi and L. Guibas, 2000. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision, 40: 99-121.
    CrossRef    
  16. Salton, G., A. Wong and C. Yang, 1976. A vector space model for automatic indexing. Commun. ACM, 18: 613-620.
    CrossRef    
  17. Steinbach, M., G. Karypis and V. Kumar, 2000. A comparison of document clustering techniques. Proceeding of the ACM International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, 400: 525-526.
  18. Wang, X. and E. Grimson, 2007. Spatial latent dirichlet allocation. Adv. Neur. In., 20: 1-8.
  19. Wichterich, M., I. Assent, P. Kranen and T. Seidl, 2008. Efficient EMD-based similarity search inmultimedia databases via flexible dimensionality reduction. Proceeding of the ACM SIGMOD International Conference on Management of Data(SIGMOD, 2008), pp: 199-212.
  20. Xu, J., Z. Zhang, A.K.H. Tung and G. Yu, 2010. Efficient and effective similarity search over probabilistic data based on earth mover’s distance. Proceeding of the VLDB Endowment, 3(1): 758-769.
    CrossRef    

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online):  2040-7467
ISSN (Print):   2040-7459
Submit Manuscript
   Information
   Sales & Services
Home   |  Contact us   |  About us   |  Privacy Policy
Copyright © 2024. MAXWELL Scientific Publication Corp., All rights reserved