Home            Contact us            FAQs
    
      Journal Home      |      Aim & Scope     |     Author(s) Information      |      Editorial Board      |      MSP Download Statistics

     Research Journal of Applied Sciences, Engineering and Technology


A Precise Distance Metric for Mixed Data Clustering using Chi-square Statistics

S. Mohanavalli and S.M. Jaisakthi
SSN College of Engineering, Chennai, Tamil Nadu-603110, India
Research Journal of Applied Sciences, Engineering and Technology   2015  12:1441-1444
http://dx.doi.org/10.19026/rjaset.10.1846  |  © The Author(s) 2015
Received: May ‎1, ‎2015  |  Accepted: May ‎10, ‎2015  |  Published: August 25, 2015

Abstract

In today's scenario, data is available as a mix of numerical and categorical values. Traditional data clustering algorithms perform well for numerical data but produce poor clustering results for mixed data. For better partitioning, the distance metric used should be capable of discriminating the data points with mixed attributes. The distance measure should appropriately balance the categorical distance as well as numerical distance. In this study we have proposed a chi-square based statistical approach to determine the weight of the attributes. This weight vector is used to derive the distance matrix of the mixed dataset. The distance matrix is used to cluster the data points using the traditional clustering algorithms. Experiments have been carried out using the UCI benchmark datasets, heart, credit and vote. Apart from these data sets we have also tested our proposed method using a real-time bank data set. The accuracy of the clustering results obtained are better than those of the existing works.

Keywords:

Chi-square statistics , clustering , mixed data attributes,


References

  1. Ahmad, A. and L. Dey, 2007. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng., 63(2): 503-527.
    CrossRef    
  2. Bai, L., J. Liang, C. Dang and F. Cao, 2011. A novel attribute weighting algorithm for clustering high-dimensional categorical data. Pattern Recogn., 44(12): 2843-2861.
    CrossRef    
  3. Cao, F., J. Liang, D. Li, L. Bai and C. Dang, 2012. A dissimilarity measure for the k-Modes clustering algorithm. Knowl-Based Syst., 26: 120-127.
    CrossRef    
  4. Cao, F., J. Liang, D. Li and X. Zhao, 2013. A weighting k-modes algorithm for subspace clustering of categorical data. Neurocomputing, 108: 23-30.
    CrossRef    
  5. Chatzis, S.P., 2011. A fuzzy c-means-type algorithm for clustering of data with mixed numeric and categorical attributes employing a probabilistic dissimilarity functional. Expert Syst. Appl., 38(7): 8684-8689.
    CrossRef    
  6. Core Team, R., 2013. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0.
  7. Han, J., M. Kamber and J. Pei, 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
  8. He, Z., X. Xu and S. Deng, 2011. Attribute value weighting in k-modes clustering. Expert Syst. Appl., 38(12): 15365-15369.
    CrossRef    
  9. Hubert, L. and A. Phipps, 1985. Comparing partitions. J. Classif., 2: 193-218.
    CrossRef    
  10. Ji, J., T. Bai, C. Zhou, C. Ma and Z. Wang, 2013. An improved k-prototypes clustering algorithm for mixed numeric and categorical data. Neurocomputing, 120: 590-596.
    CrossRef    
  11. Ji, J., W. Pang, C. Zhou, X. Han and Z. Wang, 2012. A fuzzy k-prototype clustering algorithm for mixed numeric and categorical data. Knowl-Based Syst., 30: 129-135.
    CrossRef    
  12. Li, Y., C. Luo and S.M. Chung, 2008. Text clustering with feature selection by using statistical data. IEEE T. Knowl. Data En., 20(5): 641-652.
    CrossRef    
  13. NIST, 2012. NIST/SEMATECH e-handbook of Statistical Methods. Retrieved form: http://www.itl.nist.gov/div898/handbook/2012.
    Direct Link
  14. Rajalakshmi, R., 2014. Supervised term weighting methods for URL classification. J. Comput. Sci., 10: 1969-1976.
    CrossRef    
  15. Ralambondrainy, H., 1995. A conceptual version of the K-means algorithm. Pattern Recogn. Lett., 16(11): 1147-1157.
    CrossRef    
  16. Santos, J.M. and M. Embrechts, 2009. On the use of the adjusted rand index as a metric for evaluating supervised classification. In: Alippi, C. et al. (Eds.), ICANN, 2009. Part 2 LNCS 5769, Springer, Berlin, Heidelberg, pp: 175-184.
    CrossRef    
  17. UCI ML Dataset, year. UCI Machine Learning Repository. Retrieved form: http://archive.ics.uci.edu/ml. University of California, School of Information and Computer Sciences, Irvine.
  18. Vendramin, L., Campello, R.J.G.B. Ricardo and E.R. Hruschka, 2009. On the comparison of relative clustering validity criteria. Proceeding of the 9th SIAM, International Conference on Data Mining (SDM). Sparks, NV, pp: 733-744.

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online):  2040-7467
ISSN (Print):   2040-7459
Submit Manuscript
   Information
   Sales & Services
Home   |  Contact us   |  About us   |  Privacy Policy
Copyright © 2024. MAXWELL Scientific Publication Corp., All rights reserved