Unsupervised Discretization: An Analysis of Classification Approaches for  Clinical Datasets

M. Shanmugapriya; H.Khanna Nehemiah; R.S. Bhuvaneswaran; R.S. Bhuvaneswaran; J. Jabez Christopher

doi:10.19026/rjaset.14.3991

Research Journal of Applied Sciences, Engineering and Technology

Research Article | OPEN ACCESS

Unsupervised Discretization: An Analysis of Classification Approaches for Clinical Datasets

¹M. Shanmugapriya, ¹H.Khanna Nehemiah, ¹R.S. Bhuvaneswaran, ²Kannan Arputharaj and ¹J. Jabez Christopher

¹Ramanujan Computing Centre
²Department of Information Science and Technology, Anna University, Chennai-600025, India

Research Journal of Applied Sciences, Engineering and Technology 2017 2:67-72

http://dx.doi.org/10.19026/rjaset.14.3991 | © The Author(s) 2017

Received: June â€Ž28, â€Ž2016 | Accepted: August â€Ž9, â€Ž2016 | Published: February 15, 2017

Back to issue | PDF | HTML

Abstract

Discretization is a frequently used data preprocessing technique for enhancing the performance of data mining tasks in knowledge discovery from clinical data. It is used to transform the real-world quantitative data into qualitative data. The aim of this study is to present an experimental analysis of the variation in performance of two trivial unsupervised discretization methods with respect to different classification approaches. Equal width discretization and equal frequency discretization methods are applied for four benchmark clinical datasets obtained from the University of California, Irvine, machine learning repository. Both the methods were applied for transforming quantitative attributes into qualitative attributes with three, five, seven and ten intervals. Six classification approaches were evaluated using four evaluation measures. From the results of this experimental analysis, it can be observed that there is a variation in the performance of classification algorithms. Accuracy of classification varies with respect to the discretization method used and also with respect to the number of intervals of discretization. Moreover it can be inferred that different classification approaches require different discretization methods. No method can be deemed to be ‘the best-suitable’ for all applications; hence the choice of an appropriate discretization method depends on data distribution, data interpretability, correlation, classification performance and domain of application.

Keywords:

Classification, clinical knowledge-mining, equal frequency discretization, equal width discretization, qualitative data, quantitative data,

References

Agrawal, R. and R. Srikant, 1994. Fast algorithms for mining association rules. Proceeding of the 20th International Conference on Very Large Databases. Santiago, Chile, pp: 487-499.
Boser, B.E., I.M. Guyon and V.N. Vapnik, 1992. A training algorithm for optimal margin classifiers. Proceeding of the 5th Annual Workshop on Computational Learning Theory, pp: 144-152.
CrossRef
Fu, T.C., 2011. A review on time series data mining. Eng. Appl. Artif. Intel., 24(1): 164-181.
CrossRef
Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn., 1(1): 81-106.
CrossRef
Rosenblatt, F., 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6): 386-408.
CrossRef PMid:13602029
Susmi, S.J., H.K. Nehemiah, A. Kannan and J.J. Christopher, 2015. A hybrid classifier for leukemia gene expression data. Res. J. Appl. Sci. Eng. Technol., 10(2): 197-205.
Christopher, J.J., H.K. Nehemiah and A. Kannan, 2015. A clinical decision support system for diagnosis of allergic rhinitis based on intradermal skin tests. Comput. Biol. Med., 65: 76-84.
CrossRef PMid:26298488
Fayyad, U., G. Piatetsky-Shapiro and P. Smyth, 1996. Knowledge discovery and data mining: Towards a unifying framework. Proceeding of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96), pp: 82-88.
Direct Link
Han, J. and M. Kamber, 2006. Data Mining: Concepts and Techniques. 2nd Edn., Morgan Kaufmann, San Francisco, CA, USA.
PMCid:PMC3769573
Jane, N.Y., K.H. Nehemiah and K. Arputharaj, 2016. A temporal mining framework for classifying un-evenly spaced clinical data: An approach for building effective clinical decision-making system. Appl. Clin. Inform., 7(1): 1-21.
CrossRef PMid:27081403 PMCid:PMC4817331 Direct Link
Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceeding of the 14th International Joint Conference on Artificial Intelligence (IJCAI'95), 14: 1137-1143.
Direct Link
Kohavi, R. and M. Sahami, 1996. Error-based and entropy-based discretization of continuous features. Proceeding of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96), pp: 114-119.
Direct Link
Liu, B., W. Hsu and Y. Ma, 1998. Integrating classification and association rule mining. Proceeding of the 4th International Conference on Knowledge Discovery and Data Mining, pp: 80-86.
Direct Link
Liu, H., F. Hussain, C.L. Tan and M. Dash, 2002. Discretization: An enabling technique. Data Min. Knowl. Disc., 6(4): 393-423.
CrossRef
Maslove, D.M., T. Podchiyska and H.J. Lowe, 2013. Discretization of continuous features in clinical datasets. J. Am. Med. Inform. Assn., 20(3): 544-553.
CrossRef PMid:23059731 PMCid:PMC3628044
Mittal, A. and L.F. Cheong, 2002. Employing discrete bayes error rate for discretization and feature selection tasks. Proceeding of the IEEE International Conference on Data Mining (ICDM-2002), pp: 298-305.
CrossRef
Nahato, K.B., K.N. Harichandran and K. Arputharaj, 2015. Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput. Math. Method. M., 2015: 1-13.
CrossRef PMid:25821508 PMCid:PMC4364360
Richeldi, M. and M. Rossotto, 1995. Class-driven Statistical Discretization of Continuous Attributes (Extended Abstract). In: Lavrac, N. and S. Wrobel (Eds.), Machine Learning: ECML-95. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 912: 335-338.
CrossRef
Sweetlin, J.D., H.K. Nehemiah and A. Kannan, 2016. Patient-specific model based segmentation of lung computed tomographic images. J. Inform. Sci. Eng., 32(5): 1373-1394.
Direct Link
Yang, Y. and G.I. Webb, 2009. Discretization for naive-bayes learning: Managing discretization bias and variance. Mach. Learn., 74(1): 39-74.
CrossRef

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online): 2040-7467
ISSN (Print): 2040-7459

Information

Sales & Services



Journal Home \| Aim & Scope \| Author(s) Information \| Editorial Board \| MSP Download Statistics