Research Article | OPEN ACCESS
Unsupervised Discretization: An Analysis of Classification Approaches for Clinical Datasets
1M. Shanmugapriya, 1H.Khanna Nehemiah, 1R.S. Bhuvaneswaran, 2Kannan Arputharaj and 1J. Jabez Christopher
1Ramanujan Computing Centre
2Department of Information Science and Technology, Anna University, Chennai-600025, India
Research Journal of Applied Sciences, Engineering and Technology 2017 2:67-72
Received: June ‎28, ‎2016 | Accepted: August ‎9, ‎2016 | Published: February 15, 2017
Abstract
Discretization is a frequently used data preprocessing technique for enhancing the performance of data mining tasks in knowledge discovery from clinical data. It is used to transform the real-world quantitative data into qualitative data. The aim of this study is to present an experimental analysis of the variation in performance of two trivial unsupervised discretization methods with respect to different classification approaches. Equal width discretization and equal frequency discretization methods are applied for four benchmark clinical datasets obtained from the University of California, Irvine, machine learning repository. Both the methods were applied for transforming quantitative attributes into qualitative attributes with three, five, seven and ten intervals. Six classification approaches were evaluated using four evaluation measures. From the results of this experimental analysis, it can be observed that there is a variation in the performance of classification algorithms. Accuracy of classification varies with respect to the discretization method used and also with respect to the number of intervals of discretization. Moreover it can be inferred that different classification approaches require different discretization methods. No method can be deemed to be ‘the best-suitable’ for all applications; hence the choice of an appropriate discretization method depends on data distribution, data interpretability, correlation, classification performance and domain of application.
Keywords:
Classification, clinical knowledge-mining, equal frequency discretization, equal width discretization, qualitative data, quantitative data,
References
-
Agrawal, R. and R. Srikant, 1994. Fast algorithms for mining association rules. Proceeding of the 20th International Conference on Very Large Databases. Santiago, Chile, pp: 487-499.
- Boser, B.E., I.M. Guyon and V.N. Vapnik, 1992. A training algorithm for optimal margin classifiers. Proceeding of the 5th Annual Workshop on Computational Learning Theory, pp: 144-152.
CrossRef
- Fu, T.C., 2011. A review on time series data mining. Eng. Appl. Artif. Intel., 24(1): 164-181.
CrossRef
- Quinlan, J.R., 1986. Induction of decision trees. Mach. Learn., 1(1): 81-106.
CrossRef
- Rosenblatt, F., 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev., 65(6): 386-408.
CrossRef PMid:13602029
- Susmi, S.J., H.K. Nehemiah, A. Kannan and J.J. Christopher, 2015. A hybrid classifier for leukemia gene expression data. Res. J. Appl. Sci. Eng. Technol., 10(2): 197-205.
- Christopher, J.J., H.K. Nehemiah and A. Kannan, 2015. A clinical decision support system for diagnosis of allergic rhinitis based on intradermal skin tests. Comput. Biol. Med., 65: 76-84.
CrossRef PMid:26298488
- Fayyad, U., G. Piatetsky-Shapiro and P. Smyth, 1996. Knowledge discovery and data mining: Towards a unifying framework. Proceeding of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96), pp: 82-88.
Direct Link
-
Han, J. and M. Kamber, 2006. Data Mining: Concepts and Techniques. 2nd Edn., Morgan Kaufmann, San Francisco, CA, USA.
PMCid:PMC3769573
-
Jane, N.Y., K.H. Nehemiah and K. Arputharaj, 2016. A temporal mining framework for classifying un-evenly spaced clinical data: An approach for building effective clinical decision-making system. Appl. Clin. Inform., 7(1): 1-21.
CrossRef PMid:27081403 PMCid:PMC4817331 Direct Link
- Kohavi, R., 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proceeding of the 14th International Joint Conference on Artificial Intelligence (IJCAI'95), 14: 1137-1143.
Direct Link
-
Kohavi, R. and M. Sahami, 1996. Error-based and entropy-based discretization of continuous features. Proceeding of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96), pp: 114-119.
Direct Link
-
Liu, B., W. Hsu and Y. Ma, 1998. Integrating classification and association rule mining. Proceeding of the 4th International Conference on Knowledge Discovery and Data Mining, pp: 80-86.
Direct Link
-
Liu, H., F. Hussain, C.L. Tan and M. Dash, 2002. Discretization: An enabling technique. Data Min. Knowl. Disc., 6(4): 393-423.
CrossRef
-
Maslove, D.M., T. Podchiyska and H.J. Lowe, 2013. Discretization of continuous features in clinical datasets. J. Am. Med. Inform. Assn., 20(3): 544-553.
CrossRef PMid:23059731 PMCid:PMC3628044
- Mittal, A. and L.F. Cheong, 2002. Employing discrete bayes error rate for discretization and feature selection tasks. Proceeding of the IEEE International Conference on Data Mining (ICDM-2002), pp: 298-305.
CrossRef
- Nahato, K.B., K.N. Harichandran and K. Arputharaj, 2015. Knowledge mining from clinical datasets using rough sets and backpropagation neural network. Comput. Math. Method. M., 2015: 1-13.
CrossRef PMid:25821508 PMCid:PMC4364360
- Richeldi, M. and M. Rossotto, 1995. Class-driven Statistical Discretization of Continuous Attributes (Extended Abstract). In: Lavrac, N. and S. Wrobel (Eds.), Machine Learning: ECML-95. Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 912: 335-338.
CrossRef
-
Sweetlin, J.D., H.K. Nehemiah and A. Kannan, 2016. Patient-specific model based segmentation of lung computed tomographic images. J. Inform. Sci. Eng., 32(5): 1373-1394.
Direct Link
-
Yang, Y. and G.I. Webb, 2009. Discretization for naive-bayes learning: Managing discretization bias and variance. Mach. Learn., 74(1): 39-74.
CrossRef
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|