Text Categorization using Reduced Training Set

Mohamed Goudjil; Mouloud Koudil; Mouldi Bedda; Mouldi Bedda

doi:10.19026/rjaset.10.1835

Abstract

The machine learning approaches to text categorization proceed by teaching the system how to classify through labeled samples. In real application scenarios, the collection of training (labeled) samples to design a classifier is not always trivial due to the complexity and the cost which characterize the process. A possible solution to this issue can be found in the exploitation of the large number of unlabeled samples which are accessible at zero cost from the web. Active learning strives to reduce the required labeling effort while retaining the accuracy by intelligently selecting the samples to be labeled. This Study presents a novel active learning method for text classification that selects a batch of informative samples for manual labeling by an expert. The proposed method uses the posterior probability output of a multi-class SVM method. The experiments are performed with two well-known datasets and the presented experimental results show that employing our active learning method can significantly reduce the need for labeled training data.

Keywords:

Active learning, pairwise coupling, pool-based active learning, support vector machine, text classification,

References

Balamurugan, S.A.A. and R. Rajaram, 2009. Effective and efficient feature selection for large-scale data using Bayes� theorem. Int. J. Autom. Comput., 6(1): 62-71.
CrossRef
Basu, T. and C. Murthy, 2014. Towards enriching the quality of k-nearest neighbor rule for document classification. Int. J. Mach. Learn. Cybern., 5(6): 897-905.
CrossRef
Cai, F., H. Chen and Z. Shu, 2014. Web document ranking via active learning and kernel principal component analysis. Int. J. Mod. Phys. C, 26(4): 18.
Cardoso-Cachopo, A. and A.L. Oliveira, 2007. Semi-supervised single-label text categorization using centroid-based classifiers. Proceeding of the ACM Symposium on Applied Computing. ACM, New York, pp: 844-851.
CrossRef
Chang, C.C. and C.J. Lin, 2011. LIBSVM: A library for support vector machines. ACM T. Intell. Syst. Technol., 2(3): 27.
CrossRef
Demir, B., C. Persello and L. Bruzzone, 2011. Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE T. Geosci. Remote, 49(3): 1014-1031.
CrossRef
Ding, S., B. Li and X. Fu, 2014. Active learning methods for classification of hyperspectral remote sensing image. In: Huang, D.S. et al. (Eds.), ICIC, 2014. LNAI 8589, Springer International Publishing, Switzerland, pp: 484-491.
CrossRef
Duan, K.B. and S.S. Keerthi, 2005. Which is the best multiclass SVM method? An empirical study. In: Oza, N.C. (Eds.), MCS, 2005. LNCS 3541, Springer-Verlag, Berlin, Heidelberg, pp: 278-285.
CrossRef
Elahi, M., F. Ricci and N. Rubens, 2014. Active learning in collaborative filtering recommender systems. In: Hepp, M. and Y. Hoffner (Eds.), EC-Web 2014. LNBIP 188, Springer International Publishing, Switzerland, pp: 113-124.
CrossRef
Fragos, K., P. Belsis and C. Skourlas, 2014. Combining probabilistic classifiers for text classification. Proc. Soc. Behav. Sci., 147: 307-312.
CrossRef
Ghoggali, N., F. Melgani and Y. Bazi, 2009. A multiobjective genetic SVM approach for classification problems with limited training samples. IEEE T. Geosci. Remote, 47(6): 1707-1718.
CrossRef
Goudjil, M., M. Koudil, N. Hammami and M. Bedda, 2013. Arabic text categorization using SVM active learning technique: An overview. Proceeding of the World Congress on Computer and Information Technology (WCCIT), pp: 1-2.
CrossRef
Hastie, T. and R. Tibshirani, 1998. Classification by pairwise coupling. Ann. Stat., 26(2): 451-471.
CrossRef
Lam, W. and Y. Han 2003. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE T. Pattern Anal., 25(5): 628-633.
CrossRef
Li, M. and I.K. Sethi, 2006. Confidence-based active learning. IEEE T. Pattern Anal., 28(8): 1251-1261.
CrossRef PMid:16886861
Li, Q. and L. Chen, 2014. Study on Multi-class Text Classification Based on Improved SVM. In: Wen, Z. and T. Li (Eds.), Practical Applications of Intelligent Systems. Advances in Intelligent Systems and Computing 279, Springer-Verlag, Berlin, Heidelberg, pp: 519-526.
CrossRef
Mangai, J.A., V.S. Kumar, S.A. alias Balamurugan, 2012. A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput., 9(4): 442-448.
CrossRef
Persello, C. and L. Bruzzone, 2014. Active and semisupervised learning for the classification of remote sensing images. IEEE T. Geosci. Remote, 52(11): 6937-6956.
CrossRef
Salton, G. and C. Buckley, 1988. Term-weighting approaches in automatic text retrieval. Inform. Process. Manag., 24(5): 513-523.
CrossRef
Sassano, M., 2002. An empirical study of active learning with support vector machines for Japanese word segmentation. Proceeding of the 40th Annual Meeting on Association for Computational Linguistics. USA, pp: 505-512.
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR), 34(1): 1-47.
CrossRef
Settles, B., 2010. Active learning literature survey. Uni. Wisconsin, Madison, 52(55-66): 11.
Shen, Q. and R. Jensen, 2007. Rough sets, their extensions and applications. Int. J. Autom. Comput., 4(3): 217-228.
CrossRef
Sparck Jones, K., 1972. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 28(1): 11-21.
CrossRef
Sp�rck Jones, K., 2004. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 60(5): 493-502.
CrossRef
Ting-Fan, W., L. Chih-Jen and C.W. Ruby, 2004. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res., 5: 975-1005.
Tong, S. and D. Koller, 2002. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2: 45-66.
Wang, J.H. and H.Y. Wang, 2014. Incremental neural network construction for text classification. Proceeding of the International Symposium on Computer, Consumer and Control (IS3C). Taichung, pp: 970-973.
CrossRef

Research Journal of Applied Sciences, Engineering and Technology

Text Categorization using Reduced Training Set

Abstract

Keywords:

References

Competing interests

Open Access Policy

Copyright



Journal Home \| Aim & Scope \| Author(s) Information \| Editorial Board \| MSP Download Statistics