Research Article | OPEN ACCESS
Text Categorization using Reduced Training Set
1Mohamed Goudjil, 1Mouloud Koudil, 2Mouldi Bedda and 3Noureddine Ghoggali
1Ecole Nationale Supérieure d’Informatique (ESI), Oued Smar, Algiers, Algeria
2AL JOUF University, Sakaka, Kingdom of Saudi Arabia
3University of Batna, Batna, Algeria
Research Journal of Applied Sciences, Engineering and Technology 2015 12:1363-1369
Received: ‎February ‎26, ‎2015 | Accepted: March ‎25, ‎2015 | Published: August 25, 2015
Abstract
The machine learning approaches to text categorization proceed by teaching the system how to classify through labeled samples. In real application scenarios, the collection of training (labeled) samples to design a classifier is not always trivial due to the complexity and the cost which characterize the process. A possible solution to this issue can be found in the exploitation of the large number of unlabeled samples which are accessible at zero cost from the web. Active learning strives to reduce the required labeling effort while retaining the accuracy by intelligently selecting the samples to be labeled. This Study presents a novel active learning method for text classification that selects a batch of informative samples for manual labeling by an expert. The proposed method uses the posterior probability output of a multi-class SVM method. The experiments are performed with two well-known datasets and the presented experimental results show that employing our active learning method can significantly reduce the need for labeled training data.
Keywords:
Active learning, pairwise coupling, pool-based active learning, support vector machine, text classification,
References
-
Balamurugan, S.A.A. and R. Rajaram, 2009. Effective and efficient feature selection for large-scale data using Bayes’ theorem. Int. J. Autom. Comput., 6(1): 62-71.
CrossRef -
Basu, T. and C. Murthy, 2014. Towards enriching the quality of k-nearest neighbor rule for document classification. Int. J. Mach. Learn. Cybern., 5(6): 897-905.
CrossRef -
Cai, F., H. Chen and Z. Shu, 2014. Web document ranking via active learning and kernel principal component analysis. Int. J. Mod. Phys. C, 26(4): 18.
-
Cardoso-Cachopo, A. and A.L. Oliveira, 2007. Semi-supervised single-label text categorization using centroid-based classifiers. Proceeding of the ACM Symposium on Applied Computing. ACM, New York, pp: 844-851.
CrossRef -
Chang, C.C. and C.J. Lin, 2011. LIBSVM: A library for support vector machines. ACM T. Intell. Syst. Technol., 2(3): 27.
CrossRef -
Demir, B., C. Persello and L. Bruzzone, 2011. Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE T. Geosci. Remote, 49(3): 1014-1031.
CrossRef -
Ding, S., B. Li and X. Fu, 2014. Active learning methods for classification of hyperspectral remote sensing image. In: Huang, D.S. et al. (Eds.), ICIC, 2014. LNAI 8589, Springer International Publishing, Switzerland, pp: 484-491.
CrossRef -
Duan, K.B. and S.S. Keerthi, 2005. Which is the best multiclass SVM method? An empirical study. In: Oza, N.C. (Eds.), MCS, 2005. LNCS 3541, Springer-Verlag, Berlin, Heidelberg, pp: 278-285.
CrossRef -
Elahi, M., F. Ricci and N. Rubens, 2014. Active learning in collaborative filtering recommender systems. In: Hepp, M. and Y. Hoffner (Eds.), EC-Web 2014. LNBIP 188, Springer International Publishing, Switzerland, pp: 113-124.
CrossRef -
Fragos, K., P. Belsis and C. Skourlas, 2014. Combining probabilistic classifiers for text classification. Proc. Soc. Behav. Sci., 147: 307-312.
CrossRef -
Ghoggali, N., F. Melgani and Y. Bazi, 2009. A multiobjective genetic SVM approach for classification problems with limited training samples. IEEE T. Geosci. Remote, 47(6): 1707-1718.
CrossRef -
Goudjil, M., M. Koudil, N. Hammami and M. Bedda, 2013. Arabic text categorization using SVM active learning technique: An overview. Proceeding of the World Congress on Computer and Information Technology (WCCIT), pp: 1-2.
CrossRef -
Hastie, T. and R. Tibshirani, 1998. Classification by pairwise coupling. Ann. Stat., 26(2): 451-471.
CrossRef -
Lam, W. and Y. Han 2003. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE T. Pattern Anal., 25(5): 628-633.
CrossRef -
Li, M. and I.K. Sethi, 2006. Confidence-based active learning. IEEE T. Pattern Anal., 28(8): 1251-1261.
CrossRef PMid:16886861 -
Li, Q. and L. Chen, 2014. Study on Multi-class Text Classification Based on Improved SVM. In: Wen, Z. and T. Li (Eds.), Practical Applications of Intelligent Systems. Advances in Intelligent Systems and Computing 279, Springer-Verlag, Berlin, Heidelberg, pp: 519-526.
CrossRef -
Mangai, J.A., V.S. Kumar, S.A. alias Balamurugan, 2012. A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput., 9(4): 442-448.
CrossRef -
Persello, C. and L. Bruzzone, 2014. Active and semisupervised learning for the classification of remote sensing images. IEEE T. Geosci. Remote, 52(11): 6937-6956.
CrossRef -
Salton, G. and C. Buckley, 1988. Term-weighting approaches in automatic text retrieval. Inform. Process. Manag., 24(5): 513-523.
CrossRef -
Sassano, M., 2002. An empirical study of active learning with support vector machines for Japanese word segmentation. Proceeding of the 40th Annual Meeting on Association for Computational Linguistics. USA, pp: 505-512.
-
Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR), 34(1): 1-47.
CrossRef -
Settles, B., 2010. Active learning literature survey. Uni. Wisconsin, Madison, 52(55-66): 11.
-
Shen, Q. and R. Jensen, 2007. Rough sets, their extensions and applications. Int. J. Autom. Comput., 4(3): 217-228.
CrossRef -
Sparck Jones, K., 1972. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 28(1): 11-21.
CrossRef -
Spärck Jones, K., 2004. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 60(5): 493-502.
CrossRef -
Ting-Fan, W., L. Chih-Jen and C.W. Ruby, 2004. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res., 5: 975-1005.
-
Tong, S. and D. Koller, 2002. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2: 45-66.
-
Wang, J.H. and H.Y. Wang, 2014. Incremental neural network construction for text classification. Proceeding of the International Symposium on Computer, Consumer and Control (IS3C). Taichung, pp: 970-973.
CrossRef
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Copyright
The authors have no competing interests.
|
|
|
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
|
Information |
|
|
|
Sales & Services |
|
|
|