Home            Contact us            FAQs
    
      Journal Home      |      Aim & Scope     |     Author(s) Information      |      Editorial Board      |      MSP Download Statistics

     Research Journal of Applied Sciences, Engineering and Technology


Text Categorization using Reduced Training Set

1Mohamed Goudjil, 1Mouloud Koudil, 2Mouldi Bedda and 3Noureddine Ghoggali
1Ecole Nationale Supérieure d’Informatique (ESI), Oued Smar, Algiers, Algeria
2AL JOUF University, Sakaka, Kingdom of Saudi Arabia
3University of Batna, Batna, Algeria
Research Journal of Applied Sciences, Engineering and Technology   2015  12:1363-1369
http://dx.doi.org/10.19026/rjaset.10.1835  |  © The Author(s) 2015
Received: ‎February ‎26, ‎2015  |  Accepted: March ‎25, ‎2015  |  Published: August 25, 2015

Abstract

The machine learning approaches to text categorization proceed by teaching the system how to classify through labeled samples. In real application scenarios, the collection of training (labeled) samples to design a classifier is not always trivial due to the complexity and the cost which characterize the process. A possible solution to this issue can be found in the exploitation of the large number of unlabeled samples which are accessible at zero cost from the web. Active learning strives to reduce the required labeling effort while retaining the accuracy by intelligently selecting the samples to be labeled. This Study presents a novel active learning method for text classification that selects a batch of informative samples for manual labeling by an expert. The proposed method uses the posterior probability output of a multi-class SVM method. The experiments are performed with two well-known datasets and the presented experimental results show that employing our active learning method can significantly reduce the need for labeled training data.

Keywords:

Active learning, pairwise coupling, pool-based active learning, support vector machine, text classification,


References

  1. Balamurugan, S.A.A. and R. Rajaram, 2009. Effective and efficient feature selection for large-scale data using Bayes’ theorem. Int. J. Autom. Comput., 6(1): 62-71.
    CrossRef    
  2. Basu, T. and C. Murthy, 2014. Towards enriching the quality of k-nearest neighbor rule for document classification. Int. J. Mach. Learn. Cybern., 5(6): 897-905.
    CrossRef    
  3. Cai, F., H. Chen and Z. Shu, 2014. Web document ranking via active learning and kernel principal component analysis. Int. J. Mod. Phys. C, 26(4): 18.
  4. Cardoso-Cachopo, A. and A.L. Oliveira, 2007. Semi-supervised single-label text categorization using centroid-based classifiers. Proceeding of the ACM Symposium on Applied Computing. ACM, New York, pp: 844-851.
    CrossRef    
  5. Chang, C.C. and C.J. Lin, 2011. LIBSVM: A library for support vector machines. ACM T. Intell. Syst. Technol., 2(3): 27.
    CrossRef    
  6. Demir, B., C. Persello and L. Bruzzone, 2011. Batch-mode active-learning methods for the interactive classification of remote sensing images. IEEE T. Geosci. Remote, 49(3): 1014-1031.
    CrossRef    
  7. Ding, S., B. Li and X. Fu, 2014. Active learning methods for classification of hyperspectral remote sensing image. In: Huang, D.S. et al. (Eds.), ICIC, 2014. LNAI 8589, Springer International Publishing, Switzerland, pp: 484-491.
    CrossRef    
  8. Duan, K.B. and S.S. Keerthi, 2005. Which is the best multiclass SVM method? An empirical study. In: Oza, N.C. (Eds.), MCS, 2005. LNCS 3541, Springer-Verlag, Berlin, Heidelberg, pp: 278-285.
    CrossRef    
  9. Elahi, M., F. Ricci and N. Rubens, 2014. Active learning in collaborative filtering recommender systems. In: Hepp, M. and Y. Hoffner (Eds.), EC-Web 2014. LNBIP 188, Springer International Publishing, Switzerland, pp: 113-124.
    CrossRef    
  10. Fragos, K., P. Belsis and C. Skourlas, 2014. Combining probabilistic classifiers for text classification. Proc. Soc. Behav. Sci., 147: 307-312.
    CrossRef    
  11. Ghoggali, N., F. Melgani and Y. Bazi, 2009. A multiobjective genetic SVM approach for classification problems with limited training samples. IEEE T. Geosci. Remote, 47(6): 1707-1718.
    CrossRef    
  12. Goudjil, M., M. Koudil, N. Hammami and M. Bedda, 2013. Arabic text categorization using SVM active learning technique: An overview. Proceeding of the World Congress on Computer and Information Technology (WCCIT), pp: 1-2.
    CrossRef    
  13. Hastie, T. and R. Tibshirani, 1998. Classification by pairwise coupling. Ann. Stat., 26(2): 451-471.
    CrossRef    
  14. Lam, W. and Y. Han 2003. Automatic textual document categorization based on generalized instance sets and a metamodel. IEEE T. Pattern Anal., 25(5): 628-633.
    CrossRef    
  15. Li, M. and I.K. Sethi, 2006. Confidence-based active learning. IEEE T. Pattern Anal., 28(8): 1251-1261.
    CrossRef    PMid:16886861    
  16. Li, Q. and L. Chen, 2014. Study on Multi-class Text Classification Based on Improved SVM. In: Wen, Z. and T. Li (Eds.), Practical Applications of Intelligent Systems. Advances in Intelligent Systems and Computing 279, Springer-Verlag, Berlin, Heidelberg, pp: 519-526.
    CrossRef    
  17. Mangai, J.A., V.S. Kumar, S.A. alias Balamurugan, 2012. A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput., 9(4): 442-448.
    CrossRef    
  18. Persello, C. and L. Bruzzone, 2014. Active and semisupervised learning for the classification of remote sensing images. IEEE T. Geosci. Remote, 52(11): 6937-6956.
    CrossRef    
  19. Salton, G. and C. Buckley, 1988. Term-weighting approaches in automatic text retrieval. Inform. Process. Manag., 24(5): 513-523.
    CrossRef    
  20. Sassano, M., 2002. An empirical study of active learning with support vector machines for Japanese word segmentation. Proceeding of the 40th Annual Meeting on Association for Computational Linguistics. USA, pp: 505-512.
  21. Sebastiani, F., 2002. Machine learning in automated text categorization. ACM Comput. Surv. (CSUR), 34(1): 1-47.
    CrossRef    
  22. Settles, B., 2010. Active learning literature survey. Uni. Wisconsin, Madison, 52(55-66): 11.
  23. Shen, Q. and R. Jensen, 2007. Rough sets, their extensions and applications. Int. J. Autom. Comput., 4(3): 217-228.
    CrossRef    
  24. Sparck Jones, K., 1972. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 28(1): 11-21.
    CrossRef    
  25. Spärck Jones, K., 2004. A statistical interpretation of term specificity and its application in retrieval. J. Doc., 60(5): 493-502.
    CrossRef    
  26. Ting-Fan, W., L. Chih-Jen and C.W. Ruby, 2004. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Res., 5: 975-1005.
  27. Tong, S. and D. Koller, 2002. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2: 45-66.
  28. Wang, J.H. and H.Y. Wang, 2014. Incremental neural network construction for text classification. Proceeding of the International Symposium on Computer, Consumer and Control (IS3C). Taichung, pp: 970-973.
    CrossRef    

Competing interests

The authors have no competing interests.

Open Access Policy

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Copyright

The authors have no competing interests.

ISSN (Online):  2040-7467
ISSN (Print):   2040-7459
Submit Manuscript
   Information
   Sales & Services
Home   |  Contact us   |  About us   |  Privacy Policy
Copyright © 2024. MAXWELL Scientific Publication Corp., All rights reserved