Research Article | OPEN ACCESS
Linear Reranking Model for Chinese Pinyin-to-Character Conversion
1Xinxin Li, 1Xuan Wang, 1Lin Yao and 1, 2Muhammad Waqas Anwar
1Computer Application Research Center, Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China
2Department of Computer Science, COMSATS Institute of Information Technology,
Abbottabad, Pakistan
Research Journal of Applied Sciences, Engineering and Technology 2014 5:975-980
Received: January 31, 2013 | Accepted: February 25, 2013 | Published: February 05, 2014
Pinyin-to-character conversion is an important task for Chinese natural language processing tasks. Previous work mainly focused on n-gram language models and machine learning approaches, or with additional hand-crafted or automatic rule-based post-processing. There are two problems unable to solve for word n-gram language model: out-of-vocabulary word recognition and long-distance grammatical constraints. In this study, we proposed a linear reranking model trying to solve these problems. Our model uses minimum error learning method to combine different sub models, which includes word and character n-gram LMs, part-of-speech tagging model and dependency model. Impact of different sub models on the conversion are fully experimented and analyzed. Results on the Lancaster Corpus of Mandarin Chinese show that our new model outperforms word n-gram language model.
Dependency model, minimum error learning method, part-of-speech tagging, word n-gram model,
Collins, M., 2002. Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pp: 1-8.
Jiang, W., G. Guan, X. Wang and B. Liu, 2007. Pinyin to character conversion model based on support vector machines. J. Chinese Inf. Proces., 21(2): 100-105.
Jiang, W., L. Huang, Q. Liu and Y. Lü, 2008. A cascaded linear model for joint Chinese word segmentation and part-of-speech tagging. Proceedings of ACL-08: HLT, Columbus, Ohio, pp: 897-904.
Li, X., W. Wang and L. Yao, 2011. Joint decoding for Chinese word segmentation and pos tagging using character-based and word-based discriminative models. 2011 International Conference on Asian Language Processing (IALP), Penang, Malaysia, pp: 11-14.
Ng, H.T. and J.K. Low, 2004. Chinese part-of-speech tagging: One-at-a-time or all-at-once? Word-based or character-based? In: Lin, D. and D. Wu (Eds.), Proceedings of EMNLP 2004, Barcelona, Spain, pp: 277-284.
Och, F.J., 2003. Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp: 160-167.
Stolcke, A., 2002. Srilm - an extensible language modeling toolkit. Proceedings of the International Conference on Spoken Language Processing, Denver, Colorado, pp: 901-904.
Wang, X., Q. Chen and D.S. Yeung, 2004. Mining pinyin-to-character conversion rules from large-scale corpus: A rough set approach. IEEE T. Syst. Man Cy, B, 34(2): 834-844.
Xiao, J., B. Liu and X. Wang, 2007. Exploiting pinyin constraints in pinyin-to-character conversion task: A class-based maximum entropy markov model approach. Comput. Linguist. Chinese Language Proces., 12(3): 325-348.
Zaidan, O., 2009. Z-mert: A fully configurable open source tool for minimum error rate training of machine translation systems. Prague Bull. Math. Linguistics, 91(1): 79-88.
Zhang, Y. and S. Clark, 2008. Joint word segmentation and pos tagging using a single perceptron. Proceedings of ACL-08: HLT, Columbus, Ohio, pp: 888-896.
Zhang, Y. and J. Nivre, 2011. Transition-based dependency parsing with rich non-local features. Proceedings of the 49th Annual Meeting of the Association for Computational Linguis-tics: Human Language Technologies, Portland, Oregon, USA, pp: 188-193.
Competing interests
The authors have no competing interests.
Open Access Policy
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
The authors have no competing interests.
ISSN (Online): 2040-7467
ISSN (Print): 2040-7459 |
Information |
Sales & Services |