Journal of Inforamtion Science and Engineering, Vol. 16, No. 6, pp. 903-922 (November 2000)

OCR Error Correction of an Inflectional Indian Language
Using Morphological Parsing

U. Pal, P. K. Kundu and B. B. Chaudhuri
Computer Vision and Pattern Recognition Unit
Indian Statistical Institute
203, B. T. Rd., Calcutta,
700035 India
E-mail: {umapada, bbc}

This paper deals with an OCR (Optical Character Recognition) error detection and correction technique for a highly inflectional Indian language, Bangla, the second-most popular language in India and fifth-most popular language in the world. The technique is based on morphological parsing where using two separate lexicons of root words and suffixes, the candidate root-suffix pairs of each input string, are detected, their grammatical agreement is tested and the root/suffix part in which the error has occurred is noted. The correction is made to the corresponding error part of the input string by means of a fast dictionary access technique. To do so, the information about the error patterns generated by the OCR system are examined, and some alternative strings are generated for an erroneous word. Among the alternative strings, those satisfying grammatical agreement in root and suffix are finally chosen as suggested words. In the list of suggested words generated by the system, the desired word is available in 84.22% cases.

Keywords: OCR (Optical Character Recognition), error detection, error correction, Indian language, morphological parsing, suffix, inflectional language

Received November 13, 1998; revised October 18, 1999; accepted March 14, 2000.
Communicated by Zen Chen.