Retrieving Information from Document Images:
problems and solutions


Fu Chang

psfileTR-IIS-00-006


Abstract


     An information retrieval system that captures both visual and textual contents from paper documents

can derive maximal benefits out of DAR techniques while demanding little human assistance for achieving

its goals. This article discusses the technical problems, solution methods and integration of them into a

well-performing system. Focus of the discussion is on very hard applications, for example, to Chinese and

Japanese documents.  

     In addition to large group of potential readers, the latter types of documents create many technical issues

that deserve experts¡¦attention. The complicated Chinese or Kanji characters, for example, impose serious

problem for image binarization. The coexistence of vertical and horizontal textlines on the same page renders

document segmentation difficult. The large number of characters also challenges the way textual contents are

recognized and retrieved.

     Problems discussed in this article will be centered on these issues. Solution methods will also be highlighted

, with the emphasis placed upon some new ideas, including window-based binarization using scale measures,

document layout analysis as solving multiple constraint problem, and full-text searching technique capable of

evading machine recognition errors.


Keyword¡G

information retrieval, document images, binarization, global threshold binarization, window-based binarization,

scale, document layout analysis, multiple contraint problem, split and merge, OCR-error-tolerant full search