Retrieving Information from Document Images:
Fu Chang
Abstract
An information retrieval system that captures both visual and
textual contents from paper documents
can derive maximal benefits out of DAR techniques while demanding little human assistance for achieving
its goals. This article discusses the technical problems, solution methods and integration of them into a
well-performing system. Focus of the discussion is on very hard applications, for example, to Chinese and
Japanese documents.
In addition to large group of potential readers, the latter types of documents create many technical issues
that deserve experts¡¦attention. The complicated Chinese or Kanji characters, for example, impose serious
problem for image binarization. The coexistence of vertical and horizontal textlines on the same page renders
document segmentation difficult. The large number of characters also challenges the way textual contents are
recognized and retrieved.
Problems discussed in this article will be centered on these issues. Solution methods will also be highlighted
, with the emphasis placed upon some new ideas, including window-based binarization using scale measures,
document layout analysis as solving multiple constraint problem, and full-text searching technique capable of
evading machine recognition errors.
Keyword¡G
information retrieval, document images, binarization, global threshold binarization, window-based binarization,
scale, document layout analysis, multiple contraint problem, split and merge,
OCR-error-tolerant full search