Automatic cataloguing and searching for retrospective data by use of OCR text

Yuen Hsien Tseng*

*此作品的通信作者

研究成果: 雜誌貢獻期刊論文同行評審

8 引文 斯高帕斯(Scopus)

摘要

This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested.

原文英語
頁(從 - 到)378-390
頁數13
期刊Journal of the American Society for Information Science and Technology
52
發行號5
DOIs
出版狀態已發佈 - 2001 3月
對外發佈

ASJC Scopus subject areas

  • 軟體
  • 資訊系統
  • 人機介面
  • 電腦網路與通信
  • 人工智慧

指紋

深入研究「Automatic cataloguing and searching for retrospective data by use of OCR text」主題。共同形成了獨特的指紋。

引用此