Automatic cataloguing and searching for retrospective data by use of OCR text

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested.

Original languageEnglish
Pages (from-to)378-390
Number of pages13
JournalJournal of the American Society for Information Science and Technology
Volume52
Issue number5
DOIs
Publication statusPublished - 2001 Mar 1

Fingerprint

Optical character recognition
Vector spaces
language
Information retrieval
weighting
indexing
information retrieval
layout
Retrospective data
Language
performance
Contests

ASJC Scopus subject areas

  • Software
  • Information Systems
  • Human-Computer Interaction
  • Computer Networks and Communications
  • Artificial Intelligence

Cite this

@article{da24000ecf0647b887ab728ebd184ef1,
title = "Automatic cataloguing and searching for retrospective data by use of OCR text",
abstract = "This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested.",
author = "Yuen-Hsien Tseng",
year = "2001",
month = "3",
day = "1",
doi = "10.1002/1532-2890(2001)9999:9999<::AID-ASI1080>3.0.CO;2-A",
language = "English",
volume = "52",
pages = "378--390",
journal = "Journal of the Association for Information Science and Technology",
issn = "2330-1635",
publisher = "John Wiley and Sons Ltd",
number = "5",

}

TY - JOUR

T1 - Automatic cataloguing and searching for retrospective data by use of OCR text

AU - Tseng, Yuen-Hsien

PY - 2001/3/1

Y1 - 2001/3/1

N2 - This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested.

AB - This article describes our efforts in supporting information retrieval from OCR degraded text. In particular, we report our approach to an automatic cataloging and searching contest for books in multiple languages. In this contest, 500 books in English, German, French, and Italian published during the 1770s to 1970s are scanned into images and OCRed to digital text. The goal is to use only automatic ways to extract information for sophisticated searching. We adopted the vector space retrieval model, an n-gram indexing method, and a special weighting scheme to tackle this problem. Although the performance by this approach is slightly inferior to the best approach, which is mainly based on regular expression match, one advantage of our approach is that it is less language dependent and less layout sensitive, thus is readily applicable to other languages and document collections. Problems of OCR text retrieval for some Asian languages are also discussed in this article, and solutions are suggested.

UR - http://www.scopus.com/inward/record.url?scp=0035281882&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035281882&partnerID=8YFLogxK

U2 - 10.1002/1532-2890(2001)9999:9999<::AID-ASI1080>3.0.CO;2-A

DO - 10.1002/1532-2890(2001)9999:9999<::AID-ASI1080>3.0.CO;2-A

M3 - Article

VL - 52

SP - 378

EP - 390

JO - Journal of the Association for Information Science and Technology

JF - Journal of the Association for Information Science and Technology

SN - 2330-1635

IS - 5

ER -