TY - JOUR
T1 - An Approach to Retrieval of OCR Degraded Text
AU - 曾, 元顯(Yuen-Hsien Tseng)
PY - 1998
Y1 - 1998
N2 - The major problem with retrieval of OCR text is the unpredictable distortion of characters due to recognition errors. Because users have no ideas of such distortion, the terms they query can hardly match the terms stored in the OCR text exactly. Thus retrieval effectiveness is significantly reduced, especially for low-quality input. To reduce the losses from retrieving such noisy OCR text, a fault-tolerant retrieval strategy based on automatic keyword extraction and fuzzy matching is proposed. In this strategy, terms, correct or not, and their term frequencies are extracted from the noisy text and presented for browsing and selection in response to users' initial queries. With the understanding of the real terms stored in the noisy text and of their estimated frequency distributions, users may then choose appropriate terms for a more effective searching. A text retrieval system based on this strategy has been built. Examples to show the effectiveness are demonstrated. Finally, some OCR issues for further enhancing retrieval effectiveness are discussed.
AB - The major problem with retrieval of OCR text is the unpredictable distortion of characters due to recognition errors. Because users have no ideas of such distortion, the terms they query can hardly match the terms stored in the OCR text exactly. Thus retrieval effectiveness is significantly reduced, especially for low-quality input. To reduce the losses from retrieving such noisy OCR text, a fault-tolerant retrieval strategy based on automatic keyword extraction and fuzzy matching is proposed. In this strategy, terms, correct or not, and their term frequencies are extracted from the noisy text and presented for browsing and selection in response to users' initial queries. With the understanding of the real terms stored in the noisy text and of their estimated frequency distributions, users may then choose appropriate terms for a more effective searching. A text retrieval system based on this strategy has been built. Examples to show the effectiveness are demonstrated. Finally, some OCR issues for further enhancing retrieval effectiveness are discussed.
KW - Optical character recognition
KW - information retrieval
KW - fault-tolerant retrieval
KW - keyword extraction
KW - fuzzy matching
U2 - 10.6182/jls.1998.13.153
DO - 10.6182/jls.1998.13.153
M3 - Article
SN - 1018-3817
SP - 153
EP - 168
JO - 圖書館學刊
JF - 圖書館學刊
IS - 13
ER -