TY - JOUR
T1 - Enhancing performance of protein and gene name recognizers with filtering and integration strategies
AU - Hou, Wen Juan
AU - Chen, Hsin Hsi
N1 - Funding Information:
Part of this research was supported in part by National Science Council under contracts NSC-91-2213-E-002-088, NSC-92-2213-E-002-022, and NSC-93-2752-E-001-001-PAE. We thank Dr. Lorrie Tanabe and Dr. W. John Wilbur in NCBI, NLM, NIH, and Dr. George Demetriou in the Department of the Computer Science, University of Sheffield who kindly supported the resources in this work.
PY - 2004/12
Y1 - 2004/12
N2 - Named entity (NE) recognition is a fundamental task in biological relationship mining. This paper considers protein/gene collocates extracted from biological corpora as restrictions to enhance the precision rate of protein/gene name recognition. In addition, we integrate the results of multiple NE recognizers to improve the recall rates. Yapex and KeX, and ABGene and Idgene are taken as examples of protein and gene name recognizers, respectively. The precision of Yapex increases from 70.90 to 85.84% at the low expense of the recall rate (i.e., it only decreases 2.44%) when collocates are incorporated. When both filtering and integration strategies are employed together, the Yapex-based integration with KeX shows good performance, i.e., the F-score increases by 7.83% compared to the pure Yapex method. The results of gene recognition show the same tendency. The ABGene-based integration with Idgene shows a 10.18% F-score increase compared to the pure ABGene method. These successful methodologies can be easily extended to other name finders in biological documents.
AB - Named entity (NE) recognition is a fundamental task in biological relationship mining. This paper considers protein/gene collocates extracted from biological corpora as restrictions to enhance the precision rate of protein/gene name recognition. In addition, we integrate the results of multiple NE recognizers to improve the recall rates. Yapex and KeX, and ABGene and Idgene are taken as examples of protein and gene name recognizers, respectively. The precision of Yapex increases from 70.90 to 85.84% at the low expense of the recall rate (i.e., it only decreases 2.44%) when collocates are incorporated. When both filtering and integration strategies are employed together, the Yapex-based integration with KeX shows good performance, i.e., the F-score increases by 7.83% compared to the pure Yapex method. The results of gene recognition show the same tendency. The ABGene-based integration with Idgene shows a 10.18% F-score increase compared to the pure ABGene method. These successful methodologies can be easily extended to other name finders in biological documents.
KW - Biological keywords
KW - Collocation model
KW - Gene name recognition
KW - Protein name recognition
KW - t test
UR - http://www.scopus.com/inward/record.url?scp=8444241489&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=8444241489&partnerID=8YFLogxK
U2 - 10.1016/j.jbi.2004.08.006
DO - 10.1016/j.jbi.2004.08.006
M3 - Article
C2 - 15542018
AN - SCOPUS:8444241489
SN - 1532-0464
VL - 37
SP - 448
EP - 460
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
IS - 6
ER -