Abstract
The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.
Original language | English |
---|---|
Pages (from-to) | 455-465 |
Number of pages | 11 |
Journal | Journal of Informetrics |
Volume | 9 |
Issue number | 3 |
DOIs | |
Publication status | Published - 2015 Jul 1 |
Bibliographical note
Publisher Copyright:© 2015 Elsevier Ltd.
All Science Journal Classification (ASJC) codes
- Computer Science Applications
- Library and Information Sciences