Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods

Erjia Yan, Yongjun Zhu

Research output: Contribution to journalArticlepeer-review

16 Citations (Scopus)


The objective of this study is to evaluate the performance of five entity extraction methods for the task of identifying entities from scientific publications, including two vocabulary-based methods (a keyword-based and a Wikipedia-based) and three model-based methods (conditional random fields (CRF), CRF with keyword-based dictionary, and CRF with Wikipedia-based dictionary). These methods are applied to an annotated test set of publications in computer science. Precision, recall, accuracy, area under the ROC curve, and area under the precision-recall curve are employed as the evaluative indicators. Results show that the model-based methods outperform the vocabulary-based ones, among which CRF with keyword-based dictionary has the best performance. Between the two vocabulary-based methods, the keyword-based one has a higher recall and the Wikipedia-based one has a higher precision. The findings of this study help inform the understanding of informetric research at a more granular level.

Original languageEnglish
Pages (from-to)455-465
Number of pages11
JournalJournal of Informetrics
Issue number3
Publication statusPublished - 2015 Jul 1

Bibliographical note

Publisher Copyright:
© 2015 Elsevier Ltd.

All Science Journal Classification (ASJC) codes

  • Computer Science Applications
  • Library and Information Sciences


Dive into the research topics of 'Identifying entities from scientific publications: A comparison of vocabulary- and model-based methods'. Together they form a unique fingerprint.

Cite this