An identification framework for print-scan books in a large database

Sang Hoon Lee, Jongyoo Kim, Sanghoon Lee

Research output: Contribution to journalArticlepeer-review

6 Citations (Scopus)


In this paper, we propose an identification framework to determine copyright infringement in the form of illegally distributed print-scan books in a large database. The framework contains following main stages: image pre-processing, feature vector extraction, clustering, and indexing, and hierarchical search. The image pre-processing stage provides methods for alleviating the distortions induced by a scanner or digital camera. From the pre-processed image, we propose to generate feature vectors that are robust against distortion. To enhance the clustering performance in a large database, we use a clustering method based on the parallel-distributed computing of Hadoop MapReduce. In addition, to store the clustered feature vectors efficiently and minimize the searching time, we investigate an inverted index for feature vectors. Finally, we implement a two-step hierarchical search to achieve fast and accurate on-line identification. In a simulation, the proposed identification framework shows accurate and robust in the presence of print-scan distortions. The processing time analysis in a parallel computing environment gives extensibility of the proposed framework to massive data. In the matching performance analysis, we empirically and theoretically find that in terms of query time, the optimal number of clusters scales with O(N) for N print-scan books.

Original languageEnglish
Pages (from-to)33-54
Number of pages22
JournalInformation sciences
Publication statusPublished - 2017 Aug 1

Bibliographical note

Publisher Copyright:
© 2017 Elsevier Inc.

All Science Journal Classification (ASJC) codes

  • Theoretical Computer Science
  • Software
  • Control and Systems Engineering
  • Computer Science Applications
  • Information Systems and Management
  • Artificial Intelligence


Dive into the research topics of 'An identification framework for print-scan books in a large database'. Together they form a unique fingerprint.

Cite this