Cylindrical convolutional networks for joint object detection and viewpoint estimation

Sunghun Joung, Seungryong Kim, Hanjae Kim, Minsu Kim, Ig Jae Kim, Junghyun Cho, Kwanghoon Sohn

Research output: Contribution to journalConference articlepeer-review

12 Citations (Scopus)


Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space. CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint. With the view-specific feature, we simultaneously determine objective category and viewpoints using the proposed sinusoidal soft-argmax module. Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.

Original languageEnglish
Article number9156810
Pages (from-to)14151-14160
Number of pages10
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Publication statusPublished - 2020
Event2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020 - Virtual, Online, United States
Duration: 2020 Jun 142020 Jun 19

Bibliographical note

Funding Information:
This research was supported by R&D program for Advanced Integrated-intelligence for Identification (AIID) through the National Research Foundation of KOREA (NRF) funded by Ministry of Science and ICT (NRF-2018M3E3A1057289). ∗Corresponding author

Publisher Copyright:
© 2020 IEEE.

All Science Journal Classification (ASJC) codes

  • Software
  • Computer Vision and Pattern Recognition


Dive into the research topics of 'Cylindrical convolutional networks for joint object detection and viewpoint estimation'. Together they form a unique fingerprint.

Cite this