Improving cross-platform binary analysis using representation learning via graph alignment

Geunwoo Kim, Sanghyun Hong, Michael Franz, Dokyung Song

Research output: Chapter in Book/Report/Conference proceedingConference contribution

7 Citations (Scopus)

Abstract

Cross-platform binary analysis requires a common representation of binaries across platforms, on which a specific analysis can be performed. Recent work proposed to learn low-dimensional, numeric vector representations (i.e., embeddings) of disassembled binary code, and perform binary analysis in the embedding space. Unfortunately, however, existing techniques fall short in that they are either (i) specific to a single platform producing embeddings not aligned across platforms, or (ii) not designed to capture the rich contextual information available in a disassembled binary. We present a novel deep learning-based method, XBA, which addresses the aforementioned problems. To this end, we first abstract binaries as typed graphs, dubbed binary disassembly graphs (BDGs), which encode control-flow and other rich contextual information of different entities found in a disassembled binary, including basic blocks, external functions called, and string literals referenced. We then formulate binary code representation learning as a graph alignment problem, i.e., finding the node correspondences between BDGs extracted from two binaries compiled for different platforms. XBA uses graph convolutional networks to learn the semantics of each node, (i) using its rich contextual information encoded in the BDG, and (ii) aligning its embeddings across platforms. Our formulation allows XBA to learn semantic alignments between two BDGs in a semi-supervised manner, requiring only a limited number of node pairs be aligned across platforms for training. Our evaluation shows that XBA can learn semantically-rich embeddings of binaries aligned across platforms without apriori platform-specific knowledge. By training our model only with 50% of the oracle alignments, XBA was able to predict, on average, 75% of the rest. Our case studies further show that the learned embeddings encode knowledge useful for cross-platform binary analysis.

Original languageEnglish
Title of host publicationISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
EditorsSukyoung Ryu, Yannis Smaragdakis
PublisherAssociation for Computing Machinery, Inc
Pages151-163
Number of pages13
ISBN (Electronic)9781450393799
DOIs
Publication statusPublished - 2022 Jul 18
Event31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022 - Virtual, Online, Korea, Republic of
Duration: 2022 Jul 182022 Jul 22

Publication series

NameISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

Conference

Conference31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022
Country/TerritoryKorea, Republic of
CityVirtual, Online
Period22/7/1822/7/22

Bibliographical note

Publisher Copyright:
© 2022 ACM.

All Science Journal Classification (ASJC) codes

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'Improving cross-platform binary analysis using representation learning via graph alignment'. Together they form a unique fingerprint.

Cite this