Doctor of Philosophy, The Ohio State University, 2024, Computer Science and Engineering
Binary code comprehension, particularly within the context of stripped binaries,
stands as a very useful task in binary analysis and software security applications
ranging from malware analysis to vulnerability discovery and binary reverse engineering. Understanding stripped binary code is challenging due to the absence of symbols such as variable names, data types, and function names. This complexity is further
exacerbated by the variety of binary abstract interfaces, instruction sets, computer
architectures, compiler optimizations, and obfuscations.
This dissertation systematically explores the problem of binary code comprehension
using binary analysis, deep learning, and large language models. We first present
an exploratory study, BinSum, on how machine learning models, particularly the
state-of-the-art generative large language models, can understand binary code with
a comprehensive benchmark and dataset encompassing over 557K binary functions.
Subsequently, motivated by BinSum's finding of the semantic significance of function
names in binary code, we introduce SymLM, a novel binary function name prediction
framework, employing a unique neural architecture that captures comprehensive
function semantics by modeling both the execution behavior of functions and their
calling contexts. The third contribution of this dissertation focuses on the evaluation
of code summaries' quality, in which we introduce a novel LLM-based code summary
semantic evaluation metric, SimLLM, for assessing semantic similarity. This methodsignificantly surpasses traditional metrics and exhibits a high correlation with human
judgment, addressing their shortcomings in understanding domain-specific terminologies
prevalent in code summaries. Finally, we explore the generalizability of function
name prediction by presenting BinSymn, a novel model architecture, trained on
domain-adapted generative LLMs.
Together, BinSum, SymLM, SimLLM, and BinSymn provide a comprehensiv (open full item for complete abstract)
Committee: Zhiqiang Lin (Advisor); Atanas Rountev (Committee Member); Srinivasan Parthasarathy (Committee Member); Carter Yagemann (Committee Member)
Subjects: Computer Science