Learning a Portfolio-Based Checker for Provenance-Similarity of Binaries
This is an ongoing Independent Research & Development (IRAD)
project at the Software Engineering Institute, Carnegie Mellon
University. The goal of this project is to explore the use of
supervised learning (a.k.a. classification) in detecting
provenance-similarity between binaries, or executables. Broadly,
two binaries are provenance-similar if they have been compiled
from similar source code with similar compilers. Detecting
provenance-similarity is a challenging area of research, with
important applications ranging from code clone detection,
understanding the impact of software updates, judging the
provenance of untrusted software, and fighting against malware.
The project is being led by Sagar
Chaki, Arie Gurfinkel, and Cory Cohen. Our current focus is on
detecting similarity between functions. Intuitively, a
function is a fragment of a binary derived by compiling a
source-level procedure or method. We believe that functions are an
ideal basis for judging binary similarity: they are the
fundamental units of a binary's behavior. If two binaries have
many functions in common, then they are very likely to be
similar. The greater the share of common functions, the higher the
degree of similarity. We have recently blogged
about our work.
Benchmark
Please
contact Sagar
Chaki.
Publications
Contact