Binary Function Clustering using Semantic Hashes
Wesley
Jin, Sagar Chaki, Cory Cohen, Arie Gurfinkel, Jeffrey Havrilla,
Charles Hines, Priya Narasimhan,
Proceedings of the 11th International Conference on Machine
Learning and Applications (ICMLA), page 386--391, December 12--15,
2012.
Abstract:
The ability to identify semantically-related functions, in large
collections of binary executables, is important for malware
detection. Intuitively, two pieces of code are similar if they have
the same effect on a machine's state. Current state-of-the-art tools
employ a variety of pairwise comparisons (e.g., template matching
using SMT solvers, Value-Set analysis at critical program points, API
call matching, etc.) However, these methods are unscalable for
clustering large datasets, of size N, since they require O(N^2)
comparisons. In this paper, we present an alternative approach based
upon "hashing". We propose a scheme that captures the semantics of
functions as semantic hashes. Our approach treats a function as
a set of features, each of which represent the input-output behavior
of a basic block. Using a form of locality-sensitive hashing known as
MinHashing, functions with many common features can be quickly
identified, and the complexity of clustering is reduced to
O(N). Experiments on functions extracted from the CERT malware
catalog indicate that we are able to cluster closely-related code with
a low false positive rate.
PDF