Supervised Learning for Provenance-Similarity of Binaries
Sagar
Chaki, Cory Cohen, Arie Gurfinkel,
Proceedings of the 17th ACM SIGKDD Conference on Knowledge
Discovery and Data Mining (KDD), page 15-23, August 21-24,
2011.
Abstract:
Understanding, measuring, and leveraging the similarity of binaries
(executable code) is a foundational challenge in software
engineering. We present a notion of similarity based on provenance --
two binaries are similar if they are compiled from the same (or very
similar) source code with the same (or similar) compilers. Empirical
evidence suggests that provenance-similarity accounts for a
significant portion of variation in existing binaries, particularly in
malware. We propose and evaluate the applicability of classification
to detect provenance-similarity. We evaluate a variety of classifiers,
and different types of attributes and similarity labeling schemes, on
two benchmarks derived from open-source software and malware
respectively. We present encouraging results indicating that
classification is a viable approach for automated
provenance-similarity detection, and as an aid for malware analysts in
particular.
Online