Supervised Learning for Provenance-Similarity of Binaries

Sagar Chaki, Cory Cohen, Arie Gurfinkel, Proceedings of the 17th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), page 15-23, August 21-24, 2011.

Abstract: Understanding, measuring, and leveraging the similarity of binaries (executable code) is a foundational challenge in software engineering. We present a notion of similarity based on provenance -- two binaries are similar if they are compiled from the same (or very similar) source code with the same (or similar) compilers. Empirical evidence suggests that provenance-similarity accounts for a significant portion of variation in existing binaries, particularly in malware. We propose and evaluate the applicability of classification to detect provenance-similarity. We evaluate a variety of classifiers, and different types of attributes and similarity labeling schemes, on two benchmarks derived from open-source software and malware respectively. We present encouraging results indicating that classification is a viable approach for automated provenance-similarity detection, and as an aid for malware analysts in particular.