Home » Projects » Active Projects » Code Similarity

Code Similarity

Dynamic Code Similarity: This is a multi-disciplinary project joint with Profs. Simha Sethumadhavan and Tony Jebara.
“Code clones” are statically similar code fragments that usually arise via copy/paste or independently writing
lookalike code; best practice removes clones (refactoring) or tracks them (e.g., to ensure bugs fixed in one clone are
also fixed in others). This part of the project instead studies dynamically similar code for two different similarity models. One
model is functional similarity, finding code fragments that exhibit similar input/output behavior during execution. Our
other dynamic similarity model is the novel notion of behavioral similarity, which we call “code relatives”.
Two or more code fragments are deemed code relatives if their executions are similar. We model this
as finding similarities among the dynamic data dependency graphs representing instruction-level execution traces.
We used machine learning techniques to devise a (relatively) fast inexact subgraph isomorphism algorithm to cluster
these execution-level similarities. Our experiments show that both of our tools find most of the same “similar” code as
the best static code clone detectors but also find many others they can’t, because the code looks very different even
though functionally and/or behaviorally similar; however, dynamic detection will not necessarily find all static code
clones because lookalike code involving polymorphism need not exhibit the same function/behavior. Our behavioral
and functional similarity detectors do not always find the same similarities, because two or more code fragments may
compute the same function using very different algorithms. Thus these kinds of techniques complement each other.
Beyond the conventional applications of static code clone detection, dynamic similarity detection also addresses
malware detection, program understanding, re-engineering legacy software to use modern APIs, and informing
design of hardware accelerators and compiler optimizations.

Static Code Similarity: We also investigate of static similarity detection to augment our similarity detection toolkit.
This work is joint with Prof. Baishakhi Ray of the University of Virginia and Prof. Jonathan Bell of George Mason University.
Unlike most other static code clone research, we look for similarities at the instruction level
rather than in the source code, so our techniques can work even on obfuscated executables where no source code is
available and thus conventional static detectors cannot be applied. This situation arises for both malware and
misappropriated intellectual property. We exploit the increasingly popular notion of “big code”, i.e., training from opensource
repositories, using features that combine instruction-level call graph analysis and topic modeling (an NLPbased
machine learning technique). We believe we can effectively deobfuscate most suspect code by finding similarities
within a corpus consisting of known code and its obfuscated counterparts. Our approach handles control flow transformations
and introduction of extraneous methods, not just method names.

Contact Gail Kaiser (kaiser@cs.columbia.edu)

Team Members

Gail Kaiser

Graduate Students
Fang-Hsiang (“Mike”) Su

Former Graduate Students
Jonathan Bell
Kenny Harvey   
Apoorv Patwardhan



Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser and Tony Jebara. Code Relatives: Detecting Similarly Behaving Software. 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), November 2016. Artifact accepted as platinum.

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser and Simha Sethumadhavan. Identifying Functionally Similar Code in Complex Codebases. 24th IEEE International Conference on Program Comprehension (ICPC), May 2016, pp. 1-10. (ACM SIGSOFT Distinguished Paper Award)

Fang-Hsiang Su, Jonathan Bell, and Gail Kaiser. Challenges in Behavioral Code Clone Detection (Position Paper). 10th International Workshop on Software Clones (IWSC), affiliated with IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, volume 3, pp. 21-22. (People’s Choice Award for Best Position Paper)


Download DyCLink from github.

Download HitoshiIO from github.

Download Code Similarity Experiments toolkit from github.