Home » Projects » Active Projects » Dynamic Code Similarity

Dynamic Code Similarity

“Code clones” are statically similar code fragments dispersed via copy/paste or independently writing lookalike code; best practice removes clones (refactoring) or tracks them (e.g., to ensure bugs fixed in one clone are also fixed in others). We instead study dynamically similar code, for two different similarity models.

One model is functional similarity, finding code fragments that exhibit similar input/output behavior. Other researchers reported success with C but failure for object-oriented languages, presenting the many challenges to executing OO code fragments in isolation. We side-stepped the isolation problem by adapting in-vivo testing, previously developed by our lab for another purpose, where unit test cases are run in the application states constructed during full system test cases.

Our other dynamic similarity model is the novel notion of behavioral similarity: two code fragments are deemed similar if matching subgraphs can be found in the dynamic data dependency graphs representing the instruction-level execution traces. We developed a fast subgraph isomorphism algorithm to recognize and cluster these execution-level similarities, using PageRank to prioritize the “most important” instructions on which to pivot subgraph comparisons.

Our empirical studies show that our tools find most of the same “similar” code as the best static code clone detectors but also find many others they can’t, because the code looks very different even though functionally and/or behaviorally similar (dynamic detection will not necessarily find all static code clones because lookalike code involving polymorphism need not exhibit the same function/behavior).

We are investigating various applications of statically and dynamically similar code detection, including code search, program understanding, detecting malware, and re-engineering legacy software to use modern APIs.

All project student slots have already been filled for Fall 2016, but there may be openings for Spring 2017.

Contact Mike Su (mikefhsu@cs.columbia.edu)

Team Members

Gail Kaiser
Simha Sethumadhavan
Tony Jebara

Graduate Students
Fang-Hsiang (“Mike”) Su
Apoorv Patwardhan

Former Graduate Students
Jonathan Bell
Kenny Harvey



Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser and Tony Jebara. Code Relatives: Detecting Similarly Behaving Software. 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), November 2016. Artifact accepted as platinum. (To appear, earlier technical report available.)

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser and Simha Sethumadhavan. Identifying Functionally Similar Code in Complex Codebases. 24th IEEE International Conference on Program Comprehension (ICPC), May 2016, pp. 1-10. (ACM SIGSOFT Distinguished Paper Award)

Fang-Hsiang Su, Jonathan Bell, and Gail Kaiser. Challenges in Behavioral Code Clone Detection (Position Paper). 10th International Workshop on Software Clones (IWSC), affiliated with IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, volume 3, pp. 21-22. (People’s Choice Award for Best Position Paper)


Download DyCLink from github.

Download HitoshiIO from github.

Download Code Similarity Experiments toolkit from github.