This project investigates dynamic analysis approaches to identifying behavioral similarities among code elements in the same or different programs, particularly for code that behaves similarly during execution but does not look similar so would be difficult or impossible to detect using static analysis (code clones). While code clone technology is fairly mature, detecting behavioral similarities is challenging.
Dynamic analysis involves executing the candidate programs, and portions of programs such as classes and methods, which means we need to identify program and method inputs that will be useful for identifying code that expert humans would deem similar — without numerous false positives. For instance, many methods given null or zero inputs produce null or zero outputs and/or execute similar instruction sequences, even though they behave entirely differently with other inputs. Some methods might coincidentally happen to behave similarity for one or a small number of more complex inputs, but still behave quite differently on the majority of the input space.
We plan to adapt the “preponderance of the evidence” metaphor as to whether a given piece of code is sufficiently similar to the code at hand. This requires constructing test suites (collections of inputs) suitable for weighing similarities and differences, and devising machine learning or analytical approaches to classifying the corresponding collections of executions as similar or different. We will seek to combine metrics from both functional input/output vectors and instruction trace (e.g., data dependency graph) representations of executions.
Our initial use case is program understanding: Most software is developed by teams of engineers, not a single individual working alone. A software product is rarely developed once and for all with no bug fixes or features added after original deployment. Original team members leave the company or the project and new members join. Thus most engineers need to understand code originally written by other persons, indeed studies have found that engineers spend more of their time trying to understand unfamiliar code than creating or modifying code. The engineer might not know what a piece of code is supposed to do, how it works, how it is supposed to work, how to fix it to make it work; documentation might be poor, outdated, or non-existent, and knowledgeable co-workers are not always available. We hypothesize that showing developers code that behaves similarly to the code at hand may be helpful, particularly if the similar code is better documented, better structured or simply easier to understand.
We are seeking undergraduate and MS project students interested in program understanding, program analysis, compilers and language runtime environments, software testing and machine learning. Possibly also security since we may apply to behavioral similarities to vulnerability detection and identifying malware.
Contact: Professor Gail Kaiser, kaiser@cs.columbia.edu. Please put “saber” in the subject of your email.