Home » Projects » Active Projects

Category Archives: Active Projects

Improving rr record/replay support for Java

rr (see http://rr-project.org/) is a widely-used open-source C/C++ debugging tool for Linux that enhances gdb with record/replay capabilities.  PSL is seeking several project students (to work in a team) for spring 2018 to adapt rr to record/replay Java applications on top of the JVM.  rr already works with Java/JVM recording/replaying all system calls.  The goal is to modify rr to record/replay only those system calls specific to the Java application recorded/replayed, not the system calls due to the JVM’s internal mechanisms, to improve performance of Java application recording and replaying. (Eventually we plan to modify rr further, to support mutable replay, but that will be a later semester.) Prospective project students should have strong Java/JVM, C/C++ and Linux skills, and preferably have completed 4115 and 4118.

Contact Prof. Kaiser at kaiser@cs.columbia.edu.

Gameful Computational Thinking

Inspired by CS for All?  The Programming Systems Lab, led by Professor Gail Kaiser, is building a collaborative game-based learning and assessment system that infuses computational thinking in grade 6-8 curricula.  For the Spring 2018 semester, we’re recruiting students interested in collaborating as a Scrum team to operationalize an MVP which includes several features:
  • A gameful direct instruction system that embeds Parson’s Programming Puzzles in Scratch
  • A gameful constructionism mode that integrates scoring systems for play and automated formative assessment through Blockly
  • A gameful intelligent tutor that provides hints on-demand and just-in-time via clustering, sequential pattern mining, and behavior detection algorithms
  • A gameful affinity space that enables curricula assimilation with quest management and learning metrics using Node.js and Angular
Our projects with students from previous semesters have resulted in the development of a rich architecture and we anticipate a productive period ahead with opportunities for co-authorship of publications that document results from field studies with grade 6-8 students.  Participants will register for 3 credits of COMS E6901, section 028.  CVN students are welcome.  To learn more, please contact Jeff Bender, jeffrey.bender@columbia.edu.
SAGE | Social Addictive Gameful Engineering

Mutable Replay

This project is heavily influenced by previous work from Prof. Jason Nieh in this area, in his case for
x86 for Linux whereas we work with JVM-based managed languages. We are also collaborating with
Prof. Jonathan Bell of George Mason University.  Record/replay records all non-deterministic
inputs received during deployed execution (in a customer environment) so the execution can later be replayed
deterministically in the development lab by reusing those inputs, e.g., to reproduce a reported bug or examine
anomalous behavior for intrusions and other security exploits. Non-deterministic “inputs” here include responses from
system and library calls, e.g., random number generators, network traffic, file system and database accesses, etc., not
just end-user inputs. Conventional record/replay only works when the code replayed in the lab is exactly the same as
the code that had been recorded in the field, or adds only very limited debugging instrumentation that does not effect
the non-deterministic inputs or their dependencies. But it is desirable to replay candidate bug fixes to check that the
code change (mutation) indeed fixes the bug, even though a correct bug fix will almost always require some
new/different inputs and/or produce some new/different outputs, and replay may need to skip over or replace some
previously recorded inputs that are invalidated by changed dependencies. We are investigating static and dynamic
program analysis techniques aimed to determine how to transform the recorded log to enable replay to proceed past
changed portions of the code. This problem is undecidable in the general case (imagine inserting an infinite loop), but
preliminary results show is feasible in many practical cases when the developer’s version repository is available for
analysis (which was not leveraged for Prof. Nieh’s earlier system) and the managed language platform hides
operating system level complexities such as memory mapping and interrupts that may reduce performance.

Contact Gail Kaiser (kaiser@cs.columbia.edu)

Team Members

Gail Kaiser

Graduate Students
Anthony Saeiva Narin

Former Graduate Students
Jonathan Bell
Kenny Harvey   


Code Similarity

Dynamic Code Similarity: This is a multi-disciplinary project joint with Profs. Simha Sethumadhavan and Tony Jebara. “Code clones” are statically similar code fragments that usually arise via copy/paste or independently writing lookalike code; best practice removes clones (refactoring) or tracks them (e.g., to ensure bugs fixed in one clone are also fixed in others). This part of the project instead studies dynamically similar code for two different similarity models. One model is functional similarity, finding code fragments that exhibit similar input/output behavior during execution. Our other dynamic similarity model is the novel notion of behavioral similarity, which we call “code relatives”. Two or more code fragments are deemed code relatives if their executions are similar. We model this as finding similarities among the dynamic data dependency graphs representing instruction-level execution traces. We used machine learning techniques to devise a (relatively) fast inexact subgraph isomorphism algorithm to cluster these execution-level similarities. Our experiments show that both of our tools find most of the same “similar” code as the best static code clone detectors but also find many others they can’t, because the code looks very different even though functionally and/or behaviorally similar; however, dynamic detection will not necessarily find all static code clones because lookalike code involving polymorphism need not exhibit the same function/behavior. Our behavioral and functional similarity detectors do not always find the same similarities, because two or more code fragments may compute the same function using very different algorithms. Thus these kinds of techniques complement each other. Beyond the conventional applications of static code clone detection, dynamic similarity detection also addresses malware detection, program understanding, re-engineering legacy software to use modern APIs, and informing design of hardware accelerators and compiler optimizations.

Static Code Similarity: We also investigate of static similarity detection to augment our similarity detection toolkit. This work is joint with Prof. Baishakhi Ray of the University of Virginia and Prof. Jonathan Bell of George Mason University. Unlike most other static code clone research, we look for similarities at the instruction level rather than in the source code, so our techniques can work even on obfuscated executables where no source code is available and thus conventional static detectors cannot be applied. This situation arises for both malware and misappropriated intellectual property. We exploit the increasingly popular notion of “big code”, i.e., training from open-source repositories, using features that combine instruction-level call graph analysis and topic modeling (an NLP-based machine learning technique). We believe we can effectively deobfuscate most suspect code by finding similarities within a corpus consisting of known code and its obfuscated counterparts. Our approach handles control flow transformations and introduction of extraneous methods, not just method names.

Contact Gail Kaiser (kaiser@cs.columbia.edu)

Team Members

Gail Kaiser

Former Graduate Students
Fang-Hsiang (“Mike”) Su
Jonathan Bell
Kenny Harvey   
Apoorv Patwardhan



Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser and Tony Jebara. Code Relatives: Detecting Similarly Behaving Software. 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), November 2016. Artifact accepted as platinum.

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser and Simha Sethumadhavan. Identifying Functionally Similar Code in Complex Codebases. 24th IEEE International Conference on Program Comprehension (ICPC), May 2016, pp. 1-10. (ACM SIGSOFT Distinguished Paper Award)

Fang-Hsiang Su, Jonathan Bell, and Gail Kaiser. Challenges in Behavioral Code Clone Detection (Position Paper). 10th International Workshop on Software Clones (IWSC), affiliated with IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, volume 3, pp. 21-22. (People’s Choice Award for Best Position Paper)


Download DyCLink from github.

Download HitoshiIO from github.

Download Code Similarity Experiments toolkit from github.