Home » Projects » Active Projects

Category Archives: Active Projects

Gameful Computational Thinking

Inspired by CS for All?  The Programming Systems Lab, led by Professor Gail Kaiser, is building a game-based learning and assessment system that infuses computational thinking in grade 6-8 curricula.  For the Fall 2017 semester, we’re recruiting students interested in collaborating as a Scrum team to complete an MVP expected to include several features:
  • A gameful direct instruction system that embeds Parson’s Programming Puzzles in Scratch
  • A gameful constructionism mode that integrates scoring systems for play and automated assessment through Blockly
  • A gameful intelligent tutor that provides hints on-demand and just-in-time via clustering and sequential pattern mining algorithms
  • A gameful affinity space that enables curricula assimilation with quest management and learning metrics using Node.js
Our projects with students from previous semesters have resulted in the development of a rich architecture and we anticipate a productive period ahead with opportunities for co-authorship of publications that document results from experiments with grade 6-8 students.  Participants will register for 3 credits of COMS E6901, section 014.  CVN students are welcome.  To learn more, please contact Jeff Bender, jeffrey.bender@columbia.edu.
SAGE | Social Addictive Gameful Engineering

Mutable Replay

This project is heavily influenced by previous work from Prof. Jason Nieh in this area, in his case for
x86 for Linux whereas we work with JVM-based managed languages. We are also collaborating with
Prof. Jonathan Bell of George Mason University.  Record/replay records all non-deterministic
inputs received during deployed execution (in a customer environment) so the execution can later be replayed
deterministically in the development lab by reusing those inputs, e.g., to reproduce a reported bug or examine
anomalous behavior for intrusions and other security exploits. Non-deterministic “inputs” here include responses from
system and library calls, e.g., random number generators, network traffic, file system and database accesses, etc., not
just end-user inputs. Conventional record/replay only works when the code replayed in the lab is exactly the same as
the code that had been recorded in the field, or adds only very limited debugging instrumentation that does not effect
the non-deterministic inputs or their dependencies. But it is desirable to replay candidate bug fixes to check that the
code change (mutation) indeed fixes the bug, even though a correct bug fix will almost always require some
new/different inputs and/or produce some new/different outputs, and replay may need to skip over or replace some
previously recorded inputs that are invalidated by changed dependencies. We are investigating static and dynamic
program analysis techniques aimed to determine how to transform the recorded log to enable replay to proceed past
changed portions of the code. This problem is undecidable in the general case (imagine inserting an infinite loop), but
preliminary results show is feasible in many practical cases when the developer’s version repository is available for
analysis (which was not leveraged for Prof. Nieh’s earlier system) and the managed language platform hides
operating system level complexities such as memory mapping and interrupts that may reduce performance.

Contact Gail Kaiser (kaiser@cs.columbia.edu)

Team Members

Gail Kaiser

Graduate Students
Anthony Saeiva Narin

Former Graduate Students
Jonathan Bell
Kenny Harvey   


Code Similarity

Dynamic Code Similarity: This is a multi-disciplinary project joint with Profs. Simha Sethumadhavan and Tony Jebara.
“Code clones” are statically similar code fragments that usually arise via copy/paste or independently writing
lookalike code; best practice removes clones (refactoring) or tracks them (e.g., to ensure bugs fixed in one clone are
also fixed in others). This part of the project instead studies dynamically similar code for two different similarity models. One
model is functional similarity, finding code fragments that exhibit similar input/output behavior during execution. Our
other dynamic similarity model is the novel notion of behavioral similarity, which we call “code relatives”.
Two or more code fragments are deemed code relatives if their executions are similar. We model this
as finding similarities among the dynamic data dependency graphs representing instruction-level execution traces.
We used machine learning techniques to devise a (relatively) fast inexact subgraph isomorphism algorithm to cluster
these execution-level similarities. Our experiments show that both of our tools find most of the same “similar” code as
the best static code clone detectors but also find many others they can’t, because the code looks very different even
though functionally and/or behaviorally similar; however, dynamic detection will not necessarily find all static code
clones because lookalike code involving polymorphism need not exhibit the same function/behavior. Our behavioral
and functional similarity detectors do not always find the same similarities, because two or more code fragments may
compute the same function using very different algorithms. Thus these kinds of techniques complement each other.
Beyond the conventional applications of static code clone detection, dynamic similarity detection also addresses
malware detection, program understanding, re-engineering legacy software to use modern APIs, and informing
design of hardware accelerators and compiler optimizations.

Static Code Similarity: We also investigate of static similarity detection to augment our similarity detection toolkit.
This work is joint with Prof. Baishakhi Ray of the University of Virginia and Prof. Jonathan Bell of George Mason University.
Unlike most other static code clone research, we look for similarities at the instruction level
rather than in the source code, so our techniques can work even on obfuscated executables where no source code is
available and thus conventional static detectors cannot be applied. This situation arises for both malware and
misappropriated intellectual property. We exploit the increasingly popular notion of “big code”, i.e., training from opensource
repositories, using features that combine instruction-level call graph analysis and topic modeling (an NLPbased
machine learning technique). We believe we can effectively deobfuscate most suspect code by finding similarities
within a corpus consisting of known code and its obfuscated counterparts. Our approach handles control flow transformations
and introduction of extraneous methods, not just method names.

Contact Gail Kaiser (kaiser@cs.columbia.edu)

Team Members

Gail Kaiser

Graduate Students
Fang-Hsiang (“Mike”) Su

Former Graduate Students
Jonathan Bell
Kenny Harvey   
Apoorv Patwardhan



Fang-Hsiang Su, Jonathan Bell, Kenneth Harvey, Simha Sethumadhavan, Gail Kaiser and Tony Jebara. Code Relatives: Detecting Similarly Behaving Software. 24th ACM SIGSOFT International Symposium on the Foundations of Software Engineering (FSE), November 2016. Artifact accepted as platinum.

Fang-Hsiang Su, Jonathan Bell, Gail Kaiser and Simha Sethumadhavan. Identifying Functionally Similar Code in Complex Codebases. 24th IEEE International Conference on Program Comprehension (ICPC), May 2016, pp. 1-10. (ACM SIGSOFT Distinguished Paper Award)

Fang-Hsiang Su, Jonathan Bell, and Gail Kaiser. Challenges in Behavioral Code Clone Detection (Position Paper). 10th International Workshop on Software Clones (IWSC), affiliated with IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), March 2016, volume 3, pp. 21-22. (People’s Choice Award for Best Position Paper)


Download DyCLink from github.

Download HitoshiIO from github.

Download Code Similarity Experiments toolkit from github.