Ad hoc Test Generation Through Binary Rewriting
Anthony Saieva, Shirish Singh and Gail Kaiser. Ad hoc Test Generation Through Binary Rewriting. 20th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), Virtual, September 2020, pp. 115-126. https://doi.org/10.1109/SCAM51674.2020.00018.
Software available at https://github.com/Programming-Systems-Lab/ATTUNE.
When a security vulnerability or other critical bug is not detected by the developers’ test suite, and is discovered post-deployment, developers must quickly devise a new test that reproduces the buggy behavior. Then the developers need to test whether their candidate patch indeed fixes the bug, without breaking other functionality, while racing to deploy before attackers pounce on exposed user installations. This can be challenging when factors in a specific user environment triggered the bug. If enabled, however, record-replay technology faithfully replays the execution in the developer environment as if the program were executing in that user environment under the same conditions as the bug manifested. This includes intermediate program states dependent on system calls, memory layout, etc. as well as any externally-visible behavior. Many modern record-replay tools integrate interactive debuggers, to help locate the root cause, but don’t help the developers test whether their patch indeed eliminates the bug under those same conditions. In particular, modern record-replay tools that reproduce intermediate program state cannot replay recordings made with one version of a program using a different version of the program where the differences affect program state. This work builds on record-replay and binary rewriting to automatically generate and run targeted tests for candidate patches significantly faster and more efficiently than traditional test suite generation techniques like symbolic execution. These tests reflect the arbitrary (ad hoc) user and system circumstances that uncovered the bug, enabling developers to check whether a patch indeed fixes that bug. The tests essentially replay recordings made with one version of a program using a different version of the program, even when the the differences impact program state, by manipulating both the binary executable and the recorded log to result in an execution consistent with what would have happened had the the patched version executed in the user environment under the same conditions where the bug manifested with the original version. Our approach also enables users to make new recordings of their own workloads with the original version of the program, and automatically generate and run the corresponding ad hoc tests on the patched version, to validate that the patch does not break functionality they rely on.
@INPROCEEDINGS{9252025,
author={Anthony {Saieva} and Shirish {Singh} and Gail {Kaiser}},
booktitle={IEEE 20th International Working Conference on Source Code Analysis and Manipulation (SCAM)},
title={{Ad hoc Test Generation Through Binary Rewriting}},
month = {September},
year={2020},
location = {Virtual},
volume={},
number={},
pages={115–126},
url = {https://doi.org/10.1109/SCAM51674.2020.00018},
}
Replay without Recording of Production Bugs for Service Oriented Applications
Nipun Arora, Jonathan Bell, Franjo Ivančić, Gail Kaiser and Baishakhi Ray. Replay without Recording of Production Bugs for Service Oriented Applications. 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE), Montpellier, France, September 2018, pp. 452-463. https://doi.org/10.1145/3238147.3238186.
Software available at https://github.com/Programming-Systems-Lab/Parikshan. The software is not maintained.
Short time-to-localize and time-to-fix for production bugs is extremely important for any 24×7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead.
@inproceedings{Arora:2018:RWR:3238147.3238186, author = {Arora, Nipun and Bell, Jonathan and Ivan\v{c}i\'{c}, Franjo and Kaiser, Gail and Ray, Baishakhi}, title = {Replay Without Recording of Production Bugs for Service Oriented Applications}, booktitle = {Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering}, series = {ASE 2018}, year = {2018}, isbn = {978-1-4503-5937-5}, location = {Montpellier, France}, pages = {452--463}, numpages = {12}, url = {http://doi.acm.org/10.1145/3238147.3238186}, doi = {10.1145/3238147.3238186}, acmid = {3238186}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {Fault reproduction, live debugging}, }
SEmantic Security bug detection with Pseudo-Oracles (SESPO)
Semantic security bugs cause serious vulnerabilities across a wide range of software, e.g., bypassing security checks, gaining access to sensitive information, and escalating privileges. Automatically detecting these bugs is hard because unlike crashing bugs they may not show any obvious side-effects. The safety and security specifications violated by semantic security bugs are rarely written in a form easy to express as test oracles that define precisely what the behavior should be for every possible input sequence.
Pseudo-oracle testing is one popular technique to identify non-crashing buggy behaviors when true test oracles are unavailable: Two or more executions are compared, as oracles for each other, to find discrepancies. The most common pseudo-oracle approach used in security research is differential testing, where multiple independent implementations of the same functionality are compared to see if they agree on, e.g., whether a given untrusted input is valid. But that only works when multiple independent implementations are available to the developers.
An alternative approach that requires only the one implementation at hand is metamorphic testing, which checks domain-specific metamorphic relations that are defined across sets of input/output pairs from multiple executions of the same test program. For example, a sample metamorphic property for program p adding two inputs a and b can be p(a, b) = p(a, 0) + p(b, 0). For a program r that sorts an array of inputs, r( [original array] ) should produce the same sorted result as r( [permutation of original array] ). A machine learning classifier should also produce the same results when the order of the training set is shuffled.
However, it is challenging to apply metamorphic relations to detecting security vulnerabilities, as security properties require richer semantics and often cannot be expressed as simple relationships among input-output pairs. This project investigates whether there is a comprehensive range of semantically expressive metamorphic relations that can successfully detect semantic security vulnerabilities.
We are seeking undergraduate and MS project students to participate in a pilot study: 1. Create a vulnerability database of known semantic security bugs, their corresponding patches, and the test cases needed to reproduce those bugs. 2. Manually classify the bugs into different sub-categories, e.g., incorrect error handling, missing input validation, API abuse, etc. 3. Determine which properties change across the buggy and fixed versions that can be expressed as violations of metamorphic relations. 4. Inspect other execution paths to check whether these relations are satisfied during benign executions. 5. Find out whether these metamorphic relations are also applicable to similar bugs in other programs. Students should be interested in security, software testing and program analysis.
Contact: Professor Gail Kaiser, kaiser@cs.columbia.edu. Please put “sespo” in the subject of your email.
Preponderance of the Evidence for SimilAr BEhavioR (SABER)
This project investigates dynamic analysis approaches to identifying behavioral similarities among code elements in the same or different programs, particularly for code that behaves similarly during execution but does not look similar so would be difficult or impossible to detect using static analysis (code clones). While code clone technology is fairly mature, detecting behavioral similarities is challenging.
Dynamic analysis involves executing the candidate programs, and portions of programs such as classes and methods, which means we need to identify program and method inputs that will be useful for identifying code that expert humans would deem similar — without numerous false positives. For instance, many methods given null or zero inputs produce null or zero outputs and/or execute similar instruction sequences, even though they behave entirely differently with other inputs. Some methods might coincidentally happen to behave similarity for one or a small number of more complex inputs, but still behave quite differently on the majority of the input space.
We plan to adapt the “preponderance of the evidence” metaphor as to whether a given piece of code is sufficiently similar to the code at hand. This requires constructing test suites (collections of inputs) suitable for weighing similarities and differences, and devising machine learning or analytical approaches to classifying the corresponding collections of executions as similar or different. We will seek to combine metrics from both functional input/output vectors and instruction trace (e.g., data dependency graph) representations of executions.
Our initial use case is program understanding: Most software is developed by teams of engineers, not a single individual working alone. A software product is rarely developed once and for all with no bug fixes or features added after original deployment. Original team members leave the company or the project and new members join. Thus most engineers need to understand code originally written by other persons, indeed studies have found that engineers spend more of their time trying to understand unfamiliar code than creating or modifying code. The engineer might not know what a piece of code is supposed to do, how it works, how it is supposed to work, how to fix it to make it work; documentation might be poor, outdated, or non-existent, and knowledgeable co-workers are not always available. We hypothesize that showing developers code that behaves similarly to the code at hand may be helpful, particularly if the similar code is better documented, better structured or simply easier to understand.
We are seeking undergraduate and MS project students interested in program understanding, program analysis, compilers and language runtime environments, software testing and machine learning. Possibly also security since we may apply to behavioral similarities to vulnerability detection and identifying malware.
Contact: Professor Gail Kaiser, kaiser@cs.columbia.edu. Please put “saber” in the subject of your email.
Toward Trustworthy MUtable RePLAY for Security Patches (MUPLAY)
Society is increasingly reliant on software, but deployed software contains security vulnerabilities and other bugs that can threaten privacy, property and even human lives. When a security vulnerability or critical error is discovered, a software patch is issued to attempt to fix the problem, but patches themselves can be incorrect, inadequate, and break necessarily functionality. This project investigates the full workflow for the developer to rapidly diagnose the root cause of the vulnerability or error, for the developer to test that a prospective patch indeed completely removes the defect, and for users to check the issued patch on their own configurations and workloads before adopting the patch.
Record/replay is an emerging technique for low-overhead recording of deployed software, in case a failure or exploit occurs in the field, so the user can transmit the recorded log to the developer to deterministically reproduce the bug. But existing record/replay systems can replay only with the exact same version of the software as recorded, possibly with insertion of debugging instrumentation that does not affect application state. This project investigates “mutable replay” to help reproduce, diagnose, and fix software bugs. Mutable replay will enable logs recorded with the buggy version to be replayed after the modest code changes typical of critical patches, which do impact application state, to show that patches work correctly (or not!) to resolve detected problems.
This project leverages semantic information readily available to the developer (e.g., from their version control repository) to conduct static and dynamic analyses to correctly transform the recorded log to enable mutable replay. For example, the replayer could skip over no longer valid log entries and temporarily go-live when new data is needed.
We are currently modifying rr (https://rr-project.org/) to support mutable replay, since rr is an actively supported open-source record/replay system in wide use and integrated with gdb. Its recorded log includes details of the instruction locations in memory, register contents, and other low-level details rather than just the non-deterministic inputs to the application recorded by some other record/replay systems. This makes it impossible to use rr’s replayer to run the new version of the executable while feeding from a log recorded with a previous version of the executable. Instead, we need to patch the old executable to jump into the new executable where they diverge, run that modified code “live”, then jump back if/when they converge.
We would also like to investigate mutable replay for Java/JVM, but first need to find a practical record/replay system to leverage — or build our own.
We are seeking undergraduate and MS project students interested in compilers and language runtime environments, debuggers, program analysis, software testing, software reliability, security and systems (Linux kernel and JVM).
Contact: Anthony Narin, ant@cs.columbia.edu, cc Prof. Gail Kaiser, kaiser@cs.columbia.edu. Please put “muplay” in the subject of your email.
SESPO (SEmantic Security bug detection with Pseudo-Oracles): targeted subproject #1
Semantic security bugs resulting from logic flaws severely undermine the security of critical software systems. For example, attackers exploited a recent semantic security bug (CVE-2017-5638) in Apache Struts to steal sensitive personal data of 143 million customers of Equifax. Despite the severity of such semantic security bugs, there is no generic method for detecting these bugs. The key challenge is that semantic bugs, unlike memory corruption bugs, do not have any program-independent side-effects like crashing. Therefore, semantic security bugs are hard to detect without formal specifications of the program under test. Unfortunately, full formal specifications (test oracles) are usually not available even for security critical software as these specifications are difficult to create and keep in sync with functionality changes.
Our longer-term goal is to develop SESPO, a generic framework for automatically detecting semantic security bugs using pseudo-oracle testing—a way to create test oracles by comparing behaviors of related executions even when complete specifications are not available. SESPO will use two main classes of pseudo-oracles: (i) Differential Testing to compare outputs of different independent programs with similar functionality and (ii) Metamorphic Testing to check the relationship between outputs from multiple executions of the same program with different inputs.
We are currently looking for one or more undergraduate/MS project students to perform a relatively short-term project for academic credit: devising metamorphic relations corresponding to some security properties of Java and/or C/C++ software. A “metamorphic relation” specifies a relationship between input/output pairs. In particular, if you have a given input i and its already known output o (already known because you ran the software with input i), the relation defines 1. a constructor, 2. a predictor and 3. a checker, where the constructor crafts a new input i’ from i, the predictor determines what the output o’ ought to be from the already known i, o and i’, and the checker compares the real output o” (when you run the software with input i’) to the predicted output o’. If o” is not consistent with o’, we know there’s a bug in the software (the checker is often but not necessarily equality). For example, if you ran a sort program on a list of elements as input to produce sorted output, permuting the original input list to derive the new input should result in the identical output except for “ties”.
We will train you in the concepts of metamorphic testing. Your job will be to find a reasonable number of different metamorphic relations relevant to security properties of real-world software, where inputs come from untrusted users.
To learn more, please contact Prof. Kaiser, kaiser@cs.columbia.edu.
MuPlay (MUtable rePLAY) for patching security vulnerabilities
Seeking new PhD student as well as MS thesis and graduate/undergraduate project students.
The project investigates the full patching workflow for both Java and C/C++, from when the security vulnerability is detected and a reproducible exploit is demonstrated, through testing candidate patches, to when users have validated and applied the released patch in production. Developer productivity will be improved since constructing, executing and checking exploit test cases will be automated; the productivity of the application’s administrators will be improved since they will spend less time painstakingly validating patches by hand and more time directly benefiting their business. The application’s users will more rapidly be protected from a vulnerable application threatening their own security and privacy.
We build on record/replay technology, which already supports deterministic bug reproduction and replay of the originally recorded log after very restricted changes, such as inserting print statements, recompiling with instrumentation and reconfiguring to increase verbosity, that do not affect application state. However, unrestricted changes constituting a patch – which typically do affect application state – are often sufficiently modest that the modified application can still be replayed with small mutations to the original record/replay log. In particular, static and dynamic analyses, such as slicing, data dependency graphs, and taint tracking, can rule out potential log mutations that would be unsound and suggest other prospective log mutations consistent with the code changes. Then the replayer can skip over no longer relevant log entries and “go live” for new/replaced library and system calls, while reusing unaffected data from the recorded log.
In addition to longer-term participants, we are currently looking for one or more undergraduate/MS project students to perform a relatively short-term project for academic credit: We would like to convert the open-source rr record/replay system (https://rr-project.org/) to directly support Java programs. rr already supports Java applications running on top of the JVM, but it records and replays all the activities of the underlying JVM rather than just the Java application, which is inefficient and makes it difficult to reason about the Java application state, needed to enable mutable replay. We will train you in the concepts of record/replay. Your job will be to study how rr works internally, to evaluate the feasibility of such a conversion and determine what, specifically, would need to be changed.
Other one-semester projects can be defined for interested project students.
To learn more, please contact Prof. Kaiser, kaiser@cs.columbia.edu.
Leveraging SABER (SimilAr BEhavioR) to aid program understanding
Seeking new PhD student as well as MS thesis and graduate/undergraduate project students.
Most software is developed by teams of engineers, not a single individual working alone. Further, a software product is rarely developed once and for all with no bug fixes or features added after original deployment. Original team members leave the company or the project and new members join. Thus most engineers need to understand software code originally written by other persons, indeed studies have found that engineers spend more of their time trying to understand unfamiliar code than creating or modifying code. The engineer might not know what a piece of code is supposed to do, how it works, how it is supposed to work, how to fix it to make it work; documentation might be poor, outdated, or non-existent, and knowledgeable co-workers are not always available.
This project investigates automated tools and techniques to help engineers more quickly and more completely understand the code they are tasked to work with, to improve their productivity and the quality of their work. With less time spent understanding existing code, engineers have more time to spend on modifying that code, fixing bugs and adding features desired by their users, and creating new software benefiting the public.
More specifically, this project investigates dynamic analysis approaches to identifying behavioral similarities among code elements in the same or different programs, particularly for code that behaves similarly during execution but does not look similar so would be difficult or impossible to detect using static analysis (code clones). While code clone technology is fairly mature, tools for detecting behavioral similarities are relatively primitive. The primary objective is to improve and shape behavioral similarity analysis for practical use cases, concentrating on finding similar code in the same or other codebases that might help developers understand, debug, and add features to unfamiliar code they are tasked to work with.
The project seeks to advance knowledge about what it means for code to be behaviorally similar, how dynamic analyses can identify behavioral code similarities, how to drive the executions necessary for these analyses, and how to leverage code whose behavior is reported as highly similar to the code at hand to achieve common software engineering tasks that may be ill-suited to representational code similarities (code clones). The research investigates the utility and scalability of dynamic analyses seeking behavioral similarities in corresponding representations of code executions; guiding input case generation techniques to produce test executions useful for comparing/contrasting code behaviors for particular use cases; and filtering and weighting schemes for adapting the preponderance of the evidence metaphor to choosing the most convincing similarities for the software engineering task.
To learn more, please contact Prof. Kaiser, kaiser@cs.columbia.edu.
Gameful Computational Thinking (SAGE)
Seeking undergraduate and graduate students in Computer Science and related fields. Also see separate project description for students with formal background in education.
Inspired by CS for All? The Programming Systems Lab, led by Professor Gail Kaiser, is building a collaborative game-based learning and assessment system intended to help teachers infuse computational thinking in grade 6-8 curricula. We’re recruiting students interested in collaborating as a Scrum team to continue development of several features:
- A gameful direct instruction system that embeds Parson’s Programming Puzzles in Scratch
- A gameful constructionism mode that integrates scoring systems for play and automated formative assessment through Blockly
- A gameful intelligent tutor that provides hints on-demand and just-in-time via clustering, sequential pattern mining, and behavior detection algorithms
- A gameful affinity space that enables curricula assimilation with quest management and learning metrics using Node.js and Angular
Participants should register for 3 credits of COMS E6901 (graduate students) or COMS W3998 (undergraduates). CVN students are welcome.
To learn more, please contact Jeff Bender, jeffrey.bender@columbia.edu.
SAGE | Social Addictive Gameful Engineering