Obfuscation Resilient Search through Executable Classification
Android applications are usually obfuscated before release,
making it difficult to analyze them for malware presence or
intellectual property violations. Obfuscators might hide the
true intent of code by renaming variables and/or modifying
program structures. It is challenging to search for executables
relevant to an obfuscated application for developers to analyze
efficiently. Prior approaches toward obfuscation resilient
search have relied on certain structural parts of apps remaining
as landmarks, un-touched by obfuscation. For instance,
some prior approaches have assumed that the structural relationships
between identifiers are not broken by obfuscators;
others have assumed that control flow graphs maintain their
structures. Both approaches can be easily defeated by a motivated
obfuscator. We present a new approach, MACNETO,
to search for programs relevant to obfuscated executables
leveraging deep learning and principal features on instructions.
MACNETO makes few assumptions about the kinds of
modifications that an obfuscator might perform. We show
that it has high search precision for executables obfuscated
by a state-of-the-art obfuscator that changes control flow. Further,
we also demonstrate the potential of MACNETO to help
developers understand executables, where MACNETO infers
keywords (which are from relevant un-obfuscated programs)
for obfuscated executables.
@inproceedings{Su:2018:ORS:3211346.3211352, author = {Su, Fang-Hsiang and Bell, Jonathan and Kaiser, Gail and Ray, Baishakhi}, title = {{Obfuscation Resilient Search Through Executable Classification}}, booktitle = {{Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL)}}, series = {MAPL 2018}, year = {2018}, isbn = {978-1-4503-5834-7}, location = {Philadelphia, PA, USA}, pages = {20--30}, numpages = {11}, url = {http://doi.acm.org/10.1145/3211346.3211352}, doi = {10.1145/3211346.3211352}, acmid = {3211352}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {bytecode analysis, bytecode search, deep learning, executable search, obfuscation resilience}, }
Gameful Computational Thinking: seeking students from Teachers College
Inspired by CS for All? The Programming Systems Lab, led by Professor Gail Kaiser, is building a collaborative game-based learning and assessment system intended to help teachers infuse computational thinking in grade 6-8 curricula. We are seeking students with backgrounds in teaching (anything) to middle-school age children, to help us transition to field studies with our partner school and possibly other educational programs. Software development skills are not needed for these positions, but an interest in applying theories from the learning sciences within educational technology is. Appropriate academic credit can be arranged on a case by case basis.
To learn more, please contact Jeff Bender, jeffrey.bender@columbia.edu.
SAGE | Social Addictive Gameful Engineering
https://github.com/cu-sage/About
Improving rr record/replay support for Java
rr (see http://rr-project.org/) is a widely-used open-source C/C++ debugging tool for Linux that enhances gdb with record/replay capabilities. PSL is seeking several project students (to work in a team) for spring 2018 to adapt rr to record/replay Java applications on top of the JVM. rr already works with Java/JVM recording/replaying all system calls. The goal is to modify rr to record/replay only those system calls specific to the Java application recorded/replayed, not the system calls due to the JVM’s internal mechanisms, to improve performance of Java application recording and replaying. (Eventually we plan to modify rr further, to support mutable replay, but that will be a later semester.) Prospective project students should have strong Java/JVM, C/C++ and Linux skills, and preferably have completed 4115 and 4118.
Contact Prof. Kaiser at kaiser@cs.columbia.edu.
Record/Replay Bug Reproduction for Java
There will inevitably continue to be bugs that are not detected by any testing approach, but eventually impact users who then file bug reports. Reproducing field failures in the development environment can be difficult, however, especially in the case of software that behaves non-deterministically, relies on remote resources, or has complex reproduction steps (the users may not even know what led up to triggering the flaw, particularly in the case of software interacting with external devices, databases, etc. in addition to human users). So a record/replay approach is used to capture the state of the system just before a bug is encountered, so the steps leading up to this state can be replayed later in the lab. The naive approach of constant logging in anticipation of a defect tends to produce unacceptably high overheads (reaching 2,000+ %) in the deployed application. Novel solutions that lower this overhead typically limit the depth of information recorded (e.g., to use only a stack trace, rather than a complete state history) or the breadth of information recorded (e.g., to only log information during execution of a particular subsystem that a developer identifies as potentially buggy). But limiting the depth of information gathered may fail to reproduce an error if the defect does not present itself immediately and limiting logging to a specific subcomponent of an application makes it only possible to reproduce the bug if it occurred within that subcomponent.
Our new technique, called “Chronicler”, instead captures program execution in a manner that allows for deterministic replay in the lab with very low overhead. The key insight is to log sources of non-determinism only at the library level – allowing for a lightweight recording process while still supporting a complete replay for debugging purposes (programs with no sources of non-determinism, e.g., no user interactions, are trivial to replay – just provide the same inputs). When a failure occurs, Chronicler automatically generates a test case that consists of the inputs (e.g., file or network I/O, user inputs, random numbers, etc.) that caused the system to fail. This general approach can be applied to any “managed” language that runs in a language virtual machine (for instance, JVM or Microsoft’s .NET CLR), requiring no modifications to the interpreter or environment, and thus addresses a different class of programs than related work for non-managed languages like C and C++.
We expect to extend and use this tool as part of the Mutable Replay project, and are seeking new project students in tandem with that effort.
Contact Professor Gail Kaiser (kaiser@cs.columbia.edu)
Links
Publications
Jonathan Bell, Nikhil Sarda and Gail Kaiser. Chronicler: Lightweight Recording to Reproduce Field Failures. 35th International Conference on Software Engineering, May 2013, pp. 362-371. See teaser video at https://www.youtube.com/watch?v=4IYGfdDnAJg.
Software
Download <a href=”http://ChroniclerJ.
Code Relatives: Detecting Similarly Behaving Software
@inproceedings{Su:2016:CRD:2950290.2950321,
author = {Su, Fang-Hsiang and Bell, Jonathan and Harvey, Kenneth and Sethumadhavan, Simha and Kaiser, Gail and Jebara, Tony},
title = “{Code Relatives: Detecting Similarly Behaving Software}”,
booktitle = “{24th ACM SIGSOFT International Symposium on Foundations of Software Engineering}”,
series = {FSE 2016},
year = {2016},
isbn = {978-1-4503-4218-6},
location = {Seattle, WA, USA},
pages = {702–714},
numpages = {13},
url = {http://doi.acm.org/10.1145/2950290.2950321},
doi = {10.1145/2950290.2950321},
acmid = {2950321},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Code relatives, code clones, link analysis, runtime behavior, subgraph matching},
note = “Artifact accepted as platinum.”
}
Mutable Replay
Society is increasingly reliant on software, but deployed software contains security vulnerabilities and other bugs that can threaten privacy, property and even human lives. When a security vulnerability or critical error is discovered, a software patch is issued to attempt to fix the problem, but patches themselves can be incorrect, inadequate, and break necessarily functionality. This project investigates the full workflow for the developer to rapidly diagnose the root cause of the vulnerability or error, for the developer to test that a prospective patch indeed completely removes the defect, and for users to check the issued patch on their own configurations and workloads before adopting the patch.
This project explores the use of mutable replay to help reproduce, diagnose, and fix software bugs. A low-overhead recorder records the execution of software in case a failure or exploit occurs, allowing the developer to replay the recorded log to reproduce the problem. Mutable replay allows logs recorded with the buggy version to be replayed after the modest code changes typical of critical patches to show that patches work correctly to resolve detected problems. This project leverages semantic information readily available to the developer to conduct well-understood static and dynamic analyses to correctly transform the recorded log to enable mutable replay. The results of this research will benefit society and individuals by simplifying and hastening both generation and validation of patches, ultimately making software more reliable and secure.
Contact Gail Kaiser (kaiser@cs.columbia.edu)
Team Members
Faculty
Gail Kaiser
Graduate Students
Anthony Saeiva Narin
Former Graduate Students
Jonathan Bell
Kenny Harvey
Identifying Functionally Similar Code in Complex Codebases
@inproceedings{hitoshiio,
author = “Fang-Hsiang Su and Jonathan Bell and Gail Kaiser and Simha Sethumadhavan”,
title = “(Identifying Functionally Similar Code in Complex Codebases}”,
booktitle = “{24th IEEE International Conference on Program Comprehension (ICPC)}”,
month = “May”,
year = “2016”,
pages = “1–10”,
url = “http://dx.doi.org/10.1109/ICPC.2016.7503720”,
note = “ACM SIGSOFT Distinguished Paper Award”
}
Challenges in Behavioral Code Clone Detection
@inproceedings{CodeRelatives.position,
author = “Fang-Hsiang Su and Jonathan Bell and Gail Kaiser”,
title = “{Challenges in Behavioral Code Clone Detection (Position Paper)}”,
booktitle = “{10th International Workshop on Software Clones (IWSC), affiliated with IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER)}”,
month = “March”,
year = “2016”,
volume = “3”,
pages = “21–22”,
url = “http://dx.doi.org/10.1109/SANER.2016.75”,
note = “People’s Choice Award for Best Position Paper.”
}