Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles
Ziyuan Zhong, Gail Kaiser and Baishakhi Ray. Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles. IEEE Transactions on Software Engineering (TSE), 49(4):1860-1875, April 2023. https://doi.org/10.1109/TSE.2022.3195640. (Also appeared at the 45th International Conference on Software Engineering (ICSE) as a journal-first paper, Melbourne Australia, May 2023.).
Self-driving cars and trucks, autonomous vehicles (AVs), should not be accepted by regulatory bodies and the public until they have much higher confidence in their safety and reliability — which can most practically and convincingly be achieved by testing. But existing testing methods are inadequate for checking the end-to-end behaviors of AV controllers against complex, real-world corner cases involving interactions with multiple independent agents such as pedestrians and human-driven vehicles. While test-driving AVs on streets and highways fails to capture many rare events, existing simulation-based testing methods mainly focus on simple scenarios and do not scale well for complex driving situations that require sophisticated awareness of the surroundings. To address these limitations, we propose a new fuzz testing technique, called AutoFuzz, which can leverage widely-used AV simulators’ API grammars to generate semantically and temporally valid complex driving scenarios (sequences of scenes). To efficiently search for traffic violations-inducing scenarios in a large search space, we propose a constrained neural network (NN) evolutionary search method to optimize AutoFuzz. Evaluation of our prototype on one state-of-the-art learning-based controller, two rule-based controllers, and one industrial-grade controller in five scenarios shows that AutoFuzz efficiently finds hundreds of traffic violations in high-fidelity simulation environments. For each scenario, AutoFuzz can find on average 10-39% more unique traffic violations than the best-performing baseline method. Further, fine-tuning the learning-based controller with the traffic violations found by AutoFuzz successfully reduced the traffic violations found in the new version of the AV controller software.
@article{AutoCop,
author = {Ziyuan Zhong and Gail Kaiser and Baishakhi Ray},
title = {{Neural Network Guided Evolutionary Fuzzing for Finding Traffic Violations of Autonomous Vehicles}},
journal = {IEEE Transactions on Software Engineering (TSE)},
year = {2023},
month = {April},
volume = {49},
number = {4},
pages = {1860-1875},
url = {https://doi.org/10.1109/TSE.2022.3195640},
}
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements
Yangruibo Ding, Sahil Suneja, Yunhui Zheng, Jim Laredo, Alessandro Morari, Gail Kaiser and Baishakhi Ray. VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements. 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Virtual, March 2022. 36.2% accepted. https://doi.org/10.1109/SANER53432.2022.00114 Video at https://www.youtube.com/watch?v=caoQkTaxyYc
Automatically locating vulnerable statements in
source code is crucial to assure software security and alleviate
developers’ debugging efforts. This becomes even more important
in today’s software ecosystem, where vulnerable code can flow
easily and unwittingly within and across software repositories like
GitHub. Across such millions of lines of code, traditional static
and dynamic approaches struggle to scale. Although existing
machine-learning-based approaches look promising in such a
setting, most work detects vulnerable code at a higher granularity
– at the method or file level. Thus, developers still need to inspect
a significant amount of code to locate the vulnerable statement(s)
that need to be fixed.
This paper presents VELVET, a novel ensemble learning
approach to locate vulnerable statements. Our model combines
graph-based and sequence-based neural networks to successfully
capture the local and global context of a program graph and
effectively understand code semantics and vulnerable patterns.
To study VELVET’s effectiveness, we use an off-the-shelf synthetic
dataset and a recently published real-world dataset. In the static
analysis setting, where vulnerable functions are not detected in
advance, VELVET achieves 4.5× better performance than the
baseline static analyzers on the real-world data. For the isolated
vulnerability localization task, where we assume the vulnerability
of a function is known while the specific vulnerable statement
is unknown, we compare VELVET with several neural networks
that also attend to local and global context of code. VELVET
achieves 99.6% and 43.6% top-1 accuracy over synthetic data and
real-world data, respectively, outperforming the baseline deep
learning models by 5.3-29.0%.
@INPROCEEDINGS{velvet,
author={Ding, Yangruibo and Suneja, Sahil and Zheng, Yunhui and Laredo, Jim and Morari, Alessandro and Kaiser, Gail and Ray, Baishakhi},
booktitle={2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)},
title={{VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements}},
year={2022},
month={March},
volume={},
number={},
pages={959-970},
doi={10.1109/SANER53432.2022.00114},
url={https://doi.org/10.1109/SANER53432.2022.00114},
}
Update with Care: Testing Candidate Bug Fixes and Integrating Selective Updates through Binary Rewriting
Anthony Saieva and Gail Kaiser. Update with Care: Testing Candidate Bug Fixes and Integrating Selective Updates through Binary Rewriting. The Journal of Systems & Software (JSS), 191(111381), September 2022. https://doi.org/10.1016/j.jss.2022.111381.
Enterprise software updates depend on the interaction between user and developer organizations. This interaction becomes especially complex when a single developer organization writes software that services hundreds of different user organizations. Miscommunication during patching and deployment efforts lead to insecure or malfunctioning software installations. While developers oversee the code, the update process starts and ends outside their control. Since developer test suites may fail to capture buggy behavior finding and fixing these bugs starts with user generated bug reports and 3rd party disclosures. The process ends when the fixed code is deployed in production. Any friction between user, and developer results in a delay patching critical bugs.
Two common causes for friction are a failure to replicate user specific circumstances that cause buggy behavior and incompatible software releases that break critical functionality. Existing test generation techniques are insufficient. They fail to test candidate patches for post-deployment bugs and to test whether the new release adversely effects customer workloads. With existing test generation and deployment techniques, users cannot choose (nor validate) compatible portions of new versions and retain their previous version’s functionality.
We present two new technologies to alleviate this friction. First, Test Generation for Ad Hoc Circumstances transforms buggy executions into test cases. Second, Binary Patch Decomposition allows users to select the compatible pieces of update releases. By sharing specific context around buggy behavior and developers can create specific test cases that demonstrate if their fixes are appropriate. When fixes are distributed by including extra context users can incorporate only updates that guarantee compatibility between buggy and fixed versions.
We use change analysis in combination with binary rewriting to transform the old executable and buggy execution into a test case including the developer’s prospective changes that let us generate and run targeted tests for the candidate patch. We also provide analogous support to users, to selectively validate and patch their production environments with only the desired bug-fixes from new version releases.
This paper presents a new patching workflow that allows developers to validate prospective patches and users to select which updates they would like to apply, along with two new technologies that make it possible. We demonstrate our technique constructs tests cases more effectively and more efficiently than traditional test case generation on a collection of real world bugs compared to traditional test generation techniques, and provides the ability for flexible updates in real world scenarios.
@article{attune-journal,
title = {{Update with care: Testing candidate bug fixes and integrating selective updates through binary rewriting}},
journal = {Journal of Systems and Software (JSS)},
volume = {191},
pages = {111381},
year = {2022},
month = {September},
issn = {0164-1212},
doi = {https://doi.org/10.1016/j.jss.2022.111381},
url = {https://www.sciencedirect.com/science/article/pii/S0164121222001054},
author = {Anthony Saieva and Gail Kaiser},
keywords = {Test generation, Change analysis, Binary analysis, Binary rewriting},
}
DIRECT : A Transformer-based Model for Decompiled Identifier Renaming
Vikram Nitin, Anthony Saieva, Baishakhi Ray and Gail Kaiser. DIRECT : A Transformer-based Model for Decompiled Identifier Renaming. 1st Workshop on Natural Language Processing for Programming (NLP4Prog), co-located with ACL-IJCNLP, Virtual, August 2021, pp. 48-57. http://dx.doi.org/10.18653/v1/2021.nlp4prog-1.6.
Decompiling binary executables to high-level code is an important step in reverse engineering scenarios, such as malware analysis and legacy code maintenance. However, the generated high-level code is difficult to understand since the original variable names are lost. In this paper, we leverage transformer models to reconstruct the original variable names from decompiled code. Inherent differences between code and natural language present certain challenges in applying conventional transformer-based architectures to variable name recovery. We propose DIRECT, a novel transformer-based architecture customized specifically for the task at hand. We evaluate our model on a dataset of decompiled functions and find that DIRECT outperforms the previous state-of-the-art model by up to 20%. We also present ablation studies evaluating the impact of each of our modifications. We make the source code of DIRECT available to encourage reproducible research.
@inproceedings{direct,
author = {Vikram Nitin and Anthony Saieva and Baishakhi Ray and Gail Kaiser},
title = {{DIRECT : A Transformer-based Model for Decompiled Identifier Renaming}},
booktitle = {{1st Workshop on Natural Language Processing for Programming (NLP4Prog), co-located with ACL-IJCNLP}},
month = {August},
year = {2021},
pages = {48-57},
location={Virtual},
url = {http://dx.doi.org/10.18653/v1/2021.nlp4prog-1.6},
}
Metamorphic Detection of Repackaged Malware
Shirish Singh and Gail Kaiser. Metamorphic Detection of Repackaged Malware. 6th International Workshop on Metamorphic Testing (MET), co-located with ICSE, Virtual, June 2021, pp. 9-16. https://doi.org/10.1109/MET52542.2021.00009.
Machine learning-based malware detection systems are often vulnerable to evasion attacks, in which a malware developer manipulates their malicious software such that it is misclassified as benign. Such software hides some properties of the real class or adopts some properties of a different class by applying small perturbations. A special case of evasive malware hides by repackaging a bonafide benign mobile app to contain malware in addition to the original functionality of the app, thus retaining most of the benign properties of the original app. We present a novel malware detection system based on metamorphic testing principles that can detect such benign-seeming malware apps. We apply metamorphic testing to the feature representation of the mobile app, rather than to the app itself. That is, the source input is the original feature vector for the app and the derived input is that vector with selected features removed. If the app was originally classified benign, and is indeed benign, the output for the source and derived inputs should be the same class, i.e., benign, but if they differ, then the app is exposed as (likely) malware. Malware apps originally classified as malware should retain that classification, since only features prevalent in benign apps are removed. This approach enables the machine learning model to classify repackaged malware with reasonably few false negatives and false positives. Our training pipeline is simpler than many existing ML-based malware detection methods, as the network is trained end-to-end to jointly learn appropriate features and to perform classification. We pre-trained our classifier model on 3 million apps collected from the widely-used AndroZoo dataset. 1 We perform an extensive study on other publicly available datasets to show our approach’s effectiveness in detecting repackaged malware with more than 94% accuracy, 0.98 precision, 0.95 recall, and 0.96 F1 score.
@INPROCEEDINGS{9477656,
author={Singh, Shirish and Kaiser, Gail},
booktitle={IEEE/ACM 6th International Workshop on Metamorphic Testing (MET), co-located with ICSE},
title={{Metamorphic Detection of Repackaged Malware}},
year={2021},
month = {June},
volume={},
number={},
pages={9-16},
location={Virtual},
url = {https://doi.org/10.1109/MET52542.2021.00009},
}
Outreach Service
Are you an unpaid volunteer for a non-profit organization in the New York City area? Have you completed COMS W4156 Advanced Software Engineering (or equivalent) with grade B+ or higher? Does your work at the non-profit, or could your work there, leverage your software engineering skills to develop and maintain software important for the non-profit’s mission, not just their website? Would you like to receive academic credit for your work in 3998, 4901, or 6901?
Contact Professor Gail Kaiser to discuss opportunities: kaiser@cs.columbia.edu.
Replay without Recording of Production Bugs for Service Oriented Applications
Presented by Jonathan Bell at 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE) in September 2018. https://doi.org/10.1145/3238147.3238186.
Software available at https://github.com/Programming-Systems-Lab/Parikshan. The software is not maintained.
Short time-to-localize and time-to-fix for production bugs is extremely important for any 24×7 service-oriented application (SOA). Debugging buggy behavior in deployed applications is hard, as it requires careful reproduction of a similar environment and workload. Prior approaches for automatically reproducing production failures do not scale to large SOA systems. Our key insight is that for many failures in SOA systems (e.g., many semantic and performance bugs), a failure can automatically be reproduced solely by relaying network packets to replicas of suspect services, an insight that we validated through a manual study of 16 real bugs across five different systems. This paper presents Parikshan, an application monitoring framework that leverages user-space virtualization and network proxy technologies to provide a sandbox “debug” environment. In this “debug” environment, developers are free to attach debuggers and analysis tools without impacting performance or correctness of the production environment. In comparison to existing monitoring solutions that can slow down production applications, Parikshan allows application monitoring at significantly lower overhead.
Binary Quilting to Generate Patched Executables without Compilation
Anthony Saieva and Gail Kaiser. Binary Quilting to Generate Patched Executables without Compilation. ACM Workshop on Forming an Ecosystem Around Software Transformation (FEAST), Virtual, November 2020, pp. 3-8. https://doi.org/10.1145/3411502.3418424
When applying patches, or dealing with legacy software, users are often reluctant to change the production executables for fear of unwanted side effects. This results in many active systems running vulnerable or buggy code even though the problems have already been identified and resolved by developers. Furthermore when dealing with old or proprietary software, users can’t view or compile source code so any attempts to change the application after distribution requires binary level manipulation. We present a new technique we call binary quilting that allows users to apply the designated minimum patch that preserves core semantics without fear of unwanted side effects introduced either by the build process or by additional code changes. Unlike hot patching, binary quilting is a one-time procedure that creates an entirely new reusable binary. Our case studies show the efficacy of this technique on real software in real patching scenarios.
@inproceedings{10.1145/3411502.3418424,
author = {Saieva, Anthony and Kaiser, Gail},
title = {{Binary Quilting to Generate Patched Executables without Compilation}},
year = {2020},
month = {November},
url = {https://doi.org/10.1145/3411502.3418424},
doi = {10.1145/3411502.3418424},
booktitle = {ACM Workshop on Forming an Ecosystem Around Software Transformation (FEAST)},
pages = {3–-8},
location = {Virtual},
}
Binary Quilting to Generate Patched Executables without Compilation
Presented by Anthony Saieva at ACM Workshop on Forming an Ecosystem Around Software Transformation (FEAST) on November 13, 2020. https://doi.org/10.1145/3411502.3418424
When applying patches, or dealing with legacy software, users are often reluctant to change the production executables for fear of unwanted side effects. This results in many active systems running vulnerable or buggy code even though the problems have already been identified and resolved by developers. Furthermore when dealing with old or proprietary software, users can’t view or compile source code so any attempts to change the application after distribution requires binary level manipulation. We present a new technique we call binary quilting that allows users to apply the designated minimum patch that preserves core semantics without fear of unwanted side effects introduced either by the build process or by additional code changes. Unlike hot patching, binary quilting is a one-time procedure that creates an entirely new reusable binary. Our case studies show the efficacy of this technique on real software in real patching scenarios.
Ad hoc Test Generation Through Binary Rewriting
Presented by Anthony Saieva at 20th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM) on September 27, 2020. https://doi.org/10.1109/SCAM51674.2020.00018
Software available at https://github.com/Programming-Systems-Lab/ATTUNE.
When a security vulnerability or other critical bug is not detected by the developers’ test suite, and is discovered post-deployment, developers must quickly devise a new test that reproduces the buggy behavior. Then the developers need to test whether their candidate patch indeed fixes the bug, without breaking other functionality, while racing to deploy before attackers pounce on exposed user installations. This can be challenging when factors in a specific user environment triggered the bug. If enabled, however, record-replay technology faithfully replays the execution in the developer environment as if the program were executing in that user environment under the same conditions as the bug manifested. This includes intermediate program states dependent on system calls, memory layout, etc. as well as any externally-visible behavior. Many modern record-replay tools integrate interactive debuggers, to help locate the root cause, but don’t help the developers test whether their patch indeed eliminates the bug under those same conditions. In particular, modern record-replay tools that reproduce intermediate program state cannot replay recordings made with one version of a program using a different version of the program where the differences affect program state. This work builds on record-replay and binary rewriting to automatically generate and run targeted tests for candidate patches significantly faster and more efficiently than traditional test suite generation techniques like symbolic execution. These tests reflect the arbitrary (ad hoc) user and system circumstances that uncovered the bug, enabling developers to check whether a patch indeed fixes that bug. The tests essentially replay recordings made with one version of a program using a different version of the program, even when the the differences impact program state, by manipulating both the binary executable and the recorded log to result in an execution consistent with what would have happened had the the patched version executed in the user environment under the same conditions where the bug manifested with the original version. Our approach also enables users to make new recordings of their own workloads with the original version of the program, and automatically generate and run the corresponding ad hoc tests on the patched version, to validate that the patch does not break functionality they rely on.