There will inevitably continue to be bugs that are not detected by any testing approach, but eventually impact users who then file bug reports. Reproducing field failures in the development environment can be difficult, however, especially in the case of software that behaves non-deterministically, relies on remote resources, or has complex reproduction steps (the users may not even know what led up to triggering the flaw, particularly in the case of software interacting with external devices, databases, etc. in addition to human users). So a record/replay approach is used to capture the state of the system just before a bug is encountered, so the steps leading up to this state can be replayed later in the lab. The naive approach of constant logging in anticipation of a defect tends to produce unacceptably high overheads (reaching 2,000+ %) in the deployed application. Novel solutions that lower this overhead typically limit the depth of information recorded (e.g., to use only a stack trace, rather than a complete state history) or the breadth of information recorded (e.g., to only log information during execution of a particular subsystem that a developer identifies as potentially buggy). But limiting the depth of information gathered may fail to reproduce an error if the defect does not present itself immediately and limiting logging to a specific subcomponent of an application makes it only possible to reproduce the bug if it occurred within that subcomponent.
Our new technique, called “Chronicler”, instead captures program execution in a manner that allows for deterministic replay in the lab with very low overhead. The key insight is to log sources of non-determinism only at the library level – allowing for a lightweight recording process while still supporting a complete replay for debugging purposes (programs with no sources of non-determinism, e.g., no user interactions, are trivial to replay – just provide the same inputs). When a failure occurs, Chronicler automatically generates a test case that consists of the inputs (e.g., file or network I/O, user inputs, random numbers, etc.) that caused the system to fail. This general approach can be applied to any “managed” language that runs in a language virtual machine (for instance, JVM or Microsoft’s .NET CLR), requiring no modifications to the interpreter or environment, and thus addresses a different class of programs than related work for non-managed languages like C and C++.
We expect to extend and use this tool as part of the Mutable Replay project, and are seeking new project students in tandem with that effort.
Contact Professor Gail Kaiser (firstname.lastname@example.org)
Jonathan Bell, Nikhil Sarda and Gail Kaiser. Chronicler: Lightweight Recording to Reproduce Field Failures. 35th International Conference on Software Engineering, May 2013, pp. 362-371. See teaser video at https://www.youtube.com/watch?v=4IYGfdDnAJg.