DIRECT : A Transformer-based Model for Decompiled Identifier Renaming

Vikram Nitin, Anthony Saieva, Baishakhi Ray and Gail Kaiser. DIRECT : A Transformer-based Model for Decompiled Identifier Renaming. 1st Workshop on Natural Language Processing for Programming (NLP4Prog), co-located with ACL-IJCNLP, Virtual, August 2021, pp. 48-57. http://dx.doi.org/10.18653/v1/2021.nlp4prog-1.6.

Decompiling binary executables to high-level code is an important step in reverse engineering scenarios, such as malware analysis and legacy code maintenance. However, the generated high-level code is difficult to understand since the original variable names are lost. In this paper, we leverage transformer models to reconstruct the original variable names from decompiled code. Inherent differences between code and natural language present certain challenges in applying conventional transformer-based architectures to variable name recovery. We propose DIRECT, a novel transformer-based architecture customized specifically for the task at hand. We evaluate our model on a dataset of decompiled functions and find that DIRECT outperforms the previous state-of-the-art model by up to 20%. We also present ablation studies evaluating the impact of each of our modifications. We make the source code of DIRECT available to encourage reproducible research.

@inproceedings{direct,
author = {Vikram Nitin and Anthony Saieva and Baishakhi Ray and Gail Kaiser},
title = {{DIRECT : A Transformer-based Model for Decompiled Identifier Renaming}},
booktitle = {{1st Workshop on Natural Language Processing for Programming (NLP4Prog), co-located with ACL-IJCNLP}},
month = {August},
year = {2021},
pages = {48-57},
location={Virtual},
url = {http://dx.doi.org/10.18653/v1/2021.nlp4prog-1.6},
}