Data Augmentation for Supervised Code Translation Learning

authored by: Binger Chen, Jacek Golebiowski, Ziawasch Abedjan
Abstract: Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.
Organisation(s): Data Base and Information Systems Section
L3S Research Centre
External Organisation(s): Technische Universität Berlin
Amazon.com, Inc.
Type: Conference contribution
Pages: 444-456
No. of pages: 13
Publication date: 02.07.2024
Publication status: Published
Peer reviewed: Yes
ASJC Scopus subject areas: Computer Science Applications, Software, Safety, Risk, Reliability and Quality
Electronic version(s): https://doi.org/10.1145/3643991.3644923 (Access: Open)

BibTeX

@inproceedings{0cb5630ba0ca435c94674d2060e90623,
title = "Data Augmentation for Supervised Code Translation Learning",
abstract = "Data-driven program translation has been recently the focus of several lines of research. A common and robust strategy is supervised learning. However, there is typically a lack of parallel training data, i.e., pairs of code snippets in the source and target language. While many data augmentation techniques exist in the domain of natural language processing, they cannot be easily adapted to tackle code translation due to the unique restrictions of programming languages. In this paper, we develop a novel rule-based augmentation approach tailored for code translation data, and a novel retrieval-based approach that combines code samples from unorganized big code repositories to obtain new training data. Both approaches are language-independent. We perform an extensive empirical evaluation on existing Java-C#-benchmarks showing that our method improves the accuracy of state-of-the-art supervised translation techniques by up to 35%.",
author = "Binger Chen and Jacek Golebiowski and Ziawasch Abedjan",
note = "Publisher Copyright: {\textcopyright} 2024 ACM.; 21st IEEE/ACM International Conference on Mining Software Repositories, MSR 2024 ; Conference date: 15-04-2024 Through 16-04-2024",
year = "2024",
month = jul,
day = "2",
doi = "10.1145/3643991.3644923",
language = "English",
pages = "444--456",
booktitle = "2024 IEEE/ACM 21st International Conference on Mining Software Repositories",
publisher = "ACM DL",
}