Abstract
The confidentiality threat against training data has become a significant security problem in neural language models. Recent studies have shown that memorized training data can be extracted by injecting well-chosen prompts into generative language models. While these attacks have achieved remarkable success in the English-based Transformer architecture, it is unclear whether they are still effective in other language domains. This paper studies the effectiveness of attacks against Korean models and the potential for attack improvements that might be beneficial for future defense studies. The contribution of this study is two-fold. First, we perform a membership inference attack against the state-of-the-art Korean GPT model. We found approximate training data with 20% to 90% precision in the top-100 samples and confirmed that the proposed attack technique for naive GPT is valid across the language domains. Second, in this process, we observed that the redundancy of the selected sentences could hardly be detected with the existing attack method. Since the information appearing in a few documents is more likely to be meaningful, it is desirable to increase the uniqueness of the sentences to improve the effectiveness of the attack. Thus, we propose a deduplication strategy to replace the traditional word-level similarity metric with the BPE token level. Our proposed strategy reduces 6% to 22% of the underestimated samples among selected ones, thereby improving precision by up to 7%p. As a result, we show that considering both language- and model-specific characteristics is essential to improve the effectiveness of attack strategies. We also discuss possible mitigations against the MI attacks on the general language models.
Original language | English |
---|---|
Pages (from-to) | 10207-10217 |
Number of pages | 11 |
Journal | IEEE Access |
Volume | 11 |
DOIs | |
Publication status | Published - 2023 |
Bibliographical note
Funding Information:This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2019R1A2C1088802).
Publisher Copyright:
© 2013 IEEE.
All Science Journal Classification (ASJC) codes
- Computer Science(all)
- Materials Science(all)
- Engineering(all)
- Electrical and Electronic Engineering