| Issue |
Security and Safety
Volume 4, 2025
|
|
|---|---|---|
| Article Number | 2025014 | |
| Number of page(s) | 24 | |
| Section | Other Fields | |
| DOI | https://doi.org/10.1051/sands/2025014 | |
| Published online | 23 October 2025 | |
Research Article
PUMA: Secure inference of LLaMA-7B in five minutes
1
Ant Group, Beijing, 100081, China
2
National University of Singapore, Singapore, 119260, Singapore
3
Singapore University of Technology and Design, Singapore, 487372, Singapore
* Corresponding authors (email: This email address is being protected from spambots. You need JavaScript enabled to view it.
)
Received:
15
July
2025
Revised:
14
October
2025
Accepted:
16
October
2025
Abstract
Transformer models (e.g., Bert and GPT) have shown their dominance in machine learning tasks. Many cloud companies have begun to provide services based on Transformer models, examples include translation and text-speech conversion. However, such a service inevitably requires access to the client’s data, which might contain sensitive information. Theoretically, running the services under secure multi-party computation (MPC) could protect clients’ privacy. However, current MPC frameworks are still limited in terms of model performance, efficiency, deployment, and functionality, especially when facing complex Transformer models. To this end, we propose an MPC framework PUMA to enable secure and efficient Transformer model inference. We first design high-quality approximations for the bottleneck functions in Transformers such as GELU and Softmax, reducing about 20 − 76% computation and communication costs than state-of-the-art works without performance drop. Then, we provide concrete instantiations for secure Embedding and LayerNorm. These implementations produce correct results and integrate compatible system architectures of cleartext Transformer models. Finally, we conducted extensive experiments on six popular benchmarks: text classification/generation/summarization/translation, audio-to-text, and image-to-text. Results show that PUMA can finish most tasks in several minutes, with comparable model performance (e.g., accuracy) as cleartext, and even evaluate LLaMA-7B in less than 5 minutes to generate 1 token.
Key words: Privacy / Security / Secure Three-Party Computation / Privacy-Preserving Machine Learning / Large Language Models
Citation: Dong Y, Lu WJ, Zheng Y, Wu H, Zhao D, Tan J, Huang Z, Hong C, Wei T, Chen WG and Zhou J. Puma: Secure inference of LLaMA-7B in five minutes. Security and Safety 2025; 4: 2025014. https://doi.org/10.1051/sands/2025014
© The Author(s) 2025. Published by EDP Sciences and China Science Publishing & Media Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.