PUMA: Secure inference of LLaMA-7B in five minutes

Ye Dong; Wen-Jie Lu; Yancheng Zheng; Haoqi Wu; Derun Zhao; Jin Tan; Zhicong Huang; Cheng Hong; Tao Wei; Wen-Guang Chen; Jianying Zhou

doi:10.1051/sands/2025014

Open Access

Issue		Security and Safety Volume 4, 2025


Article Number		2025014
Number of page(s)		24
Section		Other Fields
DOI		https://doi.org/10.1051/sands/2025014
Published online		23 October 2025

Security and Safety, Vol. 4, 2025014 (2025)

Research Article

PUMA: Secure inference of LLaMA-7B in five minutes

Ye Dong¹^,2, Wen-Jie Lu¹, Yancheng Zheng¹, Haoqi Wu¹, Derun Zhao¹, Jin Tan¹, Zhicong Huang¹, Cheng Hong¹^*, Tao Wei¹, Wen-Guang Chen¹ and Jianying Zhou³

¹ Ant Group, Beijing, 100081, China
² National University of Singapore, Singapore, 119260, Singapore
³ Singapore University of Technology and Design, Singapore, 487372, Singapore

^* Corresponding authors (email: This email address is being protected from spambots. You need JavaScript enabled to view it. )

Received: 15 July 2025
Revised: 14 October 2025
Accepted: 16 October 2025

Abstract

Transformer models (e.g., Bert and GPT) have shown their dominance in machine learning tasks. Many cloud companies have begun to provide services based on Transformer models, examples include translation and text-speech conversion. However, such a service inevitably requires access to the client’s data, which might contain sensitive information. Theoretically, running the services under secure multi-party computation (MPC) could protect clients’ privacy. However, current MPC frameworks are still limited in terms of model performance, efficiency, deployment, and functionality, especially when facing complex Transformer models. To this end, we propose an MPC framework PUMA to enable secure and efficient Transformer model inference. We first design high-quality approximations for the bottleneck functions in Transformers such as GELU and Softmax, reducing about 20 − 76% computation and communication costs than state-of-the-art works without performance drop. Then, we provide concrete instantiations for secure Embedding and LayerNorm. These implementations produce correct results and integrate compatible system architectures of cleartext Transformer models. Finally, we conducted extensive experiments on six popular benchmarks: text classification/generation/summarization/translation, audio-to-text, and image-to-text. Results show that PUMA can finish most tasks in several minutes, with comparable model performance (e.g., accuracy) as cleartext, and even evaluate LLaMA-7B in less than 5 minutes to generate 1 token.

Key words: Privacy / Security / Secure Three-Party Computation / Privacy-Preserving Machine Learning / Large Language Models

Citation: Dong Y, Lu WJ, Zheng Y, Wu H, Zhao D, Tan J, Huang Z, Hong C, Wei T, Chen WG and Zhou J. Puma: Secure inference of LLaMA-7B in five minutes. Security and Safety 2025; 4: 2025014. https://doi.org/10.1051/sands/2025014

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.