PUMA: Secure inference of LLaMA-7B in five minutes

Open Access

Issue		Security and Safety Volume 4, 2025


Article Number		2025014
Number of page(s)		24
Section		Other Fields
DOI		https://doi.org/10.1051/sands/2025014
Published online		23 October 2025

Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, et al. (eds.). Advances in Neural Information Processing Systems, Vol. 30. Curran Associates, 2017. [Google Scholar]
Radford A and Narasimhan K. Improving language understanding by generative pre-training, 2018. [Google Scholar]
Zhuge M, Gao D, Fan D-P, et al. Kaleido-bert: Vision-language pre-training on fashion domain. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 12 647–57. [Google Scholar]
Appalaraju S, Jasani B, Kota BU et al. Docformer: End-to-end transformer for document understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, 993–1003. [Google Scholar]
Ding S, Shang J, Wang S et al. ERNIE-Doc: A retrospective long-document modeling transformer. In: Zong C, Xia F, Li W et al. (eds.). Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 1. Long Papers, 2021, 2914–27. [Online]. Available: https://aclanthology.org/2021.acl-long.227 [Google Scholar]
Kim G, Hong T, Yim M et al. Ocr-free document understanding transformer. In: European Conference on Computer Vision. Springer, 2022, 498–517. [Google Scholar]
Soifer J, Li J, Li M et al. Deep learning inference service at microsoft. In: 2019 USENIX Conference on Operational Machine Learning (OpML 19), 2019, 15–7. [Google Scholar]
Shamir A. How to share a secret. Commun ACM 1979; 22: 612–3. [Google Scholar]
Yao AC-C. How to generate and exchange secrets. In: 27th Annual Symposium on Foundations of Computer Science (sfcs 1986). IEEE, 1986, 162–7. [Google Scholar]
Goldreich O, Micali S and Wigderson A. How to play any mental game. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, ser. STOC ’87. New York, NY, USA: Association for Computing Machinery, 1987, 218–29. [Google Scholar]
Li D, Wang H, Shao R et al. MPCFORMER: FAST, PERFORMANT AND PRIVATE TRANSFORMER INFERENCE WITH MPC. In: The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=CWmvjOEhgH- [Google Scholar]
Lu W-j, Huang Z, Gu Z et al. Bumblebee: Secure two-party inference framework for large transformers. Cryptology ePrint Archive, 2023. [Google Scholar]
Gupta K, Jawalkar N, Mukherjee A et al. Sigma: Secure gpt inference with function secret sharing. Cryptology ePrint Archive, 2023. [Google Scholar]
Hao M, Li H, Chen H et al. Iron: Private inference on transformers. In: Oh AH, Agarwal A, Belgrave D et al. (eds.). Advances in Neural Information Processing Systems, 2022. [Online]. Available: https://openreview.net/forum?id=deyqjpcTfsG [Google Scholar]
Hou X, Liu J, Li J et al. Ciphergpt: Secure two-party gpt inference. Cryptology ePrint Archive, 2023. [Google Scholar]
Pang Q, Zhu J, Möllering H et al. Bolt: Privacy-preserving, accurate and efficient inference for transformers. Cryptology ePrint Archive, 2023. [Google Scholar]
Akimoto Y, Fukuchi K, Akimoto Y et al. Privformer: Privacy-preserving transformer with mpc. In: 2023 IEEE 8th European Symposium on Security and Privacy (EuroSP). Los Alamitos, CA, USA: IEEE Computer Society, 2023, 392–410. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/EuroSP57164.2023.00031 [Google Scholar]
Liu X and Liu Z. Llms can understand encrypted prompt: Towards privacy-computing friendly transformers, 2023. [Google Scholar]
Kim S, Gholami A, Yao Z et al. I-bert: Integer-only bert quantization. In: International Conference on Machine Learning. PMLR, 2021, 5506–18. [Google Scholar]
Wu H, Fang W, Zheng Y et al. Ditto: Quantization-aware secure inference of transformers upon mpc. arXiv preprint arXiv:2405.05525, 2024. [Google Scholar]
Kumar A, Raghunathan A, Jones R et al. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022. [Google Scholar]
Liang Z, Wang P, Zhang R et al. Merge: Fast private text generation, 2023. [Google Scholar]
Zhang J, Yang X, He L et al. Secure transformer inference made non-interactive. In: NDSS, 2025. [Online]. Available: https://www.ndss-symposium.org/ndss-paper/secure-transformer-inference-made-non-interactive/ [Google Scholar]
Knott B, Venkataraman S, Hannun A et al. Crypten: Secure multi-party computation meets machine learning. In: arXiv 2109.00984, 2021. [Google Scholar]
Dong X, Bao J, Zhang T et al. Bootstrapped masked autoencoders for vision bert pretraining. in: European Conference on Computer Vision. Springer, 2022, 247–64. [Google Scholar]
Araki T, Furukawa J, Lindell Y et al. High-throughput semi-honest secure three-party computation with an honest majority. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, 805–17. [Google Scholar]
Mohassel P and Rindal P. Aby3: A mixed protocol framework for machine learning. In: Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. New York, NY, USA: Association for Computing Machinery, 2018, 35–52. [Online]. Available: https://doi.org/10.1145/3243734.3243760 [Google Scholar]
Devlin J, Chang M-W, Lee K et al. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, Vol. abs/1810.04805, 2019. [Google Scholar]
Yang Z, Dai Z, Yang Y et al. Xlnet: Generalized autoregressive pretraining for language understanding. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Red Hook, NY, USA: Curran Associates Inc., 2019. [Google Scholar]
Touvron H, Lavril T, Izacard G et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. [Google Scholar]
Chen H, Wang Y, Guo T et al. Pre-trained image processing transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, 12 299–310. [Google Scholar]
Hendrycks D and Gimpel K. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016. [Google Scholar]
Fusing convolution and batch norm using custom function, 2022. [Online]. Available: https://pytorch.org/tutorials/intermediate/custom_function_conv_bn_tutorial.html [Google Scholar]
Lu W-j, Fang Y, Huang Z et al. Faster secure multiparty computation of adaptive gradient descent. In: Proceedings of the 2020 Workshop on Privacy-Preserving Machine Learning in Practice, ser. PPMLP’20. New York, NY, USA: Association for Computing Machinery, 2020, 47–9. [Google Scholar]
Keller M. Mp-spdz: A versatile framework for multi-party computation. In: Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, 2020, 1575–90. [Google Scholar]
Canetti R. Security and composition of multiparty cryptographic protocols. J Cryptol 2000; 13: 143–202. [CrossRef] [Google Scholar]
Dalskov A, Escudero D and Keller M. Secure evaluation of quantized neural networks. Proc Priv Enhanc Technol 2020; 2020: 355–75. [Google Scholar]
Wagh S, Tople S, Benhamouda F et al. Falcon: Honest-majority maliciously secure framework for private deep learning. arXiv preprint arXiv:2004.02229, 2020. [Google Scholar]
Abadi M, Chu A, Goodfellow I et al. Deep learning with differential privacy. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016, 308–18. [Google Scholar]
Ma J, Zheng Y, Feng J et al. SecretFlow-SPU: A performant and User-Friendly framework for Privacy-Preserving machine learning. In: 2023 USENIX Annual Technical Conference (USENIX ATC 23). Boston, MA: USENIX Association, 2023, 17–33. [Google Scholar]
Wolf T, Debut L, Sanh V et al. Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, 38–45. [Online]. Available: https://www.aclweb.org/anthology/2020.emnlp-demos.6 [Google Scholar]
Varda K. Protocol buffers: Google’s data interchange format. Google Open Source Blog, Available at least as early as Jul, Vol. 72, 2008, 23. [Google Scholar]
van Oortmerssen W. Flatbuffers: A memory efficient serialization library, 2014. [Online]. Available: WebPage.androiddevelopers.googleblog.com/2014/06/flatbuffers-memory-efficient.html [Google Scholar]
Wang Y, Suh GE, Xiong W et al. Characterization of mpc-based private inference for transformer-based models. In: 2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2022, 187–97. [Google Scholar]
Fan X, Chen K, Wang G et al. Nfgen: Automatic non-linear function evaluation code generator for general-purpose mpc platforms. In: Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, 2022, 995–1008. [Google Scholar]
Tan S, Knott B, Tian Y et al. Cryptgpu: Fast privacy-preserving machine learning on the gpu. arXiv preprint arXiv:2104.10949, 2021. [Google Scholar]
Knott B, Venkataraman S, Hannun A et al. Crypten: Secure multi-party computation meets machine learning. Adv. Neural Inf. Process. Syst. 2021; 34: 4961–73. [Google Scholar]
Heek J, Levskaya A, Oliver A et al. Flax: A neural network library and ecosystem for JAX, 2023. [Online]. Available: http://github.com/google/flax [Google Scholar]
Raffel C, Shazeer N, Roberts A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 2020; 21: 5485–551. [Google Scholar]
Radford A, Kim JW, Xu T et al. Robust speech recognition via large-scale weak supervision. In: International Conference on Machine Learning. PMLR, 2023, 28 492–518. [Google Scholar]
Li M, Lv T, Chen J et al. Trocr: Transformer-based optical character recognition with pre-trained models. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37, 2023, 13 094–102. [Google Scholar]
Wang A, Singh A, Michael J et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In: International Conference on Learning Representations, 2019. [Online]. Available: https://openreview.net/forum?id=rJ4km2R5t7 [Google Scholar]
Zhu Q. On the performance of matthews correlation coefficient (mcc) for imbalanced dataset. Pattern Recognit Lett 2020; 136: 1–80. [Google Scholar]
Jelinek F, Mercer RL, Bahl LR et al. Perplexity – A measure of the difficulty of speech recognition tasks. J Acoust Soc Am 1977; 62: S63. [Google Scholar]
Merity S, Xiong C, Bradbury J et al. Pointer sentinel mixture models, 2016. [Google Scholar]
See A, Liu PJ and Manning CD. Get to the point: Summarization with pointer-generator networks. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vol. 1. Long Papers. Vancouver, Canada: Association for Computational Linguistics, 2017, 1073–83. [Online]. Available: https://www.aclweb.org/anthology/P17-1099 [Google Scholar]
Lin C-Y. Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, 2004, 74–81. [Google Scholar]
Bojar Or, Chatterjee R, Federmann C et al. Findings of the 2016 conference on machine translation. In: Proceedings of the First Conference on Machine Translation. Berlin, Germany: Association for Computational Linguistics, 2016, 131–98. [Online]. Available: http://www.aclweb.org/anthology/W/W16/W16-2301 [Google Scholar]
Panayotov V, Chen G, Povey D et al. Librispeech: An asr corpus based on public domain audio books. In: Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on. IEEE, 2015, 5206–10. [Google Scholar]
Wang Y-Y, Acero A and Chelba C. Is word error rate a good indicator for spoken language understanding accuracy. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721). IEEE, 2003, 577–82. [Google Scholar]
Lin T-Y, Maire M, Belongie S et al. Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13. Springer, 2014, 740–55. [Google Scholar]
Reiter E. A structured review of the validity of bleu. Comput Linguist 2018; 44: 393–401. [Google Scholar]
Microsoft SEAL (release 4.1). https://github.com/Microsoft/SEAL, Jan. 2023, microsoft Research, Redmond, WA. [Google Scholar]
Mohassel P and Zhang Y. Secureml: A system for scalable privacy-preserving machine learning. In: 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 2017, 19–38. [Google Scholar]
Liu J, Juuti M, Lu Y et al. Oblivious neural network predictions via minionn transformations. In: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, 619–31. [Google Scholar]
Mishra P, Lehmkuhl R, Srinivasan A et al. Delphi: A cryptographic inference service for neural networks. In: 29th {USENIX} Security Symposium ({USENIX} Security 20), 2020, 2505–22. [Google Scholar]
Patra A, Schneider T, Suresh A et al. {ABY2. 0}: Improved {Mixed-Protocol} secure {Two-Party} computation. In: 30th USENIX Security Symposium (USENIX Security 21), 2021, 2165–82. [Google Scholar]
Rathee D, Rathee M, Kumar N et al. Cryptflow2: Practical 2-party secure inference. New York, NY, USA: Association for Computing Machinery, 2020. [Online]. Available: https://doi.org/10.1145/3372297.3417274 [Google Scholar]
Rathee D, Rathee M, Goli RKK et al. Sirnn: A math library for secure rnn inference. arXiv preprint https://arxiv.org/abs/2105.04236, 2021. [Google Scholar]
Huang Z, jie Lu W, Hong C et al. Cheetah: Lean and fast secure Two-Party deep neural network inference. In: 31st USENIX Security Symposium (USENIX Security 22). Boston, MA: USENIX Association, 2022, 809–26. [Google Scholar]
Wagh S, Gupta D and Chandran N. Securenn: 3-party secure computation for neural network training. Proc Priv Enhanc Technol 2019; 2019: 26–49. [Google Scholar]
Kumar N, Rathee M, Chandran N et al. Cryptflow: Secure tensorflow inference. arXiv preprint arXiv:1909.07814, 2019. [Google Scholar]
Patra A and Suresh A. Blaze: Blazing fast privacy-preserving machine learning. arXiv preprint arXiv:2005.09042, 2020. [Google Scholar]
Dong Y, Xiaojun C, Jing W et al. Meteor: Improved secure 3-party neural network inference with reducing online communication costs. In: Proceedings of the ACM Web Conference 2023, ser. WWW ’23. New York, NY, USA: Association for Computing Machinery, 2023, 2087–98. [Google Scholar]
Byali M, Chaudhari H, Patra A et al. Flash: Fast and robust framework for privacy-preserving machine learning. Proc Priv Enhanc Technol 2020; 2020: 459–80. [Google Scholar]
Dalskov A, Escudero D and Keller M. Fantastic four: Honest-majority four-party secure computation with malicious security. In: 30th {USENIX} Security Symposium Security 21), 2021. [Google Scholar]
Braun L, Demmler D, Schneider T et al. Motion – A framework for mixed-protocol multi-party computation. ACM Trans Priv Secur 2022; 25: 1–35 [Google Scholar]
Chen T, Bao H, Huang S et al. The-x: Privacy-preserving transformer inference with homomorphic encryption. In: Findings of the Association for Computational Linguistics: ACL 2022, 2022, 3510–20. [Google Scholar]
Li Z, Yang K, Tan J et al. Nimbus: Secure and efficient two-party inference for transformers. In: Advances in Neural Information Processing Systems, Vol. 37. Curran Associates, Inc., 2024, 21 572–600. [Google Scholar]
Yan G, Zhang Y, Guo Z et al. Comet: Accelerating Private Inference for Large Language Model by Predicting Activation Sparsity. In: 2025 IEEE Symposium on Security and Privacy (SP). Los Alamitos, CA, USA: IEEE Computer Society, 2025, 2827–45. [Google Scholar]

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.