TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

📄 Abstract - TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

Large language models (LLMs) exhibit exceptional general language processing capabilities, but their memory and compute costs hinder deployment. Ternarization has emerged as a promising compression technique, offering significant reductions in model size and inference complexity. However, existing methods struggle with heavy-tailed activation distributions and therefore keep activations in high precision, fundamentally limiting end-to-end inference acceleration. To overcome this limitation, we propose TWLA, a post-training quantization (PTQ) framework that achieves 1.58-bit weight compression and 4-bit activation quantization while maintaining high accuracy. TWLA comprises three components: (1) Euclidean-to-Manifold Asymmetric Ternary Quantizer (E2M-ATQ) minimizes layer-output error under weight ternarization via a two-stage optimization from Euclidean initialization to manifold relocation; (2) Kronecker Orthogonal Tri-Modal Shaping (KOTMS) applies a Kronecker-structured orthogonal rotation to reshape weights into ternary-friendly tri-modal distributions, while the shared rotation statistically suppresses activation outliers; and (3) Inter-Layer Aware Activation Mixed Precision (ILA-AMP) explicitly introduces adjacent-layer second-order interaction costs in bit allocation and jointly optimizes for the layer-wise disparity of activation quantization gains induced by the shared orthogonal transform, preventing cascades triggered by a few weak layers. Extensive experiments demonstrate that TWLA maintains high accuracy under W1.58A4, while delivering significant inference acceleration. The code is available at <this https URL.

TWLA：通过训练后量化实现大语言模型的三值权重与低位激活 / TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

1️⃣ 一句话总结

本文提出了一种名为TWLA的新型训练后量化框架，通过巧妙的数学变换和优化策略，成功将大语言模型的权重压缩到1.58位、激活值量化到4位，在保持高精度的同时大幅提升推理速度，解决了此前极低比特量化中激活值难以压缩的难题。

← 返回列表

菜单

AI 帮我研读全文

1️⃣ 一句话总结

密码管理

设置密码

修改密码

移除密码

菜单

AI 帮我研读全文

1️⃣ 一句话总结

获取最新论文摘要