菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-17
📄 Abstract - Generalised Eigenvalue Geometry of Semantic Adversarial Attacks

Recent empirical work shows that semantically equivalent paraphrases can fool financial sentiment classifiers: although a paraphrase remains close to the original under a strong reference embedding, it may shift the target model's representation enough to change the predicted class. Existing robustness theory either assumes a single-model threat model or focuses mainly on empirical attack algorithms. We develop a continuous local model of semantic paraphrase perturbations that captures this two-model structure. We show that the worst-case local displacement of the target representation, subject to a proxy-model budget, is governed by the largest generalised eigenvalue of a matrix pencil $(A,B)$ constructed from the Jacobians of the two embedding maps. The resulting attackability index $\lambda^*(x)$ is intrinsic to the local paraphrase geometry and the chosen embedders, yields a closed-form prediction-flip condition for affine readouts, and supports conservative population and finite-sample attackability certificates. For uniform control over classes of affine readouts, we derive a distribution-free VC bound for binary attackability indicators and a scale-sensitive margin bound based on an attackability-adjusted margin that subtracts a local geometric penalty from the standard classifier margin. We also connect the continuous theory to discrete paraphrase search, identify an asymmetry between successful and unsuccessful finite searches, and give a covering condition under which the discrete and continuous settings agree. Finally, we propose an empirical verification framework using soft-token relaxations and generated paraphrase sets to assess the local eigenvalue geometry, prediction-flip condition, and finite-search approximation on a deployed financial-text classifier.

顶级标签: llm machine learning financial
详细标签: adversarial attacks robustness theory eigenvalue geometry sentiment classification paraphrase robustness 或 搜索:

语义对抗攻击的广义特征值几何 / Generalised Eigenvalue Geometry of Semantic Adversarial Attacks


1️⃣ 一句话总结

本文提出了一种基于广义特征值的几何框架,用于理解并量化语义等价改写如何欺骗情感分类模型:通过分析代理模型和目标模型之间的局部几何关系,作者推导出一个攻击性指标,能够预测何时微小语义变化会导致分类翻转,并为此提供了理论保证和实验验证。

源自 arXiv: 2606.19212