利用高熵生成模型扩展功能性蛋白质序列空间 / Expanding functional protein sequence space using high entropy generative models
1️⃣ 一句话总结
本研究通过比较不同结构的玻尔兹曼机模型,发现高熵模型能在保持人工酶高功能成功率的同时,探索比低熵模型大十五个数量级的序列空间,且更能避免过拟合,从而更真实地反映蛋白质的进化适应度景观。
Boltzmann Machines trained on evolutionary sequence data have emerged as a powerful paradigm for the data-driven design of artificial proteins. However, the relationship between model architecture, specifically parameter density, and experimental performance remains poorly understood. Here, we investigate this relationship using the Chorismate Mutase enzyme family as a model system. We compare standard fully connected Boltzmann Machines for Direct Coupling Analysis (bmDCA) with sparse models generated via progressive edge activation (eaDCA) and edge decimation (edDCA). We identify a maximum-entropy model (meDCA) along the decimation trajectory that represents an optimal balance between constraint satisfaction and the flexibility of the probability distribution. We synthesized and tested artificial sequences from all models using an in vivo complementation assay, finding that all architectures, regardless of sparsity, generate functional enzymes with high success rates, even at significant divergence from natural sequences. Despite this functional equivalence, we demonstrate that the meDCA model samples a viable sequence space that is more than fifteen orders of magnitude larger than its low-entropy counterparts. Furthermore, comparative analyses reveal that high-entropy models systematically minimize overfitting and better capture the local neutral spaces surrounding natural proteins. These findings suggest that while various models satisfying coevolutionary statistics can generate functional sequences, high-entropy Boltzmann Machines provide a superior representation of the underlying evolutionary fitness landscape.
利用高熵生成模型扩展功能性蛋白质序列空间 / Expanding functional protein sequence space using high entropy generative models
本研究通过比较不同结构的玻尔兹曼机模型,发现高熵模型能在保持人工酶高功能成功率的同时,探索比低熵模型大十五个数量级的序列空间,且更能避免过拟合,从而更真实地反映蛋白质的进化适应度景观。
源自 arXiv: 2605.03578