菜单

关于 🐙 GitHub
arXiv 提交日期: 2026-06-29
📄 Abstract - Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets

Atomistic machine learning datasets are increasingly used for training: large immutable snapshots are read repeatedly, shuffled across epochs, staged across clusters' storage systems, and republished as reusable scientific artifacts. This workload differs from interactive scientific curation, where mutable records and ad hoc inspection are often more important than random indexed throughput. We present Atompack, an append-oriented storage format and distribution layer designed around a simple workload: training pipelines usually consume complete molecular records, while the order of records is randomized by the learning algorithm. Atompack appends records efficiently during dataset construction, then commits an immutable index and serves records through a memory-mapped read path optimized for training. We compare Atompack with HDF5, LMDB, and ASE baselines representing array stores, key-value records, serialized records, and object-oriented databases. The benchmarks measure sequential reads, shuffled reads, shared-filesystem behavior, write throughput, and artifact size. On a representative 64-atom workload, Atompack is 96x faster than ASE LMDB on shuffled training-style reads while producing artifacts about 79\% smaller. The results indicate that serving complete molecule records, rather than field chunks or reconstructed objects, improves shuffled training throughput while keeping artifacts compact enough for public distribution.

顶级标签: machine learning data
详细标签: dataset storage training throughput molecular records benchmark 或 搜索:

Atompack:面向重度读取的原子级机器学习训练数据集的存储与分发层 / Atompack: A Storage and Distribution Layer for Read-Heavy Atomistic ML Training Datasets


1️⃣ 一句话总结

该论文提出了一种名为Atompack的新型存储格式,它通过只追加写入、不可变索引和内存映射读取,专门优化了原子机器学习训练数据中大量分子记录的随机读取效率,比传统HDF5、LMDB等方法在训练场景下快近百倍且占用空间更小。

源自 arXiv: 2606.29975