Dataset

MSR-ACC-TAE25_Train




Species content of dataset


Name :
MSR-ACC-TAE25_Train
Authors :
Sebastian Ehlert, Jan Hermann, Thijs Vogels, Victor Garcia Satorras, Stephanie Lanius, Marwin Segler, Klaas J.H. Giesbertz, Derk P. Kooi, Kenji Takeda, Chin-Wei Huang, Giulia Luise, Rianne van den Berg, Paola Gori-Giorgi, Amir Karton
Description :
MSR-ACC/TAE25 (Microsoft Research Accurate Chemistry Collection, Total Atomization Energies 2025) provides 73,040 total atomization energies (TAEs) at the CCSD(T)/CBS level obtained with the W1-F12 composite wavefunction protocol implemented in Molpro 2024.1. This is the canonical training split comprising 71,871 molecules (99% of molecules remaining after removing overlap with the W4-17 and GMTKN55 benchmark sets).The dataset covers the chemical space of closed-shell, charge-neutral, covalently bound equilibrium molecular structures containing up to 5 non-hydrogen atoms drawn from elements H through Ar, excluding rare gases. Molecular structures were generated by exhaustive graph enumeration and degree-sequence sampling, then optimized through a cascade of GFN2-xTB, r2SCAN-3c, and B3LYP-D3(BJ)/def2-TZVPP levels of theory (ORCA). Structures were filtered to exclude those with significant multireference character (%TAE[(T)] > 6% at CCSD(T)/6-31G*), triplet electronic ground states, or dissociated fragments. The W1-F12 protocol includes Hartree-Fock extrapolation to the complete basis set limit (cc-pVDZ-F12 and cc-pVTZ-F12, alpha=5), CCSD-F12b correlation, perturbative triples delta(T) using jul-cc-pV(D+d)Z and jul-cc-pV(T+d)Z basis sets (alpha=3.22), and a core-valence correction using cc-pwCVTZ. The dataset spans 45.1% organic and 54.9% inorganic molecules and provides broader chemical diversity than comparable datasets such as GDB-9 or VQM24/DMC. Additional data available in the source files, including DFT atomization energies at approximately 90 levels of theory, singlet-triplet gaps, %TAE[(T)] multireference diagnostics, and W1-F12 energy components, can be downloaded from ColabFit Exchange.
Cite As :
Ehlert, S., Hermann, J., Vogels, T., Satorras, V. G., Lanius, S., Segler, M., Giesbertz, K. J., Kooi, D. P., Takeda, K., Huang, C., Luise, G., Berg, R., Gori-Giorgi, P., and Karton, A. "MSR-ACC-TAE25 Train." ColabFit, 2026. https://doi.org/None.
ColabFit ID :
Date Added :
2026-05-11
License :
CDLA-Permissive-2.0
Downloads :
0
Num. Configurations :
71,871
Num. Atoms :
532,242
Calculated Property Types :
atomization_energy
Elements :
Al (4.52%) B (5.61%) Be (3.08%) C (8.18%) Cl (1.03%) F (1.3%) H (44.58%) Li (2.18%) Mg (2.34%) N (6.34%) Na (1.41%) O (3.55%) P (5.34%) S (4.29%) Si (6.24%)
Methods :
W1-F12/CCSD(T)-CBS
Software :
Molpro 2024.1
Spec File :
Configuration Sets by Name :
Configuration Sets by ID :
Dataset viewer powered by Hugging Face

No uploaded content is transferred in ownership from the original creators to ColabFit. All content is distributed under the license specified by its contributor who has stated that he or she has the authority to share it under the specified license.