91做厙

Skip to main content
SHARE
Publication

Evaluating the Use of Foundational Chemical Language Models in Multimodal Graph Fusion

by Collin Francel, Massimiliano Lupo Pasini, Zachary R Fox
Publication Type
Conference Paper
Book Title
AI for Accelerated Materials Design - NeurIPS 2024
Publication Date
Publisher Location
United States of America
Conference Name
AI4Mat Workshop at NeurIPS: AI for Accelerated Materials Design
Conference Location
Vancouver, Canada
Conference Sponsor
NeurIPS
Conference Date
-

Rapid and accurate prediction of the physicochemical properties of molecules given their structures remains a key challenge in cheminformatics. Machine learning approaches offer high-throughput options, but the optimality of inductive biases and data representations are up for debate. For example, BERT-based masked language models (MLMs) can be trained in a self-supervised way on hundreds of millions to billions of readily available SMILES strings. Another option is graph neural networks (GNNs), which can operate directly on molecular structures. Yet, generating accurate molecular geometry is computationally expensive, leading to a relative scarcity in data compared to SMILES strings. It is attractive to combine these two paradigms by pre-training an LM on a large corpus of SMILES strings and embedding these representation into a geometric graph neural network. Despite the promise of such an approach, and contrary to previous studies, we find mixed results with the combination of the LMs and GNNs on several molecule datasets. In particular, we found evidence for improvement on the FreeSolv and QM7 benchmarks, but degraded performance on the ESOL, LIPO and QM9 datasets compared to a GNN baseline.