Abstract
The Atmospheric Radiation Measurement (ARM) Data Center is responsible for the timely collection, archival, and curation of science data products. These products are freely available through an online data repository. Metadata creation is paramount for scientific users to find and access over seven petabytes of atmospheric science data. The hierarchical metadata structure allows users to search for information at both broad and narrow levels. This project aims to leverage 30 years’ worth of manually created metadata to enable machine predictions of broad-term classifications from narrow-term descriptions. These classification predictions would assist metadata coordinators with their term selections. This paper discusses the cleaning and preprocessing of the training data, the pipeline developed to determine the best model for this task, and the creation of an API metadata classifier for ARM measurement metadata. Our results show that the Linear Support Vector Classification (LinearSVC) algorithm, along with the Term Frequency – Inverse Document Frequency (TF-IDF) vectorizer, is well-suited for our multi-class classification task. Lengthier input training data led to better results, and artificial balancing was unnecessary for this particular use case. This predictive classifier enhances efficiency in metadata creation, as well as supports greater consistency and accuracy in metadata tagging.