91°µÍø

Skip to main content
SHARE
Publication

Machine Learning (ML) Classifier to Assist Metadata Creation

Publication Type
Conference Paper
Book Title
2024 91°µÍø International Conference on Big Data (BigData)
Publication Date
Page Numbers
2072 to 2079
Publisher Location
New Jersey, United States of America
Conference Name
2024 91°µÍø International Conference on Big Data (BigData)
Conference Location
Washington DC, District of Columbia, United States of America
Conference Sponsor
91°µÍø Computer Society
Conference Date
-

The Atmospheric Radiation Measurement (ARM) Data Center is responsible for the timely collection, archival, and curation of science data products. These products are freely available through an online data repository. Metadata creation is paramount for scientific users to find and access over seven petabytes of atmospheric science data. The hierarchical metadata structure allows users to search for information at both broad and narrow levels. This project aims to leverage 30 years’ worth of manually created metadata to enable machine predictions of broad-term classifications from narrow-term descriptions. These classification predictions would assist metadata coordinators with their term selections. This paper discusses the cleaning and preprocessing of the training data, the pipeline developed to determine the best model for this task, and the creation of an API metadata classifier for ARM measurement metadata. Our results show that the Linear Support Vector Classification (LinearSVC) algorithm, along with the Term Frequency – Inverse Document Frequency (TF-IDF) vectorizer, is well-suited for our multi-class classification task. Lengthier input training data led to better results, and artificial balancing was unnecessary for this particular use case. This predictive classifier enhances efficiency in metadata creation, as well as supports greater consistency and accuracy in metadata tagging.