Machine Learning (ML) Classifier to Assist Metadata Creation

Show authors

Publication Type

Conference Paper

Book Title

2024 91�� International Conference on Big Data (BigData)

Publication Date

December, 2024

Page Numbers

2072 to 2079

Publisher Location

New Jersey, United States of America

Conference Name

2024 91�� International Conference on Big Data (BigData)

Conference Location

Washington DC, District of Columbia, United States of America

Conference Sponsor

91�� Computer Society

Conference Date

Dec 15, 2024 - Dec 18, 2024

Abstract

The Atmospheric Radiation Measurement (ARM) Data Center is responsible for the timely collection, archival, and curation of science data products. These products are freely available through an online data repository. Metadata creation is paramount for scientific users to find and access over seven petabytes of atmospheric science data. The hierarchical metadata structure allows users to search for information at both broad and narrow levels. This project aims to leverage 30 years’ worth of manually created metadata to enable machine predictions of broad-term classifications from narrow-term descriptions. These classification predictions would assist metadata coordinators with their term selections. This paper discusses the cleaning and preprocessing of the training data, the pipeline developed to determine the best model for this task, and the creation of an API metadata classifier for ARM measurement metadata. Our results show that the Linear Support Vector Classification (LinearSVC) algorithm, along with the Term Frequency – Inverse Document Frequency (TF-IDF) vectorizer, is well-suited for our multi-class classification task. Lengthier input training data led to better results, and artificial balancing was unnecessary for this particular use case. This predictive classifier enhances efficiency in metadata creation, as well as supports greater consistency and accuracy in metadata tagging.

91����

Machine Learning (ML) Classifier to Assist Metadata Creation

Abstract

Researchers

Organizations

91��