91°µÍø

Skip to main content
SHARE
Publication

Automating the Analysis of Large Language Models Responses through Zero-Shot Question Answering

by Hilda B Klasky
Publication Type
ORNL Report
Publication Date

Recent advancements in Large Language Models (LLMs) have shown significant potential in various applications, yet their evaluation, particularly in zero-shot question answering scenarios, remains a challenging task. In this study, our objective was to explore precision metrics for Large Language Models (LLM) and design and implement a software pipeline to automatically evaluate LLMs' outputs under zero-shot question answering. Zero-shot question answering involves a model providing answers to questions about topics it hasn't seen during training. It leverages the principles of zero-shot learning by relying on semantic understanding and generalization from related knowledge. The data used was metadata from medical databases on congenital heart disease. We explored eleven LLM metrics and selected three for our evaluation: BLEU, BERTScore, and MoverScore. BLEU calculates a score based on the overlap of n-grams (contiguous sequences of n items, typically words) between the machine-generated translation and the reference translations. Higher BLEU scores indicate better correspondence between the machine-generated and human-generated translations. BERTScore is a metric used to evaluate the quality of machine-generated text by measuring the similarity of token embeddings produced by BERT (Bidirectional Encoder Representations from Transformers) between the generated text and reference text. MoverScore is a metric that quantifies the dissimilarity between the distributions of word embeddings from machine-generated text and reference text, emphasizing semantic similarity over exact token overlap. We also introduced HBKI, a composite metric summarizing these approaches. We tested five models —GPT-3, Llama-2, Gemini 1.5 Pro, Solar 10.7B, and Mixtral-8x7b. Our software pipeline, designed and implemented using Object-Oriented Programming principles, allows users to customize the selection and extraction of features for topics of interest in their own research. Our results show that MoverScore delivered the most precise evaluation of the LLM's outputs, while Mixtral-8x7b achieved the best overall performance in extracting metadata from the databases.