Date
Sep 12, 2024

Details

Event Description

Video Link:  Enhancing Evaluation Coverage and Validation with Language Models

Abstract: Two major challenges in evaluation are coverage and validation:(1) Existing benchmarks struggle to cover the vast range of potential use cases of language models. (2) Current methods for answer validation rely on costly ground-truth labels or potentially unreliable LLM-as-judge. In this talk, I will present approaches to leverage language models for constructing reliable benchmarks and judging model responses.
First, we introduce a declarative framework for benchmark construction, where we define a set of desiderata (e.g., salience, difficulty, novelty) and formulate benchmark construction as a search problem that optimizes for these criteria. To solve this, we propose a language model-driven approach called AutoBencher. AutoBencher’s scalability allows it to test fine-grained categories and tail knowledge, discovering knowledge gaps in state-of-the-art models. Second, we introduce a consistency criterion for reliably judging model responses. This criterion, which we refer to as GV-consistency (Generation and Validation consistency), enables evaluation without the need for ground truth labels and provides an upper bound on model accuracies. We find that even GPT-4 (0613) exhibits only 76% GV-consistency and show that this consistency serves as an effective self-improvement signal, improving generator and validator accuracies by 16% and 6.3%, without using labeled data.

Bio: Xiang Lisa Li is a 5th-year PhD student in computer science at Stanford University, advised by Percy Liang and Tatsunori Hashimoto. She works on designing methods for controllable and steerable language models, and she is interested in constructing new evaluation frameworks for language models. Lisa is supported by the Two Sigma PhD fellowship and Stanford Graduate Fellowship and is the recipient of an EMNLP Best Paper award.

Special Event