May 8, 2025

HELMET   🌐Website 📜Paper (ICLR25)
LongProc  🌐Website 📜Paper

TLDR:

The rapid advancement of language models (LMs) has opened exciting new frontiers in artificial intelligence. One particularly promising development is the ability for these models to handle increasingly longer pieces of text —both for reading and generating. This capability unlocks exciting applications like analyzing entire research papers or generating complete code repositories. But with this expanded capability comes an important question: How do we properly evaluate whether these models truly understand and can effectively work with long texts?

In this blog, we introduce HELMET and LongProc, two benchmarks from our recent efforts to build a holistic test suite for evaluating long-context LMs.
HELMET assesses processing of long inputs (128K tokens), while LongProc evaluates capabilities for generating extended outputs (8K tokens). These benchmarks improve upon existing evaluation methods by featuring: 1) diverse tasks with strong connections to downstream applications, 2) difficulty levels that challenge even frontier models, and 3) reliable evaluation metrics. Together, they create a complementary testbed for comprehensive long-context evaluation.

With HELMET and LongProc, we evaluated over 50 language models, including base models, instruction-tuned models, and recent reasoning models. Our evaluation highlights important limitations in current models. Even frontier proprietary models have significant room for improvement on challenging real-world long-context tasks. Interestingly, reasoning models achieved stronger overall performance on long-output tasks, but they still show limitations with extensive long-input tasks. To help address these limitations, HELMET and LongProc support researchers with thorough evaluation of their models and quick comparison of performance across the rapidly moving landscape.

Background: Long-Context Evaluation Is Important but Challenging

With the development of long-context LMs across both industry and the open-source community, it is crucial to have a reliable testbed for evaluating and comparing these models. A common practice for long-context evaluation is to use perplexity or synthetic tasks, such as needle-in-a-haystack (NIAH).  However, recent works have shown that perplexity does not correlate well with downstream performance (Fang et al., 2024). 

Among the existing benchmarks with realistic applications, such as ZeroScrolls (Shaman et al., 2023)LongBench (Bai et al., 2024), and InfiniteBench (Zhang et al., 2024), there are still crucial limitations:

  • Insufficient coverage of downstream tasks
  • Inadequate lengths for testing frontier LCLMs: older QA datasets are often limited to <32K tokens
  • Unreliable metrics: N-gram matching metrics like ROUGE are noisy—they do not correlate with human judgments (Goyal et al., 2023) and do not distinguish between models

Thus, we propose HELMET to address these limitations by providing a comprehensive evaluation framework for long-input tasks. Additionally, we introduce LongProc as a reliable testbed specifically designed for long-output capabilities, filling an important gap in current evaluation approaches that focus on long-input processing rather than generation quality.

Figure 1 data

Figure 1: Overview of HELMET datasets

HELMET: Effective and Thorough Evaluation of Long-Input Tasks

We design HELMET with the following desiderata:

  •  Diverse coverage of downstream tasks
  • Controllable length and complexity
  • Reliable evaluation for base and instruction-tuned models

Figure 2 shows an overview of the benchmark. In our experiments, we evaluate input length from 8K to 128K tokens, but HELMET can be easily extended to even longer context lengths.

Key Improvements Over Existing Benchmarks

Diverse coverage: HELMET includes a diverse set of tasks, such as retrieval-augmented generation with real retrieval passages, generation with citations, and summarization. We carefully select datasets with naturally long contexts that reflect real-world applications.

Controllable length and difficulty: An important dimension to consider when evaluating LCLMs is the input length, as longer inputs can provide more information while challenging the model's ability to process noisy contexts. In our tasks, we can control the input length by changing the number of retrieved passages (RAG, Cite, Re-rank), the number of demonstrations (ICL), or the length of the input document (LongQA, Summ).

Reliable evaluation: Many existing benchmarks still use n-gram-based metrics, such as ROUGE, despite their poor correlation with human judgments. We employ model-based evaluations that show better distinguishability between models and different input lengths. Furthermore, our human studies show that our metrics have a high agreement with human judgments.

LongProc: benchmarking LMs on Long Output Tasks

Complementary to HELMET, we introduce LongProc, a new benchmark that focuses on long-form generation. In LongProc, we propose a new way for evaluating LCLMs through long procedural generation, which requires models to follow complex instructions, aggregate substantial information, and generate long outputs.  LongProc includes six diverse tasks (see  Figure 3 for examples), such as extracting target information from lengthy unstructured HTML documents into structured TSV files, or creating trip plans by executing prolonged search procedures.

Figure 2 longproc horizontal

Figure 2: Example tasks in LongProc. Tasks in LongProc require LMs to follow a given 
procedure detailed by instructions and generate long-form outputs (up to 8K tokens).

These tasks present various challenges in information access, multi-step reasoning, and complex search procedures, with direct relevance to practical applications like web agents and real-world planning tasks. The required capability for long-context reasoning also serves as a foundation for recent advances demonstrated by OpenAI o1/o3 and DeepSeek-R1 models. Furthermore, since all tasks follow deterministic procedures and produce structured outputs, we can reliably evaluate long outputs using rule-based metrics.

Long-context LMs Still Have a Long Way to Go 

Our experiments and analyses evaluate over 50 LCLMs on HELMET and over 20 LCLMs on LongProc. To our knowledge, this is the most thorough and controlled comparison of long-context models on diverse applications, covering both leading proprietary and open-source models. We also consider models with different architectures (e.g., full-attention transformers, hybrid architectures) and different post-training methods (e.g., instruction tuned-models and reasoning models). In this section, we will highlight a few key findings from our experiments.

Models Degrade with Increasing Input and Output Lengths

We present evaluation results on HELMET and LongProc. Overall, we find current models are limited in both processing extensive long inputs and generating long outputs. In this blog post, we show results from several representative frontier proprietary models and open-source models. Additional results can be found in the paper and the website.

Figure 3 helmet results horizontal

Figure 3: HELMET results on selected instruction-tuned models across tasks and input lengths.

First, we assess models' capabilities of processing long input on HELMET. We observe performance degradation with increasing input lengths. Even the most advanced models, such as GPT-4o and Gemini, experience a significant decrease in performance, especially on more challenging tasks like re-ranking. Open-source models lag behind closed-source models on complex tasks. Although the gap appears small on simpler tasks, such as Recall, the gap widens on more complex ones, such as Cite.

Figure 4 long proc results horizontal

Figure 4: LongProc results on selected instruction-tuned models across tasks and input lengths.

We also observe performance degradation with increasing output length in our LongProc evaluation. We configure each task in LongProc at three difficulty levels with maximum output requirements of 500, 2K, and 8K tokens. While all tested models claim a context window size above 32K tokens, open-weight models typically falter on 2K-token tasks, and closed-source models show significant degradation on 8K-token tasks. Together with results on HELMET, we find current long-context LMs still have substantial room for improvement.

Diverse Evaluation is Needed for Assessing Long-Context Abilities

Long-context benchmarks are often constructed with specific applications in mind, such as summarization or question answering, which limits the understanding of LCLMs in a broader context. We examine model performance over a wide range of real tasks and find that different categories do not always correlate with each other (Figure 4).

Figure 5 correlation horizontal

 

Figure 5: Different categories do not correlate well with each other.

While some tasks moderately correlate with each other (e.g., RAG and MS-MARCO) due to their retrieval-based nature, others show little correlation (e.g., LongQA and Cite). Notably, ICL has the lowest correlation with other tasks, which suggests that it is a unique task that requires different capabilities from the model. Therefore, model developers should evaluate across these distinct axes to draw a more holistic picture of the model's capabilities.

How Reasoning Models Perform on Long–context Tasks

The recent emergence of large reasoning models has dramatically extended the length of chain-of-thought (CoT) reasoning that language models can generate. This capability closely aligns with long-context capabilities, especially the skills tested in LongProc. When testing various reasoning models against their base or instruction-tuned counterparts, we find reasoning models outperform instruction-tuned models on LongProc tasks. This performance gap suggests potential benefits of long reasoning training for enhancing general long-form generation capabilities. However, reasoning models are not universally better, as they sometimes underperform on long input tasks in HELMET. This highlights the need for further research to better adapt reasoning models for processing extensive inputs

Supporting Future Development with HELMET and LongProc

Both HELMET and LongProc are featured in a unified HELMET evaluation platform. Using the HELMET platform is easy! Simply clone our GitHub repository and everything is ready to go after setting up the environment! You can easily run these evaluations on our tasks with just

```

python eval.py --config configs/rag.yaml --model_name_or_path <model_name>

```

For faster iterations during model development, we recommend using the Recall and RAG tasks. These tasks achieve a good balance between fast evaluation and correlation with other realistic tasks.

By evaluating on HELMET and LongProc, researchers can directly compare their models to existing ones simply by referencing our results, which cover 59 models of different sizes and architectures. You can find more details on the HELMET leaderboard and LongProc leaderboard.

References

@inproceedings{helmet,
    title={HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly}, 
    author={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen},
    year={2025},
    booktitle={International Conference on Learning Representations (ICLR)},
}

@article{longproc,
   title={LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation},
   author={Ye, Xi and Yin, Fangcong and He, Yinghui and Zhang, Joie and Yen, Howard and Gao, Tianyu and Durrett, Greg and Chen, Danqi},
   journal={arXiv preprint},
   year={2025}
}