Jan. 4, 2024

Black-Box Detection of Pretraining Data


2023 has witnessed the proliferation of strong large language models (LLMs) like GPT-4, LLaMA-2, Mistral, etc. These models are typically trained using corpora consisting of trillions of tokens scraped from internet web pages. As the scale of the data used to pretrain LLMs has risen to unprecedented levels over recent months, model developers have often chosen not to release their data or even fully disclose its composition. Although these decisions could be motivated by retaining a competitive advantage over other developers or thwarting potential misuse, this lack of transparency can be problematic. We list a few major concerns:

  • Copyright infringement: There has been much debate over whether training LLMs using copyrighted materials constitutes fair use or represents copyright infringement. Anecdotal observations that LLMs can regurgitate text from copyrighted works they are trained on have led to several recent high-profile lawsuits against these developers. For instance, a lawsuit against GitHub in early 2023 alleged that its LLM-based code-generation service Copilot reproduced code copyrighted by the plaintiffs. In July 2023, multiple novelists sued OpenAI for infringing on copyrights to their works. An open letter signed by over 10,000 writers accused OpenAI of using “millions of copyrighted books, articles, essays, and poetry” to train their AI systems.
  • Benchmark contamination: Benchmarks such as MMLU, ARC, etc., consisting of downstream tasks, remain the chief means of extrinsically evaluating the quality of an LLM. There have been increasing concerns over recent months of the leakage of test data from these benchmarks into LLM pretraining data. Researchers have found that it is possible to prompt ChatGPT to generate memorized datapoints from multiple popular downstream benchmarks (e.g. SQuAD, MNLI) indicating this type of contamination. Additionally, non-negligible amounts of test data from benchmark datasets have also been found in commonly used pretraining corpora (Elazar et al., 2023; Dodge et al., 2021; Brown et al., 2020; Touvron et al., 2023). The use of standard benchmarks to measure LLMs’ understanding and generalization abilities poses problems since any leakage of such a benchmark’s test set into an LLM’s pretraining data would render its benchmark performance an overestimate of the true extent of its abilities. Although initial evaluations found GPT-4’s performance on code-generation tasks extremely impressive, it was later found that these results were likely influenced by contamination. It could solve 100% of a set of Codeforces problems from pre-2021 but none from a set of more recent problems of the same difficulty. In a somewhat satirical recent article (Schaeffer, 2023), it was also shown that pretraining on test sets is “all you need” to achieve SOTA performance on benchmarks.
  • Privacy concerns: LLMs trained on large scales of data scraped from the internet are likely to have encountered personally identifiable information, including private emails, phone numbers, addresses, etc. It has been found that LLMs can be prompted to generate memorized (Carlini et al., 2018) private email conversations (Mozes et al., 2023) and can even be used to extract personally identifiable information (Carlini et al., 2020, Lukas et al., 2023). Nasr et al. (2023) found that simple prompting techniques can extract gigabytes of sensitive training data from both open and closed-source LLMs. They also show that simple attacks can prove effective even on RLHFed models like ChatGPT.

It has been argued that the techniques used by developers to detect cross-contamination in practice are often superficial and brittle (Narayan & Kapoor, 2023). Even when efforts are taken to filter out these types of undesirable content from pretraining corpora, they may not be entirely successful since the enormous scale of the data can make studying it difficult (Kreutzer et al, 2022; Elazar et al., 2023). There have been some proposals (Marone & Van Durme, 2023) for developers to use lightweight tools that allow testing for the membership of arbitrary strings in pretraining data, but these haven’t yet been widely adopted. 

In response to these concerns, we investigate the following question in our work:

Given a piece of text and black-box access only to an LLM’s logits, is it possible to determine if the text occurred in the model’s pretraining data?

There has been a rich body of prior work exploring similar questions in the space of membership inference attacks (MIA) (Shokri et al., 2016; Song & Shmatikov, 2019; Shejwalkar et al., 2021; Mahloujifar et al., 2021) which involve determining if an arbitrary data point x existed in a model’s training set (is ‘member’ data) or did not (is ‘non-member’ data). However, most existing state-of-the-art MIA approaches assume implicit access to the distribution that the model’s pretraining data was sampled from by relying on a reference model (often of the same model family) trained on data from this distribution (Carlini et al., 2022; Watson et al., 2022). However, this begs the question. We do not know the details of these distributions for contemporary SOTA LLMs like GPT-4 and LLaMA-2, let alone have the ability to draw enough samples from them to train reference models.

Another factor that makes it hard to apply existing MIA methods to this setting is the disparity in the scale of data and compute used during pretraining and finetuning. Most existing MIA literature is geared towards determining if a given data point was encountered by a model during its finetuning phase. Since MIA intuitively becomes more difficult under larger datasets, fewer training epochs, and lower learning rates, existing MIA methods are difficult to apply to the pretraining data detection problem.

Additionally, the fact that the pretraining corpora of most contemporary LLMs are unknown makes it challenging to even objectively evaluate the effectiveness of MIA methods on this problem. Noting the lack of standardized evaluations in this space, we introduce the WikiMIA benchmark.

WikiMIA: A Benchmark for Pretraining Data Detection Methods

Noticing that Wikipedia is a very commonly used data source for contemporary LLMs, we created and released a benchmark called WikiMIA for evaluating pretraining data detection methods by leveraging Wikipedia timestamp data. In particular, Wikipedia event data created before the year 2017 is very likely to have been a part of contemporary LLMs’ pretraining corpora. On the other hand, articles about 2023 events are guaranteed to be absent from the pretraining corpora of models released before January 1, 2023. Consequently, we can confidently assume data from pre-2017 articles to be member data in these LLMs’ corpora and 2023 articles to be non-member data.

Member and Non-Member Data

The definitions of Member and Non-Member Data.

WikiMIA includes not only verbatim snippets from these articles but also ChatGPT-paraphrased semantically equivalent ones. These settings, along with offering snippets truncated to various lengths, allow for more systematic and grounded evaluations of MIA methods than previous such benchmarks.

Member and Non-Member Data

The definitions of Member and Non-Member Data.

Min-K% Prob: A Novel Pretraining Data Detection Method

Besides introducing a benchmark, we also present a novel reference-free MIA method aimed at the pretraining data detection problem. Unlike many existing methods, Min-K% Prob does not rely on any knowledge of the pretraining corpus, does not use reference models, and does not require any additional training. Instead, Min-K% Prob only requires black-box access to an LLM’s logits and works based on the following hypothesis:

When the token probabilities of a piece of text are assessed using a pretrained LLM, they are much more likely to contain a few outlier tokens with low probabilities if this text did not exist in its pretraining data than if it did.

Outlier Tokens in the context of Pretraining Data

Outlier tokens are less common when the text is drawn from the pretraining data.

Given a sequence of tokens, we devise a method called Min-K% Prob that selects the k% of tokens that the model assigns the least probabilities to and computes the average log-likelihood only over these selected tokens. If the hypothesis is correct, text that was encountered during pretraining will elicit higher Min-K% Prob scores than text that wasn’t. Hence, it may be possible to distinguish between member and non-member text by choosing an appropriate threshold over these scores.

Min-K% Prob

How Min-K% Prob works.

We show that Min-K% Prob outperforms baselines from previous work on the pretraining data detection problem by conducting evaluations using WikiMIA. These baselines utilize diverse techniques for performing MIA, including assessing losses (Yeom et al., 2017), probability curvature (Mattern et al., 2023), lower-cased perplexities, zlib compression entropies, and using perplexity judgments from smaller reference models pretrained on the same data (Carlini et al., 2021). We find that Min-K% Prob consistently yields higher AUC detection scores than these baselines, both over verbatim and paraphrased Wikipedia snippets for every LLM we study.

Min-K% Prob outperforms other techniques

Min-K% Prob consistently outperforms other techniques at detecting pretraining data.

We also demonstrate the utility of Min-K% Prob for more directly addressing the concerns motivating our work. Specifically, we show that Min-K% Prob can be used for detecting the presence of copyrighted material and leaked benchmark data in pretraining corpora. We also demonstrate the ability to audit “machine unlearning” approaches for LLMs using Min-K% Prob.

Case Study: Copyrighted Content Detection

By constructing a validation set using two collections of books: books that were found by prior work (Chang et al., 2023) to highly likely exist in ChatGPT’s training data and those whose first editions were published after ChatGPT’s training cutoff (to respectively represent member and non-member examples in its pretraining data), we identify 20 copyrighted books from the Books3 corpus that very likely occurred in its pretraining data according to our analyses. Specifically, we showed that nearly 100% of the snippets we extracted from random locations in these books were found by a Min-K% Prob-based classifier to have been a part of ChatGPT’s pretraining data.

Books that likely formed a part of ChatGPT's pretraining data

Books that likely formed a part of ChatGPT's pretraining data.

Case Study: Detecting Downstream Dataset Contamination

We simulate the leakage of test examples from downstream benchmarks into LLM pretraining data by continually finetuning LLaMA on a corpus of text sampled from the RedPajama corpus containing inserted instances of downstream test examples. We introduce these formatted test examples (that constitute member examples) in contiguous segments at random positions within the sampled RedPajama text and also maintain another set of held-out formatted examples as non-member instances. We show once more that Min-K% Prob outperforms previous baselines at detecting downstream dataset contamination.

Min-K% Prob is effective at detecting downstream contamination

Min-K% Prob is also effective at detecting downstream contamination.

We also perform empirical analyses confirming theoretical results from prior work that (1) Outlier contaminants become easier to detect with increasing dataset size (Feldman, 2020; Zhang et al., 2021), (2) In-distribution contaminants become harder to detect with increasing dataset size, and that (3) More frequently occurring contaminants are easier to detect (Kandpal et al., 2022).

Case Study: Auditing “Machine Unlearning”

Our audit of Machine Unlearning

Our audit of Machine Unlearning.

Eldan & Russinovich (2023) recently proposed a novel approach for performing machine unlearning in LLMs: inducing them to selectively unlearn all knowledge relevant to a specific domain (without impacting their knowledge of unrelated domains). Specifically, the authors demonstrated that they could finetune a LLaMA2-7B-Chat model to eliminate memorized content from the Harry Potter books, thus resulting in a publicly released LLaMA2-7B-WhoIsHarryPotter model. However, we showed through our experiments that this unlearning effort was not entirely successful. By computing Min-K% Prob scores using LLaMA2-7B-WhoIsHarryPotter for multiple snippets from the Harry Potter books, we isolated suspicious snippets that the unlearned model may not have forgotten. By posing these snippets as story-completion inputs or even as direct questions, we showed that one can elicit outputs from LLaMA2-7B-WhoIsHarryPotter that clearly demonstrate persistent knowledge of the Harry Potter books.

Snippets that were not successfully unlearned

Snippets from the Harry Potter books that were not successfully unlearned.