Authors
Oscar Sainz, Iker García-Ferrero, Eneko Agirre, Jon Ander Campos, Alon Jacovi, Yanai Elazar, Yoav Goldberg
Publication date
2024/8
Conference
Proceedings of the 1st Workshop on Data Contamination (CONDA)
Description
Welcome to the Proceedings of the first iteration of the Workshop on Data Contamination (CONDA). The workshop is hosted at ACL 2024, in Thailand, on August 16, 2024.
Data contamination in NLP where evaluation data is inadvertently included in pre-training corpora, has become a concern in recent times. The growing scale of both models and data, coupled with unsupervised web crawling, has led to the inclusion of segments from evaluation benchmarks in the pre-training datasets of large language models (LLMs). The noisy nature of internet data makes it difficult to prevent this contamination from happening, or even detect when it has happened. Crucially, when evaluation data becomes part of pre-training data, it introduces biases and can artificially inflate the performance of LLMs on specific tasks or benchmarks. This poses a challenge for fair and unbiased evaluation of NLP models, as their performance may not accurately reflect their generalization capabilities.
Scholar articles
O Sainz, I García-Ferrero, E Agirre, JA Campos… - Proceedings of the 1st Workshop on Data …, 2024