Evaluating model performance is essential in the significantly advancing fields of Artificial Intelligence and Machine Learning, especially with the introduction of Large Language Models (LLMs). This review procedure helps understand these models’ capabilities and create dependable systems based on them. However, what is referred to as Questionable Research Practices (QRPs) frequently jeopardize the integrity of these assessments. These methods have the potential to greatly exaggerate published results, deceiving the scientific community and the general public about the actual effectiveness of ML models.
The primary driving force for QRPs is the ambition to publish in esteemed journals or to attract funding and users. Due to the intricacy of ML research, which includes pre-training, post-training, and evaluation stages, there is much potential for QRPs. Contamination, cherrypicking, and misreporting are the three basic categories these actions fall into.
Contamination
When data from the test set is used for training, assessment, or even model prompts, this is known as contamination. High-capacity models such as LLMs can remember test data that is exposed during training. Researchers have provided extensive documentation on this problem, detailing cases in which models were purposefully or unintentionally trained using test data. There are various ways that contamination can occur, which are as follows.
- Training on the Test Set: This results in unduly optimistic performance predictions when test data is unintentionally added to the training set.
- Prompt Contamination: During few-shot evaluations, using test data in the prompt gives the model an unfair advantage.
- Retrieval Augmented Generation (RAG) Contamination: Data leakage via retrieval systems using benchmarks.
- Dirty Paraphrases and Contaminated Models: Rephrased test data and contaminated models are used to train models, while contaminated models are used to generate training data.
- Over-hyping and Meta-contamination: Exaggerating and meta-contaminating designs by recycling contaminated designs or fine-tuning hyperparameters after test results are obtained.
Cherrypicking
Cherrypicking is the practice of adjusting experimental conditions to support the intended result. It is possible for researchers to test their models several times under different scenarios and only publish the best outcomes. This comprises of the following.
- Baseline Nerfing: It is the deliberate under-optimization of baseline models to give the impression that the new model is better.
- Runtime Hacking: It includes modifying inference parameters after the fact to improve performance metrics.
- Benchmark Hacking Choosing simpler benchmarks or subsets of benchmarks to make sure the model runs well is known as benchmark hacking.
- Golden Seed: Reporting the top-performing seed after training with several random seeds.
Misreporting
A variety of techniques are included in misreporting when researchers present generalizations based on skewed or limited benchmarks. For example, consider the following:
- Superfluous Cog: Claiming originality by adding unnecessary modules.
- Whack-a-mole: Keeping an eye on and adjusting certain malfunctions as needed.
- P-hacking: The selective presentation of statistically significant findings.
- Point Scores: Ignoring variability by reporting results from a single run without error bars.
- Outright Lies and Over/Underclaiming: Creating fake outcomes or making incorrect assertions regarding the capabilities of the model.
Irreproducible Research Practices (IRPs), in addition to QRPs, add to the complexity of the ML evaluation environment. It is challenging for subsequent researchers to duplicate, expand upon, or examine earlier research because of IRPs. One common instance is dataset concealing, in which researchers withhold information about the training datasets they utilize, including metadata. The competitive nature of ML research and worries about copyright infringement frequently motivate this technique. The validation and replication of discoveries, which are essential to the advancement of science, are hampered by the lack of transparency in dataset sharing.
In conclusion, the integrity of ML research and assessment is critical. Although QRPs and IRPs may benefit companies and researchers in the near term, they damage the field’s credibility and dependability over the long run. Setting up and upholding strict guidelines for research processes is essential as ML models are used more often and have a greater impact on society. The full potential of ML models can only be attained by openness, responsibility, and a dedication to moral research. It is imperative that the community collaborates to recognize and address these practices, guaranteeing that the progress in ML is grounded in honesty and fairness.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..
Don’t Forget to join our 47k+ ML SubReddit
Find Upcoming AI Webinars here
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.