Machine Learning Challenges — 2 — Reproducibility | by Emre Koçyiğit | May, 2024

In the previous article, I introduced and discussed the first common Machine Learning (ML) challenge, “Data Quality”. Now, let’s proceed to another common ML challenge, “Reproducibility”.

After elaborating on the “reproducibility” topic, which is also important in the research world, particularly in the context of ML, and underlining the crucial points about it, I will conclude with some best practice suggestions.

What and Why?

First of all, let’s answer the questions “what is this” and “what is not this” to prevent any misuse or term confusion from the beginning.

Reproducibility is the process of obtaining the exact results reported with the same experimental setup. It can be often confused with “replicability,” which is the process of obtaining the same results with a different experimental setup [1], [2]. When we refer to “reproducibility,” we’re discussing the necessity of generating the same study results using identical methods. This concept serves various purposes, such as preventing unnecessary duplication, getting inspiration or lessons from others, or validating the findings [3].

Reproducibility is crucial for both ML research and practitioners in the production environment since ML projects need to be inspected, tested by other stakeholders and the ML lifecycle is an experimental process. It is necessary to ensure the building of trustworthy ML models, carry out verification and debugging stages, promote collaboration, and address fairness issues [4]. It can be also seen that reproducible ML research has more chance to be accepted at top conferences [5].

Tatman et al. presented three types of reproducibility as low, medium and high, and the highest one includes sharing code, data and environment to have the same results [2]. Semmelrock et al. also presented three degrees as experiment, data and method reproducibilities. They also pointed out four different reproducibility challenges [1] as:

  • Computational: Inherent nondeterminism causes different results even if you use the same data and code. Environmental differences such as different GPU, CPU or compilers, libraries also change the results during the computational process. To solve this issue, you should fix the random number seeds.
  • Missing data or code: You can not have the same results with a ML model trained by different data. To solve this issue, you should ensure that the data is complete and exactly same.
  • Methodological: Even if you use the same data and computational resources with the same random seed values, if you don’t prevent data leakage and use different train-test splits, you will get different results. Be sure that you are using the same methodology, splitting, data sets.
  • Structural: Academia and industry may be less-motivated to ensure the reproducibility for reasons as keeping competitive advantage, privacy concerns etc. These issues usually need case-specific solutions.


You should be aware that wherever randomness is present, it can potentially cause problems for reproducibility. This can be the case when initializing random values of a neural network. Therefore, it’s advisable to fix the seed values [6], such as:

import tensorflow as tf

You can shuffle the data using sklearn’s shuffle and you should use the same random_state value again.

from sklearn.utils import shuffle 
data = shuffle(data, random_state=value)

Scikit-learn’s common train_test_split method:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

and many more:

  • Numpy → numpy.random.RandomState
  • PyTorch → torch.manual_seed()
  • …

The first thing is to have awareness for reproducibility. Then we can list three important aspects [1]:

  1. Checklists: You can create a detailed checklist for others. While doing that you can inspect others such as [7] and [8] to get some inspiration. You can adapt them for your teams.
  2. Standardised environments: Container software such as Docker can be used to keep this standardisation. Even if you use the same library and seed value, if the version is different and add one more random step, it will change the results. So, ensure that you are using the same packages with the same versions. And add this into the checklist.
  3. Model documents (info sheets): You can create model info sheets and add descriptions about data, method such as train_test_split.

With these recommendations, you can address the reproducibility challenge. While these are general high-level suggestions, you should develop your own reproducibility strategy and practical action plan.

The next article will be about “data drift” ML challenge.

  1. Semmelrock, H., Kopeinik, S., Theiler, D., Ross-Hellauer, T., & Kowald, D. (2023). Reproducibility in Machine Learning-Driven Research. arXiv preprint arXiv:2307.10320.
  2. Tatman, R., VanderPlas, J., & Dane, S. (2018). A practical taxonomy of reproducibility for machine learning research.
  6. Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns. O’Reilly Media.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here