Machine Learning Challenges â 2 â Reproducibility | by Emre KoÃ§yiÄit | May, 2024

In the previous article, I introduced and discussed the first common Machine Learning (ML) challenge, âData Qualityâ. Now, letâs proceed to another common ML challenge, âReproducibilityâ.

After elaborating on the âreproducibilityâ topic, which is also important in the research world, particularly in the context of ML, and underlining the crucial points about it, I will conclude with some best practice suggestions.

What and Why?

First of all, letâs answer the questions âwhat is thisâ and âwhat is not thisâ to prevent any misuse or term confusion from the beginning.

Reproducibility is the process of obtaining the exact results reported with the same experimental setup. It can be often confused with âreplicability,â which is the process of obtaining the same results with a different experimental setup [1], [2]. When we refer to âreproducibility,â weâre discussing the necessity of generating the same study results using identical methods. This concept serves various purposes, such as preventing unnecessary duplication, getting inspiration or lessons from others, or validating the findings [3].

Reproducibility is crucial for both ML research and practitioners in the production environment since ML projects need to be inspected, tested by other stakeholders and the ML lifecycle is an experimental process. It is necessary to ensure the building of trustworthy ML models, carry out verification and debugging stages, promote collaboration, and address fairness issues [4]. It can be also seen that reproducible ML research has more chance to be accepted at top conferences [5].

Tatman et al. presented three types of reproducibility as low, medium and high, and the highest one includes sharing code, data and environment to have the same results [2]. Semmelrock et al. also presented three degrees as experiment, data and method reproducibilities. They also pointed out four different reproducibility challenges [1] as:

Computational: Inherent nondeterminism causes different results even if you use the same data and code. Environmental differences such as different GPU, CPU or compilers, libraries also change the results during the computational process. To solve this issue, you should fix the random number seeds.
Missing data or code: You can not have the same results with a ML model trained by different data. To solve this issue, you should ensure that the data is complete and exactly same.
Methodological: Even if you use the same data and computational resources with the same random seed values, if you donât prevent data leakage and use different train-test splits, you will get different results. Be sure that you are using the same methodology, splitting, data sets.
Structural: Academia and industry may be less-motivated to ensure the reproducibility for reasons as keeping competitive advantage, privacy concerns etc. These issues usually need case-specific solutions.

Solutions?

You should be aware that wherever randomness is present, it can potentially cause problems for reproducibility. This can be the case when initializing random values of a neural network. Therefore, itâs advisable to fix the seed values [6], such as:

import tensorflow as tf
tf.random.set_seed(value)

You can shuffle the data using sklearnâs shuffle and you should use the same random_state value again.

from sklearn.utils import shuffle 
data = shuffle(data, random_state=value)

Scikit-learnâs common train_test_split method:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

and many more:

Numpy â numpy.random.RandomState
PyTorch â torch.manual_seed()
â¦

The first thing is to have awareness for reproducibility. Then we can list three important aspects [1]:

Checklists: You can create a detailed checklist for others. While doing that you can inspect others such as [7] and [8] to get some inspiration. You can adapt them for your teams.
Standardised environments: Container software such as Docker can be used to keep this standardisation. Even if you use the same library and seed value, if the version is different and add one more random step, it will change the results. So, ensure that you are using the same packages with the same versions. And add this into the checklist.
Model documents (info sheets): You can create model info sheets and add descriptions about data, method such as train_test_split.

With these recommendations, you can address the reproducibility challenge. While these are general high-level suggestions, you should develop your own reproducibility strategy and practical action plan.

The next article will be about âdata driftâ ML challenge.

Semmelrock, H., Kopeinik, S., Theiler, D., Ross-Hellauer, T., & Kowald, D. (2023). Reproducibility in Machine Learning-Driven Research. arXiv preprint arXiv:2307.10320.
Tatman, R., VanderPlas, J., & Dane, S. (2018). A practical taxonomy of reproducibility for machine learning research.
https://blog.ml.cmu.edu/2020/08/31/5-reproducibility/
https://www.linkedin.com/pulse/ensuring-reproducibility-ml-why-how-srishti-sawla?trk=article-ssr-frontend-pulse_more-articles_related-content-card
https://ai.meta.com/blog/new-code-completeness-checklist-and-reproducibility-updates/
Lakshmanan, V., Robinson, S., & Munn, M. (2020). Machine learning design patterns. OâReilly Media.
https://aaai.org/conference/aaai/aaai-23/reproducibility-checklist/
https://www.cs.mcgill.ca/~jpineau/ReproducibilityChecklist.pdf

Machine Learning Challenges â 2 â Reproducibility | by Emre KoÃ§yiÄit | May, 2024

What and Why?

Solutions?

Recent Articles

Why the Newest LLMs use a MoE (Mixture of Experts) Architecture

Using Machine Learning in Customer Segmentation

NYT ‘Connections’ hints and answers for July 27: Tips to solve ‘Connections’ #412.

Crooks Bypassed Google’s Email Verification to Create Workspace Accounts, Access 3rd-Party Services – Krebs on Security

🤖 The AI Developer’s Toolkit: Essential Skills and Resources [2023 Edition] 🔧 | by Jett Black | Jul, 2024

Related Stories

Leave A Reply Cancel reply

Machine Learning Challenges â 2 â Reproducibility | by Emre KoÃ§yiÄit | May, 2024

What and Why?

Solutions?

Recent Articles

Related Stories

Leave A Reply Cancel reply

Machine Learning Challenges â 2 â Reproducibility | by Emre KoÃ§yiÄit | May, 2024