Distributed Parallel Computing Made Easy with Ray | by Betty LD | Jan, 2025


Illustrated with an example of Multimodal offline batch inference with CLIP

Towards Data Science

This post is a technical post summarizing my experience with the Ray library for distributed data processing and showcasing an example of using Ray for scalable offline batch inference.

Recently, I had to prepare a dataset for Vision LLM training. The quality of the training dataset is critical for the success of the training and we needed to develop tools for processing large amounts of data. The goal is to make sure the data feeding the model is controlled and high quality.

Why so much effort to create a dataset? Isn’t quantity the secret of LLM?

Tons of data. Thanks to https://unsplash.com/@jjying for the picture.

It is not. First, Let me share why engineering effort should be given to constructing and filtering a good dataset.

In the current race for the development of foundation models, many new models emerge every month at the top of the SOTA benchmarks. Some companies or laboratories share the weights with the open-source community. They sometimes even share checkpoints and training scripts.

However, the steps of creation and curation of the training datasets are rarely shared. For…

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here