How to Use Hugging Face’s Datasets Library for Efficient Data Loading

Image by Editor | Midjourney

This tutorial demonstrates how to use Hugging Face’s Datasets library for loading datasets from different sources with just a few lines of code.

Hugging Face Datasets library simplifies the process of loading and processing datasets. It provides a unified interface for thousands of datasets on Hugging Face’s hub. The library also implements various performance metrics for transformer-based model evaluation.

Initial Setup

Certain Python development environments may require installing the Datasets library before importing it.

!pip install datasets
import datasets

Loading a Hugging Face Hub Dataset by Name

Hugging Face hosts a wealth of datasets in its hub. The following function outputs a list of these datasets by name:

from datasets import list_datasets
list_datasets()

Let’s load one of them, namely the emotions dataset for classifying emotions in tweets, by specifying its name:

data = load_dataset("jeffnyman/emotions")

If you wanted to load a dataset you came across while browsing Hugging Face’s website and are unsure what the right naming convention is, click on the “copy” icon beside the dataset name, as shown below:

The dataset is loaded into a DatasetDict object that contains three subsets or folds: train, validation, and test.

DatasetDict(
train: Dataset(
features: ['text', 'label'],
num_rows: 16000
)
validation: Dataset(
features: ['text', 'label'],
num_rows: 2000
)
test: Dataset(
features: ['text', 'label'],
num_rows: 2000
)
)

Each fold is in turn a Dataset object. Using dictionary operations, we can retrieve the training data fold:

train_data = all_data["train"]

The length of this Dataset object indicates the number of training instances (tweets).

Leading to this output:

Getting a single instance by index (e.g. the 4th one) is as easy as mimicking a list operation:

which returns a Python dictionary with the two attributes in the dataset acting as the keys: the input tweet text, and the label indicating the emotion it has been classified with.

'text': 'i am ever feeling nostalgic about the fireplace i will know that it is still on the property',
'label': 2

We can also get simultaneously several consecutive instances by slicing:

This operation returns a single dictionary as before, but now each key has associated a list of values instead of a single value.

'text': ['i didnt feel humiliated', ...],
'label': [0, ...]

Last, to access a single attribute value, we specify two indexes: one for its position and one for the attribute name or key:

Loading Your Own Data

If instead of resorting to Hugging Face datasets hub you want to use your own dataset, the Datasets library also allows you to, by using the same ‘load_dataset()’ function with two arguments: the file format of the dataset to be loaded (such as “csv”, “text”, or “json”) and the path or URL it is located in.

This example loads the Palmer Archipelago Penguins dataset from a public GitHub repository:

url = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv"
dataset = load_dataset('csv', data_files=url)

Turn Dataset Into Pandas DataFrame

Last but not least, it is sometimes convenient to convert your loaded data into a Pandas DataFrame object, which facilitates data manipulation, analysis, and visualization with the extensive functionality of the Pandas library.

penguins = dataset["train"].to_pandas()
penguins.head()

Now that you have learned how to efficiently load datasets using Hugging Face’s dedicated library, the next step is to leverage them by using Large Language Models (LLMs).

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

How to Use Hugging Face’s Datasets Library for Efficient Data Loading

Initial Setup

Loading a Hugging Face Hub Dataset by Name

Loading Your Own Data

Turn Dataset Into Pandas DataFrame

Recent Articles

صیغه حلال آزادشهر 0990.564.5778صیغه حلال مسجد سلیمان صیغه حلال شاهدیه صیغه حلال رامهرمز صیغه اردکان – xelafa1532@yalcu.com

Mira Murati Launches Thinking Machines Lab to Make AI More Accessible

Meet Fino1-8B: A Fine-Tuned Version of Llama 3.1 8B Instruct Designed to Improve Performance on Financial Reasoning Tasks

AI proves time travel is impossible (but still can’t draw fingers) • Graham Cluley

Rendering the Simulation Theory: Exploring Fractals, GLSL, and the Nature of Reality

Related Stories

Leave A Reply Cancel reply