How to Log Your Data with MLflow. Mastering data logging in MLOps for… | by Jack Chang | Jan, 2025


Setting up an MLflow server locally is straightforward. Use the following command:

mlflow server --host 127.0.0.1 --port 8080

Then set the tracking URI.

mlflow.set_tracking_uri("http://127.0.0.1:8080")

For more advanced configurations, refer to the MLflow documentation.

Photo by Robert Bye on Unsplash

For this article, we are using the California housing dataset (CC BY license). However, you can apply the same principles to log and track any dataset of your choice.

For more information on the California housing dataset, refer to this doc.

mlflow.data.dataset.Dataset

Before diving into dataset logging, evaluation, and retrieval, it’s important to understand the concept of datasets in MLflow. MLflow provides the mlflow.data.dataset.Dataset object, which represents datasets used in with MLflow Tracking.

class mlflow.data.dataset.Dataset(source: mlflow.data.dataset_source.DatasetSource, name: Optional[str] = None, digest: Optional[str] = None)

This object comes with key properties:

  • A required parameter, source (the data source of your dataset as mlflow.data.dataset_source.DatasetSource object)
  • digest (fingerprint for your dataset) and name (name for your dataset), which can be set via parameters.
  • schema and profile to describe the dataset’s structure and statistical properties.
  • Information about the dataset’s source, such as its storage location.

You can easily convert the dataset into a dictionary using to_dict() or a JSON string using to_json().

Support for Popular Dataset Formats

MLflow makes it easy to work with various types of datasets through specialized classes that extend the core mlflow.data.dataset.Dataset. At the time of writing this article, here are some of the notable dataset classes supported by MLflow:

  • pandas: mlflow.data.pandas_dataset.PandasDataset
  • NumPy: mlflow.data.numpy_dataset.NumpyDataset
  • Spark: mlflow.data.spark_dataset.SparkDataset
  • Hugging Face: mlflow.data.huggingface_dataset.HuggingFaceDataset
  • TensorFlow: mlflow.data.tensorflow_dataset.TensorFlowDataset
  • Evaluation Datasets: mlflow.data.evaluation_dataset.EvaluationDataset

All these classes come with a convenient mlflow.data.from_* API for loading datasets directly into MLflow. This makes it easy to construct and manage datasets, regardless of their underlying format.

mlflow.data.dataset_source.DatasetSource

The mlflow.data.dataset.DatasetSource class is used to represent the origin of the dataset in MLflow. When creating a mlflow.data.dataset.Dataset object, the source parameter can be specified either as a string (e.g., a file path or URL) or as an instance of the mlflow.data.dataset.DatasetSource class.

class mlflow.data.dataset_source.DatasetSource

If a string is provided as the source, MLflow internally calls the resolve_dataset_source function. This function iterates through a predefined list of data sources and DatasetSource classes to determine the most appropriate source type. However, MLflow’s ability to accurately resolve the dataset’s source is limited, especially when the candidate_sources argument (a list of potential sources) is set to None, which is the default.

In cases where the DatasetSource class cannot resolve the raw source, an MLflow exception is raised. For best practices, I recommend explicitly create and use an instance of the mlflow.data.dataset.DatasetSource class when defining the dataset’s origin.

  • class HTTPDatasetSource(DatasetSource)
  • class DeltaDatasetSource(DatasetSource)
  • class FileSystemDatasetSource(DatasetSource)
  • class HuggingFaceDatasetSource(DatasetSource)
  • class SparkDatasetSource(DatasetSource)
Photo by Claudio Schwarz on Unsplash

One of the most straightforward ways to log datasets in MLflow is through the mlflow.log_input() API. This allows you to log datasets in any format that is compatible with mlflow.data.dataset.Dataset, which can be extremely helpful when managing large-scale experiments.

Step-by-Step Guide

First, let’s fetch the California Housing dataset and convert it into a pandas.DataFrame for easier manipulation. Here, we create a dataframe that combines both the feature data (california_data) and the target data (california_target).

california_housing = fetch_california_housing()
california_data: pd.DataFrame = pd.DataFrame(california_housing.data, columns=california_housing.feature_names)
california_target: pd.DataFrame = pd.DataFrame(california_housing.target, columns=['Target'])

california_housing_df: pd.DataFrame = pd.concat([california_data, california_target], axis=1)

To log the dataset with meaningful metadata, we define a few parameters like the data source URL, dataset name, and target column. These will provide helpful context when retrieving the dataset later.

If we look deeper in the fetch_california_housing source code, we can see the data was originated from https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz.

dataset_source_url: str = 'https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
dataset_source: DatasetSource = HTTPDatasetSource(url=dataset_source_url)
dataset_name: str = 'California Housing Dataset'
dataset_target: str = 'Target'
dataset_tags = {
'description': california_housing.DESCR,
}

Once the data and metadata are defined, we can convert the pandas.DataFrame into an mlflow.data.Dataset object.

dataset: PandasDataset = mlflow.data.from_pandas(
df=california_housing_df, source=dataset_source, targets=dataset_target, name=dataset_name
)

print(f'Dataset name: {dataset.name}')
print(f'Dataset digest: {dataset.digest}')
print(f'Dataset source: {dataset.source}')
print(f'Dataset schema: {dataset.schema}')
print(f'Dataset profile: {dataset.profile}')
print(f'Dataset targets: {dataset.targets}')
print(f'Dataset predictions: {dataset.predictions}')
print(dataset.df.head())

Example Output:

Dataset name: California Housing Dataset
Dataset digest: 55270605
Dataset source: <mlflow.data.http_dataset_source.HTTPDatasetSource object at 0x101153a90>
Dataset schema: ['MedInc': double (required), 'HouseAge': double (required), 'AveRooms': double (required), 'AveBedrms': double (required), 'Population': double (required), 'AveOccup': double (required), 'Latitude': double (required), 'Longitude': double (required), 'Target': double (required)]
Dataset profile: {'num_rows': 20640, 'num_elements': 185760}
Dataset targets: Target
Dataset predictions: None
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude Target
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23 4.526
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22 3.585
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24 3.521
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25 3.413
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25 3.422

Note that You can even convert the dataset to a dictionary to access additional properties like source_type:

for k,v in dataset.to_dict().items():
print(f"{k}: {v}")
name: California Housing Dataset
digest: 55270605
source: {"url": "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz"}
source_type: http
schema: {"mlflow_colspec": [{"type": "double", "name": "MedInc", "required": true}, {"type": "double", "name": "HouseAge", "required": true}, {"type": "double", "name": "AveRooms", "required": true}, {"type": "double", "name": "AveBedrms", "required": true}, {"type": "double", "name": "Population", "required": true}, {"type": "double", "name": "AveOccup", "required": true}, {"type": "double", "name": "Latitude", "required": true}, {"type": "double", "name": "Longitude", "required": true}, {"type": "double", "name": "Target", "required": true}]}
profile: {"num_rows": 20640, "num_elements": 185760}

Now that we have our dataset ready, it’s time to log it in an MLflow run. This allows us to capture the dataset’s metadata, making it part of the experiment for future reference.

with mlflow.start_run():
mlflow.log_input(dataset=dataset, context='training', tags=dataset_tags)
🏃 View run sassy-jay-279 at: http://127.0.0.1:8080/#/experiments/0/runs/5ef16e2e81bf40068c68ce536121538c
🧪 View experiment at: http://127.0.0.1:8080/#/experiments/0

Let’s explore the dataset in the MLflow UI (). You’ll find your dataset listed under the default experiment. In the Datasets Used section, you can view the context of the dataset, which in this case is marked as being used for training. Additionally, all the relevant fields and properties of the dataset will be displayed.

Training dataset in the MLflow UI; Source: Me

Congrats! You have logged your first dataset!

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here