Introducing the AI Lakehouse – KDnuggets


Sponsored Content

 

The Lakehouse is an open data analytics architecture that decouples data storage from query engines. The Lakehouse is now the dominant platform for storing data for analytics in the Enterprise, but it lacks the capabilities needed to support the building and operating of AI systems.

In order for Lakehouse to become a unified data layer for both analytics and AI, it needs to be extended with new capabilities, as shown in Figure 1, for training and running batch, real-time, and large-language model (LLM) AI applications.

 


The AI Lakehouse requires AI pipelines, an AI query engine, catalog(s) for AI assets and metadata (feature/model registry, lineage, reproducibility), AI infrastructure services (model serving, a database for feature serving, a vector index for RAG, and governed datasets with unstructured data).

 

The new capabilities include:

  • Real-Time Data Processing: The AI Lakehouse should be capable of supporting real-time AI systems, such as TikTok’s video recommendation engine. This requires “fresh” features created by streaming feature pipelines, and delivered by a low-latency feature serving database.
  • Native Python Support: Python is a 2nd class citizen in the Lakehouse, with poor read/write performance. The AI Lakehouse should provide a Python (AI) Query Engine that provides high performance reading/writing from/to Lakehouse tables, along with temporal joins to provide point-in-time correct training data (no data leakage). Netflix implemented a fast Python client using Arrow for their Apache Iceberg Lakehouse, resulting in significant productivity gains.
  • Integration Challenges: MLOps platforms connect data to models but are not fully integrated with Lakehouse systems. This disconnect results in almost half of all AI models failing to reach production due to the siloed nature of data engineering and data science workflows.
  • Unified Monitoring: The AI Lakehouse supports unified data and model monitoring by storing inference logs to monitor both data quality and model performance, providing a unified solution that helps detect and address drift and other issues early.
  • More Features for Real-Time AI Systems: The Snowflake schema data model enables both the reuse of features across different AI models as well as enabling real-time AI systems to retrieve more precomputed features using fewer entity IDs (as foreign keys enable retrieval features for many entities with only a single entity ID).

The AI Lakehouse is the evolution of the Lakehouse to meet the demands of batch, real-time, and LLM AI applications. By addressing real-time processing, enhancing Python support, improving monitoring, and Snowflake Schema data models, the AI Lakehouse will become the foundation for the next generation of intelligent applications.

This article is an abridged highlight of the main article.

Try the Hopsworks AI Lakehouse for free on Serverless or on Kubernetes.

 
 

Our Top 3 Course Recommendations

1. Google Cybersecurity Certificate – Get on the fast track to a career in cybersecurity.

2. Google Data Analytics Professional Certificate – Up your data analytics game

3. Google IT Support Professional Certificate – Support your organization in IT

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here