Understanding the Tech Stack Behind Generative AI


When ChatGPT reached the one million user mark within five days and took off faster than any other technology in history, the world began to pay attention to artificial intelligence and AI applications.

And so it continued apace. Since then, many different terms have been buzzing around — from ChatGPT and Nvidia H100 chips to Ollama, LangChain, and Explainable AI. What is actually meant for what?

That’s exactly what you’ll find in this article: A structured overview of the technology ecosystem around generative AI and LLMs.

Let’s dive in!

Table of Contents
1 What makes generative AI work – at its core
2 Scaling AI: Infrastructure and Compute Power
3 The Social Layer of AI: Explainability, Fairness and Governance
4 Emerging Abilities: When AI Starts to Interact and Act
Final Thoughts
Where Can You Continue Learning?

1 What makes generative AI work – at its core

New terms and tools in the field of artificial intelligence seem to emerge almost daily. At the core of it all are the foundational models, frameworks and the infrastructure required to run generative AI in the first place.

Foundation Models

Do you know the Swiss Army Knife? Foundation models are like such a multifunctional knife – you can perform many different tasks with just one tool.

Foundation models are large AI models that have been pre-trained on huge amounts of data (text, code, images, etc.). What is special about these models is that they can not only solve a single task but can also be used flexibly for many different applications. They can write texts, correct code, generate images or even compose music. And they are the basis for many generative AI applications.

The following three aspects are key to understanding foundation models:

  • Pre-trained
    These models were trained on huge data sets. This means that the model has ‘read’ a huge amount of text or other data. This phase is very costly and time-consuming.
  • Multitask-capable
    These foundation models can solve many tasks. If we look at GPT-4o, you can use it to solve everyday questions about knowledge questions, text improvements and code generation.
  • Transferable
    Through fine-tuning or Retrieval Augmented Generation (RAG), we can adapt such Foundation Models to specific domains or specialise them for specific application areas. I have written about RAG and fine-tuning in detail in How to Make Your LLM More Accurate with RAG & Fine-Tuning. But the core of it is that you have two options to make your LLM more accurate: With RAG, the model remains the same, but you improve the input by providing the model with additional sources. For example, the model can access past support tickets or legal texts during a query – but the model parameters and weightings remain unchanged. With fine-tuning, you retrain the pre-trained model with additional sources – the model saves this knowledge permanently.

To get a feel for the amount of data we are talking about, let’s look at FineWeb. FineWeb is a massive dataset developed by Hugging Face to support the pre-training phase of LLMs. The dataset was created from 96 common-crawl snapshots and comprises 15 trillion tokens – which takes up about 44 terabytes of storage space.

Most foundation models are based on the Transformer architecture. In this article, I won’t go into this in more detail as it’s about the high-level components around AI. The most important thing to understand is that these models can look at the entire context of a sentence at the same time, for example – and not just read word by word from left to right. The foundational paper introducing this architecture was Attention is All You Need (2017).

All major players in the AI field have released foundation models — each with different strengths, use cases, and licensing conditions (open-source or closed-source).

GPT-4 from OpenAI, Claude from Anthropic and Gemini from Google, for example, are powerful but closed models. This means that neither the model weights nor the training data are accessible to the public.

There are also high-performing open-source models from Meta, such as LLaMA 2 and LLaMA 3, as well as from Mistral and DeepSeek.

A great resource for comparing these models is the LLM Arena on Hugging Face. It provides an overview of various language models, ranks them and allows for direct comparisons of their performance.

Screenshot taken by the author: We can see a comparison of different llm models in the LLM Arena.

Multimodal models

If we look at the GPT-3 model, it can only process pure text. Multimodal models now go one step further: They can process and generate not only text, but also images, audio and video. In other words, they can process and generate several types of data at the same time.

What does this mean in concrete terms?

Multimodal models process different types of input (e.g. an image and a question about it) and combine this information to provide more intelligent answers. For example, with the Gemini 1.5 version you can upload a photo with different ingredients and ask the question which ingredients you see on this plate.

How does this work technically?

Multimodal models understand not only speech but also visual or auditory information. Multimodal models are also usually based on transformer architecture like pure text models. However, an important difference is that not only words are processed as ‘tokens’ but also images as so-called patches. These are small image sections that are converted into vectors and can then be processed by the model.

Let’s have a look at some examples:

  • GPT-4-Vision
    This model from OpenAI can process text and images. It recognises content on images and combines it with speech.
  • Gemini 1.5
    Google’s model can process text, images, audio and video. It is particularly strong at retaining context across modalities.
  • Claude 3
    Anthropic’s model can process text and images and is very good at visual reasoning. It is good at recognising diagrams, graphics and handwriting.

Other examples are Flamingo from DeepMind, Kosmos-2 from Microsoft or Grok (xAI) from Elon Musk’s xAI, which is integrated into Twitter.

GPU & Compute Providers

When generative AI models are trained, this requires enormous computing capacity. Especially for pre-training but also for inference – the subsequent application of the model to new inputs.

Imagine a musician practising for months to prepare for a concert – that’s what pre-training is like. During pre-training, a model such as GPT-4, Claude 3, LLaMA 3 or DeepSeek-VL learns from trillions of tokens that come from texts, code, images and other sources. These data volumes are processed with GPUs (Graphics Processing Units) or TPUs (Tensor Processing Units). This is necessary because this hardware enables parallel computing (compared to CPUs). Many companies rent computing power in the cloud (e.g. via AWS, Google Cloud, Azure) instead of operating their own servers.

When a pre-trained model is adapted to specific tasks with fine-tuning, this in turn, requires a lot of computing power. This is one of the major differences when the model is customised with RAG. One way to make fine-tuning more resource-efficient is low-rank adaptation (LoRA). Here, small parts of the model are specifically retrained instead of the entire model being trained with new data.

If we stay with the music example, the inference is the moment when the actual live concert takes place, which has to be played over and over again. This example also makes it clear that this also requires resources. Inference is the process of applying an AI model to a new input (e.g. you ask a question to ChatGPT) to generate an answer or a prediction.

Some examples:

Specialised hardware components that are optimised for parallel computing are used for this. For example, NVIDIA’s A100 and H100 GPUs are standard in many data centres. AMD Instinct MI300X, for example, are also catching up as a high-performance alternative. Google TPUs are also used for certain workloads – especially in the Google ecosystem.

ML Frameworks & Libraries

Just like in programming languages or web development, there are frameworks for AI tasks. For example, they provide ready-made functions for building neural networks without the need to program everything from scratch. Or they make training more efficient by parallelising calculations with the framework and making efficient use of GPUs.

The most important ML frameworks for generative AI:

  • PyTorch was developed by Meta and is open source. It is very flexible and popular in research & open source.
  • TensorFlow was developed by Google and is very powerful for large AI models. It supports distributed training – explanation and is often used in cloud environments.
  • Keras is a part of TensorFlow and is mainly used for beginners and prototype development.
  • JAX is also from Google and was specially developed for high-performance AI calculations. It is often used for advanced research and Google DeepMind projects. For example, it is used for the latest Google AI models such as Gemini and Flamingo.

PyTorch and TensorFlow can easily be combined with other tools such as Hugging Face Transformers or ONNX Runtime.

AI Application Frameworks

These frameworks enable us to integrate the Foundation Models into specific applications. They simplify access to the Foundation Models, the management of prompts and the efficient administration of AI-supported workflows.

Three tools, as examples:

  1. LangChain enables the orchestration of LLMs for applications such as chatbots, document processing and automated analyses. It supports access to APIs, databases and external storage. And it can be connected to vector databases – which I explain in the next section – to perform contextual queries.

    Let’s look at an example: A company wants to build an internal AI assistant that searches through documents. With LangChain, it can now connect GPT-4 to the internal database and the user can search company documents using natural language.

  2. LlamaIndex was specifically designed to make large amounts of unstructured data efficiently accessible to LLMs and is therefore important for Retrieval Augmented Generation (RAG). Since LLMs only have a limited knowledge base based on the training data, it allows RAG to retrieve additional information before generating an answer. And this is where LlamaIndex comes into play: it can be used to convert unstructured data, e.g. from PDFs, websites or databases, into searchable indices.

    Let’s take a look at a concrete example:

    A lawyer needs a legal AI assistant to search laws. LlamaIndex organises thousands of legal texts and can therefore provide precise answers quickly.

  3. Ollama makes it possible to run large language models on your own laptop or server without having to rely on the cloud. No API access is required as the models run directly on the device.

    For example, you can run a model such as Mistral, LLaMA 3 or DeepSeek locally on your device.

Databases & Vector Stores

In traditional data processing, relational databases (SQL databases) store structured data in tables, while NoSQL databases such as MongoDB or Cassandra are used to store unstructured or semi-structured data.

With LLMs, however, we now also need a way to store and search semantic information.

This requires vector databases: A foundation model does not process input as text, but converts it into numerical vectors – so-called embeddings. Vector databases make it possible to perform fast similarity and memory management for embeddings and thus provide relevant contextual information.

How does this work, for example, with Retrieval Augmented Generation?

  1. Each text (e.g. a paragraph from a PDF) is translated into a vector.
  2. You pass a query to the model as a prompt. For example, you ask a question. This question is now also translated into a vector.
  3. The database now calculates which vectors are closest to the input vector.
  4. These top results are made available to the LLM before it answers. And the model then uses this information additionally for the answer.

Examples of this are Pinecone, FAISS, Weaviate, Milvus, and Qdrant.

Programming Languages

Generative AI development also needs a programming language.

Of course, Python is probably the first choice for almost all AI applications. Python has established itself as the main language for AI & ML and is one of the most popular and widely used languages. It is flexible and offers a large AI ecosystem with all the previously mentioned frameworks such as TensorFlow, PyTorch, LangChain or LlamaIndex.

Why isn’t Python used for everything?

Python is not very fast. But thanks to CUDA backends, TensorFlow or PyTorch are still very performant. However, if performance is really very important, Rust, C++ or Go are more likely to be used.

Another language that must be mentioned is Rust: This language is used when it comes to fast, secure and memory-efficient AI infrastructures. For example, for efficient databases for vector searches or high-performance network communication. It is primarily used in the infrastructure and deployment area.

Julia is a language that is close to Python, but much faster – this makes it perfect for numerical calculations and tensor operations.

TypeScript or JavaScript are not directly relevant for AI applications but are often used in the front end of LLM applications (e.g., React or Next.js).

Own visualization — Illustrations from unDraw.co

2 Scaling AI: Infrastructure and Compute Power

Apart from the core components, we also need ways to scale and train the models.

Containers & Orchestration

Not only traditional applications, but also AI applications need to be provided and scaled. I wrote about containerisation in detail in this article Why Data Scientists Should Care about Containers – and Stand Out with This Knowledge. But at its core, the point is that with containers, we can run an AI model (or any other application) on any server and it works the same. This allows us to provide consistent, portable and scalable AI workloads.

Docker is the standard for containerisation. Generative AI is no different. We can use it to develop AI applications as isolated, repeatable units. Docker is used to deploy LLMs in the cloud or on edge devices. Edge means that the AI does not run in the cloud, but locally on your device. The Docker images contain everything you need: Python, ML frameworks such as PyTorch, CUDA for GPUs and AI APIs.

Let’s take a look at an example: A developer trains a model locally with PyTorch and saves it as a Docker container. This allows it to be easily deployed to AWS or Google Cloud.

Kubernetes is there to manage and scale container workloads. It can manage GPUs as resources. This makes it possible to run multiple models efficiently on a cluster – and to scale automatically when demand is high.

Kubeflow is less well-known outside of the AI world. It allows ML models to be orchestrated as a workflow from data processing to deployment. It is specifically designed for machine learning in production environments and supports automatic model training & hyperparameter training.

Chip manufacturers & AI hardware

The immense computing power that is required must be produced. This is done by chip manufacturers. Powerful hardware reduces training times and improves model inference.

There are now also some models that have been trained with fewer parameters or fewer resources for the same performance. When DeepSeek was published at the end of February, it was somewhat questioned how many resources are actually necessary. It is becoming increasingly clear that huge models and extremely expensive hardware are not always necessary.

Probably the best-known chip manufacturer in the field of AI is Nvidia, one of the most valuable companies. With its specialised A100 and H100 GPUs, the company has become the de facto standard for training and inferencing large AI models. In addition to Nvidia, however, there are other important players such as AMD with its Instinct MI300X series, Google, Amazon and Cerebras.

API Providers for Foundation Models

The Foundation Models are pre-trained models. We use APIs so that we can access them as quickly as possible without having to host them ourselves. API providers offer quick access to the models, such as OpenAI API, Hugging Face Inference Endpoints or Google Gemini API. To do this, you send a text via an API and receive the response back. However, APIs such as the OpenAI API are subject to a fee.

The best-known provider is OpenAI, whose API provides access to GPT-3.5, GPT-4, DALL-E for image generation and Whisper for speech-to-text. Anthropic also offers a powerful alternative with Claude 2 and 3. Google provides access to multimodal models such as Gemini 1.5 via the Gemini API.

Hugging Face is a central hub for open source models: the inference endpoints allow us to directly address Mistral 7B, Mixtral or Meta models, for example.

Another exciting provider is Cohere, which provides Command R+, a model specifically for Retrieval Augmented Generation (RAG) – including powerful embedding APIs.

Serverless AI architectures

Serverless computing does not mean that there is no server but that you do not need your own server. You only define what is to be executed – not how or where. The cloud environment then automatically starts an instance, executes the code and shuts the instance down again. The AWS Lambda functions, for example, are well-known here.

Something similar is also available specifically for AI. Serverless AI reduces the administrative effort and scales automatically. This is ideal, for example, for AI tasks that are used irregularly.

Let’s take a look at an example: A chatbot on a website that answers questions from customers doesn’t have to run all the time. However, when a visitor comes to the website and asks a question, it must have resources. It is, therefore, only called up when needed.

Serverless AI can save costs and reduce complexity. However, it is not useful for continuous, latency-critical tasks.

Examples: AWS Bedrock, Azure OpenAI Service, Google Cloud Vertex AI

3 The Social Layer of AI: Explainability, Fairness and Governance

With great power and capability comes responsibility. The more we integrate AI into our everyday applications, the more important it becomes to engage with the principles of Responsible AI.

So…Generative AI raises many questions:

  • Does the model explain how it arrives at its answers?
    -> Question about Transparency
  • Are certain groups favoured?
    -> Question about Fairness
  • How is it ensured that the model is not misused?
    -> Question about Security
  • Who is liable for errors?
    -> Question about Accountability
  • Who controls how and where AI is used?
    -> Question about Governance
  • Which available data from the web (e.g. images from
    artists) may be used?
    -> Question about Copyright / data ethics

While we have comprehensive regulations for many areas of the physical world — such as noise control, light pollution, vehicles, buildings, and alcohol sales — similar regulatory efforts in the IT sector are still rare and often avoided.

I’m not making a generalisation or a value judgment about whether this is good or bad. Less regulation can accelerate innovation – new technologies reach the market faster. At the same time, there is a risk that important aspects such as ethical responsibility, bias detection or energy consumption by large models will receive too little attention.

With the AI Act, the EU is focusing more on a regulated approach that is intended to create clear framework conditions – but this, in turn, can reduce the speed of innovation. The USA tends to pursue a market-driven, liberal approach with voluntary guidelines. This promotes rapid development but often leaves ethical and social issues in the background.

Let’s take a look at three concepts:

Explainability

Many large LLMs such as GPT-4 or Claude 3 are considered so-called black boxes: they provide impressive answers, but we do not know exactly how they arrive at these results. The more we entrust them with – especially in sensitive areas such as education, medicine or justice – the more important it becomes to understand their decision-making processes.

Tools such as LIME, SHAP or Attention Maps are ways of minimising these problems. They analyse model decisions and present them visually. In addition, model cards (standardised documentation) help to make the capabilities, training data, limitations and potential risks of a model transparent.

Fairness

If a model has been trained with data that contains biases or biased representations, it will also inherit these biases and distortions. This can lead to certain population groups being systematically disadvantaged or stereotyped. There are methods for recognising bias and clear standards for how training data should be selected and tested.

Governance

Finally, the question of governance arises: Who actually determines how AI may be used? Who checks whether a model is being operated responsibly?

4 Emerging Abilities: When AI Starts to Interact and Act

This is about the new capabilities that go beyond the classic prompt-response model. AI is becoming more active, more dynamic and more autonomous.

Let’s take a look at a concrete example:

A classic LLM like GPT-3 follows the typical process: For example, you ask a question like ‘Please show me how to create a button with rounded corners using HTML & CSS’. The model then provides you with the appropriate code, including a brief explanation. The model returns a pure text output without the model actively executing or thinking anything further.

Screenshot taken by the author: The answer from ChatGPT if we ask for creating buttons with rounded corners.

AI agents go much further. They not only analyse the prompt but also develop plans independently, access external tools or APIs and can complete tasks in several steps.

A simple example:

Instead of just writing the template for an email, an agent can monitor a data source and independently send an email as soon as a certain event occurs. For example, an email could go out when a sales target has been exceeded.

AI agents

AI agents are an application logic based on the Foundation Models. They orchestrate decisions and execute steps independently. Agents such as AutoGPT carry out multi-step tasks independently. They think in loops and try to improve or achieve a goal step by step.

Some examples:

  • Your AI agent analyzes new market reports daily, summarizes them, stores them in a database, and notifies the user in case of deviations.
  • An agent initiates a job application process: It scans submitted profiles and matches them with job offers.
  • In an e-commerce shop, the agent monitors inventory levels and customer demand. If a product is running low, it automatically reorders it – including price comparisons between suppliers.

What typically makes up an AI agent?

An AI agent consists of several specialized components, making it possible to autonomously plan, execute, and learn tasks:

  • Large Language Model
    The LLM is the core or thinking engine. Typical models include GPT-4, Claude 3, Gemini 1.5, or Mistral 7B.
  • Planning unit
    The planner transforms a higher-level goal into a concrete plan or sequence of steps. Often based on methods like Chain-of-Thought or ReAct.
  • Tool access
    This component enables the agent to use external tools. For example, using a browser for extended search, a Python environment for code execution or enabling access to APIs and databases.
  • Memory
    This component stores information about previous interactions, intermediate results, or contextual knowledge. This is necessary so that the agent can act consistently across multiple steps.
  • Executor
    This component executes the planned steps in the correct order, monitors progress, and replans in case of errors.

There are also tools like Make or n8n (low-code / no-code automation platforms), which also let you implement “agent-like” logic. They execute workflows with conditions, triggers, and actions. For example, an automated reply should be formulated when a new email arrives in the inbox. And there are a lot of templates for such use cases.

Screenshot taken by the author: Templates on n8n as an example for low-code or no-code platforms.

Reinforcement Learning

With reinforcement learning, the models are made more “human-friendly.” In this training method, the model learns through reward. This is especially important for tasks where there is no clear “right” or “wrong,” but rather gradual quality.

An example of this is when you use ChatGPT, receive two different responses and are asked to rate which one you prefer.

The reward can come either from human feedback (Reinforcement Learning from Human Feedback – RLHF) or from another model (Reinforcement Learning from AI Feedback – RLVR). In RLHF, a human rates several responses from a model, allowing the LLM to learn what “good” responses look like and better align with human expectations. In RLVR, the model doesn’t just receive binary feedback (e.g., good vs. bad) but differentiated, context-dependent rewards (e.g., a variable reward scale from -1 to +3). RLVR is especially useful where there are many possible “good” responses, but some match the user’s intent much better.

On my Substack, I regularly write summaries about the published articles in the fields of Tech, Python, Data Science, Machine Learning and AI. If you’re interested, take a look or subscribe.

Final Thoughts

It would probably be possible to write an entire book about Generative Ai right now – not just a single article. Artificial intelligence has been researched and applied for many years. But we are currently in a moment where an explosion of tools, applications, and frameworks is happening – AI, and especially generative AI, has truly arrived in our everyday lives. Let’s see where this takes us and end with a quote from Alan Kay:

The best way to predict the future is to invent it.

Where Can You Continue Learning?

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here