Pre-training, Fine-tuning & Instruction Tuning: What’s the Difference? | by Victor Arango-Quiroga | Mar, 2025


In recent years, Artificial Intelligence has made astonishing progress, powering everything from chatbots and virtual assistants to medical diagnostics and creative content generation. This rapid evolution has been driven by increasingly sophisticated machine learning techniques, allowing models to understand and generate human-like text, recognize images, and even make complex decisions. However, building highly capable AI models comes with significant challenges. Models must not only learn from vast amounts of data but also adapt to specific tasks and follow human instructions effectively. To address these needs, models undergo different training phases: pre-training, fine-tuning, and instruction tuning. Each plays a crucial role in shaping model’s capabilities, from general knowledge acquisition to specialized task execution and alignment with human intent.

Data is the fuel of AI, and how it is presented during training plays an important role. The growing adoption of large language models (LLMs) like ChatGPT, as well as the emergence of multimodal LLMs, have rapidly increase the usage of terms such as pre-training, fine-tuning, and instruction tuning in the model development process. This post aims to explain these three training phases and highlight their key differences, drawing from my experience as a Machine Learning Scientist.

Pre-training is the initial phase in training an AI model, where it learns general patterns, structures, and representations from vast amounts of data before being specialized for specific tasks. This phase is where the model learns from raw data without needing explicit labels (e.g. self-supervised or unsupervised). For example, large language models (LLMs) like chatGPT, DeepSeek, Phi, and others are pre-trained on massive text datasets from books, articles, and websites, learning grammar, facts, and language structure by predicting missing words in sentences.

The easiest way of thinking about it is that we show the model millions of documents from which the model will learn that what words are most likely in a sentence based on other words. For instance, if I give you the sentence “the ___ ate my homework”, you might first think of the word ‘dog’ to be highly likely to be the missing word. Although, you also know that other words are possible candidates such as horse, cat, or neighbor (hopefully not). During the pre-training phase, the model understands better the likelihood of words in a sentence. Similarly, computer vision models are pre-trained on datasets like ImageNet, enabling them to recognize general features such as edges, shapes, and textures.

While pre-training provides models with a strong foundation of general knowledge, pre-trained-only models face several challenges when applied to specific tasks. Since they learn from vast and diverse datasets without direct supervision, they may generate biased, incorrect, or contextually irrelevant outputs when handling domain-specific queries. Additionally, pre-trained models often lack task-specific expertise, meaning they may struggle with legal, medical, or technical subjects that require precise and specialized knowledge. Another limitation is their inability to follow complex human instructions effectively, as they are not explicitly trained to align responses with user intent. Without further adaptation, these models can also exhibit hallucinations generating plausible but false information. To overcome these challenges, pretrained models typically undergo fine-tuning and instruction tuning, refining their responses for specific applications and improving alignment with human needs.

To illustrate the main challenge here, let’s go over one example. Let’s say you only pre-trained your model. Then you asked this model “What is Machine Learning?”, the model learned that this is a question and that is highly likely to have similar questions next. So instead of answering the question, the model actually repeats the same question or provides new questions similar to the question the user input.

User: What is Machine Learning?
pre-trained-only AI: What is Artificial Intelligence? What is Computer Vision? What is Natural Language Processing?….

We start to see the big challenge here. The model has very important information embedded into the model, however, it is not understanding the instruction!

Instruction tuning is like teaching a well-read but socially awkward AI model how to follow directions properly. A pre-trained model may know a lot, but without instruction tuning, it can misunderstand requests. This phase involves training the model with explicit examples of instructions and desired responses, helping it align with human intent. For instance, models like ChatGPT/GPT4 are instruction-tuned using datasets where prompts (e.g., “Explain Machine Learning in simple terms”) are paired with ideal responses. Similarly, multimodal models like GPT-4V are tuned to understand image-based instructions, like “Describe this photo.”. Essentially, instruction tuning turns a knowledgeable but unpredictable model into a more helpful, concise, and user-friendly AI assistant.

Ok but how do people structure the data to do instruction tuning? Here is an example of a Hugging Face dataset which could be used to instruction tune your model. Figure 1 shows an example of a record used in an instruction dataset.

Figure 1: Instruction Tune record example

While instruction tuning significantly improves a model’s ability to follow human guidance, it still has limitations. One major challenge is that it relies on high-quality, well-structured datasets containing diverse instructions and response. Otherwise, the model may learn to follow instructions poorly or develop unintended biases. Additionally, instruction tuning enhances general alignment with user intent, but it doesn’t always make a model expert in specialized fields like law, medicine, or finance. A model trained to answer general questions may still struggle with industry-specific tasks that require deep, precise knowledge. This is where fine-tuning comes in. It allows models to be further trained on domain-specific data, tailoring their expertise for specialized applications while refining accuracy and reliability.

Fine-tuning is like sending the AI model to grad school; it takes a generally knowledgeable model and trains it on specialized data to make it an expert in a specific field. While a pre-trained and instruction-tuned model can hold a decent conversation, fine-tuning ensures it truly understands domain-specific details. For example, a legal AI model might be fine-tuned on case law and contracts to provide accurate legal insights, while a medical AI could be trained on clinical research and patient data to assist doctors. This phase helps reduce errors, improve reliability, and make the model more useful in professional settings.

One caveat is that current models are huge, they have billions of parameters so if you were to fine-tune using all the parameters of these big models, you would need a vast amount of computational resources.

Smart scientists came up with ways of training on a small set of parameters that significantly decrease the computational resources needed for fine-tuning, even using less than 5% of the total number of parameters of the model have shown amazing results when fine-tuning, of course this depends on the complexity of the task. To learn more on this, I would suggest reading LORA: LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS from from Hu et al.

In all phases, we performed model training, but the objectives and data formats were quite different.

You might find fine-tuning useful if:

  1. You have annotated data that can be used for training your model in a specific domain. For instance, records having (image, instruction, desired output) for multimodal models. Or simply (instructions, desired output) for text-to-text models. How many? Depends on the complexity of the task, but literature typically uses more than 1k samples.
  2. The open-source model or the on demand API (e.g. GPT4) does not perform well for your objective. For instance, you want the model to get an input image and prompt, and you expect the model to score the quality of the image but, even by providing context, the GPT4 or open source models such as Phi 3.5 are not performing well.
  3. You would like to migrate to an in-house model so that you save more money (instead of paying third-party providers for every single request).
  4. You have computing resources available with GPUs. In my experience, a typical compute used for fine-tuning is g5.48xlarge from AWS (see more here)

Step 2 is typically used for prototyping and quickly assess that your use case can be developed using existing models. If its performance is not what you are looking for, then data can be collected (step 1) to fine-tune your in-house model. The cost will depend on how long you will run that training process, the link in step 4 shows price estimates of using this compute.

While pre-training phase provides a vast amount of information to the models, they struggle to follow instructions to exploit its knowledge. Thus, instruction tuning trains the model further to make the model follow instructions. This is quite helpful and one of the main reasons ChatGPT had such a big impact in the world. One caveat here is that it is still based on general information. Fine-tuning then comes handy and allow people to adjust these models to their internal data to align it to anyone’s particular needs.

Fine-tuning is widely used, however in recent years new systems and techniques have evolved and shown great results. In a future post, I will be discussing more on Retrieval-Augmented Generation (RAGs), Agents, and Agentic RAGs.

I hope this post has helped clarify the development journey of AI models and answered some of your questions! 🙂

Please leave a comment with any questions or topics you’d like to learn more about!

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here