Understanding Kimi k1.5: Scaling Reinforcement Learning with LLMs | by Nandini Lokesh Reddy | Feb, 2025

With Deepseek making waves, reinforcement learning has taken center stage in the AI community. Now, Moonshot AI steps up with Kimi k1.5 — a proprietary model that not only matches Deepseek’s capabilities but brings a fresh perspective to RL implementation.

Let’s explore how Kimi k1.5 is redefining AI’s potential.

https://notebooklm.google.com/notebook/a9c1a187-7d53-4115-a452-b533af660892/audio

Imagine teaching a child to ride a bicycle. You don’t just explain the theory — you let them try, fall, adjust, and improve through practice and feedback. This is the essence of Reinforcement Learning (RL), a concept that has evolved from training computers in chess to powering today’s most sophisticated AI models.

While RL has been fundamental in-game AI (exemplified by AlphaGo and OpenAI Five), its application in language models marks a paradigm shift. Instead of relying solely on static datasets, RL enables models to learn dynamically through experience and feedback, mirroring human learning processes.

Why RL Matters for Language Models

Traditional language models operate through next-token prediction essentially predicting the most likely word to follow a given sequence based on training data. This approach, while powerful, has inherent limitations:

Static Learning Limitations:

Confined to learning from historical data
Lacks dynamic improvement capabilities
Cannot adapt to new patterns without retraining

2. Reasoning Constraints:

Struggles with long-term coherence
Limited by local token probability focus
Difficulty maintaining consistent context

RL transforms this dynamic by introducing an interactive learning process. The model develops through trial and error, receiving feedback on:

Response accuracy
Logical coherence
Reasoning quality
Contextual relevance
Output consistency

The Challenge

Traditional language models face a significant limitation: they can’t generalize beyond their training data. It’s analogous to trying to become a master chef by only reading cookbooks without practical experience.

Kimi k1.5’s approach differs fundamentally:

Active exploration through controlled experimentation
Real-time feedback integration
Dynamic adjustment of responses
Continuous refinement of output quality

Current RL Framework

While many RL implementations (like AlphaZero) rely on complex techniques, Kimi k1.5 adopts a streamlined approach:

Traditional Complex Methods:

Monte Carlo Tree Search (MCTS):

Used in-game AI for move evaluation
Requires extensive computational resources
Complex implementation requirements

2. Value Functions:

Estimates long-term rewards
Requires sophisticated modeling
High computational overhead

3. Process Reward Models:

Evaluates intermediate steps
Complex implementation
Resource-intensive

Kimi k1.5’s Simplified Approach:

Long-context Scaling (128k tokens):

Regular AI: If you gave it a long research paper, it would have to read it in chunks, often forgetting earlier parts when reading later sections — like trying to understand a movie by watching 10-minute segments with breaks in between.
Kimi k1.5: Can read the entire research paper at once and understand how page 1 connects to page 300 — like watching the whole movie in one sitting.

2. Enhanced Policy Optimization:

Traditional Method (Complex): Like having three different cooking teachers each giving you different instructions about how to make pasta, and you have to figure out which combination of their advice works best.
Kimi’s Method (Simplified): Like having one experienced chef who directly shows you what works and what doesn’t, giving clear feedback on each step. Instead of processing multiple different opinions, you learn directly from success and failure.

Example: When learning to answer questions:

Old Way: The AI would try many different approaches simultaneously, using complex calculations to figure out which one might work best
Kimi’s Way: It learns directly from what worked well before, like a student who remembers “when I explained it this way, people understood better”

3. Multimodal Integration: Kimi k1.5 can process both images and text together.

Example: If you show it a chart with text:

Regular AI: Might need to process the image and text separately, like reading a textbook description of a graph and then looking at the graph separately
Kimi k1.5: Can understand both simultaneously, like a doctor looking at an X-ray while reading the patient’s symptoms — both pieces of information work together to form a complete understanding

Key Components:

1. Policy Optimization
A training mechanism that adjusts how the model makes decisions (its “policy”). Using online mirror descent means the model updates its behavior in real-time based on feedback, while relative entropy regularization ensures it doesn’t change too drastically from its original training. It’s the core decision-making system of the model.

Think of a GPS system that learns from traffic patterns. It starts with basic route planning and gradually learns better routes based on actual travel times, but won’t suddenly suggest completely unreasonable routes.

This prevents the model from learning harmful behaviors or drastically changing its output style while still allowing for continuous improvement.

2. Length Penalty System
A mathematical formula (len_reward(i) = λ if correct, min(0, λ)) that calculates rewards based on response length. The λ value is computed based on where the response length falls between minimum and maximum acceptable lengths. This is an actual scoring system that rewards or penalizes the model based on its output length.

Like a scoring system for public speaking where you get:
– Full points (λ) if you give a correct answer within the 2–5 minute limit
– Reduced points if you go over/under time
– Zero points if you’re way off the time limit

Ensures model responses are both accurate and concise, preventing unnecessary verbosity while maintaining answer quality.

3. Smart Sampling Strategies
A two-part system for choosing training examples:
a) Curriculum Sampling: Organizes training data from easy to hard
b) Prioritized Sampling: Uses a formula (∝ (1 — si)) to determine how often to practice each problem, where si is how well the model performs on that problem

Like a personalized study plan that:
a) Starts with basic multiplication before moving to calculus
b) Makes you practice more on problems you get wrong more often

Maximizes learning efficiency by focusing on areas where improvement is most needed while maintaining a manageable difficulty progression.

Let’s dig deeper:

Stage 1: Pretraining (The Learning Foundation)

Pretraining is the initial training phase where a model learns general patterns and representations from a large unlabeled dataset through self-supervised learning objectives (like predicting masked tokens or next-token prediction). This creates a foundation of learned parameters that can later be fine-tuned for specific downstream tasks.

Example:

Phase 1: Model learns “A cat is a small furry pet” (text only)
Phase 2: Starts seeing cat pictures with descriptions
Phase 3: Can understand both “cat” in text and images of cats together

Cooldown Phase:

The Cooldown Phase in technical terms represents a specialized post-pretraining optimization phase where the model undergoes controlled parameter adjustment through targeted dataset exposure

Day 1: Simple math (2+2)
Week 1: Word problems (If John has 2 apples…)
Month 1: Complex problems (algebra)

Long-context Activation:

Long-context Activation refers to the model’s capability to process and maintain coherent attention spans across extended token sequences.

Like training someone to read an entire book and remember all details:

Start: Reading paragraphs
Middle: Reading chapters
End: Understanding entire books and connecting all information

Stage 2: Supervised Fine-Tuning:

SFT is a training phase where the model learns from a curated dataset of high-quality input-output pairs, optimized through cross-entropy loss with specialized hyperparameters (learning rate: 1e-5 to 1e-6, batch size: 32–128). The training data is carefully balanced across different task categories (500K general QA, 200K coding, 200K math/science samples) with strict quality control mechanisms requiring 85% expert validation.

Think of teaching a medical diagnosis system:

Input: Patient symptoms (fever, cough, fatigue)
Ground Truth: Doctor’s correct diagnosis
Training: Model learns to match doctor’s diagnosis
Validation: Check against other doctors’ opinions
Quality Control: Only keep high-agreement cases

Stage 3: Chain-of-Thought Training

Phase where the model learns to decompose complex problems into explicit reasoning steps using intermediate state validation and backpropagation through each reasoning stage (using step-specific loss functions and attention masking). The architecture employs recursive processing with validation gates between steps to ensure logical consistency.

When solving “35 × 25”, instead of direct output “875”, the model learns to think:

“Let me break this down: 35 × 25”
“First: 35 × 20 = 700”
“Then: 35 × 5 = 175”
“Finally: 700 + 175 = 875”

Each step is validated before proceeding to the next, similar to a math teacher checking each step of a student’s work.

Stage 4: The Smart Reward System

The Smart Reward System (Reinforcement Learning Implementation) employs a dual reward architecture where two parallel evaluation systems work simultaneously: one assessing final output accuracy (reward_final) and another evaluating the quality of reasoning steps (reward_process), with dynamic weighting (λ=0.3~0.7) between them. The system uses Policy Gradient optimization with a KL-divergence constraint to prevent deviation from pretrained behaviors.

For the math problem “What is 15% of 80?”:

Classic Reward:

Correct answer “12” → High reward
Wrong answer → Low reward

2. Process Reward:

Good process: “First convert 15% to 0.15, then multiply by 80” → High reward
Poor process: “Random guessing” → Low reward

Even if the final answer is correct, the model gets higher total reward for showing proper reasoning steps, similar to a teacher giving partial credit for showing work even if the final answer is wrong.

System Architecture

The architecture employs a dual-phase system:

a. Training phase using Megatron-LM for distributed training, and

b. Inference phase using vLLM for optimized response generation.

Memory management includes three stages:

initial weight loading, cleanup/offloading, and inference preparation, with dynamic memory allocation based on batch size and sequence length requirements.

Example:
Like a restaurant with a training kitchen (where chefs learn and practice) and a service kitchen (where orders are quickly prepared), each with its own optimized setup and workflow.

This dual-system approach maximizes efficiency by separating resource-intensive training from fast inference, allowing the model to both learn effectively and respond quickly when deployed.

2. Checkpoint Engine & Parallelism Types
The system implements three-way parallelism:
– Pipeline: Sequential layer processing across GPUs
– Expert: Task-specific GPU specialization
– Tensor: Matrix operations distributed across multiple GPUs
Each managed by a centralized checkpoint system for synchronization and state management.

Example:
Think of an assembly line for building a car:
– Pipeline: Different stations handle specific parts (engine, body, interior)
– Expert: Specialized teams for specific tasks (electronics, welding, painting)
– Tensor: Large tasks split among multiple workers (like four people assembling one large component together)

This triple parallelism approach is crucial for handling the massive computational requirements of a 128k-context model, enabling efficient processing of large datasets while maintaining training stability and preventing memory bottlenecks.

Long2Short System
A response optimization system that combines model merging, rejection sampling, and preference-based learning to generate concise yet complete responses. It employs multiple parallel approaches to achieve optimal length-to-information ratio in model outputs.

Like having an expert editor who can take a long academic paper and turn it into a clear abstract while keeping all key points.

Critical for making AI responses more user-friendly and efficient, addressing the common problem of AI models being unnecessarily verbose while ensuring no important information is lost.

2. Model Merging
A parameter averaging technique that combines weights from two specialized models (verbose and concise) into a single model, using weighted averaging of neural network parameters to preserve the strengths of both models.

Like combining recipes from two chefs — one who writes detailed 20-step instructions and another who writes quick 5-step versions — to create a perfectly balanced recipe.

This approach is essential for creating a balanced model that can maintain the detailed understanding of the verbose model while delivering the efficiency of the concise model, without training a new model from scratch.

3. Shortest Rejection Sampling
A multi-candidate generation system that produces multiple response variations for the same input, then selects the optimal response based on both accuracy and brevity metrics using comparative scoring.

Like asking eight different people to explain something, then picking the explanation that’s both correct and shortest.

Ensures the model consistently selects the most efficient way to communicate information by generating and comparing multiple possibilities, rather than settling for the first acceptable answer.

4. Direct Preference Optimization (DPO)
A training approach that uses paired examples (preferred vs. non-preferred responses) to directly teach the model to favor concise outputs while maintaining information completeness.

Like training a student by showing them two essays — one concise and one verbose — and consistently rewarding them for matching the style of the concise one.

They claim to have State-of-the-Art Reasoning Performance

Achieves top-tier results across multiple benchmarks and modalities:

AIME: 77.5
MATH 500: 96.2
Codeforces: 94th percentile
MathVista: 74.9

Matches OpenAI’s o1 model in reasoning capabilities.

Long2Short Optimization for Short-CoT Models:

Implements long-CoT techniques to enhance short-CoT performance.

Delivers best-in-class short-CoT reasoning results:

AIME: 60.8
MATH 500: 94.6
LiveCodeBench: 47.3

Outperforms existing short-CoT models like GPT-4o and Claude Sonnet 3.5 by a huge margin (up to +550%).

As reinforcement learning continues to evolve, models like Kimi k1.5 set the stage for more dynamic and human-like AI systems. By combining efficiency with depth, Moonshot AI has introduced a model that not only competes with the best but also redefines how AI learns, adapts, and interacts. The future of AI isn’t just about predicting the next word — it’s about learning, reasoning, and improving in real-time.

Kimi-k1.5 Paper: https://arxiv.org/pdf/2501.12599
Notebook LLM podcast: https://notebooklm.google.com/notebook/a9c1a187-7d53-4115-a452-b533af660892/audio

Understanding Kimi k1.5: Scaling Reinforcement Learning with LLMs | by Nandini Lokesh Reddy | Feb, 2025

Why RL Matters for Language Models

The Challenge

Current RL Framework

Kimi k1.5’s Simplified Approach:

Key Components:

Recent Articles

Limbic AI’s Generative AI–Enabled Therapy Support Tool Improves Cognitive Behavioral Therapy Outcomes

Blind Eagle Hacks Colombian Institutions Using NTLM Flaw, RATs and GitHub-Based Attacks

AI Podcasts You Need to Follow in 2025

Meta is reportedly testing in-house chips for AI training

A Complete Guide to Matrices for Machine Learning with Python

Related Stories

Leave A Reply Cancel reply