Beyond Causal Language Modeling. A deep dive into “Not All Tokens Are… | by Masatake Hirono | Jan, 2025


Contributions of This Work

This paper provides both an illuminating analysis of token-level training dynamics and a new technique called SLM:

Token Loss Analysis:
They demonstrate that a majority of tokens contribute little beyond the initial training phase, while a small subset stays persistently high loss.

SLM for Focused Learning:
By leveraging a reference model to gauge how “useful” each token is, they manage to reduce training tokens drastically without sacrificing quality — in many cases even boosting downstream performance.

Broad Demonstration of Effectiveness:
SLM works not only on math-specific tasks but also in more general domains, with either a meticulously curated reference dataset or a reference model drawn from the same large corpus.

Where Could This Go Next?

SLM encompasses various potential directions for future research. For example:

Scaling Up Further:
Though the paper primarily focuses on models around 1B to 7B parameters, there remains the open question of how SLM performs at the 30B, 70B, or 100B+ scale. If the token-level approach generalizes well, the cost savings could be enormous for truly massive LLMs.

Reference Models via API:
If you can’t gather curated data, maybe you could use an API-based language model as your reference. That might make SLM more practical for smaller research teams who lack the resources for selective reference training.

Reinforcement Learning Extensions:
Imagine coupling SLM with reinforcement learning. The reference model could act as a “reward model,” and token selection might then be optimized through something akin to policy gradients.

Multiple Reference Models:
Instead of a single RM, you could train or gather several, each focusing on a different domain or style. Then, combine their token scores to produce a more robust multi-domain filtering system.

Alignment and Safety:
There’s a growing trend toward factoring in alignment or truthfulness. One might train a reference model to give higher scores to well-supported statements and zero out tokens that look factually incorrect or harmful.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here