An alternative approach to the transformer model for text generation
Since the release of ChatGPT at the end of November 2022, LLMs (Large Language Models) have, almost, become a household name.
There is good reason for this; their success lies in their architecture, particularly the attention mechanism. It allows the model to compare every word they process to every other word.
This gives LLMs the extraordinary capabilities in understanding and generating human-like text that we are all familiar with.
However, these models are not without flaws. They demand immense computational resources to train. For example, Meta’s Llama 3 model took 7.7 million GPU hours of training[1]. Moreover, their reliance on enormous datasets — spanning trillions of tokens — raises questions about scalability, accessibility, and environmental impact.
Despite these challenges, ever since the paper ‘Attention is all you need’ in mid 2017, much of the recent progress in AI has focused on scaling attention mechanisms further, rather than exploring fundamentally new architectures.