Part 2/2 — Scaling Thomson Reuters’ Language Model Research | by John Duprey | Thomson Reuters Labs | Jul, 2024


Photo by NASA on Unsplash

Finally! With our HyperPod cluster setup, our capacity plan needs shared with the HyperPod team, and a Labs custom command line interface (CLI) to ease training job management, we were ready to experiment with training LLMs. This was a journey!

By the numbers: Over the course of ~5 months, we successfully ran ~20 training jobs. We scaled our cluster up to 16 p4ds and our largest job utilized the entire cluster. We trained a 70b parameter model on 400b input tokens and it took 36 days to complete.

The most amazing aspect of this was that we had zero hardware failures! This is perhaps a testament to HyperPod’s pre-flight health checks performed before they are made available in the cluster.

Report on initial trainings done, findings

While our experimentation is far from complete, we do have some positive preliminary findings. What I’m sharing here is an informal summary. More detailed analysis and results will be published in the future by the Labs Foundational Research team.

Continuous Pre-Training (CPT)

In continuous pre-training (CPT), you train from an existing OS LLM checkpoint. More than a time-saver; it is a strategic decision that allows for the nuanced growth of the model’s capabilities over time.

The preliminary results of our experimentation showed that we were able to train models on the legal domain without losing general knowledge.

CPT Legal vs. General Perplexity

We used a measure called perplexity. It quantifies how well the model predicts a sample of text. In essence, perplexity measures the confidence a model has in its predictions. Lower perplexity indicates that the model is more certain about its predictions. From the graphs above you can see that as we increased our batches of training, legal perplexity decreased while general perplexity increased somewhat, it quickly leveled off.

Part of our experimentation was determining the right split of domain specific (legal) and general data to train with.

Instruct fine-tuning (IFT)

Instruct fine-tuned LLMs are tuned to respond to specific instructions, enabling tasks such as question answering, summarization, and brainstorming. For instance, human-written instruction datasets include prompts like “summarize this article” or “list fun weekend activities.” Our hypothesis is that Legal LLMs can benefit from diverse legal instructions.

We have discovered that our Legal LLM greatly benefits from a vast array of diverse instructions. By compiling legal instructions, such as drafting legal headnotes, and combining them with publicly available instructions, our MPT-TR-7b model, derived from MPT-7b, has showcased improvements correlated with an increased number of instruction data sets provided.

We used an automatic measure called rouge to determine how well our domain adapted models performed compared to GPT-4. This automatic measure, based on term overlap, is not the same as human preference judgment, but gives us some degree of confidence we are headed in the right direction.

Legal Summarization

Our MPT-TR-7b model has demonstrated proficiency in legal summarization tasks, rivaling GPT-4’s performance when evaluated with automatic metrics assessing word overlap with reference summaries. While a human-based evaluation would offer deeper insights, the initial results are compelling evidence of the model’s capabilities.

AS IFT collections were added, rouge evaluations were on par or better than GPT4

Legal Classification

In other legal tasks, such as classification that was measured in accuracy and precision/recall, there’s still room to improve when compared to GPT-4. Nonetheless, the performance uptick is evident with the expansion of instruction datasets. Even more exciting is the leap in performance observed with larger base models like MPT-30b.

Even for larger models, rouge scores did not reach GPT4 levels in this task

NOTE: Results for the third task, legal question answering, are not available at this time.

Next Steps

Photo by Sylvain Mauroux on Unsplash

With the advent of even more capable models like Mistral-7b, which matches MPT-30b’s performance, we are eager to explore the potential of more recently released models like Mixtral8x7b and LLaMa-3–70b. As next step, we have been training on the Mixtral8x7b and the LLaMa-3–70b models that seem to give us even better performance than the smaller models we have been training.

Looking ahead, the integration of new alignment methods, such as DPO (Direct Preference Optimization), could further narrow the performance gap, paving the way for the next generation of specialized LLMs that could revolutionize the legal tech landscape. Its impact on training scale requirements is positive, as it simplifies the process and reduces computational overhead.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here