When translating English sentences into Spanish, maintaining control over verb conjugation can be challenging. Tools like Google Translate often provide grammatically correct translations but may not always align with the specific tense or verb form desired. This can be particularly frustrating when precision is required.
In this work, we explore fine-tuning a MarianMT model to translate English present-tense sentences into Spanish while enforcing a specific verb conjugation in the past, present, or future tense. By training the model with carefully curated data, we achieve greater control over tense consistency, ensuring translations align exactly with the intended grammatical structure.
The goal of this project is to create a dataset of English sentences with randomly selected present-tense verbs and nouns, paired with Spanish translations where the verbs are conjugated into past, present, and future tenses. The model is trained to take an English sentence as input and generate a Spanish translation with the verb correctly conjugated to the specified tense.
To accomplish this, we introduce Verb Tense Conjugation tokens, which are appended to each English sentence to indicate the desired verb tense for conjugation. These tokens — [PRESENT], [PAST], and [FUTURE] — guide the model in generating the correct verb form in the translated Spanish sentence.
We leverage pretrained weights to preserve the integrity of English-to-Spanish translation while training the model to associate a Verb Tense Conjugation (VTC) token with the correct Spanish verb conjugation.
MarianMT is an efficient, open-source neural machine translation model built on the Transformer architecture, specifically optimized for speed and scalability. Developed by the University of Edinburgh, it is designed for high-performance translation tasks and trained on the OPUS dataset, a large collection of parallel multilingual corpora.
The model follows the encoder-decoder structure of Transformers, where the encoder processes input text into contextualized embeddings, and the decoder generates translated output by attending to these embeddings with self-attention and cross-attention mechanisms.
Unlike traditional recurrent models, MarianMT leverages parallel processing through self-attention, significantly improving translation accuracy and efficiency.
To fine-tune the MarianMT model for verb conjugation, we created a dataset that includes the English present tense sentence and the translated Spanish present, past, and future tense sentences. We used the WordNet lexical database to sample random verbs and nouns (specifically “thing” nouns)
from nltk.corpus import wordnet
import nltknltk.download('wordnet')
# Get a list of verb synsets
verb_synsets = list(wordnet.all_synsets(pos=wordnet.VERB))
# Get a list of noun synsets
noun_synsets = list(wordnet.all_synsets(pos=wordnet.NOUN))
# Filter for concrete things (this may not be perfect)
thing_nouns = [syn.lemmas()[0].name() for syn in noun_synsets if syn.lexname() == 'noun.artifact']
We used the pattern.en Python package to conjugate random English verbs for various pronouns (I, he, she, we, they).
from pattern.en import lexeme
import randomrandom_verb = random.choice(verb_synsets).lemmas()[0].name()
random_noun = random.choice(thing_nouns).replace("_"," ")
conjugated_verb = lexeme(random_verb)
print(conjugated_verb)
print("He " + conjugated_verb[1] + " the " + random_noun)
To translate the English present tense sentence to the conjugated Spanish sentences, the “Helsinki-NLP/opus-mt-en-es” pre-trained MarianMT model and tokenizer from HuggingFace were used.
from transformers import MarianMTModel, MarianTokenizer# Load pre-trained model and tokenizer
model_name = "Helsinki-NLP/opus-mt-en-es"
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)
def translate(text):
# Tokenize input text
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Generate translation
translated = model.generate(**inputs)
# Decode translation
return tokenizer.decode(translated[0], skip_special_tokens=True)
random_verb = random.choice(verb_synsets).lemmas()[0].name()
random_noun = random.choice(thing_nouns).replace("_"," ")
conjugated_verb = lexeme(random_verb)
english_text = "He " + conjugated_verb[1] + " the " + random_noun
spanish_translation = translate(english_text)
print(f"English: {english_text}")
print(f"Spanish: {spanish_translation}")
# OUTPUT
# English: He makes the monocle
# Spanish: Él hace el monóculo
The dataset includes 10,000 English-to-Spanish translations that include the present, past, and future tense verb conjugations. Given a random ver, random noun, and random choice of pronoun, English present, past, and future tense sentences are constructed. These sentences are then translated into Spanish using the translate() function above. The original English present tense sentence and the corresponding Spanish sentences are written into a CSV file.
import csvnumRows = 10000
pronouns = ["I", "You", "He", "She", "We", "They"]
addedRows = 0
# Open the CSV file in append mode ("a")
with open("data10000.csv", mode="a", newline="") as file:
writer = csv.writer(file)
while addedRows < numRows:
random_verb = random.choice(verb_synsets).lemmas()[0].name()
random_noun = random.choice(thing_nouns).replace("_"," ")
conjugated_verb = lexeme(random_verb)
if "_" in conjugated_verb[0]:
continue
if len(conjugated_verb) < 4:
continue
current_pronoun = random.choice(pronouns)
if current_pronoun == "He" or current_pronoun == "She":
presentVerb = conjugated_verb[1]
else:
presentVerb = conjugated_verb[0]
pastSentence = current_pronoun + " " + conjugated_verb[3] + " the " + random_noun + "."
presentSentence = current_pronoun + " " + presentVerb + " the " + random_noun + "."
futureSentence = current_pronoun + " will " + conjugated_verb[0] + " the " + random_noun + "."
pastSentenceSpanish = translate(pastSentence)
presentSentenceSpanish = translate(presentSentence)
futureSentenceSpanish = translate(futureSentence)
row = [presentSentence, presentSentenceSpanish, pastSentenceSpanish, futureSentenceSpanish]
# Write the list as a single row
writer.writerow(row)
addedRows += 1
if addedRows % 1000 == 0:
print("Finished " + str(addedRows) + " rows")
The CSV is then read into a pandas dataframe object and a verb conjugation tense token is added to the English sentence to map to which tense the corresponding Spanish sentence is. The present, past, and future tense are then split up into their corresponding rows and a new split dataframe is created.
import pandas as pd# Define column names
column_names = ["English Present", "Spanish Present", "Spanish Past", "Spanish Future"]
# Read CSV with custom column names
df = pd.read_csv("data10000.csv", names=column_names, header=None)
# Modify the sentences to include the tense
df['English Present'] = df['English Present'] + ' [PRESENT]'
df['English Past'] = df['English Present'].str.replace('[PRESENT]', '[PAST]')
df['English Future'] = df['English Present'].str.replace('[PRESENT]', '[FUTURE]')
# Example of the modified sentences
print(df.iloc[0]['English Present'])
print(df.iloc[0]['English Past'])
print(df.iloc[0]['English Future'])
# Create a new dataframe with the desired structure
split_df = pd.DataFrame(columns=["English Sentence", "Spanish Sentence"])
# List to hold the rows to be added
rows = []
# Iterate through the original dataframe and populate the new dataframe
for index, row in df.iterrows():
rows.append({"English Sentence": row["English Present"], "Spanish Sentence": row["Spanish Present"]})
rows.append({"English Sentence": row["English Past"], "Spanish Sentence": row["Spanish Past"]})
rows.append({"English Sentence": row["English Future"], "Spanish Sentence": row["Spanish Future"]})
# Convert the list of rows to a DataFrame and concatenate with split_df
split_df = pd.concat([split_df, pd.DataFrame(rows)], ignore_index=True)
print(split_df.shape)
print(split_df.head())
# OUTPUT:
# You continue the board. [PRESENT]
# You continue the board. [PAST]
# You continue the board. [FUTURE]
# (30000, 2)
# English Sentence Spanish Sentence
# 0 You continue the board. [PRESENT] Continúa con la junta.
# 1 You continue the board. [PAST] Tú continuaste con la junta.
# 2 You continue the board. [FUTURE] Continuará con la junta.
Once the data is formatted correctly, the next step is transforming the training data from strings to tokens. To do this, the Verb Conjugation Tokens — [PRESENT], [PAST], and [FUTURE] — must be added to the Tokenizer. These tokens should not be treated as part of the sentence but should inform the model which tense to conjugate the verb during translation. Once the tense tokens are added the train, validation, and test datasets are created.
import pandas as pd
from transformers import MarianTokenizer
from sklearn.model_selection import train_test_split# Load MarianMT tokenizer
tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-es')
# Add custom tokens for verb tenses
new_tokens = ['[PAST]', '[PRESENT]', '[FUTURE]']
tokenizer.add_tokens(new_tokens)
train_df, temp_df = train_test_split(df, test_size=0.2, random_state=42)
val_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=42)
train_encodings = tokenizer(train_df['English Sentence'].tolist(), padding=True, truncation=True, max_length=512)
train_labels = tokenizer(train_df['Spanish Sentence'].tolist(), padding=True, truncation=True, max_length=512)
val_encodings = tokenizer(val_df['English Sentence'].tolist(), padding=True, truncation=True, max_length=512)
val_labels = tokenizer(val_df['Spanish Sentence'].tolist(), padding=True, truncation=True, max_length=512)
test_encodings = tokenizer(test_df['English Sentence'].tolist(), padding=True, truncation=True, max_length=512)
test_labels = tokenizer(test_df['Spanish Sentence'].tolist(), padding=True, truncation=True, max_length=512)
# Convert to pytorch dataset format
import torch
class TranslationDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __len__(self):
return len(self.encodings['input_ids'])
def __getitem__(self, idx):
return {
'input_ids': torch.tensor(self.encodings['input_ids'][idx]),
'attention_mask': torch.tensor(self.encodings['attention_mask'][idx]),
'labels': torch.tensor(self.labels['input_ids'][idx])
}
train_dataset = TranslationDataset(train_encodings, train_labels)
val_dataset = TranslationDataset(val_encodings, val_labels)
test_dataset = TranslationDataset(test_encodings, test_labels)
We then fine-tune the same MarianMT encoder-decoder and tokenizer on the created conjugation dataset.
from transformers import MarianMTModel, AdamW
from torch.utils.data import DataLoaderdevice = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# Load pre-trained MarianMT model for English to Spanish translation
model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-es').to(device)
# Resize the model's token embeddings to include the new tokens
model.resize_token_embeddings(len(tokenizer))
# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=1e-5)
# DataLoader for batching
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)
# Set up learning rate scheduler (optional)
from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=1, gamma=0.9)
best_val_loss = float('inf')
# Training loop
epochs = 20
for epoch in range(epochs):
model.train()
total_loss = 0
for batch in train_dataloader:
# Move inputs and labels to GPU if available
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Zero gradients
optimizer.zero_grad()
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
# Backward pass and optimization
loss.backward()
optimizer.step()
total_loss += loss.item()
model.eval()
val_loss = 0
for batch in val_dataloader:
with torch.no_grad():
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
val_loss += loss.item()
# Print the average loss for this epoch
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss / len(train_dataloader)}, Val Loss: {val_loss / len(val_dataloader)}")
# Save the model with the best validation loss
if val_loss < best_val_loss:
best_val_loss = val_loss
model.save_pretrained('./best_model')
tokenizer.save_pretrained('./best_model')
# Step the learning rate scheduler
scheduler.step()
from transformers import MarianMTModel, AdamW
from torch.utils.data import DataLoader
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
# Load pre-trained MarianMT model for English to Spanish translation
model = MarianMTModel.from_pretrained('Helsinki-NLP/opus-mt-en-es').to(device)
# Resize the model's token embeddings to include the new tokens
model.resize_token_embeddings(len(tokenizer))
# Set up the optimizer
optimizer = AdamW(model.parameters(), lr=5e-5)
# DataLoader for batching
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=8, shuffle=False)
# Set up learning rate scheduler (optional)
from torch.optim.lr_scheduler import StepLR
scheduler = StepLR(optimizer, step_size=1, gamma=0.9)
best_val_loss = float('inf')
# Training loop
epochs = 20
for epoch in range(epochs):
model.train()
total_loss = 0
for batch in train_dataloader:
# Move inputs and labels to GPU if available
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
# Zero gradients
optimizer.zero_grad()
# Forward pass
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
# Backward pass and optimization
loss.backward()
optimizer.step()
total_loss += loss.item()
model.eval()
val_loss = 0
for batch in val_dataloader:
with torch.no_grad():
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
val_loss += loss.item()
# Print the average loss for this epoch
print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss / len(train_dataloader)}, Val Loss: {val_loss / len(val_dataloader)}")
# Save the model with the best validation loss
if val_loss < best_val_loss:
best_val_loss = val_loss
model.save_pretrained('./best_model')
tokenizer.save_pretrained('./best_model')
# Step the learning rate scheduler
scheduler.step()
model.save_pretrained('./final_model')
tokenizer.save_pretrained('./final_model')
We trained the model for 20 epochs using the AdamW optimizer with an initial learning rate of 1e-5 and a learning rate scheduler. We save off the best validation model and final model. The training and validation losses are shown below:
The model can conjugate simple-structured sentences similar to the structure of the training dataset. The following example conjugates the sentence “They swim in the pool” to present, past, and future tense. The translated sentences are cross-referenced using Google Translate.
Original sentence: They swim in the pool.Conjugation Model:
Spanish translated past sentence: Nadaron en la piscina.
Spanish translated present sentence: Nadan en la piscina.
Spanish translated future sentence: Nadarán en la piscina.
Google Translate:
English translated past sentence: They swam in the pool.
English translated present sentence: They swim in the pool.
English translated future sentence: They will swim in the pool.
Surprisingly, the model can also handle sentences with multiple verbs.
Original sentence: They swim at the pool and eat at the restaurant.Conjugation Model:
Spanish translated past sentence: Nadaron en la piscina y comieron en el restaurante.
Spanish translated present sentence: Nadan en la piscina y comen en el restaurante.
Spanish translated future sentence: Nadarán en la piscina y comerán en el restaurante.
Google Translate:
English translated past sentence: They swam in the pool and ate in the restaurant..
English translated present sentence: They swim in the pool and eat in the restaurant.
English translated future sentence: They will swim in the pool and eat in the restaurant.
This demonstrates that the model depends on the pretrained machine translation system to produce semantically accurate sentences while successfully learning that the newly introduced Verb Conjugation Tokens — [PRESENT], [PAST], and [FUTURE] — correspond to the correct verb conjugations within each sentence!