Further Applications with Context Vectors


Context vectors are powerful representations generated by transformer models that capture the meaning of words in their specific contexts. In our previous tutorials, we explored how to generate these vectors and some basic applications. Now, we’ll focus on building practical applications that leverage context vectors to solve real-world problems.

In this tutorial, we’ll implement several applications to demonstrate the power and versatility of context vectors. We’ll use the Hugging Face transformers library to extract context vectors from pre-trained models and build applications around them. Specifically, you will learn:

  • Building a semantic search engine with context vectors
  • Creating a document clustering and topic modeling application
  • Developing a document classification system

Let’s get started.

Further Applications with Context Vectors
Photo by Matheus Bertelli. Some rights reserved.

Overview

This post is divided into three parts; they are:

  • Building a Semantic Search Engine
  • Document Clustering
  • Document Classification

Building a Semantic Search Engine

If you want to find a specific document within a collection, you might use a simple keyword search. However, this approach is limited by the precision of keyword matching. You might not remember the exact wording used in the document, only what it was about. In such cases, semantic search is more effective.

Semantic search allows you to search by meaning rather than by keywords. Each document is represented by a context vector that captures its meaning, and the query is also represented as a context vector. The search engine then finds the documents most similar to the query, using a similarity measure such as L2 distance or cosine similarity.

Since you’ve already learned how to generate context vectors using a transformer model, let’s implement a simple semantic search engine:

In this example, the context vector is created using the get_context_vector() function. You pass in the text as a string or a list of strings, and the tokenizer and model produce a tensor output. This output is a matrix of shape (batch size, sequence length, hidden size). Not all tokens in the sequence are valid, so you use the attention mask produced by the tokenizer to identify valid tokens.

Each input string’s context vector is computed as the mean of all valid token embeddings. Note that other methods to create context vectors are possible, such as using the [CLS] token or different pooling strategies.

In this example, you begin with a collection of documents and a query string. You generate context vectors for both, and in semantic_search(), compare the query vector with all document vectors using cosine similarity to find the top-k most similar documents.

The output of the above code is:

You can see that the semantic search engine understands the meaning behind queries, rather than just matching keywords. However, the quality of results depends on how well the context vectors represent the documents and queries, as well as the similarity metric used.

Document Clustering

Document clustering groups similar documents together. It is useful when organizing a large collection of documents. While you could classify documents manually, that approach is time-consuming. Clustering is an automatic, unsupervised process—you don’t need to provide any labels. The algorithm groups documents into clusters based on their similarity.

With context vectors for each document, you can use any standard clustering algorithm. Below, we use K-means clustering:

In this example, the same get_context_vector() function is used to generate context vectors for a corpus of documents. Each document is transformed into a fixed-size context vector. Then, the K-means clustering algorithm groups the documents. The number of clusters is set to 3, but you can experiment with other values to see what makes the most sense.

The output of the above code is:

The quality of clustering depends on the context vectors and the clustering algorithm. To evaluate the results, you can visualize the clusters in 2D using Principal Component Analysis (PCA). PCA reduces the vectors to their first two principal components, which can be plotted in a scatter plot:

If you don’t see clear clusters—as in this case—it suggests the clustering isn’t ideal. You may need to adjust how you generate context vectors. However, the issue might also be that all the documents are related to machine learning, so forcing them into three distinct clusters may not be meaningful.

In general, document clustering helps automatically discover topics in a collection. For good results, you need a moderately large and diverse corpus with clear topic distinctions.

If you happen to have labels for the documents, you can use them to train a classifier. This goes one step beyond clustering. With labels, you control how documents are grouped.

You may need more data to train a reliable classifier. Below, we’ll use a logistic regression classifier to categorize documents.

The context vectors are generated the same way as in the previous example. Instead of clustering or manually comparing similarities, you provide a list of labels (one per document) to a logistic regression classifier. Using the implementation from scikit-learn, we train the model on the training set and evaluate it on the test set.

The classification_report() function from scikit-learn provides metrics like precision, recall, F1 score, and accuracy. The result looks like this:

To use the trained classifier, follow the same workflow: use the get_context_vector() function to convert new text into context vectors, then pass them to the classifier to predict categories. When you run the above code, you should see:

Note that the classifier is trained on context vectors, which ideally capture the meaning of the text rather than just surface keywords. As a result, it should more effectively generalize to new inputs, even those with unseen keywords.

Summary

In this post, you’ve explored how to build practical applications using context vectors generated by transformer models. Specifically, you’ve implemented:

  • A semantic search engine to find documents most similar to a query
  • A document clustering application to group documents into meaningful categories
  • A document classification system to categorize documents into predefined categories

These applications highlight the power and versatility of context vectors for understanding and processing text. By leveraging the semantic capabilities of transformer models, you can build sophisticated NLP systems that go beyond simple keyword matching or rule-based methods.

Recent Articles

Related Stories

Leave A Reply

Please enter your comment!
Please enter your name here