Document datasets already have structure. Take advantage of it.
There are layered challenges in building retrieval-augmented generation (RAG) applications. Document retrieval, a huge part of the RAG workflow, is itself a complex set of steps that can be approached in different ways depending on the use case.
It is difficult for RAG systems to find the best set of documents relevant to a nuanced input prompt, especially when relying entirely on vector search to find the best candidates. Yet often our documents themselves are telling us where we should look for more information on a given topic — via citations, cross-references, footnotes, hyperlinks, etc. In this article, we’ll show how a new data model — linked documents — unlocks performance improvements by enabling us to parse and preserve these direct references to other texts, making them available for simultaneous retrieval — regardless of whether they were overlooked by vector search.
When answering complex or nuanced questions requiring supporting details from disparate documents, RAG systems often struggle to locate all of the relevant documents needed for a well-informed and complete response. Yet we keep relying almost exclusively on text embeddings and vector similarity to locate and retrieve relevant documents.
One often-understated fact: there is a lot of document information lost during the process of parsing, chunking, and embedding text. Document structure — including section hierarchy, headings, footnotes, cross-references, citations, and hyperlinks — are almost entirely lost in a typical text-to-vector workflow, unless we take specific action to preserve them. When the structure and metadata are telling us what other documents are directly related to what we are reading, why shouldn’t we preserve this information?
In particular, links and references are ignored in a typical chunking and embedding process, which means they can’t be used by the AI to help answer queries. But, links and references are valuable pieces of information that often point to more useful documents and text — why wouldn’t we want to check those target documents at query time, in case they’re useful?
Parsing and following links and references programmatically is not difficult, and in this article we present a simple yet powerful implementation designed for RAG systems. We show how to use document linking to preserve known connections between document chunks, connections which typical vector embedding and retrieval might fail to make.
Documents in a vector store are essentially pieces of knowledge embedded into a high-dimensional vector space. These vectors are essentially the internal “language” of LLMs — given an LLM and all of its internal parameter values, including previous context and state, a vector is the starting point from which a model generates text. So, all of the vectors in a vector store are embedded documents that an LLM might use to generate a response, and, similarly, we embed prompts into vectors that we then use to search for nearest neighbors in semantic vector space. These nearest neighbors correspond to documents that are likely to contain information that can address the prompt.
In a vector store, the closeness of vectors indicates document similarity in a semantic sense, but where there is no real concept of connectedness beyond similarity. However, documents that are close to each other (and typically retrieved together) can be viewed as a type of connection between those pieces of knowledge, forming an implicit knowledge graph where each chunk of text is connected to its nearest neighbors. A graph built in this sense would not be static or rigid like most knowledge graphs; it would change as new documents are added or search parameters adjusted. So it is not a perfect comparison, but this implicit graph can be helpful as a conceptual framework that is useful for thinking about how document retrieval works within RAG systems.
In terms of real-world knowledge — in contrast to vector representations — semantic similarity is just one of many ways that pieces of text may be related. Even before computers and digital representations of data, we’ve been connecting knowledge for centuries: glossaries, indexes, catalogs, tables-of-contents, dictionaries, and cross-references are all ways to connect pieces of knowledge with each other. Implementing these in software is quite simple, but they typically haven’t been included in vector stores, RAG systems, and other gen AI applications. Our documents are telling us what other knowledge is important and relevant; we just need to give our knowledge stores the capability to understand and follow the connections.
We’ve developed document linking for cases in which our documents are telling us what other knowledge is relevant, but our vector store isn’t capturing that and the document retrieval process is falling short. Document linking is a straightforward yet potent method for representing directed connections between documents. It encapsulates all the traditional ways we navigate and discover knowledge, whether through a table of contents, glossary, keyword — and of course the easiest for a programmatic parser to follow: hyperlinks. This concept of linking documents allows for relationships that can be asymmetric or tagged with qualitative metadata for filtering or other purposes. Links are not only easy to conceptualize and work with but also scale efficiently to large, dynamic datasets, supporting robust and efficient retrieval.
As a data type, document links are quite simple. Link information is stored alongside document vectors as metadata. That means that retrieving a given document automatically retrieves information about the links that lead from and to the given document. Outbound links point to more information that’s likely to be useful in the context of the document, inbound links show which other documents may be supported by the given document, and bi-directional (or undirected) links can represent other types of connections. Links can also be tagged with further metadata that provides qualitative information that can be used for link or document filtering, ranking, and graph traversal algorithms.
As described in more detail in the article “Scaling Knowledge Graphs by Eliminating Edges,” rather than storing every link individually, as in typical graph database implementations, our efficient and scalable implementation uses link types and link groups as intermediate data types that greatly reduce storage and compute needs during graph traversal. This implementation has a big advantage when, for example, two groups of documents are closely related.
Let’s say that we have a group of documents on the topic of the City of Seattle (call it Group A) and we have another group of documents that mention Seattle (Group B). We would like to make sure that documents mentioning Seattle can find all of the documents about the City of Seattle, and so we would like to link them. We could create a link from all of the documents in Group B to all of the documents in Group A, but unless the two groups are small, this is a lot of edges! The way we handle this is to create one link type object representing the keyword “Seattle” (kw:seattle
), and then creating directed links from the documents in Group B to this kw:seattle
object as well as links from the kw:seattle
object to the documents in Group A. This results in far fewer links to store with each document — there is only one link each — and no information is lost.
The main goal of the retrieval process in RAG systems is to find a set of documents that is sufficient to answer a given query. Standard vector search and retrieval finds documents that are most “relevant” to the query in a semantic sense, but might miss some supporting documents if their overall content doesn’t closely match the content of the query.
For example, let’s say we have a large document set that includes the documents related to Seattle as described above. We have the following prompt about the Space Needle, a prominent landmark in Seattle:
“What is close to the Space Needle?”
A vector search starting with this prompt would retrieve documents mentioning the Space Needle directly, because that is the most prominent feature of the prompt text from a semantic content perspective. Documents mentioning the Space Needle are likely to mention its location in Seattle as well. Without using any document linking, a RAG system would have to try to answer the prompt using mainly documents mentioning the Space Needle, without any guarantee that other helpful documents that don’t mention the Space Needle directly would also be retrieved and used.
Below, we construct a practical example (with code!) based on this Space Needle dataset and query. Keep reading to understand how a RAG system might miss helpful documents when links are not used, and then “find” helpful documents again by simply following link information contained within the original documents themselves.
In order to illustrate how document linking works, and how it can make connections between documents and knowledge that might be missed otherwise, let’s look at a simple example.
We’ll start with two related documents containing some text from Wikipedia pages: one document from the page for the Space Needle, and one for the neighborhood where the Space Needle is located, Lower Queen Anne. The Space Needle document has an HTML link to the Lower Queen Anne document, but not the other way around. The document on the Space needle begins as follows:
'url': 'https://en.wikipedia.org/wiki/Space_Needle'The Space Needle is an observation tower in Seattle, Washington,
United States. Considered to be an icon of the city, it has been
designated a Seattle landmark. Located in the Lower Queen Anne
neighborhood, it was built in the Seattle Center for the 1962
World's Fair, which drew over 2.3 million visitors...
In addition to these two documents derived from real, informative sources, we have also added four very short, uninformative documents — two that mention the Space Needle and two that don’t. These documents (and their fake URLs) are designed to be irrelevant or uninformative documents, such as social media posts that are merely commenting on the Space Needle and Seattle, such as:
“The Space Needle is TALL.”
and
“Queen Anne was a person.”
The full document set is included in the Colab notebook. They are HTML documents that we then process using BeautifulSoup4 as well as the HtmlLinkExtractor
from LangChain, adding those links back to the Document
objects with the add_links
function specifically so we can make use of them in theGraphVectorStore
, a relatively new addition to the LangChain codebase, contributed by my colleagues at DataStax. All of this is open-source.
Each document is processed as follows:
from langchain_core.documents import Document
from langchain_core.graph_vectorstores.links import add_links
from langchain_community.graph_vectorstores.extractors.html_link_extractor import HtmlInput, HtmlLinkExtractorsoup_doc = BeautifulSoup(html_doc, 'html.parser')
doc = Document(
page_content=soup_doc.get_text(),
metadata={"source": url}
)
doc.metadata['content_id'] = url # the ID for Links to point to this document
html_link_extractor = HtmlLinkExtractor()add_links(doc, html_link_extractor.extract_one(HtmlInput(soup_doc, url)))
Using `cassio`, we initialize the GraphVectorStore
as below:
from langchain_openai import OpenAIEmbeddings
from langchain_community.graph_vectorstores.cassandra import CassandraGraphVectorStore# Create a GraphVectorStore, combining Vector nodes and Graph edges.
EMBEDDING = 'text-embedding-3-small'
gvstore = CassandraGraphVectorStore(OpenAIEmbeddings(model=EMBEDDING))
We set up the LLM and other helpers for the RAG chain in the standard way — see the notebook for details. Note that, while almost everything used here is open-source, in the notebook we are using two SaaS products, OpenAI and DataStax’s Astra — LLM and vector data store, respectively — both of which have free usage tiers. See the LangChain documentation for alternatives.
We can run the RAG system end-to-end using a graph retriever with depth=0
— which means no graph traversal at all — and other default parameters as below:
retriever = gvstore.as_retriever(
search_kwargs={
"depth": 0, # depth of graph traversal; 0 is no traversal at all
}
)
This gives an output such as:
Question:
What is close to the Space Needle? Retrieved documents:
['https://TheSpaceNeedleisGreat',
'https://TheSpaceNeedleisTALL',
'https://en.wikipedia.org/wiki/Space_Needle',
'https://SeattleIsOutWest',
'https://en.wikipedia.org/wiki/Lower_Queen_Anne,_Seattle',
'https://QueenAnneWasAPerson']
LLM response:
('The Space Needle is close to several locations in the Lower Queen Anne '
'neighborhood, including Climate Pledge Arena, the Exhibition Hall, McCaw '
'Hall, Cornish Playhouse, Bagley Wright Theater, the studios for KEXP radio, '
'SIFF Cinema Uptown, and On the Boards.')
Of course in realistic scenarios, a RAG system would not retrieve the full document set, as we are doing here.
Retrieving all documents for each query is impractical or even impossible in some cases. It also defeats the purpose of using vector search in the first place. For all realistic scenarios, only a small fraction of documents can be retrieved for each query, which is why it is so important to get the most relevant and helpful documents near the top of the list.
To make things a little more realistic for our example with our tiny dataset, let’s change the settings of the retriever so that k=3
, meaning that a maximum of three documents are returned by each vector search. This means that three of the six total documents — the least similar or relevant according to vector similarity — will be left out of the returned document set. We can change the settings of the retriever like this:
retriever = gvstore.as_retriever(
search_kwargs={
"depth": 0, # depth of graph traversal; 0 is no traversal at all
"k": 3 # number of docs returned by initial vector search---not including graph Links
}
)
Querying the system with these settings gives the output:
Question:
What is close to the Space Needle? Retrieved documents:
['https://TheSpaceNeedleisGreat',
'https://TheSpaceNeedleisTALL',
'https://en.wikipedia.org/wiki/Space_Needle']
LLM response:
('The context does not provide specific information about what is close to the '
'Space Needle. It only mentions that it is located in the Lower Queen Anne '
'neighborhood and built for the Seattle Center for the 1962 World's Fair.')
We can see that this final response is much less informative than the previous one, now that we have access to only half of the document set, instead of having all six documents available for response generation.
There are some important points to note here.
- One document that was left out was the document on Lower Queen Anne, which is the only document that describes some significant places in the neighborhood where the Space Needle is located.
- The Lower Queen Anne document does not specifically mention the Space Needle, whereas three other documents do. So it makes sense that the initial query “What is close to the Space Needle?” returns those three.
- The main document about the Space Needle has an HTML link directly to Lower Queen Anne, and any curious human would probably click on that link to learn about the area.
- Without any sense of linking or graph traversal, this RAG system retrieves the most semantically similar documents — including two uninformative ones — and misses the one article that has the most information for answering the query.
Now, let’s look at how document linking affects results.
A simple change to our retriever setup — setting depth=1
— enables the retriever to follow any document links from the documents that are initially retrieved by vector search. (For reference, note that setting depth=2
would not only follow links in the initial document set, but would also follow the next set of links in the resulting document set — but we won’t go that far yet.)
We change the retriever depth
parameter like this:
retriever = gvstore.as_retriever(
search_kwargs={
"depth": 1, # depth of graph traversal; 0 is no traversal at all
"k": 3 # number of docs returned by initial vector search---not including graph Links
}
)
which gives the following output:
Question:
What is close to the Space Needle? Retrieved documents:
['https://TheSpaceNeedleisGreat',
'https://TheSpaceNeedleisTALL',
'https://en.wikipedia.org/wiki/Space_Needle',
'https://en.wikipedia.org/wiki/Lower_Queen_Anne,_Seattle']
LLM response:
('The Space Needle is located in the Lower Queen Anne neighborhood, which '
'includes Climate Pledge Arena, Exhibition Hall, McCaw Hall, Cornish '
'Playhouse, Bagley Wright Theater, the studios for KEXP radio, a three-screen '
'movie theater (SIFF Cinema Uptown), and On the Boards, a center for '
'avant-garde theater and music.')
We can see that the first k
documents retrieved by vector search are the same three as before, but setting depth=1
instructed the system to follow links from those three documents and include those linked documents as well. So, the direct link from the Space Needle document to Lower Queen Anne included that document as well, giving the LLM access to the neighborhood information that it needed to answer the query properly.
This hybrid approach of vector and graph retrieval can significantly enhance the context relevance and diversity of results in RAG applications. It can lead to fewer hallucinations and higher-quality outcomes by ensuring that the system retrieves the most contextually appropriate and diverse content.
Beyond improving the quality of responses of RAG systems, document linking has some advantages for implementation in a production system. Some beneficial properties of include:
- Lossless — The original content remains intact within the nodes, ensuring that no information is discarded during the graph creation process. This preserves the integrity of the data, reducing the need for frequent re-indexing as needs evolve and leveraging the LLM’s strength in extracting answers from contextual clues.
- Hands-off — This method does not require expert intervention to refine knowledge extraction. Instead, adding some edge extraction capabilities based on keywords, hyperlinks, or other document properties to the existing vector-search pipeline allows for the automatic addition of links.
- Scalable — The graph creation process involves straightforward operations on the content without necessitating the use of an LLM to generate the knowledge graph.
Performance benchmarks and a more detailed analysis of scaling document linking is included in the article mentioned earlier.
As always, there are some limitations. If your document set truly doesn’t have links or other structure, the strategies presented here won’t accomplish much. Also, while building and traversing graph connections can be powerful, it also adds complexity to the retrieval process that might be challenging to debug and optimize — primarily if traversing the graph to depths of 2 or greater.
Overall, incorporating document linking into RAG systems combines the strengths of traditional, deterministic software methodologies, graph algorithms, and modern AI techniques. By explicitly defining links between documents, we enhance the AI’s ability to navigate knowledge as a human researcher might, improving not only retrieval accuracy but also the contextual depth of responses. This approach creates more robust, capable systems that align with the complex ways humans seek and use knowledge.
Complete code from this article can be found in this Colab notebook. And, check out this introductory blog post by my colleague at DataStax, or see the documentation for GraphVectorStore
in LangChain for detailed API information and how to use document linking to enhance your RAG applications and push the boundaries of what your knowledge systems can achieve.