As the LLM revolution begins to take shape, the hype has given way to commercial development. As the initial wave of excitement dies down, generative AI is no longer seen as an omniscient black box, but more as a constituent, if extremely powerful, tool in an engineer’s arsenal. As a result, entrepreneurs and technologists now have an increasingly mature set of tools and techniques with which to develop LLM applications.
One of the most interesting use cases for LLMs has been in the field of knowledge management. Specialized LLMs, based either on OpenAI’s GPT technology or open-source models like LLaMa 2 and Flan-T5, are being used in clever ways to manage large quantities of data. Where previously organizations with large text datasets had to rely on text search techniques like fuzzy matching or full-text indexing, they now have access to a powerful system that can not only find the information but summarize it in a time-efficient and reader-friendly fashion.
Within this use case, retrieval-augmented generation architecture, or RAG, has emerged as a standout architecture with enormous flexibility and performance. With this architecture, organizations can quickly index a body of work, perform semantic queries on it, and generate informative and cogent answers to user-defined queries based on the corpus. Several companies and services have sprung up to support implementations of the RAG architecture, highlighting its staying power.
As effective as RAG can be, this architecture also has several real limitations. In this article, we will explore the RAG architecture, identify its limitations, and propose an improved architecture to solve these limitations.
As with all other articles, I am looking to connect with other technologists and AI enthusiasts. If you have thoughts on how this architecture can be improved, or have ideas about AI that you would like to discuss, please do not hesitate to reach out! You can find me on Github or LinkedIn, the links are in my profile as well as at the bottom of this article.
Content Overview
- Retrieval Augmented Generation (RAG) Architecture
- Limitations of the RAG Architecture
- Proposing QE-RAG, or Question-Enhanced RAG
- Conclusion
Retrieval Augmented Generation (RAG) Architecture
With names like RAG, Flan, and LLaMa, the AI community is likely not to win awards for futuristic and stylish names anytime soon. However, the RAG architecture certainly deserves an award for its combination of two extremely powerful techniques made available by the development of LLMs - contextual document embedding and prompt engineering.
At its simplest, the RAG architecture is a system that uses embedding vector search to find the part(s) of the corpus most relevant to a question, insert the part(s) into a prompt, and then use prompt engineering to ensure that the answer is based on the excerpts given in the prompt. If this all sounds a little confusing, please read on because I will explain each component in turn. I will also include example code so you can follow along.
The Embedding Model
First and foremost, an effective RAG system requires a powerful embedding model. The embedding model transforms a natural text document into a series of numbers, or a “vector”, that roughly represents the semantic content of the document. Assuming the embedding model is a good one, you will be able to compare the semantic values of two different documents and determine if the two documents are semantically similar using vector arithmetic.
To see this in action, paste the following code into a Python file and run it:
import openai
from openai.embeddings_utils import cosine_similarity
openai.api_key = [YOUR KEY]
EMBEDDING_MODEL = "text-embedding-ada-002" 
def get_cos_sim(input_1, input_2):
    embeds = openai.Embedding.create(model=EMBEDDING_MODEL, input=[input_1, input_2])
    return cosine_similarity(embeds['data'][0]['embedding'], embeds['data'][1]['embedding'])
print(get_cos_sim('Driving a car', 'William Shakespeare'))
print(get_cos_sim('Driving a car', 'Riding a horse'))
The above code generates the embeddings for the phrases “Driving a car”, “William Shakespeare”, and “Riding a horse” before comparing them with each other using the cosine similarity algorithm. We would expect the cosine similarity to be higher when the phrases are similar semantically, so “Driving a car” and “Riding a horse” should be much closer, whereas “Driving a car” and “William Shakespeare” should be dissimilar.
You should see that, according to OpenAI’s embedding model, ada-002, the phrase “driving a car” is 88% similar to the phrase “riding a horse” and 76% similar to the phrase “William Shakespeare”. This means the embedding model is performing as we expect. This determination of semantic similarity is the foundation of the RAG system.
The cosine similarity idea is remarkably robust when you extend it to comparisons of much larger documents. For example, take the powerful monologue from Shakespeare’s Macbeth, “Tomorrow, and tomorrow, and tomorrow”:
monologue = ‘''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.'''
print(get_cos_sim(monologue, 'Riding a car'))
print(get_cos_sim(monologue, 'The contemplation of mortality'))
You should see that the monologue is only 75% similar to the idea of “riding a car” and 82% similar to the idea of “The contemplation of mortality”.
But we don’t just have to compare monologues with ideas, we can actually compare the monologues with questions. For example:
get_cos_sim('''Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.''', 'Which Shakespearean monologue contemplates mortality?')
get_cos_sim('''Full of vexation come I, with complaint
Against my child, my daughter Hermia.
Stand forth, Demetrius. My noble lord,
This man hath my consent to marry her.
Stand forth, Lysander. And my gracious Duke,
This man hath bewitch’d the bosom of my child.
Thou, thou, Lysander, thou hast given her rhymes,
And interchanged love-tokens with my child:
Thou hast by moonlight at her window sung
With feigning voice verses of feigning love,
And stol’n the impression of her fantasy
With bracelets of thy hair, rings, gauds, conceits,
Knacks, trifles, nosegays, sweetmeats (messengers
Of strong prevailment in unharden’d youth):
With cunning hast thou filch’d my daughter’s heart,
Turn’d her obedience, which is due to me,
To stubborn harshness. And, my gracious Duke,
Be it so she will not here, before your Grace,
Consent to marry with Demetrius,
I beg the ancient privilege of Athens:
As she is mine, I may dispose of her;
Which shall be either to this gentleman,
Or to her death, according to our law
Immediately provided in that case.''', 'Which Shakespearean monologue contemplates mortality?')
You should see that the embedding shows the Macbeth monologue is much closer, contextually, to the question “Which Shakespearean monologue contemplates mortality?” than the Egeus monologue, which does mention death but does not grapple directly with the concept of mortality.
The Vector Lookup
Now that we have the embedding, how do we use it in our RAG system? Well, suppose we wanted to give our RAG system the knowledge of all Shakespeare monologues so that it can answer questions about Shakespeare. In this case, we would download all of Shakespeare’s monologues, and generate the embeddings for them. If you are following along, you can generate the embedding like this:
embedding = openai.Embedding.create(model=EMBEDDING_MODEL, input=[monologue])[‘data’][0][‘embedding’]
Once we have the embeddings, we will want to store them in a way that allows us to query and compare them with a new embedding. Normally we would put them into what’s called a Vector Database, which is a specialized data store that allows for fast comparisons of two vectors. However, unless your corpus is extremely large, brute-force comparisons are surprisingly tolerable for most non-production, experimental use cases where performance is not critical.
Whether or not you choose to use a database, you will want to build a system that can find item(s) in your corpus that best fits the question. In our example, we will want to have the ability to find the monologue that is the most relevant to the user question at hand. You might want to do something like:
monologues_embeddings = [
    ['Tomorrow, and tomorrow, and tomorrow...', [...]], # text in the left position, embedding in the right position
    ['Full of vexation come I...', [...]],
    … # More monologues and their embeddings as you see fit.
]
def lookup_most_relevant(question):
    embed = openai.Embedding.create(model=EMBEDDING_MODEL, input=[question])['data'][0]['embedding']
    top_monologue = sorted(monologues_embeddings, key=lambda x: cosine_similarity(embed, x[1]), reverse=True)[0]
    return top_monologue
lookup_most_relevant(“How does Macbeth evaluate his life when he is confronted with his mortality?“)
If you run this example, you should see the Macbeth monologue being selected, with a roughly 82% similarity to the question.
The Prompt Engineering
The last step in the RAG model is prompt engineering. In our case, it is not too difficult. Now that we have the monologue at hand, we can construct a prompt by simply embedding the monologue in our query, then asking the question:
completion = openai.ChatCompletion.create(
  model="gpt-4",
  messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": """
Consider the following monologue:
Tomorrow, and tomorrow, and tomorrow,
Creeps in this petty pace from day to day,
To the last syllable of recorded time;
And all our yesterdays have lighted fools
The way to dusty death. Out, out, brief candle!
Life's but a walking shadow, a poor player,
That struts and frets his hour upon the stage,
And then is heard no more. It is a tale
Told by an idiot, full of sound and fury,
Signifying nothing.
How does Macbeth evaluate his life when he is confronted with his mortality?
         
         """},
    ]
)
You should see an answer like:
In the well-known "Tomorrow and tomorrow and tomorrow" monologue from Shakespeare's Macbeth, the titular character plunges into existential despair when confronted with his impending death. 
Macbeth evaluates his life as hollow, futile, and meaningless when facing his mortality. He sees it as a "walking shadow" and himself as a "poor player," highlighting his view of life as a fleeting, contrived performance with no lasting substance or impact.
The "brief candle" is a metaphor for his own mortality, suggesting that life's flame can be extinguished suddenly and unceremoniously. By stating "It is a tale told by an idiot, full of sound and fury, signifying nothing," Macbeth is expressing his belief that life, despite all its noise and action, is ultimately empty, absurd, and void of meaning. 
Overall, Macbeth's examination of his life is profoundly pessimistic, revealing his deep despair and cynicism.
Of course, this particular example is not the most powerful demonstration of the RAG architecture, since most GPT models are already aware of Shakespeare’s monologues and have been trained on the large body of analysis of Shakespeare publicly on the internet. In fact, if you ask GPT-4 this exact question without the monologue embedded, you will likely get a very good answer, though it will likely not make as many quotation references to the soliloquy. However, it should be apparent that, in a commercial setting, this technique can be cross-applied to proprietary or esoteric datasets that are not accessible to existing GPT implementations.
In fact, readers who are familiar with my previous article, Building a Document Analyzer with ChatGPT, Google Cloud, and Python, may recognize that the last part of the technique is very similar to the prompt engineering that I did in that article. Extending from that idea, we can very easily imagine a RAG system built on top of publications from the Japanese government (the sample data from that article), which would allow users to search for and ask questions about Japanese economic policy. The system would quickly retrieve the most relevant documents, summarize them, and produce an answer based on deep domain-specific knowledge not available to the base GPT models. This power and simplicity is precisely why the RAG architecture is getting a lot of traction among LLM developers.
Now that we have gone over the RAG architecture, let’s explore some of the shortcomings of this architecture.
Limitations of the RAG Architecture
Embedding Debuggability
Because many RAG systems rely on document embedding and vector search to connect the question and the relevant documents, the whole system is often as good as the embedding model used. The OpenAI embedding model is incredibly flexible, and there are many techniques to tune to embeddings. LLaMa, Meta’s open-source competitor to GPT, offers fine-tunable embedding models. However, there is an inescapable black-box aspect to the Embedding model. This is somewhat manageable when comparing short text strings, but becomes difficult to validate and debug when it comes to comparing short strings with much longer documents. In our previous example, we have to take a slight leap of faith that the embedding lookup is able to connect “mortality” to the “tomorrow, and tomorrow, and tomorrow” monologue. This can be quite uncomfortable for workloads where transparency and debuggability are critical.
Context Overload
Another limitation of the RAG model is the relatively limited amount of context that can be passed to it. Because the embedding model requires document-level context to work well, we need to be careful when chopping up the corpus for embedding. The Macbeth monologue may have an 82% similarity to the question about mortality, but that number goes down to 78% when you compare the question to the embedding for the first two lines of the monologue, that is, “Tomorrow, and tomorrow, and tomorrow. Creeps in this petty pace from day to day, To the last syllable of recorded time.”.
As a result, the context that is passed to the RAG prompt needs to be rather large. Currently, the most high-context GPT models are still limited to 16,000 tokens, which is quite a lot of text, but when you are working with long interview transcripts or context-rich articles, you will be limited in how much context you can give in the final generation prompt.
Novel Terminology
The final limitation of the RAG model is its inability to work with novel terminology. People working in specific fields tend to develop terminologies and ways of speaking that are unique to that field. When these terminologies are not present the training data of the embedding model, the lookup process will suffer.
For example, the ada-002 embedding model may not know that the “Rust Programming Language” is related to “LLVM”. In fact, it returns a relatively low cosine similarity of 78%. This means documents talking about LLVM may not show a strong similarity in a query about Rust, even though the two ideas are closely related in real life.
Usually, the problem of novel terminology can be overcome with some prompt engineering, but in the context of an embedding search, that is relatively difficult to do. Fine-tuning an embedding model is, as mentioned earlier, possible, but teaching the embedding model the novel terminology in all contexts can be error-prone and time-consuming.
Proposing QE-RAG, or Question-Enhanced RAG
Given these limitations, I would like to propose a modified architecture for a new class of RAG systems that sidesteps many of the limitations described above. The idea is based on doing vector searches on frequently asked questions, in addition to the corpus, and using an LLM to preprocess the corpus in the context of the questions. If that process sounds complicated, don’t worry, we will go over the implementation details in this section along with code examples you can use to follow along.
One thing to note is that QE-RAG should be run alongside a vanilla RAG implementation so that it can fall back on another implementation if needed. As the implementation matures, it should need the fallback less and less, but QE-RAG is still intended to be an enhancement to, rather than a replacement of, the vanilla RAG architecture.
The Architecture
The broad strokes of the QE-RAG architecture are as follows:
- Create a vector database of questions that can be or likely will be asked about the corpus.
- Preprocess and summarize the corpus against the questions in the vector database.
- When a user query comes in, compare the user query with the questions in the vector database.
- If a question in the database is highly similar to the user query, retrieve the version of the corpus that is summarized to answer the question.
- Use the summarized corpus to answer the user question.
- If no question in the DB is highly similar to the user query, fall back to a vanilla RAG implementation.
Let’s go through each part in turn.
Question Embeddings
The architecture begins, much like the vanilla RAG, with an embedding and a vector database. However, instead of embedding the documents, we will embed a series of questions.
To illustrate this, suppose we are trying to build an LLM that is an expert on Shakespeare. We might want it to answer questions like:
questions = [
    "How does political power shape the way characters interact in Shakespeare's plays?",
    "How does Shakespeare use supernatural elements in his plays?",
    "How does Shakespeare explore the ideas of death and mortality in his plays?",
    "How does Shakespeare explore the idea of free will in his plays?"
]
We will want to create an embedding for them like so, and save them or later use:
questions_embed = openai.Embedding.create(model=EMBEDDING_MODEL, input=questions)
Preprocessing and Summarization
Now that we have the questions, we will want to download and summarize the corpus. For this example, we will download the HTML versions of Macbeth and Hamlet:
import openai
import os
import requests
from bs4 import BeautifulSoup
plays = {
    'shakespeare_macbeth': 'https://www.gutenberg.org/cache/epub/1533/pg1533-images.html',
    'shakespeare_hamlet': 'https://www.gutenberg.org/cache/epub/1524/pg1524-images.html',   
}
if not os.path.exists('training_plays'):
    os.mkdir('training_plays')
for name, url in plays.items():
    print(name)
    file_path = os.path.join('training_plays', '%s.txt' % name)
    if not os.path.exists(file_path):
        res = requests.get(url)
        with open(file_path, 'w') as fp_write:
            fp_write.write(res.text)
Then we process the plays into scenes, using the HTML tags as a guide:
with open(os.path.join('training_plays', 'shakespeare_hamlet.txt')) as fp_file:
    soup = BeautifulSoup(''.join(fp_file.readlines()))
headers = soup.find_all('div', {'class': 'chapter'})[1:]
scenes = []
for header in headers:
    cur_act = None
    cur_scene = None
    lines = []
    for i in header.find_all('h2')[0].parent.find_all():
        if i.name == 'h2':
            print(i.text)
            cur_act = i.text
        elif i.name == 'h3':
            print('\t', i.text.replace('\n', ' '))
            if cur_scene is not None:
                scenes.append({
                    'act': cur_act, 'scene': cur_scene,
                    'lines': lines
                })
                lines = []
            cur_scene = i.text
        elif (i.text != '' and
              not i.text.strip('\n').startswith('ACT') and 
              not i.text.strip('\n').startswith('SCENE')
             ):
            lines.append(i.text)
And here is the part that is makes QE-RAG unique, instead of creating embeddings for the specific scenes, we create summaries for them, targeted towards each of the questions:
def summarize_for_question(text, question, location):
    completion = openai.ChatCompletion.create(
      model="gpt-3.5-turbo-16k",
      messages=[
            {"role": "system", "content": "You are a literature assistant that provides helpful summaries."},
            {"role": "user",
             "content": """Is the following excerpt from %s relevant to the following question? %s
===
%s
===
If so, summarize the sections that are relevant. Include references to specific passages that would be useful.
If not, simply say: \"nothing is relevant\" without additional explanation""" % (
                 location, question, text
             )},
        ]
    )
    return completion
This function asks ChatGPT to do 2 things: 1) identify if the passage is actually useful to answering the question at hand, and 2) summarize the parts of the scene that are useful for answering the question.
If you try this function with a few pivotal scenes from Macbeth or Hamlet, you will see that GPT3.5 is quite good at identifying if a scene is relevant to the question, and the summary will be quite a bit shorter than the scene itself. This makes it much easier to embed later at the prompt engineering step.
Now we can do this for all of the scenes.
for scene in scenes:
    scene_text = ''.join(scene['lines'])
    question_summaries = {}
    for question in questions:
        completion = summarize_for_question(''.join(scene['lines']), question, "Shakespeare's Hamlet")
        question_summaries[question] = completion.choices[0].message['content']
    scene['question_summaries'] = question_summaries
In production workloads, we would put the summaries into a database, but in our case, we will just write it as a JSON file to disk.
Two-Stage Vector Search
Now suppose we get a user question like the below:
user_question = "How do Shakespearean characters deal with the concept of death?"
As in vanilla RAG, we will want to create an embedding for the question:
uq_embed = openai.Embedding.create(model=EMBEDDING_MODEL, input=[user_question])['data'][0]['embedding']
In a vanilla RAG, we would compare the user question embedding with the embeddings for the scenes in Shakespeare, but in QE-RAG, we compare with the embeddings with the questions:
print([cosine_similarity(uq_embed, q) for q in question_embed])
We see that the vector search has (correctly) identified question 3 as the most relevant question. Now, we retrieve the summary data for question 3:
relevant_texts = []
for scene in hamlet + macbeth: # hamlet and macbeth are the scene lists from the above code
    if "NOTHING IS RELEVANT" not in scene['question_summaries'][questions[2]].upper() and \
    "NOTHING IN THIS EXCERPT" not in scene['question_summaries'][questions[2]].upper() and \
    'NOTHING FROM THIS EXCERPT' not in scene['question_summaries'][questions[2]].upper() and \
    "NOT DIRECTLY ADDRESSED" not in scene['question_summaries'][questions[2]].upper():
        relevant_texts.append(scene['question_summaries'][questions[2]])
Please note that, because GPT summarization is not deterministic, you may get several different strings to indicate that a scene is not relevant to the question at hand. The key is to only push the relevant excerpts into the list of relevant summaries.
At this stage, we can do a second-level vector search to only include the most relevant summaries in our prompt, but given the size of our corpus, we can simply use the entire relevant_texts list in our prompt.
Prompt Engineering
Now we have all the excerpts, we can build the prompt like so:
completion = openai.ChatCompletion.create(
      model="gpt-3.5-turbo-16k",
      messages=[
            {"role": "system", "content": "You are a literature assistant that answers questions based on excerpts."},
            {"role": "user",
             "content": """Consider the following passages:
    %s
====
Answer this question: %s
Be detailed and complete in your analysis. Make references to specific passages where helpful. 
Answer only based on passages provided.
""" % (
                 '\n===\n'.join(relevant_texts), user_question
             )},
        ]
    ) 
Let the API call run, and you should see something like the below:
Based on the passages provided, Shakespearean characters deal with the concept of death in various ways. In Hamlet, for example, the characters contemplate the nature of death and its implications. Hamlet himself reflects on the brevity of life and the corrupting influence of vices. He questions the purpose of existence and contemplates suicide as a release from the burdens of life. In the passage, Hamlet says, "To die, to sleep— / To sleep, perchance to dream. Ay, there's the rub, / For in that sleep of death what dreams may come" (Lines 65-67). This shows his contemplation of the afterlife and the uncertainty surrounding it.
In Macbeth, characters are confronted with the consequences of their actions and the fragility of life. Lady Macbeth, for instance, experiences guilt and remorse for her involvement in the murder of King Duncan. She is haunted by visions of blood on her hands and expresses the desire to escape the consequences of her actions. In the passage, Lady Macbeth says, "Out, damned spot! Out, I say!" (Line 30). This demonstrates her struggle to cope with the guilt and the psychological toll it takes on her.
Furthermore, both Macbeth and Lady Macbeth reflect on the inevitability and the consequences of their actions. In the passage, Macbeth laments the decay of life and the loss of purpose and meaning. He reflects on the transitory nature of existence and the insignificance of worldly achievements. Macbeth states, "Life's but a walking shadow, a poor player / That struts and frets his hour upon the stage" (Lines 217-218). This shows his acceptance of mortality and his realization that material possessions and power do not bring true fulfillment.
In summary, Shakespearean characters deal with the concept of death in various ways. They contemplate the nature of life and the afterlife, struggle with guilt and remorse, and question the meaning and purpose of their existence. Shakespeare's exploration of death and mortality through his characters provides insight into the complexities of human nature and the universal human experience of confronting one's mortality.
Which is a very good summary that draws from many different parts of Hamlet and Macbeth.
For comparison, I created embeddings for entire scenes and ran a vanilla RAG lookup using the same prompt, finding the top N most relevant scenes until I reached 10,000 tokens
relevant_texts = []
for i in range(10):
    if len(encoding.encode('\n===\n'.join(relevant_texts))) > 10000:
        break
    relevant_texts.append(''.join(macbeth[sorted(
        [(idx, cosine_similarity(uq_embed, q)) for idx, q in enumerate(macbeth_embed)], 
        key=lambda x: x[1], reverse=True
    )[i][0]]['lines']))
    relevant_texts.append(''.join(hamlet[sorted(
        [(idx, cosine_similarity(uq_embed, q)) for idx, q in enumerate(hamlet_embed)], 
        key=lambda x: x[1], reverse=True
    )[i][0]]['lines']))
If you feed this to the prompt, you should see something like the below:
Based on the provided passages, Shakespearean characters deal with the concept of death in various ways. In Hamlet, the characters reflect on death and mortality extensively. Hamlet himself contemplates suicide, expressing his desire to end his suffering and questioning the purpose of life. He says, "To be, or not to be: that is the question" (Hamlet, Act III, Scene I, Lines 56). This soliloquy reveals his struggle with the concept of death and his contemplation of the consequences of choosing either life or death.
Additionally, in the same play, the appearance of the Ghost prompts discussions about death. The characters question the meaning and significance of encountering the dead, and Hamlet grapples with the idea of mortality in relation to his father's death. He reflects, "Oh God! a beast that wants discourse of reason / Would have mourn'd longer" (Hamlet, Act I, Scene II, Lines 152-153). This passage highlights Hamlet's grief and his pondering of the brevity of life and the transitory nature of existence.
In Macbeth, the characters also confront death and its consequences. Macbeth, after killing King Duncan, reflects on the significance of the act and its implications. He says, "Will all great Neptune’s ocean wash this blood Clean from my hand? No, this my hand will rather The multitudinous seas incarnadine" (Macbeth, Act II, Scene II, Lines 77-79). This passage showcases Macbeth's awareness of the guilt associated with murder and his realization that the blood on his hands cannot be cleansed.
Lady Macbeth, on the other hand, shows a different perspective on death. She suppresses her guilt and tries to detach herself from the consequences of their actions. She states, "A little water clears us of this deed" (Macbeth, Act II, Scene II, Line 69). This passage reveals her denial and determination to move forward without remorse.
In conclusion, Shakespearean characters deal with the concept of death in various ways. Some contemplate the meaning of life and question the purpose of existence, while others grapple with the guilt and consequences that come with taking a life. Shakespeare explores the complexities of death and mortality through his characters' introspection and their different responses to the inevitability of death.
This is very cogent analysis, but does not engage with many of the most important passages from Hamlet and Macbeth. As you can see, QE-RAG has a distinct advantage in being able to embed more relevant context than a standard RAG system.
The above example, however, does not demonstrate another advantage of the QE-RAG, which is the ability to give the developer better control of the embedding process. To see how QE-RAG achieves this, let’s look at an extension of this problem - dealing with new terminology.
Extending QE-RAG to New Terminology
Where QE-RAG really shines is when you are introducing a new terminology. For example, suppose you are introducing a new concept, like the Japanese word “zetsubou”, which is a term that sits between despair and hopelessness, specifically conveying a surrender to one’s circumstances. It is not as immediately catastrophic as the English concept of despair, but much more about the acquiescence to unpleasant things that are happening.
Supposed we want to answer a question like:
user_question = "How do Shakespearean characters cope with Zetsubou?"
With vanilla RAG we would do an embeddings search, then add an explainer in the final prompt engineering step:
relevant_texts = []
for i in range(10):
    if len(encoding.encode('\n===\n'.join(relevant_texts))) > 10000:
        break
    relevant_texts.append(''.join(macbeth[sorted(
        [(idx, cosine_similarity(uq_embed, q)) for idx, q in enumerate(macbeth_embed)], 
        key=lambda x: x[1], reverse=True
    )[i][0]]['lines']))
    relevant_texts.append(''.join(hamlet[sorted(
        [(idx, cosine_similarity(uq_embed, q)) for idx, q in enumerate(hamlet_embed)], 
        key=lambda x: x[1], reverse=True
    )[i][0]]['lines']))
completion = openai.ChatCompletion.create(
      model="gpt-3.5-turbo-16k",
      messages=[
            {"role": "system", "content": "You are a literature assistant that answers questions based on excerpts."},
            {"role": "user",
             "content": """Zetsubou is the concept of hopelessness and despair, combined with a surrender to whim of one's circumstances.
Consider the following passages:
    %s
====
Answer this question: %s
Be detailed and complete in your analysis. Make references to specific passages where helpful. 
Answer only based on passages provided.
""" % (
                 '\n===\n'.join(relevant_texts), user_question
             )},
        ]
    ) 
The result is a very well-written and cogent but slightly overstretched answer focusing on a few scenes from Hamlet. Macbeth is not mentioned at all in this answer, because none of the scenes passed the embedding search. When looking at the embeddings, it is very clear that the semantic meaning of “zetsubou” was not captured properly and therefore relevant texts could not be retrieved from it.
In QE-RAG, we can inject the definition for the new term at the summarization stage, dramatically improving quality of text accessible by the system:
def summarize_for_question(text, question, location, context=''):
    completion = openai.ChatCompletion.create(
      model="gpt-3.5-turbo-16k",
      messages=[
            {"role": "system", "content": "You are a literature assistant that provides helpful summaries."},
            {"role": "user",
             "content": """%s
Is the following excerpt from %s relevant to the following question? %s
===
%s
===
If so, summarize the sections that are relevant. Include references to specific passages that would be useful.
If not, simply say: \"nothing is relevant\" without additional explanation""" % (
                 context, location, question, text
             )},
        ]
    )
    return completion
questions = [
    "How do characters deal with Zetsubou in Shakespearean plays?"
]
summarize_for_question(''.join(scene['lines']), questions[0], "Shakespeare's Macbeth", 
        "Zetsubou is the concept of hopelessness and despair, combined with a surrender to whim of one's circumstances."
    )
Use this summarization prompt with the passages, and you will see summaries that are extremely accurate and contextual. These summaries can then be used in the subsequent QE-RAG.
Additionally, the question embedding will be extremely similar, because even though ada-002 does not know the precise meaning of “zetsubou”, the presence of this word in both questions makes the question embeddings extremely similar.
Therefore, using QE-RAG to answer the original question about Zetsubou, we get the following answer:
Based on the passages provided, Shakespearean characters cope with Zetsubou, or hopelessness and despair, in different ways. In "Macbeth," Lady Macbeth copes with Zetsubou through guilt and anguish. She is consumed by the guilt of her and her husband's actions, and this guilt manifests in her sleepwalking and attempts to wash her hands of imagined blood stains. This can be seen in her utterances of "Out, damned spot!" and her preoccupation with the blood on her hands. Lady Macbeth's coping mechanism is to face the consequences of her actions head-on, even if it means descending into madness. The Doctor and Gentlewoman observe her behavior and acknowledge that her condition requires divine intervention.
In "Hamlet," the excerpt does not provide specific passages or examples of characters coping with Zetsubou. However, it does address Hamlet's inner turmoil and struggle to cope with his circumstances. Hamlet's state of mind, confusion, and madness are a reflection of despair. The characters in this scene are attempting to understand Hamlet's condition and find ways to deal with it, but the excerpt does not provide details on how they cope with Zetsubou.
In "Macbeth," Macbeth himself copes with Zetsubou by surrendering to his dire circumstances. He becomes desensitized to fear and horror, indicating a kind of resignation to his grim circumstances. He acknowledges that there is no escape from his fate and chooses to face it head-on, even though he knows it will lead to his downfall. Macbeth's reaction to the news of his wife's death and his subsequent soliloquy reflect a deep despair and hopelessness. He sees life as meaningless, a "tale Told by an idiot, full of sound and fury, Signifying nothing." Macbeth's surrender to his circumstances and his belief in the futility of life demonstrate his coping mechanism for Zetsubou.
In "Macbeth," Lady Macduff copes with Zetsubou by expressing her feelings of despair and hopelessness. She feels betrayed and abandoned by her husband, who has fled and left their family behind. She questions his actions and believes that they show a lack of love and wisdom. Lady Macduff experiences a sense of abandonment, which aligns with the concept of Zetsubou. She tries to make her son understand the severity of their circumstances, but he questions the definition of a traitor. Lady Macduff ultimately meets a tragic end, indicating that her coping mechanism for Zetsubou is not successful.
In summary, Shakespearean characters cope with Zetsubou in different ways. Lady Macbeth copes through guilt and faces the consequences of her actions, even if it means descending into madness. Macbeth copes through surrendering to his dire circumstances and embracing his fate. Lady Macduff copes through expressing her feelings of despair and hopelessness but ultimately meets a tragic end. These characters' coping mechanisms reflect different responses to Zetsubou in Shakespearean plays
The output still struggles a bit with this concept, but it should be apparent that QE-RAG is much more able to engage with the meat of the question and draw from a variety of sources.
And with that, we have implemented QE-RAG, and demonstrated its use-case!
Conclusion
In today’s article, we examined the increasingly popular RAG architecture and its limitations. We then extended the RAG architecture with a new architecture called QE-RAG, which looks to more fully use the capabilities of large language models. In addition to improved accuracy and contextual access, QE-RAG allows the entire system to grow as it interacts with users and becomes more familiar with the types of questions that are being asked, allowing firms to develop unique intellectual property on top of open-source or commercially available LLMs.
Of course, as an experimental idea, QE-RAG is not perfect. If you have ideas on how this architecture can be improved, or simply want to have a discussion about LLM technologies, please don’t hesitate to drop me a line through my Github or LinkedIn.
