Communities, chats, and forums are an endless source of information on a multitude of topics. Slack often replaces technical documentation, and Telegram and Discord communities help with gaming, startups, crypto, and travel questions. Despite the relevance of firsthand information, it is frequently highly unstructured, making it difficult to search through. In this article, we will explore the complexities of implementing a Telegram bot that will find answers to questions by extracting information from the history of chat messages.

Here are the challenges that await us:

Basic chatbot userflow we are going to implement

  1. The user asks the bot a question
  2. The bot finds the closest answers in the history of messages
  3. The bot summarises the search results with the help of LLM
  4. Returns to the user the final answer with links to relevant messages

Let's walk through the main stages of this user flow and highlight the main challenges we will face.

Data preparation

To prepare a message history for search, we need to create the embeddings of these messages - vectorized text representations. While dealing with a wiki article or PDF document, we would split the text into paragraphs and compute Sentence Embedding for each.

However, we should take into account the peculiarities that are typical for chats and not for well-structured text:

Next, we should choose the embedding model. There are many different models for building embeddings, and several factors must be considered when choosing the right model.

To improve the quality of search results, we can categorize messages by topic. For example, in a chat dedicated to frontend development, users can discuss topics such as: CSS, tooling, React, Vue, etc. You can use LLM (more expensive) or classic topic-modeling methods from libraries like BERTopic to classify messages by topics.

We will also need a vector database to store embeddings and meta-information (links to original posts, categories, dates). Many vector storages, such as FAISS, Milvus, or Pinecone, exist for this purpose. A regular PostgreSQL with the pgvector extension will also work.

Processing a users question

In order to answer a user's question, we need to convert the question to a searchable form, and thus compute the question's embedding, as well as determine its intent.

The result of a semantic search on a question could be similar questions from the chat history but not the answers to them.

To imporve this, we can use one of the popular HyDE (hypothetical document embeddings) optimization techniques. The idea is to generate a hypothetical answer to a question using LLM and then compute the embedding of the answer. This approach in some cases allows more accurate and efficient search for relevant messages among answers rather than questions.

Finding the most relevant messages

Once we have the question embedding, we can search for the closest messages in the database. LLM has a limited context window, so we may be unable to add all the search results if there are too many. The question arises of how to prioritize the answers. There are several approaches for this:

Generating the final response

After searching and sorting in the previous step, we can keep the 50-100 most relevant posts that will fit into the LLM context.

The next step is to create a clear and concise prompt for LLM using the user's original query and search results. It should specify to the LLM how to answer the question, the user's query, and the context - the relevant messages we found. For this purpose, it is essential to consider these aspects:

Implementation

Now let's try to implement these steps with NodeJS. Here is the tech stack I’m going to use:

Let's skip the basic steps of installing dependencies and telegram bot setup and move on straight to the most important features. The database schema, which will be needed later:

import { Entity, Enum, Property, Unique } from '@mikro-orm/core';

@Entity({ tableName: 'groups' })
export class Group extends BaseEntity {
  @PrimaryKey()
  id!: number;

  @Property({ type: 'bigint' })
  channelId!: number;

  @Property({ type: 'text', nullable: true })
  title?: string;

  @Property({ type: 'json' })
  attributes!: Record<string, unknown>;
}

@Entity({ tableName: 'messages' })
export class Message extends BaseEntity {
  @PrimaryKey()
  id!: number;

  @Property({ type: 'bigint' })
  messageId!: number;

  @Property({ type: TextType })
  text!: string;

  @Property({ type: DateTimeType })
  date!: Date;

  @ManyToOne(() => Group, { onDelete: 'cascade' })
  group!: Group;

  @Property({ type: 'string', nullable: true })
  fromUserName?: string;

  @Property({ type: 'bigint', nullable: true })
  replyToMessageId?: number;

  @Property({ type: 'bigint', nullable: true })
  threadId?: number;

  @Property({ type: 'json' })
  attributes!: {
    raw: Record<any, any>;
  };
}

@Entity({ tableName: 'content_chunks' })
export class ContentChunk extends BaseEntity {
  @PrimaryKey()
  id!: number;

  @ManyToOne(() => Group, { onDelete: 'cascade' })
  group!: Group;

  @Property({ type: TextType })
  text!: string;

  @Property({ type: VectorType, length: 1536, nullable: true })
  embeddings?: number[];

  @Property({ type: 'int' })
  tokens!: number;

  @Property({ type: new ArrayType<number>((i: string) => +i), nullable: true })
  messageIds?: number[];

  @Property({ persist: false, nullable: true })
  distance?: number;
}

Split user dialogs into chunks

Splitting long dialogs between multiple users into chunks is not the most trivial task.

Unfortunately, default approaches such as RecursiveCharacterTextSplitter, available in the Langchain library, do not account for all the peculiarities specific to chatting. However, in the case of Telegram, we can take advantage of Telegram threads that contain related messages and the replies sent by users.

Every time a new batch of messages arrives from the chat room, our bot needs to perform a few steps:

class ChatContentSplitter {
  constructor(
    private readonly splitter RecursiveCharacterTextSplitter,
	  private readonly longMessageLength = 200
  ) {}

  public async split(messages: EntityDTO<Message>[]): Promise<ContentChunk[]> {
	  const filtered = this.filterMessage(messages);
    const merged = this.mergeUserMessageSeries(filtered);
    const threads = this.toThreads(merged);
    const chunks = await this.threadsToChunks(threads);
    return chunks;
  }
  
  toThreads(messages: EntityDTO<Message>[]): EntityDTO<Message>[][] {
    const threads = new Map<number, EntityDTO<Message>[]>();
    const orphans: EntityDTO<Message>[][] = [];
    for (const message of messages) {
      if (message.threadId) {
        let thread = threads.get(message.threadId);
        if (!thread) {
          thread = [];
          threads.set(message.threadId, thread);
        }
        thread.push(message);
      } else {
        orphans.push([message]);
      }
    }
    return [Array.from(threads.values()), ...orphans];
  }
  
  private async threadsToChunks(
    threads: EntityDTO<Message>[][],
  ): Promise<ContentChunk[]> {
    const result: ContentChunk[] = [];
    for await (const thread of threads) {
      const content = thread.map((m) => this.dtoToString(m))
        .join('\n')
      const texts = await this.splitter.splitText(content);
      const messageIds = thread.map((m) => m.id);
      const chunks = texts.map((text) =>
        new ContentChunk(text, messageIds)
      );
      result.push(...chunks);
    }
    return result;
  }
    
  mergeMessageSeries(messages: EntityDTO<Message>[]): EntityDTO<Message>[] {
    const result: EntityDTO<Message>[] = [];
    let next = messages[0];
    for (const message of messages.slice(1)) {
      const short = message.text.length < this.longMessageLength;
      const sameUser = current.fromId === message.fromId;
      const subsequent = differenceInMinutes(current.date, message.date) < 10;
      if (sameUser && subsequent && short) {
        next.text += `\n${message.text}`;
      } else {
        result.push(current);
        next = message;
      }
    }
    return result;
  }
  // ....
}

Embeddings

Next, we need to calculate the embeddings for each of the chunks. For this we can use the OpenAI model text-embedding-3-large

 public async getEmbeddings(chunks: ContentChunks[]) {
    const chunked = groupArray(chunks, 100);
    for await (const chunk of chunks) {
      const res = await this.openai.embeddings.create({
        input: c.text,
        model: 'text-embedding-3-large',
        encoding_format: "float"
      });
      chunk.embeddings = res.data[0].embedding
    }
    await this.orm.em.flush();
  }

Answering user questions

To answer a user's question, we first count the embedding of the question and then find the most relevant messages in the chat history

  public async similaritySearch(embeddings: number[], groupId; number): Promise<ContentChunk[]> {
    return this.orm.em.qb(ContentChunk)
      .where({ embeddings: { $ne: null }, group: this.orm.em.getReference(Group, groupId) })
      .orderBy({[l2Distance('embedding', embedding)]: 'ASC'})
      .limit(100);
  }

Then we rerank the search results with the help of the Cohere’s reranking model

  public async rerank(query: string, chunks: ContentChunk[]): Promise<ContentChunk> {
    const { results } = await cohere.v2.rerank({
      documents: chunks.map(c => c.text),
      query,
      model: 'rerank-v3.5',
    });
    const reranked =  Array(results.length).fill(null);
    for (const { index } of results) {
      reranked[index] = chunks[index];
    }
    return reranked;
  }

Next, ask the LLM to answer the user's question by summarising the search results. The simplified version of the processing a search query will look like this:

 public async search(query: string, group: Group) {
    const queryEmbeddings = await this.getEmbeddings(query);
    const chunks = this.chunkService.similaritySearch(queryEmbeddings, group.id);
    const reranked = this.cohereService.rerank(query, chunks);
    const completion = await this.openai.chat.completions.create({
      model: 'gpt-4-turbo',
      temperature: 0,
      messages: [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: this.userPromptTemplate(query, reranked) },
      ]
    ]
    return completion.choices[0].message;
  }
  
  // naive prompt
  public userPromptTemplate(query: string, chunks: ContentChunk[]) {
  const history = chunks
	  .map((c) => `${c.text}`)
	  .join('\n----------------------------\n')
  return `
  Answer the user's question:
    ${query}
  By summarizing the following content:
    ${history}
    Keep your answer direct and concise. Provide refernces to the corresponding messages..
    `;
}

Further improvements

Even after all optimizations, we may feel the LLM powered bot answers are non-ideal and incomplete. What else could be improved?