sia.hackernoon.com

Communities, chats, and forums are an endless source of information on a multitude of topics. Slack often replaces technical documentation, and Telegram and Discord communities help with gaming, startups, crypto, and travel questions. Despite the relevance of firsthand information, it is frequently highly unstructured, making it difficult to search through. In this article, we will explore the complexities of implementing a Telegram bot that will find answers to questions by extracting information from the history of chat messages.

Here are the challenges that await us:

Find relevant messages. The answer may be scattered across several people's dialogue or in a link to external resources.

Ignoring offtopic. There is a lot of spam and off-topics, which we should learn to identify and filter out
Prioritization. Information becomes outdated. How do you know the correct answer to date?

Basic chatbot userflow we are going to implement

The user asks the bot a question
The bot finds the closest answers in the history of messages
The bot summarises the search results with the help of LLM
Returns to the user the final answer with links to relevant messages

Let's walk through the main stages of this user flow and highlight the main challenges we will face.

Data preparation

To prepare a message history for search, we need to create the embeddings of these messages - vectorized text representations. While dealing with a wiki article or PDF document, we would split the text into paragraphs and compute Sentence Embedding for each.

However, we should take into account the peculiarities that are typical for chats and not for well-structured text:

Multiple subsequent short messages from a single user. In such cases, it is worth combining messages into larger text blocks
Some of the messages are very long and cover several different topics
Meaningless messages and spam we should filter out
The user can reply without tagging the original message. A question and answer can be separated in the chat history by many other messages
The user can respond with a link to an external resource (e.g., an article or document)

Next, we should choose the embedding model. There are many different models for building embeddings, and several factors must be considered when choosing the right model.

Embeddings dimension. The higher it is, the more nuances the model can learn from the data. The search will be more accurate but require more memory and computational resources.
Dataset on which the embedding model was trained. This will determine, for example, how well it supports the language you need.

To improve the quality of search results, we can categorize messages by topic. For example, in a chat dedicated to frontend development, users can discuss topics such as: CSS, tooling, React, Vue, etc. You can use LLM (more expensive) or classic topic-modeling methods from libraries like BERTopic to classify messages by topics.

We will also need a vector database to store embeddings and meta-information (links to original posts, categories, dates). Many vector storages, such as FAISS, Milvus, or Pinecone, exist for this purpose. A regular PostgreSQL with the pgvector extension will also work.

Processing a users question

In order to answer a user's question, we need to convert the question to a searchable form, and thus compute the question's embedding, as well as determine its intent.

The result of a semantic search on a question could be similar questions from the chat history but not the answers to them.

To imporve this, we can use one of the popular HyDE (hypothetical document embeddings) optimization techniques. The idea is to generate a hypothetical answer to a question using LLM and then compute the embedding of the answer. This approach in some cases allows more accurate and efficient search for relevant messages among answers rather than questions.

Finding the most relevant messages

Once we have the question embedding, we can search for the closest messages in the database. LLM has a limited context window, so we may be unable to add all the search results if there are too many. The question arises of how to prioritize the answers. There are several approaches for this:

Recency score. Over time, information becomes outdated, and to prioritize new messages, you can calculate the recency score using the simple formula 1 / (today - date_of_message + 1)
Metadata filtering. (you need to identify the topic of the question and posts). This helps to narrow down your search, leaving only those posts that are relevant to the topic you are looking for

Full-text search. Classic full-text search, which is well-supported by all popular databases, can sometimes be helpful.

Reranking. Once we have found the answers, we can sort them by the degree of ‘closeness’ to the question, leaving only the most relevant ones. Reranking will require a CrossEncoder model, or we can use the reranking API, for example, from Cohere.

Generating the final response

After searching and sorting in the previous step, we can keep the 50-100 most relevant posts that will fit into the LLM context.

The next step is to create a clear and concise prompt for LLM using the user's original query and search results. It should specify to the LLM how to answer the question, the user's query, and the context - the relevant messages we found. For this purpose, it is essential to consider these aspects:

System Prompt are instructions to the model that explain how it should process information. For example, you can tell the LLM to look for an answer only in the data provided.
Context length - the maximum length of the messages we can use as input. We can calculate the number of tokens using the tokenizer corresponding to the model we use. For example, OpenAI uses Tiktoken.
Model hyperparameters - for example, the temperature is responsible for how creative the model will be in its responses.
The choice of the model. It is not always worth overpaying for the most large and powerful model. It makes sense to conduct several tests with different models and compare their results. In some cases, less resource-intensive models will do the job if they do not require high accuracy.

Implementation

Now let's try to implement these steps with NodeJS. Here is the tech stack I’m going to use:

NodeJS and TypeScript
Grammy - Telegram bot framework
PostgreSQL - as a primary storage for all of our data
pgvector - PostgreSQL extension for storing text embeddings and messages
OpenAI API - LLM и and embeddings models
Mikro-ORM - to simplify db interactions

Let's skip the basic steps of installing dependencies and telegram bot setup and move on straight to the most important features. The database schema, which will be needed later:

import { Entity, Enum, Property, Unique } from '@mikro-orm/core';

@Entity({ tableName: 'groups' })
export class Group extends BaseEntity {
  @PrimaryKey()
  id!: number;

  @Property({ type: 'bigint' })
  channelId!: number;

  @Property({ type: 'text', nullable: true })
  title?: string;

  @Property({ type: 'json' })
  attributes!: Record<string, unknown>;
}

@Entity({ tableName: 'messages' })
export class Message extends BaseEntity {
  @PrimaryKey()
  id!: number;

  @Property({ type: 'bigint' })
  messageId!: number;

  @Property({ type: TextType })
  text!: string;

  @Property({ type: DateTimeType })
  date!: Date;

  @ManyToOne(() => Group, { onDelete: 'cascade' })
  group!: Group;

  @Property({ type: 'string', nullable: true })
  fromUserName?: string;

  @Property({ type: 'bigint', nullable: true })
  replyToMessageId?: number;

  @Property({ type: 'bigint', nullable: true })
  threadId?: number;

  @Property({ type: 'json' })
  attributes!: {
    raw: Record<any, any>;
  };
}

@Entity({ tableName: 'content_chunks' })
export class ContentChunk extends BaseEntity {
  @PrimaryKey()
  id!: number;

  @ManyToOne(() => Group, { onDelete: 'cascade' })
  group!: Group;

  @Property({ type: TextType })
  text!: string;

  @Property({ type: VectorType, length: 1536, nullable: true })
  embeddings?: number[];

  @Property({ type: 'int' })
  tokens!: number;

  @Property({ type: new ArrayType<number>((i: string) => +i), nullable: true })
  messageIds?: number[];

  @Property({ persist: false, nullable: true })
  distance?: number;
}

Split user dialogs into chunks

Splitting long dialogs between multiple users into chunks is not the most trivial task.

Unfortunately, default approaches such as RecursiveCharacterTextSplitter, available in the Langchain library, do not account for all the peculiarities specific to chatting. However, in the case of Telegram, we can take advantage of Telegram threads that contain related messages and the replies sent by users.

Every time a new batch of messages arrives from the chat room, our bot needs to perform a few steps:

Filter short messages by a list of stop words (e.g. ‘hello’, ‘bye’, etc.)
Merge messages from one user if they were sent consecutively within a short period of time
Group all messages belonging to the same thread
Merge the received message groups into larger text blocks and further split this text blocks into chunks using RecursiveCharacterTextSplitter
Calculate the embeddings for each chunk
Persist the text chunks in the database along with their embeddings and links to the original messages

class ChatContentSplitter {
  constructor(
    private readonly splitter RecursiveCharacterTextSplitter,
	  private readonly longMessageLength = 200
  ) {}

  public async split(messages: EntityDTO<Message>[]): Promise<ContentChunk[]> {
	  const filtered = this.filterMessage(messages);
    const merged = this.mergeUserMessageSeries(filtered);
    const threads = this.toThreads(merged);
    const chunks = await this.threadsToChunks(threads);
    return chunks;
  }
  
  toThreads(messages: EntityDTO<Message>[]): EntityDTO<Message>[][] {
    const threads = new Map<number, EntityDTO<Message>[]>();
    const orphans: EntityDTO<Message>[][] = [];
    for (const message of messages) {
      if (message.threadId) {
        let thread = threads.get(message.threadId);
        if (!thread) {
          thread = [];
          threads.set(message.threadId, thread);
        }
        thread.push(message);
      } else {
        orphans.push([message]);
      }
    }
    return [Array.from(threads.values()), ...orphans];
  }
  
  private async threadsToChunks(
    threads: EntityDTO<Message>[][],
  ): Promise<ContentChunk[]> {
    const result: ContentChunk[] = [];
    for await (const thread of threads) {
      const content = thread.map((m) => this.dtoToString(m))
        .join('\n')
      const texts = await this.splitter.splitText(content);
      const messageIds = thread.map((m) => m.id);
      const chunks = texts.map((text) =>
        new ContentChunk(text, messageIds)
      );
      result.push(...chunks);
    }
    return result;
  }
    
  mergeMessageSeries(messages: EntityDTO<Message>[]): EntityDTO<Message>[] {
    const result: EntityDTO<Message>[] = [];
    let next = messages[0];
    for (const message of messages.slice(1)) {
      const short = message.text.length < this.longMessageLength;
      const sameUser = current.fromId === message.fromId;
      const subsequent = differenceInMinutes(current.date, message.date) < 10;
      if (sameUser && subsequent && short) {
        next.text += `\n${message.text}`;
      } else {
        result.push(current);
        next = message;
      }
    }
    return result;
  }
  // ....
}

Embeddings

Next, we need to calculate the embeddings for each of the chunks. For this we can use the OpenAI model text-embedding-3-large

 public async getEmbeddings(chunks: ContentChunks[]) {
    const chunked = groupArray(chunks, 100);
    for await (const chunk of chunks) {
      const res = await this.openai.embeddings.create({
        input: c.text,
        model: 'text-embedding-3-large',
        encoding_format: "float"
      });
      chunk.embeddings = res.data[0].embedding
    }
    await this.orm.em.flush();
  }

Answering user questions

To answer a user's question, we first count the embedding of the question and then find the most relevant messages in the chat history

  public async similaritySearch(embeddings: number[], groupId; number): Promise<ContentChunk[]> {
    return this.orm.em.qb(ContentChunk)
      .where({ embeddings: { $ne: null }, group: this.orm.em.getReference(Group, groupId) })
      .orderBy({[l2Distance('embedding', embedding)]: 'ASC'})
      .limit(100);
  }

Then we rerank the search results with the help of the Cohere’s reranking model

  public async rerank(query: string, chunks: ContentChunk[]): Promise<ContentChunk> {
    const { results } = await cohere.v2.rerank({
      documents: chunks.map(c => c.text),
      query,
      model: 'rerank-v3.5',
    });
    const reranked =  Array(results.length).fill(null);
    for (const { index } of results) {
      reranked[index] = chunks[index];
    }
    return reranked;
  }

Next, ask the LLM to answer the user's question by summarising the search results. The simplified version of the processing a search query will look like this:

 public async search(query: string, group: Group) {
    const queryEmbeddings = await this.getEmbeddings(query);
    const chunks = this.chunkService.similaritySearch(queryEmbeddings, group.id);
    const reranked = this.cohereService.rerank(query, chunks);
    const completion = await this.openai.chat.completions.create({
      model: 'gpt-4-turbo',
      temperature: 0,
      messages: [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: this.userPromptTemplate(query, reranked) },
      ]
    ]
    return completion.choices[0].message;
  }
  
  // naive prompt
  public userPromptTemplate(query: string, chunks: ContentChunk[]) {
  const history = chunks
	  .map((c) => `${c.text}`)
	  .join('\n----------------------------\n')
  return `
  Answer the user's question:
    ${query}
  By summarizing the following content:
    ${history}
    Keep your answer direct and concise. Provide refernces to the corresponding messages..
    `;
}

Further improvements

Even after all optimizations, we may feel the LLM powered bot answers are non-ideal and incomplete. What else could be improved?

For user posts that include links, we can also parse the web-pages and pdf-documents content.
Query-Routing — directing user queries to the most appropriate data source, model, or index based on the query’s intent and context to optimize accuracy, efficiency, and cost.
We can include resources relevant to topic of the chat-room to the search index — at work, it can be documentation from Confluence, for visa chats, consulate websites with rules, etc.
RAG-Evaluation - We need to set up a pipeline to evaluate the quality of our bot's responses

AI Chatbot Helps Manage Telegram Communities Like a Pro