This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: 9_k3Xk8YB85liEuizfZloAx4gVpXVTFd2Le-B2A3cb0
Cover

Improving Text Embeddings with Large Language Models: Instructions for Training and Evaluation

Written by @autoencoder | Published on 2024/10/10

TL;DR
This paper introduces a novel method for generating high-quality text embeddings using synthetic data, achieving state-of-the-art results with minimal training

Authors:

(1) Liang Wang, Microsoft Corporation, and Correspondence to (wangliang@microsoft.com);

(2) Nan Yang, Microsoft Corporation, and correspondence to (nanya@microsoft.com);

(3) Xiaolong Huang, Microsoft Corporation;

(4) Linjun Yang, Microsoft Corporation;

(5) Rangan Majumder, Microsoft Corporation;

(6) Furu Wei, Microsoft Corporation and Correspondence to (fuwei@microsoft.com).

Abstract and 1 Introduction

2 Related Work

3 Method

3.1 Synthetic Data Generation

3.2 Training

4 Experiments

4.1 Statistics of the Synthetic Data

4.2 Model Fine-tuning and Evaluation

4.3 Main Results

4.4 Multilingual Retrieval

5 Analysis

5.1 Is Contrastive Pre-training Necessary?

5.2 Extending to Long Text Embeddings and 5.3 Analysis of Training Hyperparameters

6 Conclusion and References

A Implementation Details

B Test Set Contamination Analysis

C Prompts for Synthetic Data Generation

D Instructions for Training and Evaluation

D Instructions for Training and Evaluation

We manually write instructions for training datasets, as listed in Table 13. For evaluation datasets, the instructions are listed in Table 14.

Table 8: Prompt template for the long-short matching subgroup. For placeholders, “{num_words}” ∈ {"less than 10", "at least 10", "at least 50", "at least 100", "at least 200"}, “{difficulty}” ∈ {high school, college, PhD}, “{clarity}” ∈ {clear, understandable with some effort, ambiguous}.

Table 9: Prompt template for the short-short matching subgroup. We do not generate negative documents as the matching task is already reasonably difficult.

Table 10: Prompt template for the long-long matching subgroup. We do not generate negative documents for latency reasons.

Table 11: Prompt template for monolingual STS. For placeholders, “{high_score}” ∈ {4, 4.5, 5}, “{low_score}” ∈ {2.5, 3, 3.5}, “{unit}” ∈ {sentence, phrase, passage}, “{difficulty}” ∈ {elementary school, high school, college}.

Table 12: Prompt template for bitext retrieval. For placeholders, “{high_score}” ∈ {4, 4.5, 5}, “{low_score}” ∈ {1.5, 2, 2.5}, “{unit}” ∈ {sentence, phrase, passage}, “{difficulty}” ∈ {elementary school, high school, college}.

Table 13: Instructions for each training dataset.

Table 14: Instructions used for evaluation on the MTEB benchmark. “STS*” indicates we use the same instructions for all the STS tasks.

Table 15: Results for each dataset in the MTEB benchmark. The evaluation metrics and detailed baseline results are available in the original paper [28].

This paper is available on arxiv under CC0 1.0 DEED license.

[story continues]


Written by
@autoencoder
Research & publications on Auto Encoders, revolutionizing data compression and feature learning techniques.

Topics and
tags
multilingual-ai|text-embeddings|synthetic-data-generation|natural-language-processing|contrastive-pre-training|language-models|beir-benchmark|ai-for-information-retrieval
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: 9_k3Xk8YB85liEuizfZloAx4gVpXVTFd2Le-B2A3cb0