This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: yPg3Z--5hmVpLibFwq17R136m3xobTr7I48sae4xwtw
Cover

JINA EMBEDDINGS 2: 8192-Token General-Purpose Text Embeddings for Long Documents: Training Process

Written by @escholar | Published on 2024/2/23

TL;DR
Text embedding models have emerged as powerful tools for transforming sentences into fixedsized feature vectors that encapsulate semantic information.

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Michael Günther, michael.guenther;

(2) Jackmin Ong, jackmin.ong;

(3) Isabelle Mohr, isabelle.mohr;

(4) Alaeddine Abdessalem, alaeddine.abdessalem;

(5) Tanguy Abel, tanguy.abel;

(6) Mohammad Kalim Akram, kalim.akram;

(7) Susana Guzman, susana.guzman;

(8) Georgios Mastrapas, georgios.mastrapas;

(9) Saba Sturua, saba.sturua;

(10) Bo Wang, bo.wang;

(11) Maximilian Werk, maximilian.werk;

(12) Nan Wang, nan.wang;

(13) Han Xiao, han.xiao}@jina.ai.

3 Training Process Overview

The training process for Jina Embeddings v2 is divided into three stages:

I Pre-training the Backbone: For the backbone pre-training, we design a modified BERT model capable of encoding documents with up to 8192 tokens. This model is trained on a full-text corpus using a masked language modeling objective.

II First Fine-tuning with Text Pairs: To encode a text passage into a single vector representation, the model is fine-tuned in an unsupervised manner on text pairs.

III Second Fine-tuning with Hard Negatives: The model is further fine-tuned using text pairs complemented with hard negatives. This

Table 1: Architecture specifications for the Jina BERT models of varying sizes. The number of attention heads is selected to ensure a consistent head dimension of 64.

stage is crucial for enabling the model to better distinguish between relevant passages and related, but irrelevant text passages.

While both stages II and III are geared towards training the models for embedding tasks, the latter is especially critical for improving the model’s performance in retrieval and classification tasks (refer to Section 6.2).

[story continues]


Written by
@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics and
tags
text-embedding-models|jina-embeddings-v2|narrativeqa|text-embedding-ada-00|text-embedding-token-limits|information-retrieval|machine-learning-research|text-re-ranking
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: yPg3Z--5hmVpLibFwq17R136m3xobTr7I48sae4xwtw