快速入门

Sentence Transformer

Sentence Transformer(又称双编码器)模型的特点

  1. 根据文本或图像计算固定大小的向量表示(嵌入)

  2. 嵌入计算通常高效,嵌入相似度计算非常快

  3. 适用于广泛的任务,例如语义文本相似度、语义搜索、聚类、分类、释义挖掘等。

  4. 通常作为两步检索过程中的第一步使用,其中 Cross-Encoder(又称重排器)模型用于对双编码器排名前 k 的结果进行重排。

安装 Sentence Transformers 后,您可以轻松使用 Sentence Transformer 模型

from sentence_transformers import SentenceTransformer

# 1. Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# 3. Calculate the embedding similarities
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
#         [0.6660, 1.0000, 0.1411],
#         [0.1046, 0.1411, 1.0000]])

通过 SentenceTransformer("all-MiniLM-L6-v2"),我们选择要加载的 Sentence Transformer 模型。在这个例子中,我们加载了 all-MiniLM-L6-v2,这是一个在超过 10 亿个训练对的大型数据集上进行微调的 MiniLM 模型。使用 SentenceTransformer.similarity(),我们计算所有句子对之间的相似度。正如所料,前两个句子之间的相似度 (0.6660) 高于第一个和第三个句子 (0.1046) 或第二个和第三个句子 (0.1411) 之间的相似度。

微调 Sentence Transformer 模型很简单,只需几行代码。更多信息请参阅训练概述部分。

提示

阅读 Sentence Transformer > 使用 > 加速推理,了解如何将模型推理速度提升 2 到 3 倍的技巧。

Cross Encoder

Cross Encoder(又称重排器)模型的特点

  1. 给定文本对计算相似度分数

  2. 与 Sentence Transformer(又称双编码器)模型相比,通常提供更优的性能

  3. 通常比 Sentence Transformer 模型,因为它需要对每个文本对进行计算,而不是对每个文本进行计算。

  4. 由于前两个特点,Cross Encoder 通常用于重排 Sentence Transformer 模型的前 k 个结果

Cross Encoder(又称重排器)模型的使用方式与 Sentence Transformers 类似

from sentence_transformers.cross_encoder import CrossEncoder

# 1. Load a pretrained CrossEncoder model
model = CrossEncoder("cross-encoder/stsb-distilroberta-base")

# We want to compute the similarity between the query sentence...
query = "A man is eating pasta."

# ... and all sentences in the corpus
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# 2. We rank all sentences in the corpus for the query
ranks = model.rank(query, corpus)

# Print the scores
print("Query: ", query)
for rank in ranks:
    print(f"{rank['score']:.2f}\t{corpus[rank['corpus_id']]}")
"""
Query:  A man is eating pasta.
0.67    A man is eating food.
0.34    A man is eating a piece of bread.
0.08    A man is riding a horse.
0.07    A man is riding a white horse on an enclosed ground.
0.01    The girl is carrying a baby.
0.01    Two men pushed carts through the woods.
0.01    A monkey is playing drums.
0.01    A woman is playing violin.
0.01    A cheetah is running behind its prey.
"""

# 3. Alternatively, you can also manually compute the score between two sentences
import numpy as np

sentence_combinations = [[query, sentence] for sentence in corpus]
scores = model.predict(sentence_combinations)

# Sort the scores in decreasing order to get the corpus indices
ranked_indices = np.argsort(scores)[::-1]
print("Scores:", scores)
print("Indices:", ranked_indices)
"""
Scores: [0.6732372, 0.34102544, 0.00542465, 0.07569341, 0.00525378, 0.00536814, 0.06676237, 0.00534825, 0.00516717]
Indices: [0 1 3 6 2 5 7 4 8]
"""

通过 CrossEncoder("cross-encoder/stsb-distilroberta-base"),我们选择要加载的 CrossEncoder 模型。在这个例子中,我们加载了 cross-encoder/stsb-distilroberta-base,这是一个在 STS Benchmark 数据集上进行微调的 DistilRoBERTa 模型。

Sparse Encoder

稀疏编码器模型的特点

  1. 计算稀疏向量表示,其中大多数维度为零。

  2. 由于嵌入的稀疏性,为大规模检索系统提供了效率优势

  3. 通常比密集嵌入更具解释性,非零维度对应于特定的词元。

  4. 密集嵌入互补,支持结合两种方法优点的混合搜索系统。

稀疏编码器模型的使用方式与 Sentence Transformers 类似

from sentence_transformers import SparseEncoder

# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions

# 3. Calculate the embedding similarities (using dot product by default)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

# 4. Check sparsity statistics
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}")  # Typically >99% zeros
print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")

通过 SparseEncoder("naver/splade-cocondenser-ensembledistil"),我们加载一个生成稀疏嵌入的预训练 SPLADE 模型。SPLADE(SParse Lexical AnD Expansion)模型利用 MLM 预测机制创建稀疏表示,这对于信息检索任务特别有效。

下一步

接下来请考虑阅读以下部分之一