信息检索
稀疏检索器模型通常是 SPLADE 模型,它们将查询和文档映射到高维稀疏向量。给定一个查询,通过计算查询的稀疏向量与集合中所有文档的稀疏向量之间的点积(或余弦相似度)来检索相关文档。这个过程通常通过使用倒排索引和算法来提高推理效率。
许多稀疏检索器模型在 MS MARCO 上进行训练,您可以在此处找到此训练方法的示例
然而,您很可能通过在自己的特定数据集上进行训练来获得最佳结果。本页将概述您可以根据自己的数据进行调整的示例训练脚本,重点关注稀疏检索。
示例脚本可以是
-
此示例使用
SpladeLoss
(其内部使用SparseMultipleNegativesRankingLoss
)在从 GooAQ 等数据集挖掘的(查询,正例段落)对数据上进行训练。目标是训练 SPLADE 模型,使查询及其正例段落具有高相似度,并且与批次中的其他段落(批内负例)不相似。模型将通过
SparseNanoBEIREvaluator
等评估器,使用适当的检索指标(例如,nDCG@k, MRR@k)在 MS MARCO、NFCorpus 或 NQ 等数据集上评估其检索性能。 -
此示例也使用
SpladeLoss
(同样利用SparseMultipleNegativesRankingLoss
),并在 NQ(自然问题)数据集上进行训练。它展示了在问答数据上训练 SPLADE 模型以进行稀疏检索的另一种配置或方法。 -
此示例使用
CSRLoss
(其内部使用SparseMultipleNegativesRankingLoss
)用于稀疏检索器。它在来自 NQ(自然问题)等数据集的数据上进行训练。该脚本演示了如何在 SentenceTransformer 模型之上训练带有 SparseAutoEncoder 头部的稀疏模型以执行检索任务。
SparseMultipleNegativesRankingLoss (MNRL)
SparseMultipleNegativesRankingLoss
是一种非常常见且有效的损失函数,用于训练稀疏模型进行检索。它接受(查询,正例文档)对。对于批次中的每个查询,其对应的正例文档被视为正例,而所有其他文档(来自批次中其他对的文档)被视为负例(批内负例)。该损失函数旨在最大化查询与其正例文档之间的相似度得分(例如,点积),同时最小化查询与所有负例文档之间的相似度得分。
对于稀疏模型,输出嵌入是稀疏的,相似度通常是点积。拥有足够大的批次大小以提供足够多的信息性负例至关重要。
推理与评估
一旦稀疏检索器训练完成,您通常会将整个文档语料库编码成稀疏向量,并将其存储在高效的索引(例如,倒排索引)中。
给定一个新查询
使用训练好的稀疏检索器将查询编码成其稀疏向量。
使用此查询向量搜索已索引的文档向量,以找到前 k 个最相似的文档(点积得分最高的)。
推理可能看起来的示例(概念性)
from sentence_transformers import SparseEncoder, util
# 1. Load my trained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 2. Encode a corpus of texts using the SparseEncoder model
corpus = [
"A man is eating food.",
"A man is eating a piece of bread.",
"The girl is carrying a baby.",
"A man is riding a horse.",
"A woman is playing violin.",
"Two men pushed carts through the woods.",
"A man is riding a white horse on an enclosed ground.",
"A monkey is playing drums.",
"A cheetah is running behind its prey.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)
# 3. Encode the user queries using the same SparseEncoder model
queries = [
"A man is eating pasta.",
"Someone in a gorilla costume is playing a set of drums.",
"A cheetah chases prey on across a field.",
]
query_embeddings = model.encode(queries, convert_to_tensor=True)
# 4. Use the similarity function to compute the similarity scores between the query and corpus embeddings
top_k = min(5, len(corpus)) # Find at most 5 sentences of the corpus for each query sentence
results = util.semantic_search(query_embeddings, corpus_embeddings, top_k=top_k, score_function=model.similarity)
# 5. Sort the results and print the top 5 most similar sentences for each query
for query_id, query in enumerate(queries):
pointwise_scores = model.intersection(query_embeddings[query_id], corpus_embeddings)
print(f"Query: {query}")
for res in results[query_id]:
corpus_id, score = res.values()
sentence = corpus[corpus_id]
pointwise_score = model.decode(pointwise_scores[corpus_id], top_k=10)
token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in pointwise_score])
print(f"Score: {score:.4f} - Sentence: {sentence} - Top influential tokens: {token_scores}")
print("")
"""
Query: A man is eating pasta.
Score: 21.0064 - Sentence: A man is eating food. - Top influential tokens: ("man", 5.48), ("eating", 3.83), ("eat", 3.15), ("men", 3.12), ("food", 1.78), ("male", 0.87), ("person", 0.62), ("a", 0.39), ("hunger", 0.28), ("meat", 0.27)
Score: 18.2966 - Sentence: A man is eating a piece of bread. - Top influential tokens: ("man", 4.85), ("eating", 3.49), ("eat", 3.02), ("men", 2.74), ("male", 0.68), ("food", 0.66), ("person", 0.58), ("a", 0.51), ("meat", 0.36), ("culture", 0.27)
Score: 10.1537 - Sentence: A man is riding a horse. - Top influential tokens: ("man", 4.85), ("men", 3.11), ("male", 0.68), ("a", 0.60), ("person", 0.59), ("animal", 0.21), ("adam", 0.04), ("sex", 0.03), ("god", 0.02), ("who", 0.01)
Score: 6.5993 - Sentence: A man is riding a white horse on an enclosed ground. - Top influential tokens: ("man", 3.31), ("men", 1.58), ("a", 0.51), ("male", 0.41), ("person", 0.34), ("on", 0.17), ("animal", 0.16), ("wearing", 0.04), ("god", 0.04), ("culture", 0.02)
Score: 5.2185 - Sentence: Two men pushed carts through the woods. - Top influential tokens: ("men", 2.60), ("man", 2.51), ("a", 0.09), ("murder", 0.01), ("said", 0.00)
Query: Someone in a gorilla costume is playing a set of drums.
Score: 16.4688 - Sentence: A monkey is playing drums. - Top influential tokens: ("drums", 4.38), ("drum", 2.27), ("play", 2.16), ("playing", 1.77), ("drummer", 0.80), ("dance", 0.63), ("monkey", 0.55), ("music", 0.48), ("a", 0.40), ("sound", 0.39)
Score: 8.6239 - Sentence: A woman is playing violin. - Top influential tokens: ("play", 2.12), ("playing", 1.79), ("person", 0.67), ("dance", 0.58), ("music", 0.55), ("instrument", 0.52), ("guitar", 0.39), ("a", 0.35), ("wearing", 0.32), ("player", 0.21)
Score: 2.7615 - Sentence: A man is riding a horse. - Top influential tokens: ("person", 0.91), ("a", 0.49), ("man", 0.45), ("animal", 0.37), ("sport", 0.32), ("savage", 0.10), ("billy", 0.06), ("dance", 0.02), ("god", 0.01), ("hunting", 0.01)
Score: 2.4471 - Sentence: A man is eating a piece of bread. - Top influential tokens: ("person", 0.90), ("man", 0.45), ("a", 0.42), ("someone", 0.29), ("animal", 0.08), ("god", 0.07), ("ritual", 0.07), ("culture", 0.07), ("something", 0.05), ("who", 0.03)
Score: 2.3295 - Sentence: A man is riding a white horse on an enclosed ground. - Top influential tokens: ("person", 0.53), ("a", 0.42), ("man", 0.31), ("sport", 0.27), ("animal", 0.27), ("savage", 0.09), ("character", 0.09), ("wearing", 0.07), ("symbol", 0.07), ("hunting", 0.05)
Query: A cheetah chases prey on across a field.
Score: 16.3185 - Sentence: A cheetah is running behind its prey. - Top influential tokens: ("che", 3.80), ("##eta", 3.72), ("prey", 2.77), ("hunting", 0.75), ("behavior", 0.70), ("##h", 0.62), ("movement", 0.45), ("animal", 0.33), ("predator", 0.30), ("chasing", 0.29)
Score: 1.9917 - Sentence: A monkey is playing drums. - Top influential tokens: ("animal", 0.43), ("a", 0.41), ("behavior", 0.28), ("movement", 0.18), ("bird", 0.17), ("dance", 0.16), ("species", 0.07), ("dog", 0.06), ("game", 0.05), ("they", 0.05)
Score: 1.4335 - Sentence: A man is riding a white horse on an enclosed ground. - Top influential tokens: ("a", 0.43), ("animal", 0.35), ("hunting", 0.21), ("movement", 0.17), ("breed", 0.12), ("sport", 0.08), ("bird", 0.04), ("dog", 0.02)
Score: 1.4071 - Sentence: A man is riding a horse. - Top influential tokens: ("a", 0.51), ("animal", 0.48), ("movement", 0.27), ("sport", 0.10), ("hunting", 0.04), ("dance", 0.01)
Score: 1.3531 - Sentence: Two men pushed carts through the woods. - Top influential tokens: ("hunting", 0.49), ("cross", 0.41), ("move", 0.22), ("escape", 0.08), ("a", 0.07), ("across", 0.05), ("obstacle", 0.01), ("deer", 0.01), ("they", 0.01)
"""
评估通常使用 nDCG@k、MRR@k、Recall@k 和 Precision@k 等标准信息检索指标在基准数据集上进行。SparseInformationRetrievalEvaluator
可用于此目的。