信息检索

稀疏检索模型（Sparse retriever models）通常是 SPLADE 模型，它们将查询和文档映射到高维稀疏向量。给定一个查询，通过计算查询的稀疏向量与集合中所有文档的稀疏向量的点积（或余弦相似度）来检索相关文档。这个过程通常通过使用倒排索引和加速推理的算法来变得非常高效。

许多稀疏检索模型都是在 MS MARCO 上训练的，您可以在这里找到这种训练方法的示例

稀疏编码器 > 训练示例 > MS MARCO

然而，通过在您的特定数据集上进行训练，您可能会获得最佳效果。本页将概述一些示例训练脚本，您可以针对自己的数据进行调整，重点关注稀疏检索。

示例脚本可以是

train_splade_gooaq.py:

这个示例在从 GooAQ 这样的数据集中挖掘出的（查询，正例段落）对数据上使用 SpladeLoss （其内部使用 SparseMultipleNegativesRankingLoss）。其目标是训练 SPLADE 模型，使得查询与其正例段落具有高相似度，并且与批次中的其他段落（批内负例）不相似。

该模型将在 MS MARCO、NFCorpus 或 NQ 等数据集上，使用适当的检索指标（例如，nDCG@k, MRR@k），通过像 SparseNanoBEIREvaluator 这样的评估器来评估其检索性能。
train_splade_nq.py:

此示例也使用 SpladeLoss（同样利用 SparseMultipleNegativesRankingLoss），并在 NQ（自然问题）数据集上进行训练。它展示了在问答数据上训练 SPLADE 模型以进行稀疏检索的另一种配置或方法。
train_csr_nq.py:

此示例使用 CSRLoss（其内部使用 SparseMultipleNegativesRankingLoss）用于稀疏检索器。它在来自 NQ（自然问题）等数据集的数据上进行训练。该脚本演示了如何在 SentenceTransformer 模型之上使用 SparseAutoEncoder 头来训练稀疏模型以用于检索任务。

SparseMultipleNegativesRankingLoss (MNRL)

SparseMultipleNegativesRankingLoss 是一种非常常见且有效的损失函数，用于训练用于检索的稀疏模型。它接受（查询，正例文档）对。对于批次中的每个查询，其对应的正例文档被视为正例，而所有其他文档（来自批次中的其他对）被视为负例（批内负例）。该损失函数旨在最大化查询与其正例文档之间的相似度得分（例如，点积），同时最小化查询与所有负例文档之间的相似度得分。

对于稀疏模型，输出的嵌入是稀疏的，相似度通常是点积。拥有足够大的批次大小以提供足够数量的信息性负例至关重要。

推理与评估

一旦稀疏检索器训练完成，您通常会将整个文档语料库编码为稀疏向量，并将它们存储在高效的索引中（例如，倒排索引）。

给定一个新查询

使用训练好的稀疏检索器将查询编码为其稀疏向量。
使用此查询向量搜索已索引的文档向量，以找到前 k 个最相似的文档（点积得分最高的文档）。

推理过程的示例（概念性）

from sentence_transformers import SparseEncoder, util

# 1. Load my trained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# 2. Encode a corpus of texts using the SparseEncoder model
corpus = [
    "A man is eating food.",
    "A man is eating a piece of bread.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
    "A woman is playing violin.",
    "Two men pushed carts through the woods.",
    "A man is riding a white horse on an enclosed ground.",
    "A monkey is playing drums.",
    "A cheetah is running behind its prey.",
]

# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = model.encode(corpus, convert_to_tensor=True)

# 3. Encode the user queries using the same SparseEncoder model
queries = [
    "A man is eating pasta.",
    "Someone in a gorilla costume is playing a set of drums.",
    "A cheetah chases prey on across a field.",
]
query_embeddings = model.encode(queries, convert_to_tensor=True)

# 4. Use the similarity function to compute the similarity scores between the query and corpus embeddings
top_k = min(5, len(corpus))  # Find at most 5 sentences of the corpus for each query sentence
results = util.semantic_search(query_embeddings, corpus_embeddings, top_k=top_k, score_function=model.similarity)

# 5. Sort the results and print the top 5 most similar sentences for each query
for query_id, query in enumerate(queries):
    pointwise_scores = model.intersection(query_embeddings[query_id], corpus_embeddings)

    print(f"Query: {query}")
    for res in results[query_id]:
        corpus_id, score = res.values()
        sentence = corpus[corpus_id]

        pointwise_score = model.decode(pointwise_scores[corpus_id], top_k=10)

        token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in pointwise_score])

        print(f"Score: {score:.4f} - Sentence: {sentence} - Top influential tokens: {token_scores}")
    print("")

"""
Query: A man is eating pasta.
Score: 21.0064 - Sentence: A man is eating food. - Top influential tokens: ("man", 5.48), ("eating", 3.83), ("eat", 3.15), ("men", 3.12), ("food", 1.78), ("male", 0.87), ("person", 0.62), ("a", 0.39), ("hunger", 0.28), ("meat", 0.27)
Score: 18.2966 - Sentence: A man is eating a piece of bread. - Top influential tokens: ("man", 4.85), ("eating", 3.49), ("eat", 3.02), ("men", 2.74), ("male", 0.68), ("food", 0.66), ("person", 0.58), ("a", 0.51), ("meat", 0.36), ("culture", 0.27)
Score: 10.1537 - Sentence: A man is riding a horse. - Top influential tokens: ("man", 4.85), ("men", 3.11), ("male", 0.68), ("a", 0.60), ("person", 0.59), ("animal", 0.21), ("adam", 0.04), ("sex", 0.03), ("god", 0.02), ("who", 0.01)
Score: 6.5993 - Sentence: A man is riding a white horse on an enclosed ground. - Top influential tokens: ("man", 3.31), ("men", 1.58), ("a", 0.51), ("male", 0.41), ("person", 0.34), ("on", 0.17), ("animal", 0.16), ("wearing", 0.04), ("god", 0.04), ("culture", 0.02)
Score: 5.2185 - Sentence: Two men pushed carts through the woods. - Top influential tokens: ("men", 2.60), ("man", 2.51), ("a", 0.09), ("murder", 0.01), ("said", 0.00)

Query: Someone in a gorilla costume is playing a set of drums.
Score: 16.4688 - Sentence: A monkey is playing drums. - Top influential tokens: ("drums", 4.38), ("drum", 2.27), ("play", 2.16), ("playing", 1.77), ("drummer", 0.80), ("dance", 0.63), ("monkey", 0.55), ("music", 0.48), ("a", 0.40), ("sound", 0.39)
Score: 8.6239 - Sentence: A woman is playing violin. - Top influential tokens: ("play", 2.12), ("playing", 1.79), ("person", 0.67), ("dance", 0.58), ("music", 0.55), ("instrument", 0.52), ("guitar", 0.39), ("a", 0.35), ("wearing", 0.32), ("player", 0.21)
Score: 2.7615 - Sentence: A man is riding a horse. - Top influential tokens: ("person", 0.91), ("a", 0.49), ("man", 0.45), ("animal", 0.37), ("sport", 0.32), ("savage", 0.10), ("billy", 0.06), ("dance", 0.02), ("god", 0.01), ("hunting", 0.01)
Score: 2.4471 - Sentence: A man is eating a piece of bread. - Top influential tokens: ("person", 0.90), ("man", 0.45), ("a", 0.42), ("someone", 0.29), ("animal", 0.08), ("god", 0.07), ("ritual", 0.07), ("culture", 0.07), ("something", 0.05), ("who", 0.03)
Score: 2.3295 - Sentence: A man is riding a white horse on an enclosed ground. - Top influential tokens: ("person", 0.53), ("a", 0.42), ("man", 0.31), ("sport", 0.27), ("animal", 0.27), ("savage", 0.09), ("character", 0.09), ("wearing", 0.07), ("symbol", 0.07), ("hunting", 0.05)

Query: A cheetah chases prey on across a field.
Score: 16.3185 - Sentence: A cheetah is running behind its prey. - Top influential tokens: ("che", 3.80), ("##eta", 3.72), ("prey", 2.77), ("hunting", 0.75), ("behavior", 0.70), ("##h", 0.62), ("movement", 0.45), ("animal", 0.33), ("predator", 0.30), ("chasing", 0.29)
Score: 1.9917 - Sentence: A monkey is playing drums. - Top influential tokens: ("animal", 0.43), ("a", 0.41), ("behavior", 0.28), ("movement", 0.18), ("bird", 0.17), ("dance", 0.16), ("species", 0.07), ("dog", 0.06), ("game", 0.05), ("they", 0.05)
Score: 1.4335 - Sentence: A man is riding a white horse on an enclosed ground. - Top influential tokens: ("a", 0.43), ("animal", 0.35), ("hunting", 0.21), ("movement", 0.17), ("breed", 0.12), ("sport", 0.08), ("bird", 0.04), ("dog", 0.02)
Score: 1.4071 - Sentence: A man is riding a horse. - Top influential tokens: ("a", 0.51), ("animal", 0.48), ("movement", 0.27), ("sport", 0.10), ("hunting", 0.04), ("dance", 0.01)
Score: 1.3531 - Sentence: Two men pushed carts through the woods. - Top influential tokens: ("hunting", 0.49), ("cross", 0.41), ("move", 0.22), ("escape", 0.08), ("a", 0.07), ("across", 0.05), ("obstacle", 0.01), ("deer", 0.01), ("they", 0.01)
"""

评估通常使用标准的信息检索指标（如 nDCG@k、MRR@k、Recall@k 和 Precision@k）在基准数据集上进行。可使用 SparseInformationRetrievalEvaluator 来实现此目的。