计算稀疏嵌入

一旦您安装了 Sentence Transformers,您就可以轻松使用 Sparse Encoder 模型

from sentence_transformers import SparseEncoder

# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# The sentences to encode
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]

# 2. Calculate sparse embeddings by calling model.encode()
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 30522] - sparse representation with vocabulary size dimensions

# 3. Calculate the embedding similarities (using dot product by default)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

# 4. Check sparsity statistics
stats = SparseEncoder.sparsity(embeddings)
print(f"Sparsity: {stats['sparsity_ratio']:.2%}")  # Typically >99% zeros
print(f"Avg non-zero dimensions per embedding: {stats['active_dims']:.2f}")

注意

尽管我们讨论的是句子嵌入,但您也可以将 Sparse Encoder 用于较短的短语以及包含多个句子的较长文本。有关较长文本嵌入的说明,请参阅输入序列长度

初始化 Sparse Encoder 模型

第一步是加载一个预训练的 Sparse Encoder 模型。您可以使用预训练模型中的任何模型或本地模型。有关参数信息,另请参阅SparseEncoder

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# Alternatively, you can pass a path to a local model directory:
model = SparseEncoder("output/models/sparse-distilbert-nq-finetuned")

模型将自动放置在性能最佳的可用设备上,例如如果可用,则为 cudamps。您也可以显式指定设备。

model = SparseEncoder("naver/splade-cocondenser-ensembledistil", device="cuda")

计算嵌入

计算嵌入的方法是 SparseEncoder.encode

输入序列长度

对于 BERT、RoBERTa、DistilBERT 等 Transformer 模型,运行时和内存需求随输入长度呈二次方增长。这限制了 Transformer 模型只能处理特定长度的输入。BERT 模型的一个常见值为 512 个 token,这大约对应 300-400 个单词(对于英语)。

每个模型在 model.max_seq_length 下都有一个最大序列长度,这是可以处理的最大 token 数。较长的文本将被截断为前 model.max_seq_length 个 token。

from sentence_transformers import SparseEncoder

model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
print("Max Sequence Length:", model.max_seq_length)
# => Max Sequence Length: 256

# Change the length to 200
model.max_seq_length = 200

print("Max Sequence Length:", model.max_seq_length)
# => Max Sequence Length: 200

注意

您不能将长度增加到超出相应 Transformer 模型最大支持的范围。另请注意,如果模型是针对短文本训练的,则长文本的表示可能不会那么好。

控制稀疏性

对于稀疏模型,您可以使用 max_active_dims 参数控制输出嵌入中的最大活动维度(非零值)数量。这对于减少内存使用和存储需求,以及控制准确性和检索延迟之间的权衡特别有用。

您可以在模型初始化或编码期间指定 max_active_dims

from sentence_transformers import SparseEncoder

# Initialize the SPLADE model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Embed a list of sentences
sentences = [
   "This framework generates embeddings for each input sentence",
   "Sentences are passed as a list of string.",
   "The quick brown fox jumps over the lazy dog.",
]

# Generate embeddings
embeddings = model.encode(sentences)

# Print embedding dimensionality and sparsity
print(f"Embedding dim: {model.get_sentence_embedding_dimension()}")

stats = model.sparsity(embeddings)
print(f"Embedding sparsity: {stats}")
print(f"Average non-zero dimensions: {stats['active_dims']:.2f}")
print(f"Sparsity percentage: {stats['sparsity_ratio']:.2%}")
"""
Embedding dim: 30522
Embedding sparsity: {'active_dims': 56.333335876464844, 'sparsity_ratio': 0.9981543366792325}
Average non-zero dimensions: 56.33
Sparsity percentage: 99.82%
"""

# Example of using max_active_dims during encoding to limit the active dimensions
embeddings_limited = model.encode(sentences, max_active_dims=32)
stats_limited = model.sparsity(embeddings_limited)
print(f"Limited embedding sparsity: {stats_limited}")
print(f"Average non-zero dimensions: {stats_limited['active_dims']:.2f}")
print(f"Sparsity percentage: {stats_limited['sparsity_ratio']:.2%}")
"""
Limited embedding sparsity: {'active_dims': 32.0, 'sparsity_ratio': 0.9989515759124565}
Average non-zero dimensions: 32.00
Sparsity percentage: 99.90%
"""

当您设置 max_active_dims 时,模型将只保留具有最高值的 top-K 维度,并将所有其他值设置为零。这确保您的嵌入在保持受控稀疏性的同时,保留最重要的语义信息。

注意

设置过低的 max_active_dims 值可能会降低搜索结果的质量。最佳值取决于您的具体用例和数据集。

使用 max_active_dims 控制稀疏性的主要优势之一是减少内存使用。以下是显示内存节省的示例:

def get_sparse_embedding_memory_size(tensor):
    # For sparse tensors, only count non-zero elements
    return (tensor._values().element_size() * tensor._values().nelement() +
           tensor._indices().element_size() * tensor._indices().nelement())

print(f"Original embeddings memory: {get_sparse_embedding_memory_size(embeddings) / 1024:.2f} KB")
print(f"Embeddings with max_active_dims=32 memory: {get_sparse_embedding_memory_size(embeddings_limited) / 1024:.2f} KB")
"""
Original embeddings memory: 3.32 KB
Embeddings with max_active_dims=32 memory: 1.88 KB
"""

如示例所示,将活动维度限制为 32 使内存使用量减少了约 43%。当处理大型文档集合时,这种效率变得更加显著,但需要与嵌入表示可能存在的质量损失进行权衡。请注意,每个评估器类都有一个 max_active_dims 参数,可以设置该参数来控制评估期间的活动维度数量,因此您可以轻松比较不同设置的性能。

SPLADE 模型的解释性

使用 SPLADE 模型时,一个关键优势是其解释性。您可以轻松地可视化哪些 token 对嵌入的贡献最大,从而深入了解模型在文本中认为重要的内容。

from sentence_transformers import SparseEncoder

# Initialize the SPLADE model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Embed a list of sentences
sentences = [
   "This framework generates embeddings for each input sentence",
   "Sentences are passed as a list of string.",
   "The quick brown fox jumps over the lazy dog.",
]

# Generate embeddings
embeddings = model.encode(sentences)

# Visualize top tokens for each text
top_k = 10

token_weights = model.decode(embeddings, top_k=top_k)

print(f"\nTop tokens {top_k} for each text:")
# The result is a list of sentence embeddings as numpy arrays
for i, sentence in enumerate(sentences):
   token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in token_weights[i]])
   print(f"{i}: {sentence} -> Top tokens:  {token_scores}")

"""
Top tokens 10 for each text:
   0: This framework generates embeddings for each input sentence -> Top tokens:  ("framework", 2.19), ("##bed", 2.12), ("input", 1.99), ("each", 1.60), ("em", 1.58), ("sentence", 1.49), ("generate", 1.42), ("##ding", 1.33), ("sentences", 1.10), ("create", 0.93)
   1: Sentences are passed as a list of string. -> Top tokens:  ("string", 2.72), ("pass", 2.24), ("sentences", 2.15), ("passed", 2.07), ("sentence", 1.90), ("strings", 1.86), ("list", 1.84), ("lists", 1.49), ("as", 1.18), ("passing", 0.73)
   2: The quick brown fox jumps over the lazy dog. -> Top tokens:  ("lazy", 2.18), ("fox", 1.67), ("brown", 1.56), ("over", 1.52), ("dog", 1.50), ("quick", 1.49), ("jump", 1.39), ("dogs", 1.25), ("foxes", 0.99), ("jumping", 0.84)
"""

这种解释性有助于理解为什么某些文档在搜索应用程序中匹配或不匹配,并提供模型行为的透明度。

多进程/多 GPU 编码

您可以使用多个 GPU(或在 CPU 机器上使用多个进程)对输入文本进行编码。这对于大型数据集通常有显著帮助,但对于小型数据集,启动多个进程的开销可能会很大。

您可以使用 SparseEncoder.encode()(或 SparseEncoder.encode_query()SparseEncoder.encode_document())并结合:

  • device 参数,对于单进程计算可以设置为例如 "cuda:0""cpu",也可以是用于多进程或多 GPU 计算的设备列表,例如 ["cuda:0", "cuda:1"]["cpu", "cpu", "cpu", "cpu"]

    from sentence_transformers import SparseEncoder
    
    def main():
        model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
        # Encode with multiple GPUs
        embeddings = model.encode(
            inputs,
            device=["cuda:0", "cuda:1"]  # or ["cpu", "cpu", "cpu", "cpu"]
        )
    
    if __name__ == "__main__":
        main()
    
  • 在调用 SparseEncoder.start_multi_process_pool() 并提供设备列表(例如 ["cuda:0", "cuda:1"]["cpu", "cpu", "cpu", "cpu"])后,可以提供 pool 参数。这样做的好处是池可以被多次调用 SparseEncoder.encode() 重用,这比每次调用都启动一个新池要高效得多。

    from sentence_transformers import SparseEncoder
    
    def main():
        model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
        # Start a multi-process pool with multiple GPUs
        pool = model.start_multi_process_pool(devices=["cuda:0", "cuda:1"])
        # Encode with multiple GPUs
        embeddings = model.encode(inputs, pool=pool)
        # Don't forget to stop the pool after usage
        model.stop_multi_process_pool(pool)
    
    if __name__ == "__main__":
        main()
    

此外,您可以使用 chunk_size 参数来控制发送到每个进程的块大小。这与 batch_size 参数不同。例如,当 chunk_size=1000batch_size=32 时,输入文本将被分成 1000 个文本的块,每个块将被发送到一个进程,并每次以 32 个文本的批次进行嵌入。这有助于内存管理和性能,特别是对于大型数据集。