语义文本相似度

对于语义文本相似度 (STS),我们希望为所有相关文本生成稀疏嵌入,并计算它们之间的相似度。相似度得分最高的文本对在语义上最相似。

from sentence_transformers import SparseEncoder

# Initialize the SPLADE model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Two lists of sentences
sentences1 = [
    "The new movie is awesome",
    "The cat sits outside",
    "A man is playing guitar",
]

sentences2 = [
    "The dog plays in the garden",
    "The new movie is so great",
    "A woman watches TV",
]

# Compute embeddings for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)

# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)

# Output the pairs with their score
for idx_i, sentence1 in enumerate(sentences1):
    print(sentence1)
    for idx_j, sentence2 in enumerate(sentences2):
        print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")
The new movie is awesome
- The dog plays in the garden   : 1.1750
- The new movie is so great     : 24.0100
- A woman watches TV            : 0.1358
The cat sits outside
- The dog plays in the garden   : 2.7264
- The new movie is so great     : 0.6256
- A woman watches TV            : 0.2129
A man is playing guitar
- The dog plays in the garden   : 7.5841
- The new movie is so great     : 0.0316
- A woman watches TV            : 1.5672

在此示例中,SparseEncoder.similarity 方法返回一个 3x3 矩阵,其中包含 embeddings1embeddings2 之间所有可能对的相应余弦相似度分数。

相似度计算

使用的相似度度量存储在 SparseEncoder 实例的 SparseEncoder.similarity_fn_name 下。有效选项包括:

  • SimilarityFunction.DOT_PRODUCT (亦称 “dot”):点积(默认

  • SimilarityFunction.COSINE (亦称 “cosine”):余弦相似度

  • SimilarityFunction.EUCLIDEAN (亦称 “euclidean”):负欧几里得距离

  • SimilarityFunction.MANHATTAN (亦称 “manhattan”):负曼哈顿距离

该值可以通过以下几种方式更改:

  1. 通过使用所需的相似度函数初始化 SparseEncoder 实例

    from sentence_transformers import SparseEncoder, SimilarityFunction
    
    model = SparseEncoder(
        "naver/splade-cocondenser-ensembledistil",
        similarity_fn_name=SimilarityFunction.COSINE,
    )
    
  2. 通过直接在 SparseEncoder 实例上设置该值

    from sentence_transformers import SparseEncoder, SimilarityFunction
    
    model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
    model.similarity_fn_name = SimilarityFunction.COSINE
    
  3. 通过在已保存模型的 config_sentence_transformers.json 文件中,将该值设置在 "similarity_fn_name" 键下。当您保存稀疏编码器模型时,此值也将自动保存。

SparseEncoder 类实现了两种计算嵌入之间相似度的方法:

from sentence_transformers import SparseEncoder, SimilarityFunction

# Load a pretrained Sparse Encoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")

# Embed some sentences
sentences = [
    "The weather is lovely today.",
    "It's so sunny outside!",
    "He drove to the stadium.",
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(model.similarity_fn_name)
# => "dot"
print(similarities)
# tensor([[   35.629,     9.154,     0.098],
#         [    9.154,    27.478,     0.019],
#         [    0.098,     0.019,    29.553]])

# Change the similarity function to Manhattan distance
model.similarity_fn_name = SimilarityFunction.COSINE
print(model.similarity_fn_name)
# => "cosine"

similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[    1.000,     0.293,     0.003],
#         [    0.293,     1.000,     0.001],
#         [    0.003,     0.001,     1.000]])