语义文本相似度
对于语义文本相似度 (STS),我们希望为所有相关文本生成稀疏嵌入,并计算它们之间的相似度。相似度得分最高的文本对在语义上最相似。
from sentence_transformers import SparseEncoder
# Initialize the SPLADE model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# Two lists of sentences
sentences1 = [
"The new movie is awesome",
"The cat sits outside",
"A man is playing guitar",
]
sentences2 = [
"The dog plays in the garden",
"The new movie is so great",
"A woman watches TV",
]
# Compute embeddings for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)
# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)
# Output the pairs with their score
for idx_i, sentence1 in enumerate(sentences1):
print(sentence1)
for idx_j, sentence2 in enumerate(sentences2):
print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")
The new movie is awesome
- The dog plays in the garden : 1.1750
- The new movie is so great : 24.0100
- A woman watches TV : 0.1358
The cat sits outside
- The dog plays in the garden : 2.7264
- The new movie is so great : 0.6256
- A woman watches TV : 0.2129
A man is playing guitar
- The dog plays in the garden : 7.5841
- The new movie is so great : 0.0316
- A woman watches TV : 1.5672
在此示例中,SparseEncoder.similarity
方法返回一个 3x3 矩阵,其中包含 embeddings1
和 embeddings2
之间所有可能对的相应余弦相似度分数。
相似度计算
使用的相似度度量存储在 SparseEncoder
实例的 SparseEncoder.similarity_fn_name
下。有效选项包括:
SimilarityFunction.DOT_PRODUCT
(亦称 “dot”):点积(默认)SimilarityFunction.COSINE
(亦称 “cosine”):余弦相似度SimilarityFunction.EUCLIDEAN
(亦称 “euclidean”):负欧几里得距离SimilarityFunction.MANHATTAN
(亦称 “manhattan”):负曼哈顿距离
该值可以通过以下几种方式更改:
通过使用所需的相似度函数初始化
SparseEncoder
实例from sentence_transformers import SparseEncoder, SimilarityFunction model = SparseEncoder( "naver/splade-cocondenser-ensembledistil", similarity_fn_name=SimilarityFunction.COSINE, )
通过直接在
SparseEncoder
实例上设置该值from sentence_transformers import SparseEncoder, SimilarityFunction model = SparseEncoder("naver/splade-cocondenser-ensembledistil") model.similarity_fn_name = SimilarityFunction.COSINE
通过在已保存模型的
config_sentence_transformers.json
文件中,将该值设置在"similarity_fn_name"
键下。当您保存稀疏编码器模型时,此值也将自动保存。
SparseEncoder
类实现了两种计算嵌入之间相似度的方法:
SparseEncoder.similarity
:计算所有嵌入对之间的相似度。SparseEncoder.similarity_pairwise
:成对计算嵌入之间的相似度。
from sentence_transformers import SparseEncoder, SimilarityFunction
# Load a pretrained Sparse Encoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# Embed some sentences
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(model.similarity_fn_name)
# => "dot"
print(similarities)
# tensor([[ 35.629, 9.154, 0.098],
# [ 9.154, 27.478, 0.019],
# [ 0.098, 0.019, 29.553]])
# Change the similarity function to Manhattan distance
model.similarity_fn_name = SimilarityFunction.COSINE
print(model.similarity_fn_name)
# => "cosine"
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ 1.000, 0.293, 0.003],
# [ 0.293, 1.000, 0.001],
# [ 0.003, 0.001, 1.000]])