语义文本相似度
对于语义文本相似度 (STS),我们希望为所有涉及的文本生成嵌入,并计算它们之间的相似度。相似度得分最高的文本对在语义上最相似。有关获取嵌入得分的更高级详细信息,另请参阅 计算嵌入 文档。
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
# Two lists of sentences
sentences1 = [
"The new movie is awesome",
"The cat sits outside",
"A man is playing guitar",
]
sentences2 = [
"The dog plays in the garden",
"The new movie is so great",
"A woman watches TV",
]
# Compute embeddings for both lists
embeddings1 = model.encode(sentences1)
embeddings2 = model.encode(sentences2)
# Compute cosine similarities
similarities = model.similarity(embeddings1, embeddings2)
# Output the pairs with their score
for idx_i, sentence1 in enumerate(sentences1):
print(sentence1)
for idx_j, sentence2 in enumerate(sentences2):
print(f" - {sentence2: <30}: {similarities[idx_i][idx_j]:.4f}")
The new movie is awesome
- The dog plays in the garden : 0.0543
- The new movie is so great : 0.8939
- A woman watches TV : -0.0502
The cat sits outside
- The dog plays in the garden : 0.2838
- The new movie is so great : -0.0029
- A woman watches TV : 0.1310
A man is playing guitar
- The dog plays in the garden : 0.2277
- The new movie is so great : -0.0136
- A woman watches TV : -0.0327
在此示例中,SentenceTransformer.similarity
方法返回一个 3x3 矩阵,其中包含 embeddings1
和 embeddings2
之间所有可能对的余弦相似度得分。
相似度计算
使用的相似度指标存储在 SentenceTransformer 实例的 SentenceTransformer.similarity_fn_name
下。有效选项包括
SimilarityFunction.COSINE
(又名 “cosine”): 余弦相似度 (默认)SimilarityFunction.DOT_PRODUCT
(又名 “dot”): 点积SimilarityFunction.EUCLIDEAN
(又名 “euclidean”): 负欧几里得距离SimilarityFunction.MANHATTAN
(又名 “manhattan”): 负曼哈顿距离
可以通过几种方式更改此值
通过使用所需的相似度函数初始化 SentenceTransformer 实例
from sentence_transformers import SentenceTransformer, SimilarityFunction model = SentenceTransformer("all-MiniLM-L6-v2", similarity_fn_name=SimilarityFunction.DOT_PRODUCT)
通过直接在 SentenceTransformer 实例上设置该值
from sentence_transformers import SentenceTransformer, SimilarityFunction model = SentenceTransformer("all-MiniLM-L6-v2") model.similarity_fn_name = SimilarityFunction.DOT_PRODUCT
通过在已保存模型的
config_sentence_transformers.json
文件中的"similarity_fn_name"
键下设置该值。当您保存 Sentence Transformer 模型时,此值也会自动保存。
Sentence Transformers 实现了两种计算嵌入之间相似度的方法
SentenceTransformer.similarity
: 计算所有嵌入对之间的相似度。SentenceTransformer.similarity_pairwise
: 以成对方式计算嵌入之间的相似度。
from sentence_transformers import SentenceTransformer, SimilarityFunction
# Load a pretrained Sentence Transformer model
model = SentenceTransformer("all-MiniLM-L6-v2")
# Embed some sentences
sentences = [
"The weather is lovely today.",
"It's so sunny outside!",
"He drove to the stadium.",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6660, 0.1046],
# [0.6660, 1.0000, 0.1411],
# [0.1046, 0.1411, 1.0000]])
# Change the similarity function to Manhattan distance
model.similarity_fn_name = SimilarityFunction.MANHATTAN
print(model.similarity_fn_name)
# => "manhattan"
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[ -0.0000, -12.6269, -20.2167],
# [-12.6269, -0.0000, -20.1288],
# [-20.2167, -20.1288, -0.0000]])
注意
如果 Sentence Transformer 实例以 Normalize
模块结尾,那么选择 “dot” 指标而不是 “cosine” 是明智的。
归一化嵌入的点积等同于余弦相似度,但 “cosine” 将再次重新归一化嵌入。因此,“dot” 指标将比 “cosine” 更快。
如果您想在长句子列表中找到得分最高的对,请查看 释义挖掘。