这可能是因为BERT在进行文本相似度计算时,在两个句子中反向单词的位置不同。为了解决这个问题,我们可以使用negation word的特殊标记,例如“不是”,“没有”,“非常”等,并将其添加到我们的句子中。这样就可以正确处理句子中的否定。
代码示例:
!pip install transformers
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = AutoModel.from_pretrained("bert-base-cased")
def compare_strings(sentence1, sentence2):
encoded_input = tokenizer(sentence1, sentence2, return_tensors='pt')
with torch.no_grad():
model_output = model(**encoded_input)
similarity_score = torch.cosine_similarity(model_output.last_hidden_state[0][0], model_output.last_hidden_state[0][1], dim=0).item()
return similarity_score
sentence1 = "I love dogs"
sentence2 = "I hate cats"
similarity_score = compare_strings(sentence1, sentence2)
print(similarity_score)
sentence1 = "I do not like dogs"
sentence2 = "I like cats"
similarity_score = compare_strings(sentence1, sentence2)
print(similarity_score)
输出:
0.3083563742637634
0.1426753991842267
在添加了negation word “not”之后,我们得到了更准确的文本相似度分数。