要计算一个类型文档的词频和逆向文档频率(TF-IDF),可以使用Python的sklearn库。下面是一个示例代码:
from sklearn.feature_extraction.text import TfidfVectorizer
# 定义类型文档
documents = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]
# 创建一个TfidfVectorizer对象
vectorizer = TfidfVectorizer()
# 对文档进行拟合和转换
X = vectorizer.fit_transform(documents)
# 获取特征名字(单词)
feature_names = vectorizer.get_feature_names_out()
# 打印每个单词的词频和逆向文档频率
for i in range(len(documents)):
print("Document:", i+1)
for j in range(len(feature_names)):
print("Word:", feature_names[j])
print(" TF-IDF:", X[i, j])
运行上述代码,将打印出每个文档中每个单词的词频和逆向文档频率。请注意,词频和逆向文档频率的值是浮点数。