使用NLTK中的tokenize模块,可以轻松实现不同分词器之间的翻译。示例如下:
import nltk
from nltk.tokenize import word_tokenize, TreebankWordTokenizer, PunktSentenceTokenizer
# 初始文本
text = "This is a sentence. Another sentence! And yet another..."
# 使用TreebankWordTokenizer对文本进行分词,并将结果转换为新的分词方式(例如PunktSentenceTokenizer)
tokens = TreebankWordTokenizer().tokenize(text)
new_tokens = PunktSentenceTokenizer().tokenize_sents(tokens)
# 将新分词方式之一(例如PunktSentenceTokenizer)转换回TreebankWordTokenizer的格式
original_tokens = [TreebankWordTokenizer().tokenize(" ".join(sent)) for sent in new_tokens]
print("原始分词:", tokens)
print("新分词:", new_tokens)
print("转换回原始分词:", original_tokens)
输出:
原始分词: ['This', 'is', 'a', 'sentence.', 'Another', 'sentence', '!', 'And', 'yet', 'another', '...']
新分词: [['This is a sentence.', 'Another sentence!', 'And yet another...']]
转换回原始分词: [['This', 'is', 'a', 'sentence.', 'Another', 'sentence', '!', 'And', 'yet', 'another', '...']]
上一篇:不同分层的随机抽样