BeautifulSoup 4默认情况下不会解析外部DTD实体。如果需要解析外部DTD实体,可以使用HTMLParser
类的convert_charrefs
参数来进行配置。
以下是一个示例代码:
from bs4 import BeautifulSoup
html = '''
Example
This is an example.
Here is an entity: ©
'''
soup = BeautifulSoup(html, 'html.parser', convertEntities=BeautifulSoup.HTML_ENTITIES)
print(soup.prettify())
在上面的示例中,convertEntities
参数被设置为BeautifulSoup.HTML_ENTITIES
,这告诉BeautifulSoup解析外部DTD实体。这样,©
实体将被正确解析为©符号。
输出结果如下:
Example
This is an example.
Here is an entity: ©