问题可能是由于BeautifulSoup默认只使用ASCII字符集解析XML导致的。可以尝试使用lxml XML解析器来解决,它支持处理更广泛的字符集,包括UTF-8、ISO-8859-1等。
示例代码:
from bs4 import BeautifulSoup
import requests
# 使用lxml解析器
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
另一种解决方法是修改BeautifulSoup的默认解析器,例如:
from bs4 import BeautifulSoup
import requests
from bs4.dammit import EntitySubstitution
# 自定义解析器
class MyBeautifulSoup(BeautifulSoup):
def __init__(self, markup="", features=None, builder=None,
parse_only=None, from_encoding=None, exclude_encodings=None,
**kwargs):
if exclude_encodings is None:
exclude_encodings = [
'latin1',
]
else:
exclude_encodings.append('latin1')
super().__init__(markup, features, builder, parse_only, from_encoding,
exclude_encodings,
**kwargs)
# 使用自定义解析器
response = requests.get(url)
soup = MyBeautifulSoup(response.text, 'xml')