fix(document_parser): 修复文档中存在无效的Unicode字符导致解析异常的问题
- 在将文档内容编码为 UTF-8 时,添加了错误处理参数 "replace" - 这样可以避免某些特殊字符导致的编码错误 - 修改了两处相关代码,确保内容正确上传到 MinIO
This commit is contained in:
parent
2bc8d5b818
commit
2249ef3083
|
@ -550,8 +550,8 @@ def perform_parse(doc_id, doc_info, file_info, embedding_config, kb_info):
|
||||||
minio_client.put_object(
|
minio_client.put_object(
|
||||||
bucket_name=output_bucket,
|
bucket_name=output_bucket,
|
||||||
object_name=chunk_id,
|
object_name=chunk_id,
|
||||||
data=BytesIO(content.encode("utf-8")),
|
data=BytesIO(content.encode("utf-8", errors="replace")),
|
||||||
length=len(content.encode("utf-8")), # 使用字节长度
|
length=len(content.encode("utf-8", errors="replace")),
|
||||||
)
|
)
|
||||||
|
|
||||||
# 准备ES文档
|
# 准备ES文档
|
||||||
|
|
Loading…
Reference in New Issue