fix(document_parser): 修复文档中存在无效的Unicode字符导致解析异常的问题

- 在将文档内容编码为 UTF-8 时,添加了错误处理参数 "replace"
- 这样可以避免某些特殊字符导致的编码错误
- 修改了两处相关代码,确保内容正确上传到 MinIO
This commit is contained in:
zstar 2025-06-10 11:37:03 +08:00
parent 2bc8d5b818
commit 2249ef3083
1 changed files with 2 additions and 2 deletions

View File

@ -550,8 +550,8 @@ def perform_parse(doc_id, doc_info, file_info, embedding_config, kb_info):
minio_client.put_object( minio_client.put_object(
bucket_name=output_bucket, bucket_name=output_bucket,
object_name=chunk_id, object_name=chunk_id,
data=BytesIO(content.encode("utf-8")), data=BytesIO(content.encode("utf-8", errors="replace")),
length=len(content.encode("utf-8")), # 使用字节长度 length=len(content.encode("utf-8", errors="replace")),
) )
# 准备ES文档 # 准备ES文档