fix(document_parser): 修复文档中存在无效的Unicode字符导致解析异常的问题
- 在将文档内容编码为 UTF-8 时,添加了错误处理参数 "replace" - 这样可以避免某些特殊字符导致的编码错误 - 修改了两处相关代码,确保内容正确上传到 MinIO
This commit is contained in:
parent
2bc8d5b818
commit
2249ef3083
|
@ -550,8 +550,8 @@ def perform_parse(doc_id, doc_info, file_info, embedding_config, kb_info):
|
|||
minio_client.put_object(
|
||||
bucket_name=output_bucket,
|
||||
object_name=chunk_id,
|
||||
data=BytesIO(content.encode("utf-8")),
|
||||
length=len(content.encode("utf-8")), # 使用字节长度
|
||||
data=BytesIO(content.encode("utf-8", errors="replace")),
|
||||
length=len(content.encode("utf-8", errors="replace")),
|
||||
)
|
||||
|
||||
# 准备ES文档
|
||||
|
|
Loading…
Reference in New Issue