-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix csv reader encoding issue #53
Conversation
… personal/ranxia/csv_reader
… personal/ranxia/csv_reader # Conflicts: # poetry.lock
), | ||
".xlsx": PaiPandasExcelReader( | ||
concat_rows=self.reader_config.get("concat_rows", False), | ||
pandas_config={ | ||
"encoding": self.reader_config.get("encoding", "GB18030") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用GB18030加载utf8编码的中文文件会有问题吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是有可能会出问题的,虽然GB18030可以兼容Unicode,但编码机制不一样。最好是在前端页面透出可以让用户填encoding的地方。
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
@@ -82,7 +83,10 @@ def load_data( | |||
if self._concat_rows: | |||
return [Document(text="\n".join(text_list), metadata=metadata)] | |||
else: | |||
return [Document(text=text, metadata=metadata) for text in text_list] | |||
return [ | |||
Document(text=text, metadata={**metadata, **{"row_number": i}}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接metadata["row_number"] = i ? 另外,这里是不是应该用i+1
df = pd.read_csv(f, **self._pandas_config) | ||
encoding = chardet.detect(f.read(100000))["encoding"] | ||
f.seek(0) | ||
if encoding.upper() in ["GB18030", "GBK"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GB2312可能也要放进来
try: | ||
df = pd.read_csv(f, **self._pandas_config) | ||
except UnicodeDecodeError: | ||
print(f"Error: The file {file} encoding could not be decoded.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要raise 这个exception
try: | ||
df = pd.read_csv(file, **self._pandas_config) | ||
except UnicodeDecodeError: | ||
print(f"Error: The file {file} encoding could not be decoded.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也一样,加一行raise
… personal/ranxia/csv_reader
* tabular reader * tabular reader * tabular reader * tabular reader * tabular reader * tabular reader * BugFix:add an encoding parameter * BugFix:add an encoding parameter * BugFix:add an encoding parameter * BugFix:add an encoding parameter * BugFix: add row number to metadata * BugFix: add row number to metadata --------- Co-authored-by: Yue Fei <[email protected]>
No description provided.