Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix csv reader encoding issue #53

Merged
merged 20 commits into from
Jun 6, 2024
Merged

Conversation

Ceceliachenen
Copy link
Collaborator

No description provided.

@Ceceliachenen Ceceliachenen requested review from wwxxzz and moria97 June 6, 2024 08:14
@moria97 moria97 changed the title Personal/ranxia/csv reader Fix csv reader encoding issue Jun 6, 2024
),
".xlsx": PaiPandasExcelReader(
concat_rows=self.reader_config.get("concat_rows", False),
pandas_config={
"encoding": self.reader_config.get("encoding", "GB18030")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

使用GB18030加载utf8编码的中文文件会有问题吗?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是有可能会出问题的,虽然GB18030可以兼容Unicode,但编码机制不一样。最好是在前端页面透出可以让用户填encoding的地方。

Copy link

github-actions bot commented Jun 6, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
2664 1614 61% 50% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
src/pai_rag/integrations/readers/pai_csv_reader.py 74% 🟢
src/pai_rag/integrations/readers/pai_excel_reader.py 89% 🟢
TOTAL 82% 🟢

updated for commit: 3adcb39 by action🐍

poetry.lock Show resolved Hide resolved
@@ -82,7 +83,10 @@ def load_data(
if self._concat_rows:
return [Document(text="\n".join(text_list), metadata=metadata)]
else:
return [Document(text=text, metadata=metadata) for text in text_list]
return [
Document(text=text, metadata={**metadata, **{"row_number": i}})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接metadata["row_number"] = i ? 另外,这里是不是应该用i+1

df = pd.read_csv(f, **self._pandas_config)
encoding = chardet.detect(f.read(100000))["encoding"]
f.seek(0)
if encoding.upper() in ["GB18030", "GBK"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GB2312可能也要放进来

try:
df = pd.read_csv(f, **self._pandas_config)
except UnicodeDecodeError:
print(f"Error: The file {file} encoding could not be decoded.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要raise 这个exception

try:
df = pd.read_csv(file, **self._pandas_config)
except UnicodeDecodeError:
print(f"Error: The file {file} encoding could not be decoded.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也一样,加一行raise

@moria97 moria97 merged commit 2756bfb into feature Jun 6, 2024
1 check passed
@moria97 moria97 deleted the personal/ranxia/csv_reader branch June 7, 2024 07:09
moria97 added a commit that referenced this pull request Jun 14, 2024
* tabular reader

* tabular reader

* tabular reader

* tabular reader

* tabular reader

* tabular reader

* BugFix:add an encoding parameter

* BugFix:add an encoding parameter

* BugFix:add an encoding parameter

* BugFix:add an encoding parameter

* BugFix: add row number to metadata

* BugFix: add row number to metadata

---------

Co-authored-by: Yue Fei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants