-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add Chunking Strategies: Regex and Substring Methods #735
Conversation
Hi @NailaRais So, let me make it draft PR. |
Updated: Chunk_text.go and Chunk_text_test.go as per task.json file. Thank you
I have added it now. |
Hi @NailaRais , Thanks for your contribution. There are 2 things you have to read before doing this PR.
After you read those document, I think you will know how you will do to this PR. |
@NailaRais In this ticket, we do not have to modify the original code. You need to add a new execution function called data cleansing. |
Got it |
I have a quick question - should I also revert the other changes? I updated those because the tasks.json file defines the tasks that the chunk_text.go component will execute, such as text chunking methods. This file informs main.go, which initializes the application and integrates the components. The chunk_text_test.go and main_test.go files contain tests that ensure the functionality of the chunk_text.go logic and the overall application behavior. Together, these files create a workflow where the task definitions, implementation, and testing are interdependent, ensuring the application works as intended. Am I correct? @chuang8511 If I'm correct you need this?
|
Hi @NailaRais ,
Yes, I think this Golang code is what I mean. I will review them carefully later.
Before that, could you clean your PR based on the guideline? Thank you again! |
0ab782c
to
8043dc5
Compare
Because
This PR is essential for enhancing the functionality and usability of our text processing capabilities. By introducing Regex and Substring chunking methods, we empower users to customize their text handling according to specific needs, ultimately improving their experience. The ability to define custom chunking rules through regular expressions and predefined indices offers greater flexibility and efficiency, especially when dealing with complex text formats. Additionally, this update aligns with our project requirements, ensuring we meet user demands for diverse text processing strategies. Implementing these features not only addresses current needs but also lays the groundwork for future enhancements based on user feedback, making this PR a crucial addition to the project.
This commit
Implemented the Regex chunking method, allowing users to specify custom regular expression patterns for text splitting.
Introduced properties for
chunk-size
,chunk-overlap
,model-name
, andpattern
to configure the chunking behavior.Implemented the Substring chunking method, enabling users to define start and end indices for chunking the text.
Introduced properties for
chunk-size
,chunk-overlap
,model-name
,start-index
, andend-index
for detailed configuration.Enhanced the existing JSON schema to include the new chunking strategies within the
strategy
properties.It is ensured all new properties are properly documented and formatted for clarity and usability.