Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add Chunking Strategies: Regex and Substring Methods #735

Closed
wants to merge 0 commits into from

Conversation

NailaRais
Copy link

Because

This PR is essential for enhancing the functionality and usability of our text processing capabilities. By introducing Regex and Substring chunking methods, we empower users to customize their text handling according to specific needs, ultimately improving their experience. The ability to define custom chunking rules through regular expressions and predefined indices offers greater flexibility and efficiency, especially when dealing with complex text formats. Additionally, this update aligns with our project requirements, ensuring we meet user demands for diverse text processing strategies. Implementing these features not only addresses current needs but also lays the groundwork for future enhancements based on user feedback, making this PR a crucial addition to the project.

This commit

  1. Added Regex Chunking Method:

Implemented the Regex chunking method, allowing users to specify custom regular expression patterns for text splitting.
Introduced properties for chunk-size, chunk-overlap, model-name, and pattern to configure the chunking behavior.

  1. Added Substring Chunking Method:

Implemented the Substring chunking method, enabling users to define start and end indices for chunking the text.
Introduced properties for chunk-size, chunk-overlap, model-name, start-index, and end-index for detailed configuration.

  1. Updated JSON Schema:

Enhanced the existing JSON schema to include the new chunking strategies within the strategy properties.
It is ensured all new properties are properly documented and formatted for clarity and usability.

@NailaRais NailaRais changed the title Add Chunking Strategies: Regex and Substring Methods Improvement by add Chunking Strategies: Regex and Substring Methods Oct 14, 2024
@NailaRais NailaRais changed the title Improvement by add Chunking Strategies: Regex and Substring Methods feat: Add Chunking Strategies: Regex and Substring Methods Oct 14, 2024
@kuroxx kuroxx linked an issue Oct 14, 2024 that may be closed by this pull request
@chuang8511
Copy link
Contributor

Hi @NailaRais
Is this PR ready to review?
It seems there is no Golang code to process the function defined in JSON schema.

So, let me make it draft PR.

@chuang8511 chuang8511 marked this pull request as draft October 16, 2024 14:47
@NailaRais
Copy link
Author

Updated:

Chunk_text.go and Chunk_text_test.go as per task.json file.

Thank you

Hi @NailaRais Is this PR ready to review? It seems there is no Golang code to process the function defined in JSON schema.

So, let me make it draft PR.

I have added it now.

@chuang8511
Copy link
Contributor

Hi @NailaRais ,

Thanks for your contribution.

There are 2 things you have to read before doing this PR.

  1. Please read the this part. It is what I mean you have to do to write the Golang code.
  2. Please read this part to follow the git guideline.

After you read those document, I think you will know how you will do to this PR.
Thank you again!

@chuang8511
Copy link
Contributor

@NailaRais
Please refer to this JSON schema to proceed the Golang execution logic.

In this ticket, we do not have to modify the original code. You need to add a new execution function called data cleansing.

@NailaRais
Copy link
Author

@NailaRais Please refer to this JSON schema to proceed the Golang execution logic.

In this ticket, we do not have to modify the original code. You need to add a new execution function called data cleansing.

Got it

@NailaRais
Copy link
Author

NailaRais commented Oct 16, 2024

@NailaRais Please refer to this JSON schema to proceed the Golang execution logic.

In this ticket, we do not have to modify the original code. You need to add a new execution function called data cleansing.

I have a quick question - should I also revert the other changes? I updated those because the tasks.json file defines the tasks that the chunk_text.go component will execute, such as text chunking methods. This file informs main.go, which initializes the application and integrates the components. The chunk_text_test.go and main_test.go files contain tests that ensure the functionality of the chunk_text.go logic and the overall application behavior. Together, these files create a workflow where the task definitions, implementation, and testing are interdependent, ensuring the application works as intended. Am I correct? @chuang8511

If I'm correct you need this?

package main

import (
	"regexp"
	"strings"
)

// DataCleaningSetting defines the configuration for data cleansing
type DataCleaningSetting struct {
	CleanMethod     string   `json:"clean-method"` // "Regex" or "Substring"
	ExcludePatterns []string `json:"exclude-patterns,omitempty"`
	IncludePatterns []string `json:"include-patterns,omitempty"`
	ExcludeSubstrs  []string `json:"exclude-substrings,omitempty"`
	IncludeSubstrs  []string `json:"include-substrings,omitempty"`
	CaseSensitive   bool     `json:"case-sensitive,omitempty"`
}

// CleanDataInput defines the input structure for the data cleansing task
type CleanDataInput struct {
	Texts   []string             `json:"texts"`   // Array of text to be cleaned
	Setting DataCleaningSetting  `json:"setting"` // Cleansing configuration
}

// CleanDataOutput defines the output structure for the data cleansing task
type CleanDataOutput struct {
	CleanedTexts []string `json:"texts"` // Array of cleaned text
}

// cleanTextUsingRegex cleans the input texts using regular expressions based on the given settings
func cleanTextUsingRegex(inputTexts []string, settings DataCleaningSetting) []string {
	var cleanedTexts []string

	for _, text := range inputTexts {
		include := true

		// Exclude patterns
		for _, pattern := range settings.ExcludePatterns {
			re := regexp.MustCompile(pattern)
			if re.MatchString(text) {
				include = false
				break
			}
		}

		// Include patterns
		if include && len(settings.IncludePatterns) > 0 {
			include = false
			for _, pattern := range settings.IncludePatterns {
				re := regexp.MustCompile(pattern)
				if re.MatchString(text) {
					include = true
					break
				}
			}
		}

		if include {
			cleanedTexts = append(cleanedTexts, text)
		}
	}
	return cleanedTexts
}

// cleanTextUsingSubstring cleans the input texts using substrings based on the given settings
func cleanTextUsingSubstring(inputTexts []string, settings DataCleaningSetting) []string {
	var cleanedTexts []string

	for _, text := range inputTexts {
		include := true
		compareText := text
		if !settings.CaseSensitive {
			compareText = strings.ToLower(text)
		}

		// Exclude substrings
		for _, substr := range settings.ExcludeSubstrs {
			if !settings.CaseSensitive {
				substr = strings.ToLower(substr)
			}
			if strings.Contains(compareText, substr) {
				include = false
				break
			}
		}

		// Include substrings
		if include && len(settings.IncludeSubstrs) > 0 {
			include = false
			for _, substr := range settings.IncludeSubstrs {
				if !settings.CaseSensitive {
					substr = strings.ToLower(substr)
				}
				if strings.Contains(compareText, substr) {
					include = true
					break
				}
			}
		}

		if include {
			cleanedTexts = append(cleanedTexts, text)
		}
	}
	return cleanedTexts
}

// CleanData is the main function to perform data cleansing based on the input and settings
func CleanData(input CleanDataInput) CleanDataOutput {
	var cleanedTexts []string

	switch input.Setting.CleanMethod {
	case "Regex":
		cleanedTexts = cleanTextUsingRegex(input.Texts, input.Setting)
	case "Substring":
		cleanedTexts = cleanTextUsingSubstring(input.Texts, input.Setting)
	default:
		// If no valid method is provided, return the original texts
		cleanedTexts = input.Texts
	}

	return CleanDataOutput{CleanedTexts: cleanedTexts}
}

@chuang8511
Copy link
Contributor

Hi @NailaRais ,
Thanks for your contribution!

If I'm correct you need this?

Yes, I think this Golang code is what I mean. I will review them carefully later.

Please read this part to follow the git guideline.

Before that, could you clean your PR based on the guideline?
And, I prefer not to touching the other code in this ticket.
So, please create a new PR that only involves what we have to do in this ticket.

Thank you again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants