Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need to specify tokenization for content #109

Open
mpgreg opened this issue Nov 13, 2023 · 1 comment
Open

Need to specify tokenization for content #109

mpgreg opened this issue Nov 13, 2023 · 1 comment

Comments

@mpgreg
Copy link
Contributor

mpgreg commented Nov 13, 2023

"name": "content",

Without specifying a tokenization scheme ingest will default to word as per https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization. This will split snake-case configuration parameters and environment variables treating underscore as whitespace.

Example as per https://github.com/weaviate/weaviate/blob/764935fe4b576c87750d6a16ea20fd6c349b20b8/adapters/repos/db/helpers/tokenizer.go#L67

func main() {
	in := "THIS is my_env_variable"

	fmt.Print("\nwhitespace")
	fmt.Print(tokenizeWhitespace(in))
	fmt.Print("\nlowercase")
	fmt.Print(tokenizeLowercase(in))
	fmt.Print("\nword")
	fmt.Print(tokenizeWord(in))
	fmt.Print("\nwildcards")
	fmt.Print(tokenizeWordWithWildcards(in))

}

Results in...

whitespace[THIS is my_env_variable]
lowercase[this is my_env_variable]
word[this is my env variable]
wildcards[this is my env variable]

To prevent splitting of snake-case words or to lose camel-case params we need to switch to whitespace.

mpgreg added a commit to mpgreg/ask-astro-upstream that referenced this issue Nov 13, 2023
sunank200 pushed a commit that referenced this issue Nov 17, 2023
sunank200 pushed a commit that referenced this issue Nov 20, 2023
sunank200 pushed a commit that referenced this issue Nov 23, 2023
@shillion
Copy link
Collaborator

@sunank200 — is this issue still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants