Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index is larger than image set? #22

Open
johncave opened this issue May 3, 2017 · 3 comments
Open

Index is larger than image set? #22

johncave opened this issue May 3, 2017 · 3 comments

Comments

@johncave
Copy link

johncave commented May 3, 2017

I'm trying to index 120,000 images (around 50GB) from a spinning HDD onto a 128GB SSD using SCALABLE_COLOR for testing purposes. To my surprise after just 20,000 images, the index has swelled to 60GB, making the index likely to be ten times the size of my source images by the time it finishes.

Is this expected behaviour? Am I accidentally storing the entire image in Elasticsearch?

For the record, my mapping is curl -XPUT 'localhost:9201/images/art/_mapping' -d '{ "my_image_item": { "properties": { "img": { "type": "image", "feature": { "SCALABLE_COLOR": { "hash": ["LSH"] } } } } } }'

@johncave
Copy link
Author

johncave commented May 3, 2017

I think the problem here is Elastic is writing my data out to disk before indexing it, and it is filling up as my application is writing images more quickly than elastic can handle them. Upon stopping my application from filling in more data, Elastic search takes a few minutes, then the index shrinks to <200MB. Elastic then unassigns itself from my index and appears to lose a lot of the image information.

Do you know how I can force Elastic to perform the indexing using an API call and therefore make my application wait for it to complete?

@johncave
Copy link
Author

johncave commented May 3, 2017

I guess my issue is mainly, how do I perform the initial indexing of a large image set?

@kiwionly
Copy link
Owner

The way i doing index basically is using bulk api, which sending maybe around 10 images per time, so it will at least stable when doing indexes.

some how I not index large image set yet, so not sure what will be happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants