Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full text search #757

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Full text search #757

wants to merge 4 commits into from

Conversation

vdimir
Copy link
Contributor

@vdimir vdimir commented Aug 2, 2020

Resolves #734

Uses bleve.

@paskal paskal requested review from umputun and paskal August 2, 2020 21:00
@paskal
Copy link
Collaborator

paskal commented Aug 2, 2020

I have trouble reviewing this PR because of vendoring I suppose: my IDE just refuses to show a list of files. Would you be so kind as to split your repo into a branch with vendoring and branch on top of it with just the code changes, create a PR with just the code changes and ping me for review there?

@vdimir
Copy link
Contributor Author

vdimir commented Aug 4, 2020

@paskal I removed vendor from this PR and pushed it to separate branch full-text-search-vendor. I haven't create PR for this separate branch yet, because maybe it is better to keep one PR and merge changes back after reviews? Or I'll create another PR later if we decide to do it.

@paskal
Copy link
Collaborator

paskal commented Aug 4, 2020

Sorry for not being clear, I wanted to have PR like one we have here to be a separate one on top of vendoring branch in your fork, so I would review it there. I'll try to look into this by tomorrow evening.

@paskal
Copy link
Collaborator

paskal commented Aug 8, 2020

I expect to have the ability to renew it this weekend.

Copy link
Collaborator

@paskal paskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, didn't have time for proper review yet. Comments are small stuff about the code style rather than about architecture, I'll look into the idea behind the code shortly after.

Also if you don't mind I'll recommend rebasing against current master as it will simplify the review a little.

Dockerfile Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
backend/_example/memory_store/accessor/data.go Outdated Show resolved Hide resolved
backend/app/store/search/service.go Outdated Show resolved Hide resolved
backend/_example/memory_store/accessor/data.go Outdated Show resolved Hide resolved
backend/app/cmd/server_test.go Outdated Show resolved Hide resolved
backend/app/cmd/server.go Outdated Show resolved Hide resolved
backend/app/cmd/server.go Outdated Show resolved Hide resolved
backend/app/cmd/server.go Outdated Show resolved Hide resolved
backend/app/cmd/server.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@paskal paskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round, I reviewed everything but search module as it's a beast on its own. @vdimir please let me know if what I wrote something which makes sense or I'm looking into wrong things and that's not what you were looking for.

backend/remark.rest Outdated Show resolved Hide resolved
backend/remark.rest Outdated Show resolved Hide resolved
backend/app/rest/api/rest_public_test.go Outdated Show resolved Hide resolved
backend/app/rest/api/rest_test.go Outdated Show resolved Hide resolved
backend/app/rest/api/rest_public_test.go Outdated Show resolved Hide resolved
backend/app/cmd/server.go Outdated Show resolved Hide resolved
backend/app/store/search/service.go Outdated Show resolved Hide resolved
backend/app/store/search/search_test.go Outdated Show resolved Hide resolved
backend/app/store/search/search_test.go Outdated Show resolved Hide resolved
backend/app/store/search/search_test.go Outdated Show resolved Hide resolved
@vdimir
Copy link
Contributor Author

vdimir commented Aug 11, 2020

Another round, I reviewed everything but search module as it's a beast on its own. @vdimir please let me know if what I wrote something which makes sense or I'm looking into wrong things and that's not what you were looking for.

Thanks for feedback!
I have read comments and understand how to fix most of them. I'll fix it or reply in conversation.

@vdimir vdimir requested a review from paskal August 18, 2020 19:08
@paskal
Copy link
Collaborator

paskal commented Aug 24, 2020

I should have time tonight to review it thoughtfully.

@vdimir
Copy link
Contributor Author

vdimir commented Aug 24, 2020

I should have time tonight to review it thoughtfully.

Thanks!

I think main issue that I haven't figure out is wrapping store.Interface. Wrapping store with proxy is simple way to handle all store calls and perform additional work to support search without modifying store. But I understand that this solution not so clear, but I don't know how to do it better.

Copy link
Collaborator

@paskal paskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into search module and the architecture tomorrow. It would be great if you would fix small things and move unrelated changes out of PR, that would really make things simpler for me.

@paskal
Copy link
Collaborator

paskal commented Aug 27, 2020

Very busy work week, I hope I'll have time for the review either on Friday or Sunday.

Copy link
Collaborator

@paskal paskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good, aside from TODO stuff which I didn't check. Also, I didn't check test coverage as the code is not 100% complete at the moment, and I didn't run it yet as I want to run when the code review from Umputun will be passed.

I propose to fix review comments, ping me directly if you'll need any clarifications, and after these comments are resolved ping Umputun for final review. It's unlikely we'll jump on your todo items so I propose to add them as comment threads in review and ping me/Umputun directly on them if you want our reaction.

backend/app/store/search/elastic.go Outdated Show resolved Hide resolved
backend/app/store/search/buffered_engine.go Outdated Show resolved Hide resolved
backend/app/store/search/buffered_engine.go Outdated Show resolved Hide resolved
backend/app/store/search/buffered_engine.go Outdated Show resolved Hide resolved
backend/app/store/search/bleve.go Outdated Show resolved Hide resolved
backend/app/store/search/buffered_engine.go Outdated Show resolved Hide resolved
backend/app/store/search/bleve.go Outdated Show resolved Hide resolved
backend/app/store/search/elastic.go Outdated Show resolved Hide resolved
backend/app/store/search/multiplexer.go Outdated Show resolved Hide resolved
backend/app/cmd/server_test.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@vdimir vdimir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added issue to my TODOs

backend/app/store/search/internal/elastic.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/multiplexer.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/buffered_engine.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@paskal paskal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent job! Few very minor things to fix plus few tests to write, and also please check that it compiles as it doesn't now. After fixing that review notes ping Umputun directly, I don't think I have any more things I'm conserned about other than ones written in that review comments.

backend/app/store/search/internal/bleve.go Outdated Show resolved Hide resolved
backend/app/store/engine/engine_mock.go Outdated Show resolved Hide resolved
backend/app/store/search/service/search_service.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/bleve.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/multiplexer.go Outdated Show resolved Hide resolved
backend/app/rest/api/rest_public.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/bleve.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/buffered_engine.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/elastic.go Outdated Show resolved Hide resolved
backend/app/store/search/store_engine_decorator.go Outdated Show resolved Hide resolved
@vdimir
Copy link
Contributor Author

vdimir commented Oct 8, 2020

Excellent job! Few very minor things to fix plus few tests to write, and also please check that it compiles as it doesn't now. After fixing that review notes ping Umputun directly, I don't think I have any more things I'm conserned about other than ones written in that review comments.

Thanks!
A lot of comments connected with typos, and it is not the thing that is required checking from reviewer, sorry for that, I'll more careful and will try to setup spellchecker or something like that to not disturb reviewer with such minors.

I hope I am finishing fixing latest issues soon.

@vdimir
Copy link
Contributor Author

vdimir commented Oct 8, 2020

I finally run service after all modification, load bunch of comments like in first post in this PR, seems it works.
Should I merge vendor files? Now it is in separate branch full-text-search-vendor

@vdimir vdimir marked this pull request as ready for review October 8, 2020 09:04
@vdimir
Copy link
Contributor Author

vdimir commented Oct 8, 2020

@umputun could you review the code, please?

paskal
paskal previously approved these changes Oct 11, 2020
@umputun
Copy link
Owner

umputun commented Nov 1, 2020

Sorry for the long delay and thx for the impressive work and detailed reviews.

I have tried to understand the general structure/flow before I dig deeper into implementation details and it got me confused a little bit on multiple levels. Somehow the overall structure doesn't look intuitive to me and I found myself looking at things implementing unexpected logic in unexpected places. Part of my confusion represented in the comments I've made, but I'm not really sure if my comments make much sense due to the general confusion.

Fundamentally all of this feels to me as two levels of abstraction, Servcie and Engine:

  • search.Service - this is a struct (interfaces could be defined by consumers) providing all high-level API for the search functionality the end consumer (i.e. REST) needs. I'd expect some kind of Search call here. Probably nothing else needed, at least logically.

  • both engines (bleve and elastic) are some implementations of search.Engine interface defined near search.Service

  • the logic is currently done by multiplexer, i.e. different sites support, probably should be the core part of the search.Service

  • the service should be used directly by REST at least to perform the search. It shouldn't be embedded into servcie.DataStore

  • the other part of search.Service (extracted interface) responsible for creating/updates/delete can be integrated into service.DataStore so, the current Create/Edit/Delete calls will affect search indexes properly

  • The other package structure could represent this as a couple of sub-packages - engine and service or search_engine and search_service to make imports clear and avoid conflicts. On the top level (i.e search.go) we can keep all shared things used by both packages and this way import loops could be prevented.

and just to reiterate the general flow from the consumer point of view: the only thing consumers care about and use are interfaces extracted from the service struct. Some consumers may need a different subset of those methods.

let me know what you think

backend/app/cmd/server.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/elastic.go Outdated Show resolved Hide resolved
backend/app/store/search/search.go Outdated Show resolved Hide resolved
backend/app/store/search/types/types.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/noop.go Outdated Show resolved Hide resolved
backend/app/store/search/search.go Outdated Show resolved Hide resolved
backend/app/store/service/service.go Outdated Show resolved Hide resolved
backend/app/store/search/internal/bleve.go Outdated Show resolved Hide resolved
backend/app/store/search/store_engine_decorator.go Outdated Show resolved Hide resolved
@vdimir
Copy link
Contributor Author

vdimir commented Nov 16, 2020

@umputun , thank you for review!

I think I've got general idea and I'm going to rewrite PR reusing my code with respect to your notes. If some additional question will occur during implementation, I'll ask here. I hope I'll start soon :)

@paskal
Copy link
Collaborator

paskal commented Nov 30, 2020

@vdimir I would love to have your work incorporated into the next release (1.7.0), please let us know if there is any way to assist you.

@vdimir
Copy link
Contributor Author

vdimir commented Dec 1, 2020

@vdimir I would love to have your work incorporated into the next release (1.7.0), please let us know if there is any way to assist you.

Thanks for the concern. I'm going to work on it this weekend, I think after that we can discuss it more detailed.
Also when new release is planned to happen?

@paskal
Copy link
Collaborator

paskal commented Dec 1, 2020

After these changes + documentation ready, + some UI changes if we'll find someone to make them in time.

paskal
paskal previously approved these changes Dec 1, 2022
paskal
paskal previously approved these changes Feb 28, 2023
@vdimir vdimir force-pushed the full-text-search branch 2 times, most recently from 29d260c to f9e7345 Compare April 2, 2023 19:26
@vdimir
Copy link
Contributor Author

vdimir commented Apr 2, 2023

@paskal I updated the code to have synchronous indexing for a cold start in (serverApp).run. Internally it's still executed with several workers that process topics in parallel in a sized group. Performance in sufficient to block once on update here.

Tested on 8cpu/16gb digitalocean droplet, dataset with 70k comments, took 12 sec:

Logs
remark42-dev  | 2023/04/02 14:30:20.571 [INFO]  {cmd/server.go:1224 cmd.(*ServerCommand).makeSearchService} creating search service
remark42-dev  | 2023/04/02 14:30:20 [INFO] creating new search index var/search_index/72656d61726b811c9dc5
remark42-dev  | 2023/04/02 14:30:20.587 [WARN]  {cmd/server.go:654 cmd.(*serverApp).run} admin basic auth enabled
remark42-dev  | 2023/04/02 14:30:20.587 [INFO]  {service/service.go:962 service.(*DataStore).IndexSites} start building search index for 1 sites
remark42-dev  | 2023/04/02 14:30:20.587 [INFO]  {migrator/backup.go:26 migrator.AutoBackup.Do} activate auto-backup for remark under ./var/backup, duration 24h0m0s
remark42-dev  | 2023/04/02 14:30:20.587 [DEBUG] {migrator/backup.go:29 migrator.AutoBackup.Do} first backup for remark at 2023-04-03 14:30:20.587640815 -0500 CDT m=+86400.821362338
remark42-dev  | 2023/04/02 14:30:20.588 [INFO]  {provider/dev_provider.go:49 provider.(*DevAuthServer).Run} run local oauth2 dev server on 8084, redirect url=http://127.0.0.1:8080/auth/dev/callback
remark42-dev  | 2023/04/02 14:30:23 [INFO] 3669 documents indexed from topic {remark https://remark42.com/demo/}
remark42-dev  | 2023/04/02 14:30:23 [INFO] 3667 documents indexed from topic {remark https://remark42.com/demo9/}
remark42-dev  | 2023/04/02 14:30:23 [INFO] 3652 documents indexed from topic {remark https://remark42.com/demo1/}
remark42-dev  | 2023/04/02 14:30:23 [INFO] 3736 documents indexed from topic {remark https://remark42.com/demo14/}
remark42-dev  | 2023/04/02 14:30:23 [INFO] 3648 documents indexed from topic {remark https://remark42.com/demo11/}
remark42-dev  | 2023/04/02 14:30:24 [INFO] 3720 documents indexed from topic {remark https://remark42.com/demo17/}
remark42-dev  | 2023/04/02 14:30:24 [INFO] 3702 documents indexed from topic {remark https://remark42.com/demo10/}
remark42-dev  | 2023/04/02 14:30:24 [INFO] 3729 documents indexed from topic {remark https://remark42.com/demo16/}
remark42-dev  | 2023/04/02 14:30:26 [INFO] 3653 documents indexed from topic {remark https://remark42.com/demo19/}
remark42-dev  | 2023/04/02 14:30:26 [INFO] 3702 documents indexed from topic {remark https://remark42.com/demo18/}
remark42-dev  | 2023/04/02 14:30:26 [INFO] 3750 documents indexed from topic {remark https://remark42.com/demo15/}
remark42-dev  | 2023/04/02 14:30:26 [INFO] 3724 documents indexed from topic {remark https://remark42.com/demo3/}
remark42-dev  | 2023/04/02 14:30:29 [INFO] 3686 documents indexed from topic {remark https://remark42.com/demo2/}
remark42-dev  | 2023/04/02 14:30:29 [INFO] 3782 documents indexed from topic {remark https://remark42.com/demo20/}
remark42-dev  | 2023/04/02 14:30:29 [INFO] 3774 documents indexed from topic {remark https://remark42.com/demo12/}
remark42-dev  | 2023/04/02 14:30:33 [INFO] 3623 documents indexed from topic {remark https://remark42.com/demo6/}
remark42-dev  | 2023/04/02 14:30:33 [INFO] 3688 documents indexed from topic {remark https://remark42.com/demo8/}
remark42-dev  | 2023/04/02 14:30:33 [INFO] 3715 documents indexed from topic {remark https://remark42.com/demo13/}
remark42-dev  | 2023/04/02 14:30:33 [INFO] 3710 documents indexed from topic {remark https://remark42.com/demo4/}
remark42-dev  | 2023/04/02 14:30:33 [INFO] 3728 documents indexed from topic {remark https://remark42.com/demo7/}
remark42-dev  | 2023/04/02 14:30:33 [INFO] 3664 documents indexed from topic {remark https://remark42.com/demo5/}
remark42-dev  | 2023/04/02 14:30:33.062 [INFO]  {service/service.go:981 service.(*DataStore).IndexSites} finish building search index in 12.475010151s
remark42-dev  | 2023/04/02 14:30:33.062 [INFO]  {api/rest.go:116 api.(*Rest).Run} activate http rest server on :8080
remark42-dev  | 2023/04/02 14:30:33.062 [INFO]  {image/image.go:182 image.(*Service).Cleanup} start pictures cleanup, staging ttl=7m30s
remark42-dev  | 2023/04/02 14:30:33.064 [INFO]  {api/rest.go:483 api.addFileServer} run file server from ./web from the disk

@paskal
Copy link
Collaborator

paskal commented Apr 2, 2023

Can you please test on the cheapest droplet?

@vdimir
Copy link
Contributor Author

vdimir commented Apr 3, 2023

I tried on a small droplet (1gb ram) and faced memory issues, I will investigate.

UPD: on 2GB ram droplet succeeded in 2 minues (100k comments). But I set num workers to 1, so need to make it configurable or determine automatically.
Memory usage is about 600-800 mb.

Details
remark42-dev  | 2023/04/03 16:05:11.616 [INFO]  {service/service.go:962 service.(*DataStore).IndexSites} start building search index for 1 sites
remark42-dev  | 2023/04/03 16:05:11.627 [INFO]  {migrator/backup.go:26 migrator.AutoBackup.Do} activate auto-backup for remark under ./var/backup, duration 24h0m0s
remark42-dev  | 2023/04/03 16:05:11.627 [DEBUG] {migrator/backup.go:29 migrator.AutoBackup.Do} first backup for remark at 2023-04-04 16:05:11.627861637 -0500 CDT m=+86400.935999044
remark42-dev  | 2023/04/03 16:05:11.628 [INFO]  {provider/dev_provider.go:49 provider.(*DevAuthServer).Run} run local oauth2 dev server on 8084, redirect url=http://127.0.0.1:8080/auth/dev/callback
remark42-dev  | 2023/04/03 16:05:16 [INFO] 4738 documents indexed from topic {remark https://remark42.com/demo9/}
remark42-dev  | 2023/04/03 16:05:21 [INFO] 4665 documents indexed from topic {remark https://remark42.com/demo/}
remark42-dev  | 2023/04/03 16:05:25 [INFO] 4708 documents indexed from topic {remark https://remark42.com/demo1/}
remark42-dev  | 2023/04/03 16:05:29 [INFO] 4755 documents indexed from topic {remark https://remark42.com/demo10/}
remark42-dev  | 2023/04/03 16:05:33 [INFO] 4720 documents indexed from topic {remark https://remark42.com/demo11/}
remark42-dev  | 2023/04/03 16:05:38 [INFO] 4829 documents indexed from topic {remark https://remark42.com/demo12/}
remark42-dev  | 2023/04/03 16:05:42 [INFO] 4767 documents indexed from topic {remark https://remark42.com/demo13/}
remark42-dev  | 2023/04/03 16:05:47 [INFO] 4780 documents indexed from topic {remark https://remark42.com/demo14/}
remark42-dev  | 2023/04/03 16:05:51 [INFO] 4880 documents indexed from topic {remark https://remark42.com/demo15/}
remark42-dev  | 2023/04/03 16:05:57 [INFO] 4774 documents indexed from topic {remark https://remark42.com/demo16/}
remark42-dev  | 2023/04/03 16:06:04 [INFO] 4801 documents indexed from topic {remark https://remark42.com/demo17/}
remark42-dev  | 2023/04/03 16:06:09 [INFO] 4734 documents indexed from topic {remark https://remark42.com/demo18/}
remark42-dev  | 2023/04/03 16:06:18 [INFO] 4696 documents indexed from topic {remark https://remark42.com/demo19/}
remark42-dev  | 2023/04/03 16:06:26 [INFO] 4756 documents indexed from topic {remark https://remark42.com/demo2/}
remark42-dev  | 2023/04/03 16:06:34 [INFO] 4823 documents indexed from topic {remark https://remark42.com/demo20/}
remark42-dev  | 2023/04/03 16:06:39 [INFO] 4778 documents indexed from topic {remark https://remark42.com/demo3/}
remark42-dev  | 2023/04/03 16:06:43 [INFO] 4800 documents indexed from topic {remark https://remark42.com/demo4/}
remark42-dev  | 2023/04/03 16:06:46 [INFO] 4752 documents indexed from topic {remark https://remark42.com/demo5/}
remark42-dev  | 2023/04/03 16:06:49 [INFO] 4680 documents indexed from topic {remark https://remark42.com/demo6/}
remark42-dev  | 2023/04/03 16:06:54 [INFO] 4805 documents indexed from topic {remark https://remark42.com/demo7/}
remark42-dev  | 2023/04/03 16:06:57 [INFO] 4760 documents indexed from topic {remark https://remark42.com/demo8/}
remark42-dev  | 2023/04/03 16:06:57.631 [INFO]  {service/service.go:981 service.(*DataStore).IndexSites} finish building search index in 1m46.013316679s
remark42-dev  | 2023/04/03 16:06:57.631 [INFO]  {api/rest.go:116 api.(*Rest).Run} activate http rest server on :8080

@vdimir
Copy link
Contributor Author

vdimir commented Apr 8, 2023

@paskal

Tested on latest commit on cheapest 512ram droplet 100k comments in 20 topics:

remark42-dev  | 2023/04/08 15:25:45.520 [INFO]  {service/service.go:974 service.(*DataStore).IndexSites} finish building search index in 1m56.786843857s

It seems it was an issue with too big batches passed to bleve.Index method (the whole topic). Now it's limited to 1024.
Also, I made this method single-threaded because on multicore machines, it's fast enough even with one thread, but on slow, it may get stuck if it's run several threads.

@paskal
Copy link
Collaborator

paskal commented Apr 10, 2023

Are default concurrency settings good enough not to overuse the machine's memory?

@vdimir
Copy link
Contributor Author

vdimir commented Apr 10, 2023

Are default concurrency settings good enough not to overuse the machine's memory?

I think so, peak usage was about 300-400mb.

I also tested on another workload with 1000 topics and ~100 comments each. It took more time - 4 min on 512mb ram single-core droplet.

...
remark42-dev  | 2023/04/10 14:16:03 [INFO] 95 documents indexed from topic {remark https://remark42.com/demo997/}
remark42-dev  | 2023/04/10 14:16:03 [INFO] 96 documents indexed from topic {remark https://remark42.com/demo998/}
remark42-dev  | 2023/04/10 14:16:03 [INFO] 94 documents indexed from topic {remark https://remark42.com/demo999/}
remark42-dev  | 2023/04/10 14:16:03.777 [INFO]  {service/service.go:974 service.(*DataStore).IndexSites} finish building search index in 4m3.937382438s

@paskal
Copy link
Collaborator

paskal commented Apr 10, 2023

Sounds acceptable to me for a single-time index build. @umputun please take a final look.

itzomen pushed a commit to traleor/comments that referenced this pull request Apr 16, 2023
I haven't found a linter for these, so I had to catch these manually.
I found umputun#757 to fix one of these, and I thought it would be good
to fix everything at once.
@paskal paskal removed this from the v1.12.0 milestone Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full text search
4 participants