-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Horizontal scaling across multiple nodes #1916
Comments
i'm also interested by this question... |
I'm pretty sure that this would be compatible with the idea of a Federated search, such as Elasticsearch. The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again. There are a couple of protocols out there that fail to do this effectively, or fail to assign weights to different aspects of a page, losing much of the information in HTML. |
Another approach to this issue would be to separate servers based on their functionality. The part of the system that is absolutely critical to keep all together is the web site metadata, so keeping a separate database server would be the first part to this solution. Another server or multiple servers could do crawling and feed the database via network access. And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API. |
|
|
Sharedobjects? This is the first I’ve heard of that. It’s generally better to use simpler, or more efficient, or more mainstream software rather than the more novel idea unless there is some new feature of the newer idea that adds measurable value. I’m not familiar with this structure so I don’t have a reason to use Sharedobjects. |
A cluster? Databases can be clustered, and copy data between nodes synchronously or asynchronously, sort of how I understand Sharedobjects work. However, the amount of data involved would make keeping a copy on each search or web server impractical. Besides, a single dedicated database server would easily be able to handle the transaction load by itself for a sizable cluster of web servers. Existing clustering configurations for a dedicated database cluster can further expand scalability to dozens or hundreds of web servers. |
well, nothing is impossible in a digital world. How do you think FB or else can manage their DB amon hundreds of thousands of servers? |
I also forgot the torrent protocol, can also be interesting to explore |
Is it possible to scale the crawler module and/or search module across multiple computers, all concurrently operating on the same data set? (similar to Elasticsearch, for example). If not, a work-around would be to mount a networked file-system, and set that as the data-path, but would this cause any problems with the software that you're aware of (besides the obvious increase in read/write latency)?
The text was updated successfully, but these errors were encountered: