Horizontal scaling across multiple nodes #1916

lausycampari · 2018-12-20T03:38:16Z

Is it possible to scale the crawler module and/or search module across multiple computers, all concurrently operating on the same data set? (similar to Elasticsearch, for example). If not, a work-around would be to mount a networked file-system, and set that as the data-path, but would this cause any problems with the software that you're aware of (besides the obvious increase in read/write latency)?

ROBERT-MCDOWELL · 2019-06-16T11:18:19Z

i'm also interested by this question...

jelutz77 · 2019-07-20T21:21:00Z

I'm pretty sure that this would be compatible with the idea of a Federated search, such as Elasticsearch. The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again. There are a couple of protocols out there that fail to do this effectively, or fail to assign weights to different aspects of a page, losing much of the information in HTML.

jelutz77 · 2019-07-20T21:24:06Z

Another approach to this issue would be to separate servers based on their functionality. The part of the system that is absolutely critical to keep all together is the web site metadata, so keeping a separate database server would be the first part to this solution. Another server or multiple servers could do crawling and feed the database via network access. And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API.

ROBERT-MCDOWELL · 2019-07-21T07:01:25Z

The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again
why not sharedObjects?

ROBERT-MCDOWELL · 2019-07-21T07:03:45Z

The biggest challenge to this approach is developing a protocol to share the results of crawling without having to essentially do the work of crawling again
Why not SharedObjects?
And another server could perform web functions, such as supply a web interface for users (possibly a shared server), or access the database via API.
Maybe the concept of cluster would be more effecient, use a UDP protocol (like a DNS server), to share instantly everything new or modified, the sharedObjects will analyze the part to change so will pass to the stream only the new bytes or modified bytes

jelutz77 · 2019-07-21T07:24:30Z

Sharedobjects? This is the first I’ve heard of that. It’s generally better to use simpler, or more efficient, or more mainstream software rather than the more novel idea unless there is some new feature of the newer idea that adds measurable value. I’m not familiar with this structure so I don’t have a reason to use Sharedobjects.

jelutz77 · 2019-07-21T07:32:04Z

A cluster? Databases can be clustered, and copy data between nodes synchronously or asynchronously, sort of how I understand Sharedobjects work. However, the amount of data involved would make keeping a copy on each search or web server impractical. Besides, a single dedicated database server would easily be able to handle the transaction load by itself for a sizable cluster of web servers. Existing clustering configurations for a dedicated database cluster can further expand scalability to dozens or hundreds of web servers.

ROBERT-MCDOWELL · 2019-07-21T08:42:27Z

well, nothing is impossible in a digital world. How do you think FB or else can manage their DB amon hundreds of thousands of servers?
Shared Objects is a 10 years old feature, more recent in JS, but exists in Java, Actionscript, etc..

ROBERT-MCDOWELL · 2019-07-21T08:43:43Z

I also forgot the torrent protocol, can also be interesting to explore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Horizontal scaling across multiple nodes #1916

Horizontal scaling across multiple nodes #1916

lausycampari commented Dec 20, 2018

ROBERT-MCDOWELL commented Jun 16, 2019

jelutz77 commented Jul 20, 2019

jelutz77 commented Jul 20, 2019

ROBERT-MCDOWELL commented Jul 21, 2019

ROBERT-MCDOWELL commented Jul 21, 2019

jelutz77 commented Jul 21, 2019

jelutz77 commented Jul 21, 2019 •

edited

Loading

ROBERT-MCDOWELL commented Jul 21, 2019

ROBERT-MCDOWELL commented Jul 21, 2019

Horizontal scaling across multiple nodes #1916

Horizontal scaling across multiple nodes #1916

Comments

lausycampari commented Dec 20, 2018

ROBERT-MCDOWELL commented Jun 16, 2019

jelutz77 commented Jul 20, 2019

jelutz77 commented Jul 20, 2019

ROBERT-MCDOWELL commented Jul 21, 2019

ROBERT-MCDOWELL commented Jul 21, 2019

jelutz77 commented Jul 21, 2019

jelutz77 commented Jul 21, 2019 • edited Loading

ROBERT-MCDOWELL commented Jul 21, 2019

ROBERT-MCDOWELL commented Jul 21, 2019

jelutz77 commented Jul 21, 2019 •

edited

Loading