The next problem that needed to be solved was: how to make an inverted index updatable without losing the benefits of immutability? The answer turned out to be: use more than one index.
Instead of rewriting the whole inverted index, add new supplementary indices to reflect more recent changes. Each inverted index can be queried in turn — starting with the oldest — and the results combined.
Lucene, the Java libraries on which Elasticsearch is based, introduced the concept of per-segment search. A segment is an inverted index in its own right, but now the word index in Lucene came to mean a ``collection of segments'' plus a commit point — a file which lists all known segments.
To add to the confusion, a Lucene index is what we call a shard in Elasticsearch, while an index in Elasticsearch is a collection of shards. When Elasticsearch searches an ``index'', it sends the query out to a copy of every shard (Lucene index) that belongs to the index, then reduces the per-shards results to a global result set, as described in [distributed-search].
The way that per-segment search works is as follows:
-
New documents are collected in an in-memory indexing buffer. See A Lucene index with new documents in the in-memory buffer, ready to commit.
-
Every so often, the buffer is commited:
-
A new segment — a supplementary inverted index — is written to disk.
-
A new commit point is written to disk, which includes the name of the new segment.
-
The disk is fsync’ed — all writes waiting in the file-system cache are flushed to disk, to ensure that they have been physically written.
-
-
The new segment is opened, making the documents it contains visible to search.
-
The in-memory buffer is cleared, and is ready to accept new documents.
When a query is issued, all known segments are queried in turn. Term statistics are aggregated across all segments to ensure that the relevance of each term and each document is calculated accurately. In this way, new documents can be added to the index relatively cheaply.
Segments are immutable, so documents cannot be removed from older segments,
nor can older segments be updated to reflect a newer version of a document.
Instead, every commit point includes a .del
file which lists which documents
in which segments have been deleted.
When a document is `deleted'', it is actually just marked as deleted in the
`.del
file. A document which has been marked as deleted can still match a
query, but it is removed from the results list before the final query results
are returned.
Document updates work in a similar way: when a document is updated, the old version of the document is marked as deleted, and the new version of the document is indexed in a new segement. Perhaps both versions of the document will match a query, but the older deleted version is removed before the query results are returned.
In [merge-process], we will talk about how deleted documents are purged from the file system.