How to use Elastic Search

Connect to the server with a read-only account

from elasticsearch import Elasticsearch
es = Elasticsearch(
         cloud_id="TODO",
         api_key="TODO",
         retry_on_timeout=True,
         http_compress=True)

Find out which indices exist

indices = [name for name in es.indices.get("*").keys() if not name.startswith(".")]

# Or in the newer es version

indices = [name for name in es.indices.get(index="*").keys() if not name.startswith(".")]

At the moment, this will give the following indices:

c4: All of C4, including the "en", "en.noblocklist", and "en.noclean" subsets.
laion1b-nolang: Portion of LAION that doesn't have a detected language
laion2b-multi-1: Part one of LAION with a detected language other than English
laion2b-multi-2: Part two of LAION with a detected language other than English
metrics-endpoint.metadata_current_default: Internal ES stuff
openwebtext: Is what it says.
re_laion2b-en-1: Part one of LAION with a detected language of English
re_laion2b-en-2: Part two of LAION with a detected language of English
re_oscar: OSCAR.
re_pile: PILE
s2orc-abstracts: Abstracts from S2ORC. The URLs in this are Semantic Scholar URLs generated from the paper IDs.
search-test: Test index that's empty. I keep this around to look at the default mappings from time to time.

Search over one index

This searches for the word "water" in the OSCAR dataset.

es.search(index="re_oscar", body={
    "query": {
        "match": {
            "text": "water"
        }
     }
})

Search over multiple indices

Because LAION has more documents than can fit into one Elastic Search index, it is split over multiple indices. Fortunately, you can query more than one index at a time. Here is an example for searching for the word "water" in all of LAION-2B-en:

es.search(index="re_laion2b-en-*", body={
    "query": {
        "match": {
            "text": "water"
        }
    }
})

Search over C4 subsets

Because C4 is so big, and the subsets have considerable overlap, we didn't want to make one index per subset. Instead, every document has a "subset" field that can be queried. Here is how you search for "water" only in the "en" subset:

es.search(index="c4", body={
    "query": {
        "bool": {
            "must": {
                "match": {"text": "water"}
             },
             "filter": {
                 "term": {"subset": "en"}
             }
         }
     }
})

Getting documents when you already have a document ID

When you already have a document ID, you don't need to search. You can just retrieve the document. It's much faster. This is how you do it:

es.get("re_laion2b-en-2", doc_id)

Document IDs are usually derived from the source data, if the source data has a reasonable ID that we can use. If they can't be derived from the source data, they are derived from the JSON file and line number that the document came from.

LAION has an "ID" field in the source data, but it's not a unique identifier. Many LAION documents share the same ID, so I didn't use that field.
C4 document IDs are hashes of the URL. See below.

C4 document IDs are hashes of the URL

I wish we could just use the URL as document ids, but they are too long. So instead, we use a hash of the URL as document ID. This is how you turn a URL into a document ID:

import hashlib
encoded_url = url.strip().encode("UTF8")
doc_id = hashlib.blake2b(encoded_url).hexdigest()[:512]

Access random documents

import h5py
import random


dt = h5py.string_dtype(encoding='utf-8')
f = h5py.File("ids.h5", "r")
ds = f["ids"]

random_doc_id = random.choice(ds).decode()


es.get(index="laion2b-en-2", id=random_doc_id)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ELASTIC.md

ELASTIC.md

How to use Elastic Search

Connect to the server with a read-only account

Find out which indices exist

Search over one index

Search over multiple indices

Search over C4 subsets

Getting documents when you already have a document ID

C4 document IDs are hashes of the URL

Access random documents

Files

ELASTIC.md

Latest commit

History

ELASTIC.md

File metadata and controls

How to use Elastic Search

Connect to the server with a read-only account

Find out which indices exist

Search over one index

Search over multiple indices

Search over C4 subsets

Getting documents when you already have a document ID

C4 document IDs are hashes of the URL

Access random documents