Sprucfluo: Streaming Datasets for Large Language Models

This is a library to support streaming datasets for large language models. It's centered around TorchData, which is a library of combinators around torch Datasets. It uses HuggingFace for tokenization, and fsspec for reading files from local filesystems, http, and cloud storage.

Here's a quick example of how to use it. This script will stream the pile and another corpus, tokenize them, and mix samples from them in equal proportions.

from torchdata.datapipes.iter.util.samplemultiplexer import SampleMultiplexerDataPipe
from transformers import AutoTokenizer

import sprucfluo as sf

pile_data = sf.corpus.load_corpus("https://mystic.the-eye.eu/public/AI/pile/train/{00..29}.jsonl.zst")
pubmed_data = sf.corpus.load_corpus("gcs://pubmed-mosaic/pubmed-sharded/pubmedAbs_train.{1..128}-of-128.jsonl.gz")

tokenizer = AutoTokenizer.from_pretrained("gpt2")

pubmed_data = pubmed_data.then(sf.tokenize_and_group_texts, tokenizer=tokenizer, seq_len=1024)
pile_data = pile_data.then(sf.tokenize_and_group_texts, tokenizer=tokenizer, seq_len=1024)

data = SampleMultiplexerDataPipe({pile_data: 5, pubmed_data: 5}, seed=0)

for encoded in data: # iterdatapipe of BatchEncoding
    print(encoded)

One somewhat sneaky thing that sprucfluo does is it registers the then method on IterDataPipe using (the existing) TorchData registration mechanism. This means that you can use the then method on any IterDataPipe, passing in a function that goes from IterDataPipe to IterDataPipe, along with any additional arguments that can be passed in à la functtools.partial.

It also registers a few additional methods:

shard_by_rank: splits the data into shards based on the rank of the process.
load_from_fsspec_fancy: similar to the built-in load_from_fsspec, but accepts additional arguments to pass to fsspec.open, including compression and compression_opts and any authentication information. It also automatically decompresses using the built-in decompression mechanisms in fsspec.

At the moment it doesn't support caching, though that's in progress.

Open TAsks

Add support for caching
think about loading from HF datasets?
support for the following datasets:
- openwebtext (weird archive format)
- wikitext-103 (need to detokenize)
- lambada (need to detokenize)

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
examples		examples
scripts		scripts
src/sprucfluo		src/sprucfluo
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sprucfluo: Streaming Datasets for Large Language Models

Open TAsks

About

Releases

Packages

Languages

License

stanford-crfm/sprucfluo

Folders and files

Latest commit

History

Repository files navigation

Sprucfluo: Streaming Datasets for Large Language Models

Open TAsks

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages