layout | title | permalink | redirect_from | ||
---|---|---|---|---|---|
post |
OVERVIEW |
/docs/overview |
|
Training deep learning (DL) models on petascale datasets is essential for achieving competitive and state-of-the-art performance in applications such as speech, video analytics, and object recognition. However, existing distributed filesystems were not developed for the access patterns and usability requirements of DL jobs.
In this white paper we describe AIStore (AIS) and components, and then compare system performance experimentally using image classification workloads and storing training data on a variety of backends.
See also:
The rest of this document is structured as follows:
- At a glance
- Terminology
- Design Philosophy
- Key Concepts and Diagrams
- AIStore API
- Traffic Patterns
- Read-after-write consistency
- Open Format
- Existing Datasets
- Data Protection
- Scale-Out
- Networking
- HA
- Other Services
- dSort
- CLI
- ETL
- No limitations principle
Following is a high-level block diagram with an emphasis on supported frontend and backend APIs, and the capability to scale-out horizontally. The picture also tries to make the point that AIS aggregates an arbitrary numbers of storage servers ("targets") with local or locally accessible drives, whereby each drive is formatted with a local filesystem of choice (e.g., xfs or zfs).
In any aistore cluster, there are two kinds of nodes: proxies (a.k.a. gateways) and storage nodes (targets):
Proxies provide access points ("endpoints") for the frontend API. At any point in time there is a single primary proxy that also controls versioning and distribution of the current cluster map. When and if the primary fails, another proxy is majority-elected to perform the (primary) role.
All user data is equally distributed (or balanced) across all storage nodes ("targets"). Which, combined with zero (I/O routing and metadata processing) overhead, provides for linear scale with no limitation on the total number of aggregated storage drives.
-
Target - a storage node. To store user data, targets utilize mountpaths (see next). In the docs and the code, instead of saying something like "storage node in an aistore cluster" we simply say: "target."
-
Proxy - a gateway providing API access point. Proxies are diskless - they do not have direct access to user data, and do not "see" user data in-flight. One of the proxies is elected, or designated, as the primary (or leader) of the cluster. There may be any number of ais proxies/gateways (but only one primary at any given time).
AIS proxy/gateway implements RESTful APIs, both native and S3 compatible. Upon primary failure, remaining proxies collaborate with each other to perform majority-voted HA failover. The terms "proxy" and "gateway" are used interchangeably.
In AIS cluster, there is no correlation between the numbers of proxies and targets, although for symmetry we usually deploy one proxy for each target (storage) node.
-
Mountpath - a single disk or a volume (a RAID) formatted with a local filesystem of choice, and a local directory that AIS can fully own and utilize (to store user data and system metadata). Note that any given disk (or RAID) can have (at most) one mountpath - meaning no disk sharing. Secondly, mountpath directories cannot be nested. Further:
- a mountpath can be temporarily disabled and (re)enabled;
- a mountpath can also be detached and (re)attached, thus effectively supporting growth and "shrinkage" of local capacity;
- it is safe to execute the 4 listed operations (enable, disable, attach, detach) at any point during runtime;
- in a typical deployment, the total number of mountpaths would compute as a direct product of (number of storage targets) x (number of disks in each target).
-
Backend Provider - an abstraction, and simultaneously an API-supported option, that allows to delineate between "remote" and "local" buckets with respect to a given AIS cluster.
-
Unified Global Namespace - AIS clusters attached to each other, effectively, form a super-cluster providing unified global namespace whereby all buckets and all objects of all included clusters are uniformly accessible via any and all individual access points (of those clusters).
-
Xaction - asynchronous batch operations that may take many seconds (minutes, hours, etc.) to execute - are called eXtended actions or simply xactions. CLI and CLI documentation refers to such operations as jobs - the more familiar term that can be used interchangeably. Examples include erasure coding or n-way mirroring a dataset, resharding and reshuffling a dataset, archiving multiple objects, copying buckets, and many more. All eXtended actions support generic API and CLI to show both common counters (byte and object numbers) as well as operation-specific extended statistics.
It is often more optimal to let applications control how and whether the stored content is stored in chunks. That's the simple truth that holds, in particular, for AI datasets that are often pre-sharded with content and boundaries of those shards based on application-specific optimization criteria. More exactly, the datasets could be pre-sharded, post-sharded, and otherwise transformed to facilitate training, inference, and simulation by the AI apps.
The corollary of this statement is two-fold:
- Breaking objects into pieces (often called chunks but also slices, segments, fragments, and blocks) and the related functionality does not necessarily belong to an AI-optimized storage system per se;
- Instead of chunking the objects and then reassembling them with the help of cluster-wide metadata (that must be maintained with extreme care), the storage system could alternatively focus on providing assistance to simplify and accelerate dataset transformations.
Notice that the same exact approach works for the other side of the spectrum - the proverbial small-file problem. Here again, instead of optimizing small-size IOPS, we focus on application-specific (re)sharding, whereby each shard would have a desirable size, contain a batch of the original (small) files, and where the files (aka samples) would be sorted to optimizes subsequent computation.
In this section: high-level diagrams that introduce key concepts and architecture, as well as possible deployment options.
AIS cluster comprises arbitrary (and not necessarily equal) numbers of gateways and storage targets. Targets utilize local disks while gateways are HTTP proxies that provide most of the control plane and never touch the data.
The terms gateway and proxy are used interchangeably throughout this README and other sources in the repository.
Both gateways and targets are userspace daemons that join (and, by joining, form) a storage cluster at their respective startup times, or upon user request. AIStore can be deployed on any commodity hardware with pretty much any Linux distribution (although we do recommend 4.x kernel). There are no designed-in size/scale type limitations. There are no dependencies on special hardware capabilities. The code itself is free, open, and MIT-licensed.
The diagram depicting AIS clustered node follows below, and makes the point that gateways and storage targets can be colocated in a single machine (or a VM) but not necessarily:
AIS can be deployed as a self-contained standalone persistent storage cluster or a fast tier in front of any of the supported backends including Amazon S3 and Google Cloud (GCP). The built-in caching mechanism provides LRU replacement policy on a per-bucket basis while taking into account configurable high and low capacity watermarks (see LRU for details). AWS/GCP integration is turnkey and boils down to provisioning AIS targets with credentials to access Cloud-based buckets.
If (compute + storage) rack is a unit of deployment, it may as well look as follows:
Finally, AIS target provides a number of storage services with S3-like RESTful API on top and a MapReduce layer that we call dSort.
In addition to industry-standard S3, AIS provides its own (value-added) native API that can be (conveniently) called directly from Go and Python programs:
For Amazon S3 compatibility and related topics, see also:
In AIS, all inter- and intra-cluster networking is based on HTTP/1.1 (with HTTP/2 option currently under development). HTTP(S) clients execute RESTful operations vis-à-vis AIS gateways and data then moves directly between the clients and storage targets with no metadata servers and no extra processing in-between:
MDS in the diagram above stands for the metadata server(s) or service(s).
In the picture, a client on the left side makes an I/O request which is then fully serviced by the left target - one of the nodes in the AIS cluster (not shown). Symmetrically, the right client engages with the right AIS target for its own GET or PUT object transaction. In each case, the entire transaction is executed via a single TCP session that connects the requesting client directly to one of the clustered nodes. As far as the datapath is concerned, there are no extra hops in the line of communications.
For detailed traffic patterns diagrams, please refer to this readme.
Distribution of objects across AIS cluster is done via (lightning fast) two-dimensional consistent-hash whereby objects get distributed across all storage targets and, within each target, all local disks.
PUT(object)
is a transaction. New object (or new version of the object) becomes visible/accessible only when aistore finishes writing the first replica and its metadata.
For S3 or any other remote backend, the latter includes:
- remote PUT via vendor's SDK library;
- local write under a temp name;
- getting successful remote response that carries remote metadata;
- simultaneously, computing checksum (per bucket config);
- optionally, checksum validation, if configured;
- finally, writing combined object metadata, at which point the object becomes visible and accessible.
But not prior to that point!
If configured, additional copies and EC slices are added asynchronously. E.g., given a bucket with 3-way replication you may already read the first replica when the other two (copies) are still pending.
It is worth emphasizing that the same rules of data protection and consistency are universally enforced across the board for all data writing scenarios, including (but not limited to):
- RESTful PUT (above);
- cold GET (as in:
ais get s3://abc/xyz /dev/null
when S3 hasabc/xyz
while aistore doesn't); - copy bucket; transform bucket;
- multi-object copy; multi-object transform; multi-object archive;
- prefetch remote bucket;
- download very large remote objects (blobs);
- rename bucket;
- promote NFS share
and more.
AIS targets utilize local Linux filesystems including (but not limited to) xfs, ext4, and openzfs. User data is checksummed and stored as is without any alteration (that also allows us to support direct client <=> disk datapath). AIS on-disk format is, therefore, largely defined by local filesystem(s) chosen at deployment time.
Notwithstanding, AIS stores and then maintains object replicas, erasure-coded slices, bucket metadata - in short, a variety of local and global-scope (persistent) structures - for details, please refer to:
You can access your data with and without AIS, and without any need to convert or export/import, etc. - at any time! Your data is stored in its original native format using user-given object names. Your data can be migrated out of AIS at any time as well, and, again, without any dependency whatsoever on the AIS itself.
Your own data is unlocked and immediately available at all times.
Common way to use AIStore include the most fundamental and, often, the very first step: populating AIS cluster with an existing dataset, or datasets. Those (datasets) can come from remote buckets (AWS, Google Cloud, Azure), HDFS directories, NFS shares, local files, or any vanilla HTTP(S) locations.
To this end, AIS provides 6 (six) easy ways ranging from the conventional on-demand caching to promoting colocated files and directories.
Related references and examples include this technical blog that shows how to copy a file-based dataset in two easy steps.
- Cold GET
- Prefetch
- Internet Downloader
- HTTP(S) Datasets
- Promote local or shared files
- Backend Bucket
- Download very large objects (BLOBs)
- Copy remote bucket
- Copy multiple remote objects
In particular:
If the dataset in question is accessible via S3-like object API, start working with it via GET primitive of the AIS API. Just make sure to provision AIS with the corresponding credentials to access the dataset's bucket in the Cloud.
As far as supported S3-like backends, AIS currently supports Amazon S3, Google Cloud, and Azure.
AIS executes cold GET from the Cloud if and only if the object is not stored (by AIS), or the object has a bad checksum, or the object's version is outdated.
In all other cases, AIS will service the GET request without going to Cloud.
Alternatively or in parallel, you can also prefetch a flexibly-defined list or range of objects from any given remote bucket, as described in this readme.
For CLI usage, see:
But what if the dataset in question exists in the form of (vanilla) HTTP/HTTPS URL(s)? What if there's a popular bucket in, say, Google Cloud that contains images that you'd like to bring over into your Data Center and make available locally for AI researchers?
For these and similar use cases we have AIS Downloader - an integrated tool that can execute massive download requests, track their progress, and populate AIStore directly from the Internet.
AIS can also promote
files and directories to objects. The operation entails synchronous or asynchronus massively-parallel downloading of any accessible file source, including:
- a local directory (or directories) of any target node (or nodes);
- a file share mounted on one or several (or all) target nodes in the cluster.
You can now use promote
(CLI, API) to populate AIS datasets with any external file source.
Originally (experimentally) introduced in the v3.0 to handle "files and directories colocated within AIS storage target machines", promote
has been redefined, extended (in terms of supported options and permutations), and completely reworked in the v3.9.
AIS supports end-to-end checksumming and two distinct storage services - N-way mirroring and erasure coding - providing for data redundancy.
The functionality that we denote as end-to-end checksumming further entails:
- self-healing upon detecting corruption,
- optimizing-out redundant writes upon detecting existence of the destination object,
- utilizing client-provided checksum (iff provided) to perform end-to-end checksum validation,
- utilizing Cloud checksum of an object that originated in a Cloud bucket, and
- utilizing its version to perform so-called "cold" GET when object exists both in AIS and in the Cloud,
and more.
Needless to say, each of these sub-topics may require additional discussion of:
- configurable options,
- default settings, and
- the corresponding performance tradeoffs.
When an AIS bucket is EC-configured as (D, P), where D is the number of data slices and P - the number of parity slices, the corresponding space utilization ratio is not (D + P)/D
, as one would assume.
It is, actually, 1 + (D + P)/D
.
This is because AIS was created to perform and scale in the first place. AIS always keeps one full replica at its HRW location.
AIS will utilize EC to automatically self-heal upon detecting corruption (of the full replica). When a client performs a read on a non-existing (or not found) name, AIS will check with EC - assuming, obviously, that the bucket is erasure coded.
EC-related philosophy can be summarized as one word: recovery. EC plays no part in the fast path.
The scale-out category includes balanced and fair distribution of objects where each storage target will store (via a variant of the consistent hashing) 1/Nth of the entire namespace where (the number of objects) N is unlimited by design.
AIS cluster capability to scale-out is truly unlimited. The real-life limitations can only be imposed by the environment - capacity of a given Data Center, for instance.
Similar to the AIS gateways, AIS storage targets can join and leave at any moment causing the cluster to rebalance itself in the background and without downtime.
Architecture-wise, aistore is built to support 3 (three) logical networks:
- user-facing public and, possibly, multi-home) network interface
- intra-cluster control, and
- intra-cluster data
The way the corresponding config may look in production (e.g.) follows:
$ ais config node t[nKfooBE] local h... <TAB-TAB>
host_net.hostname host_net.port_intra_control host_net.hostname_intra_control
host_net.port host_net.port_intra_data host_net.hostname_intra_data
$ ais config node t[nKfooBE] local host_net --json
"host_net": {
"hostname": "10.50.56.205",
"hostname_intra_control": "ais-target-27.ais.svc.cluster.local",
"hostname_intra_data": "ais-target-27.ais.svc.cluster.local",
"port": "51081",
"port_intra_control": "51082",
"port_intra_data": "51083"
}
The fact that there are 3 logical networks is not a limitation - i.e, not a requirement to have exactly 3 (networks).
Using the example above, here's a small deployment-time change to run a single one:
"host_net": {
"hostname": "10.50.56.205",
"hostname_intra_control": "ais-target-27.ais.svc.cluster.local",
"hostname_intra_data": "ais-target-27.ais.svc.cluster.local",
"port": "51081",
"port_intra_control": "51081, # <<<<<< notice the same port
"port_intra_data": "51081" # <<<<<< ditto
}
Ideally though, production clusters are deployed over 3 physically different and isolated networks, whereby intense data traffic, for instance, does not introduce additional latency for the control one, etc.
Separately, there's a multi-homing capability motivated by the fact that today's server systems may often have, say, two 50Gbps network adapters. To deliver the entire 100Gbps without LACP trunking and (static) teaming, we could simply have something like:
"host_net": {
"hostname": "10.50.56.205, 10.50.56.206",
"hostname_intra_control": "ais-target-27.ais.svc.cluster.local",
"hostname_intra_data": "ais-target-27.ais.svc.cluster.local",
"port": "51081",
"port_intra_control": "51082",
"port_intra_data": "51083"
}
And that's all: add the second NIC (second IPv4 addr 10.50.56.206
above) with no other changes.
See also:
AIS features a highly-available control plane where all gateways are absolutely identical in terms of their (client-accessible) data and control plane APIs.
Gateways can be ad hoc added and removed, deployed remotely and/or locally to the compute clients (the latter option will eliminate one network roundtrip to resolve object locations).
AIS can be deployed as a fast tier in front of any of the multiple supported backends.
As a fast tier, AIS populates itself on demand (via cold GETs) and/or via its own prefetch API (see List/Range Operations) that runs in the background to download batches of objects.
The (quickly growing) list of services includes (but is not limited to):
- health monitoring and recovery
- range read
- dry-run (to measure raw network and disk performance)
- performance and capacity monitoring with full observability via StatsD/Grafana
- load balancing
Load balancing consists in optimal selection of a local object replica and, therefore, requires buckets configured for local mirroring.
Most notably, AIStore provides dSort - a MapReduce layer that performs a wide variety of user-defined merge/sort transformations on large datasets used for/by deep learning applications.
Dsort “views” AIS objects as named shards that comprise archived key/value data. In its 1.0 realization, dSort supports tar, zip, and tar-gzip formats and a variety of built-in sorting algorithms; it is designed, though, to incorporate other popular archival formats including tf.Record
and tf.Example
(TensorFlow) and MessagePack. The user runs dSort by specifying an input dataset, by-key or by-value (i.e., by content) sorting algorithm, and a desired size of the resulting shards. The rest is done automatically and in parallel by the AIS storage targets, with no part of the processing that’d involve a single-host centralization and with dSort stage and progress-within-stage that can be monitored via user-friendly statistics.
By design, dSort tightly integrates with the AIS-object to take full advantage of the combined clustered CPU and IOPS. Each dSort job (note that multiple jobs can execute in parallel) generates a massively-parallel intra-cluster workload where each AIS target communicates with all other targets and executes a proportional "piece" of a job. This ultimately results in a transformed dataset optimized for subsequent training and inference by deep learning apps.
AIStore includes an easy-to-use management-and-monitoring facility called AIS CLI. Once installed, to start using it, simply execute:
$ export AIS_ENDPOINT=http://ais-gateway:port
$ ais --help
where ais-gateway:port
(above) denotes a hostname:port
address of any AIS gateway (for developers it'll often be localhost:8080
). Needless to say, the "exporting" must be done only once.
One salient feature of AIS CLI is its Bash style auto-completions that allow users to easily navigate supported operations and options by simply pressing the TAB key:
AIS CLI is currently quickly developing. For more information, please see the project's own README.
AIStore is a hyper-converged architecture tailored specifically to run extract-transform-load workloads - run them close to data and on (and by) all storage nodes in parallel:
For background and further references, see:
There are no designed-in limitations on the:
- object sizes
- total number of objects and buckets in AIS cluster
- number of objects in a single AIS bucket
- numbers of gateways (proxies) and storage targets in AIS cluster
Ultimately, the limit on object size may be imposed by a local filesystem of choice and a physical disk capacity. While limit on the cluster size - by the capacity of the hosting AIStore Data Center. But as far as AIS itself, it does not impose any limitations whatsoever.