Skip to content

Commit

Permalink
Hierarchical bandwidth and operations limits.
Browse files Browse the repository at this point in the history
Introduce six new properties: iolimit_{bw,op}_{read,write,total}.

The iolimit_bw_* properties limit the read, write, or combined bandwidth,
respectively, that a dataset and its descendants can consume.
Limits are applied to both file systems and ZFS volumes.

The configured limits are hierarchical, just like quotas; i.e., even if
a higher limit is configured on the child dataset, the parent's lower
limit will be enforced.

The limits are applied at the VFS level, not at the disk level.
The dataset is charged for each operation even if no disk access is
required (e.g., due to caching, compression, deduplication,
or NOP writes) or if the operation will cause more traffic (due to
the copies property, mirroring, or RAIDZ).

Read bandwidth consumption is based on:

- read-like syscalls, eg., aio_read(2), pread(2), preadv(2), read(2),
  readv(2), sendfile(2)

- syscalls like getdents(2) and getdirentries(2)

- reading via mmaped files

- zfs send

Write bandwidth consumption is based on:

- write-like syscalls, eg., aio_write(2), pwrite(2), pwritev(2),
  write(2), writev(2)

- writing via mmaped files

- zfs receive

The iolimit_op_* properties limit the read, write, or both metadata
operations, respectively, that dataset and its descendants can generate.

Read operations consumption is based on:

- read-like syscalls where the number of operations is equal to the
  number of blocks being read (never less than 1)

- reading via mmaped files, where the number of operations is equal
  to the number of pages being read (never less than 1)

- syscalls accessing metadata: readlink(2), stat(2)

Write operations consumption is based on:

- write-like syscalls where the number of operations is equal to the
  number of blocks being written (never less than 1)

- writing via mmaped files, where the number of operations is equal
  to the number of pages being written (never less than 1)

- syscalls modifing a directory's content: bind(2) (UNIX-domain
  sockets), link(2), mkdir(2), mkfifo(2), mknod(2), open(2) (file
  creation), rename(2), rmdir(2), symlink(2), unlink(2)

- syscalls modifing metadata: chflags(2), chmod(2), chown(2),
  utimes(2)

- updating the access time of a file when reading it

Just like iolimit_bw_* limits, the iolimit_op_* limits are also
hierarchical and applied at the VFS level.

Signed-off-by: Pawel Jakub Dawidek <[email protected]>
  • Loading branch information
pjd committed Nov 19, 2024
1 parent ff3df12 commit f20f80c
Show file tree
Hide file tree
Showing 62 changed files with 4,747 additions and 70 deletions.
18 changes: 14 additions & 4 deletions cmd/zfs/zfs_main.c
Original file line number Diff line number Diff line change
Expand Up @@ -2490,15 +2490,25 @@ zfs_do_inherit(int argc, char **argv)
if (!zfs_prop_inheritable(prop) && !received) {
(void) fprintf(stderr, gettext("'%s' property cannot "
"be inherited\n"), propname);
if (prop == ZFS_PROP_QUOTA ||
prop == ZFS_PROP_RESERVATION ||
prop == ZFS_PROP_REFQUOTA ||
prop == ZFS_PROP_REFRESERVATION) {
switch (prop) {
case ZFS_PROP_QUOTA:
case ZFS_PROP_RESERVATION:
case ZFS_PROP_REFQUOTA:
case ZFS_PROP_REFRESERVATION:
case ZFS_PROP_IOLIMIT_BW_READ:
case ZFS_PROP_IOLIMIT_BW_WRITE:
case ZFS_PROP_IOLIMIT_BW_TOTAL:
case ZFS_PROP_IOLIMIT_OP_READ:
case ZFS_PROP_IOLIMIT_OP_WRITE:
case ZFS_PROP_IOLIMIT_OP_TOTAL:
(void) fprintf(stderr, gettext("use 'zfs set "
"%s=none' to clear\n"), propname);
(void) fprintf(stderr, gettext("use 'zfs "
"inherit -S %s' to revert to received "
"value\n"), propname);
break;
default:
break;
}
return (1);
}
Expand Down
1 change: 1 addition & 0 deletions include/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -131,6 +131,7 @@ COMMON_H = \
sys/zfs_file.h \
sys/zfs_fuid.h \
sys/zfs_impl.h \
sys/zfs_iolimit.h \
sys/zfs_project.h \
sys/zfs_quota.h \
sys/zfs_racct.h \
Expand Down
1 change: 1 addition & 0 deletions include/os/freebsd/spl/sys/systm.h
Original file line number Diff line number Diff line change
Expand Up @@ -39,5 +39,6 @@
#define PAGEMASK (~PAGEOFFSET)

#define delay(x) pause("soldelay", (x))
#define delay_sig(x) (pause_sig("soldelay", (x)) != EAGAIN)

#endif /* _OPENSOLARIS_SYS_SYSTM_H_ */
9 changes: 6 additions & 3 deletions include/os/freebsd/zfs/sys/zfs_znode_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -168,9 +168,12 @@ zfs_exit(zfsvfs_t *zfsvfs, const char *tag)
(tp)->tv_sec = (time_t)(stmp)[0]; \
(tp)->tv_nsec = (long)(stmp)[1]; \
}
#define ZFS_ACCESSTIME_STAMP(zfsvfs, zp) \
if ((zfsvfs)->z_atime && !((zfsvfs)->z_vfs->vfs_flag & VFS_RDONLY)) \
zfs_tstamp_update_setup_ext(zp, ACCESSED, NULL, NULL, B_FALSE);
#define ZFS_ACCESSTIME_STAMP(zfsvfs, zp) do { \
if ((zfsvfs)->z_atime && !((zfsvfs)->z_vfs->vfs_flag & VFS_RDONLY)) { \
zfs_iolimit_metadata_write((zfsvfs)->z_os); \
zfs_tstamp_update_setup_ext(zp, ACCESSED, NULL, NULL, B_FALSE);\
} \
} while (0)

extern void zfs_tstamp_update_setup_ext(struct znode *,
uint_t, uint64_t [2], uint64_t [2], boolean_t have_tx);
Expand Down
3 changes: 2 additions & 1 deletion include/os/linux/spl/sys/timer.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@
#define ddi_time_before_eq64(a, b) (!ddi_time_after64(a, b))
#define ddi_time_after_eq64(a, b) ddi_time_before_eq64(b, a)

#define delay(ticks) schedule_timeout_uninterruptible(ticks)
#define delay(ticks) schedule_timeout_uninterruptible(ticks)
#define delay_sig(ticks) (schedule_timeout_interruptible(ticks) > 0)

#define SEC_TO_TICK(sec) ((sec) * HZ)
#define MSEC_TO_TICK(ms) msecs_to_jiffies(ms)
Expand Down
6 changes: 6 additions & 0 deletions include/sys/dsl_dir.h
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ extern "C" {
#endif

struct dsl_dataset;
struct zfs_iolimit;
struct zthr;
/*
* DD_FIELD_* are strings that are used in the "extensified" dsl_dir zap object.
Expand Down Expand Up @@ -127,6 +128,10 @@ struct dsl_dir {
boolean_t dd_activity_cancelled;
uint64_t dd_activity_waiters;

/* protected by spa_iolimit_lock */
struct zfs_iolimit *dd_iolimit;
dsl_dir_t *dd_iolimit_root;

/* protected by dd_lock; keep at end of struct for better locality */
char dd_myname[ZFS_MAX_DATASET_NAME_LEN];
};
Expand Down Expand Up @@ -182,6 +187,7 @@ int dsl_dir_set_quota(const char *ddname, zprop_source_t source,
uint64_t quota);
int dsl_dir_set_reservation(const char *ddname, zprop_source_t source,
uint64_t reservation);
int dsl_dir_set_iolimit(const char *dsname, zfs_prop_t prop, uint64_t value);
int dsl_dir_activate_fs_ss_limit(const char *);
int dsl_fs_ss_limit_check(dsl_dir_t *, uint64_t, zfs_prop_t, dsl_dir_t *,
cred_t *, proc_t *);
Expand Down
6 changes: 6 additions & 0 deletions include/sys/fs/zfs.h
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,12 @@ typedef enum {
ZFS_PROP_VOLTHREADING,
ZFS_PROP_DIRECT,
ZFS_PROP_LONGNAME,
ZFS_PROP_IOLIMIT_BW_READ,
ZFS_PROP_IOLIMIT_BW_WRITE,
ZFS_PROP_IOLIMIT_BW_TOTAL,
ZFS_PROP_IOLIMIT_OP_READ,
ZFS_PROP_IOLIMIT_OP_WRITE,
ZFS_PROP_IOLIMIT_OP_TOTAL,
ZFS_NUM_PROPS
} zfs_prop_t;

Expand Down
2 changes: 2 additions & 0 deletions include/sys/spa_impl.h
Original file line number Diff line number Diff line change
Expand Up @@ -463,6 +463,8 @@ struct spa {
uint64_t spa_leaf_list_gen; /* track leaf_list changes */
uint32_t spa_hostid; /* cached system hostid */

rrmlock_t spa_iolimit_lock;

/* synchronization for threads in spa_wait */
kmutex_t spa_activities_lock;
kcondvar_t spa_activities_cv;
Expand Down
73 changes: 73 additions & 0 deletions include/sys/zfs_iolimit.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/

/*
* Copyright (c) 2024 The FreeBSD Foundation
*
* This software was developed by Pawel Dawidek <[email protected]>
* under sponsorship from the FreeBSD Foundation.
*/

#ifndef _SYS_ZFS_IOLIMIT_H
#define _SYS_ZFS_IOLIMIT_H

#include <sys/dmu_objset.h>

#ifdef __cplusplus
extern "C" {
#endif

struct zfs_iolimit;

#define ZFS_IOLIMIT_BW_READ 0
#define ZFS_IOLIMIT_BW_WRITE 1
#define ZFS_IOLIMIT_BW_TOTAL 2
#define ZFS_IOLIMIT_OP_READ 3
#define ZFS_IOLIMIT_OP_WRITE 4
#define ZFS_IOLIMIT_OP_TOTAL 5
#define ZFS_IOLIMIT_FIRST ZFS_IOLIMIT_BW_READ
#define ZFS_IOLIMIT_LAST ZFS_IOLIMIT_OP_TOTAL
#define ZFS_IOLIMIT_NTYPES (ZFS_IOLIMIT_LAST + 1)

int zfs_iolimit_prop_to_type(zfs_prop_t prop);
zfs_prop_t zfs_iolimit_type_to_prop(int type);

struct zfs_iolimit *zfs_iolimit_alloc(const uint64_t *limits);
void zfs_iolimit_free(struct zfs_iolimit *iol);
struct zfs_iolimit *zfs_iolimit_set(struct zfs_iolimit *iol, zfs_prop_t prop,
uint64_t limit);

int zfs_iolimit_data_read(objset_t *os, size_t blocksize, size_t bytes);
int zfs_iolimit_data_write(objset_t *os, size_t blocksize, size_t bytes);
int zfs_iolimit_data_copy(objset_t *srcos, objset_t *dstos, size_t blocksize,
size_t bytes);
int zfs_iolimit_metadata_read(objset_t *os);
int zfs_iolimit_metadata_write(objset_t *os);

void zfs_iolimit_data_read_spin(objset_t *os, size_t blocksize, size_t bytes);
void zfs_iolimit_data_write_spin(objset_t *os, size_t blocksize,
size_t bytes);

#ifdef __cplusplus
}
#endif

#endif /* _SYS_ZFS_IOLIMIT_H */
8 changes: 7 additions & 1 deletion lib/libzfs/libzfs.abi
Original file line number Diff line number Diff line change
Expand Up @@ -2049,7 +2049,13 @@
<enumerator name='ZFS_PROP_VOLTHREADING' value='97'/>
<enumerator name='ZFS_PROP_DIRECT' value='98'/>
<enumerator name='ZFS_PROP_LONGNAME' value='99'/>
<enumerator name='ZFS_NUM_PROPS' value='100'/>
<enumerator name='ZFS_PROP_IOLIMIT_BW_READ' value='100'/>
<enumerator name='ZFS_PROP_IOLIMIT_BW_WRITE' value='101'/>
<enumerator name='ZFS_PROP_IOLIMIT_BW_TOTAL' value='102'/>
<enumerator name='ZFS_PROP_IOLIMIT_OP_READ' value='103'/>
<enumerator name='ZFS_PROP_IOLIMIT_OP_WRITE' value='104'/>
<enumerator name='ZFS_PROP_IOLIMIT_OP_TOTAL' value='105'/>
<enumerator name='ZFS_NUM_PROPS' value='106'/>
</enum-decl>
<typedef-decl name='zfs_prop_t' type-id='4b000d60' id='58603c44'/>
<enum-decl name='zprop_source_t' naming-typedef-id='a2256d42' id='5903f80e'>
Expand Down
33 changes: 31 additions & 2 deletions lib/libzfs/libzfs_dataset.c
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
#include <sys/spa.h>
#include <sys/zap.h>
#include <sys/dsl_crypt.h>
#include <sys/zfs_iolimit.h>
#include <libzfs.h>
#include <libzutil.h>

Expand Down Expand Up @@ -2287,6 +2288,12 @@ get_numeric_property(zfs_handle_t *zhp, zfs_prop_t prop, zprop_source_t *src,
case ZFS_PROP_SNAPSHOT_LIMIT:
case ZFS_PROP_FILESYSTEM_COUNT:
case ZFS_PROP_SNAPSHOT_COUNT:
case ZFS_PROP_IOLIMIT_BW_READ:
case ZFS_PROP_IOLIMIT_BW_WRITE:
case ZFS_PROP_IOLIMIT_BW_TOTAL:
case ZFS_PROP_IOLIMIT_OP_READ:
case ZFS_PROP_IOLIMIT_OP_WRITE:
case ZFS_PROP_IOLIMIT_OP_TOTAL:
*val = getprop_uint64(zhp, prop, source);

if (*source == NULL) {
Expand Down Expand Up @@ -2811,12 +2818,15 @@ zfs_prop_get(zfs_handle_t *zhp, zfs_prop_t prop, char *propbuf, size_t proplen,
case ZFS_PROP_REFQUOTA:
case ZFS_PROP_RESERVATION:
case ZFS_PROP_REFRESERVATION:
case ZFS_PROP_IOLIMIT_BW_READ:
case ZFS_PROP_IOLIMIT_BW_WRITE:
case ZFS_PROP_IOLIMIT_BW_TOTAL:

if (get_numeric_property(zhp, prop, src, &source, &val) != 0)
return (-1);
/*
* If quota or reservation is 0, we translate this into 'none'
* (unless literal is set), and indicate that it's the default
* If the value is 0, we translate this into 'none' (unless
* literal is set), and indicate that it's the default
* value. Otherwise, we print the number nicely and indicate
* that its set locally.
*/
Expand All @@ -2835,6 +2845,25 @@ zfs_prop_get(zfs_handle_t *zhp, zfs_prop_t prop, char *propbuf, size_t proplen,
zcp_check(zhp, prop, val, NULL);
break;

case ZFS_PROP_IOLIMIT_OP_READ:
case ZFS_PROP_IOLIMIT_OP_WRITE:
case ZFS_PROP_IOLIMIT_OP_TOTAL:

if (get_numeric_property(zhp, prop, src, &source, &val) != 0)
return (-1);
/*
* If the value is 0, we translate this into 'none', unless
* literal is set.
*/
if (val == 0 && !literal) {
(void) strlcpy(propbuf, "none", proplen);
} else {
(void) snprintf(propbuf, proplen, "%llu",
(u_longlong_t)val);
}
zcp_check(zhp, prop, val, NULL);
break;

case ZFS_PROP_FILESYSTEM_LIMIT:
case ZFS_PROP_SNAPSHOT_LIMIT:
case ZFS_PROP_FILESYSTEM_COUNT:
Expand Down
1 change: 1 addition & 0 deletions lib/libzpool/Makefile.am
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ nodist_libzpool_la_SOURCES = \
module/zfs/zfs_chksum.c \
module/zfs/zfs_fm.c \
module/zfs/zfs_fuid.c \
module/zfs/zfs_iolimit.c \
module/zfs/zfs_ratelimit.c \
module/zfs/zfs_rlock.c \
module/zfs/zfs_sa.c \
Expand Down
111 changes: 111 additions & 0 deletions man/man7/zfsprops.7
Original file line number Diff line number Diff line change
Expand Up @@ -1236,6 +1236,117 @@ and the minimum is
.Sy 100000 .
This property may be changed with
.Nm zfs Cm change-key .
.It Sy iolimit_bw_read Ns = Ns Ar size Ns | Ns Sy none
.It Sy iolimit_bw_write Ns = Ns Ar size Ns | Ns Sy none
.It Sy iolimit_bw_total Ns = Ns Ar size Ns | Ns Sy none
Limits the read, write, or combined bandwidth, respectively, that a dataset and
its descendants can consume.
Limits are applied to file systems, volumes and their snapshots.
Bandwidth limits are in bytes per second.
.Pp
The configured limits are hierarchical, just like quotas; i.e., even if a
higher limit is configured on the child dataset, the parent's lower limit will
be enforced.
.Pp
The limits are applied at the VFS level, not at the disk level.
The dataset is charged for each operation even if no disk access is required
(e.g., due to caching, compression, deduplication, or NOP writes) or if the
operation will cause more traffic (due to the copies property, mirroring,
or RAIDZ).
.Pp
Read bandwidth consumption is based on:
.Bl -bullet
.It
read-like syscalls, eg.,
.Xr aio_read 2 ,
.Xr copy_file_range 2 ,
.Xr pread 2 ,
.Xr preadv 2 ,
.Xr read 2 ,
.Xr readv 2 ,
.Xr sendfile 2
.It
syscalls like
.Xr getdents 2
and
.Xr getdirentries 2
.It
reading via mmaped files
.It
.Nm zfs Cm send
.El
.Pp
Write bandwidth consumption is based on:
.Bl -bullet
.It
write-like syscalls, eg.,
.Xr aio_write 2 ,
.Xr copy_file_range 2 ,
.Xr pwrite 2 ,
.Xr pwritev 2 ,
.Xr write 2 ,
.Xr writev 2
.It
writing via mmaped files
.It
.Nm zfs Cm receive
.El
.It Sy iolimit_op_read Ns = Ns Ar count Ns | Ns Sy none
.It Sy iolimit_op_write Ns = Ns Ar count Ns | Ns Sy none
.It Sy iolimit_op_total Ns = Ns Ar count Ns | Ns Sy none
Limits the read, write, or both metadata operations, respectively, that a
dataset and its descendants can generate.
Limits are number of operations per second.
.Pp
Read operations consumption is based on:
.Bl -bullet
.It
read-like syscalls where the number of operations is equal to the number of
blocks being read (never less than 1)
.It
reading via mmaped files, where the number of operations is equal to the
number of pages being read (never less than 1)
.It
syscalls accessing metadata:
.Xr readlink 2 ,
.Xr stat 2
.El
.Pp
Write operations consumption is based on:
.Bl -bullet
.It
write-like syscalls where the number of operations is equal to the number of
blocks being written (never less than 1)
.It
writing via mmaped files, where the number of operations is equal to the
number of pages being written (never less than 1)
.It
syscalls modifing a directory's content:
.Xr bind 2 (UNIX-domain sockets) ,
.Xr link 2 ,
.Xr mkdir 2 ,
.Xr mkfifo 2 ,
.Xr mknod 2 ,
.Xr open 2 (file creation) ,
.Xr rename 2 ,
.Xr rmdir 2 ,
.Xr symlink 2 ,
.Xr unlink 2
.It
syscalls modifing metadata:
.Xr chflags 2 ,
.Xr chmod 2 ,
.Xr chown 2 ,
.Xr utimes 2
.It
updating the access time of a file when reading it
.El
.Pp
Just like
.Sy iolimit_bw
limits, the
.Sy iolimit_op
limits are also hierarchical and applied at the VFS level.
.It Sy exec Ns = Ns Sy on Ns | Ns Sy off
Controls whether processes can be executed from within this file system.
The default value is
Expand Down
Loading

0 comments on commit f20f80c

Please sign in to comment.