From 6caac143bf2b8d29a4cacedfacdeff692289d7d4 Mon Sep 17 00:00:00 2001
From: Iris Wagner You can download the bundle from here: 935553.probe.bundle.tar.gz and scp or sftp to the Linux client that you wish to run the probe. Or you may choose to use wget to download the bundle directly on the Linux client that you wish to run the probe: Americas: Download using sftp to the Linux client that you wish to run the probe. Deployment OverviewDownload¶
-
wget "https://vastdatasupport.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-08-08T14:31:50Z&se=2024-08-01T22:31:50Z&spr=https&sv=2022-11-02&sr=b&sig=5zlspUUlKFfwHr%2BpDaE1UCWWeJMZLpCFHq5VQpujbHY%3D" -O 935553.probe.bundle.tar.gz
+
% sftp gl4f_probe@halo.storagelr5.ext.hpe.com:/935553.probe.bundle.tar.gz .
+The authenticity of host 'halo.storagelr5.ext.hpe.com (63.215.98.146)' can't be established.
Europe:
-
wget "https://vastdatasupporteuwest.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:46:51Z&se=2026-04-24T23:46:51Z&spr=https&sv=2021-12-02&sr=b&sig=Nska0jbs3Fz%2BrX7IavgLBq8lZeoLyo3n2sQ%2Bz3CrdOM%3D" -O 935553.probe.bundle.tar.gz
-
-Asia/Pacific:
-
wget "https://vastsupportjapanwest.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:47:22Z&se=2026-04-24T23:47:22Z&spr=https&sv=2021-12-02&sr=b&sig=EURsk5b4LHKM%2Bk32qyVMiab%2FZXnfIodpDiTCm5wB%2F1w%3D" -O 935553.probe.bundle.tar.gz
-
-South Africa:
-
wget "https://vastsupportsanorth.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:46:13Z&se=2026-04-24T23:46:13Z&spr=https&sv=2021-12-02&sr=b&sig=mw%2BBoG6YYy7TTA%2B8ga5zzfWZDdOwjJ9al6ot2Z7b6wQ%3D" -O 935553.probe.bundle.tar.gz
+Type in password: HPE@cc3$$4SFTP
+
gl4f_probe@halo.storagelr5.ext.hpe.com's password:
+Connected to halo.storagelr5.ext.hpe.com.
+Fetching /935553.probe.bundle.tar.gz to ./935553.probe.bundle.tar.gz
Expand & Verify Download¶
Now that you've downloaded the probe, you'll need to untar it and then verify the download is correct.
@@ -275,6 +266,49 @@ Re-Running The ProbeTroubleshooting¶
Refer to the Troubleshooting document and contact HPE Support.
+
+
diff --git a/search/search_index.json b/search/search_index.json
index 1d495b7..acef846 100644
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"index.html","text":"HPE GreenLake for File Storage Data Reduction Estimation Probe \u00b6 This tool is designed to run against an existing one or more File Storage mounts and provide an accurate estimation of how much data reduction you should expect to see when moving your data set to an HPE GreenLake for File Storage solution. Synopsis \u00b6 This documentation shows how to check the prerequisites , deploy the probe, and understand the output . Support \u00b6 Typically you would work with your HPE Sales engineer to deploy and use The HPE GreenLake for File Storage Data Reduction Estimation Probe. Should HPE Sales engineers have issues or additional questions please contact us at Slack channel #ask-greenlake-for-filestorage.","title":"HPE GreenLake for File Storage Data Reduction Estimation Probe"},{"location":"index.html#hpe_greenlake_for_file_storage_data_reduction_estimation_probe","text":"This tool is designed to run against an existing one or more File Storage mounts and provide an accurate estimation of how much data reduction you should expect to see when moving your data set to an HPE GreenLake for File Storage solution.","title":"HPE GreenLake for File Storage Data Reduction Estimation Probe"},{"location":"index.html#synopsis","text":"This documentation shows how to check the prerequisites , deploy the probe, and understand the output .","title":"Synopsis"},{"location":"index.html#support","text":"Typically you would work with your HPE Sales engineer to deploy and use The HPE GreenLake for File Storage Data Reduction Estimation Probe. Should HPE Sales engineers have issues or additional questions please contact us at Slack channel #ask-greenlake-for-filestorage.","title":"Support"},{"location":"deployment/index.html","text":"Deployment Overview \u00b6 The HPE GreenLake for File Storage Data Reduction Estimation Probe provides estimated data reduction rate achieable based on an example data set. Make sure to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This article will guide you through the process of deployment and execution of the probe. Deployment Overview Download Expand & Verify Download Mount Filesystems Selected to Be Probed Create Probe Directories Size of the Data Set Running The Probe Other Probe Flags Understanding the Results Re-Running The Probe Troubleshooting Download \u00b6 You can download the bundle from here: 935553.probe.bundle.tar.gz and scp or sftp to the Linux client that you wish to run the probe. Or you may choose to use wget to download the bundle directly on the Linux client that you wish to run the probe: Americas: wget \"https://vastdatasupport.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-08-08T14:31:50Z&se=2024-08-01T22:31:50Z&spr=https&sv=2022-11-02&sr=b&sig=5zlspUUlKFfwHr%2BpDaE1UCWWeJMZLpCFHq5VQpujbHY%3D\" -O 935553.probe.bundle.tar.gz Europe: wget \"https://vastdatasupporteuwest.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:46:51Z&se=2026-04-24T23:46:51Z&spr=https&sv=2021-12-02&sr=b&sig=Nska0jbs3Fz%2BrX7IavgLBq8lZeoLyo3n2sQ%2Bz3CrdOM%3D\" -O 935553.probe.bundle.tar.gz Asia/Pacific: wget \"https://vastsupportjapanwest.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:47:22Z&se=2026-04-24T23:47:22Z&spr=https&sv=2021-12-02&sr=b&sig=EURsk5b4LHKM%2Bk32qyVMiab%2FZXnfIodpDiTCm5wB%2F1w%3D\" -O 935553.probe.bundle.tar.gz South Africa: wget \"https://vastsupportsanorth.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:46:13Z&se=2026-04-24T23:46:13Z&spr=https&sv=2021-12-02&sr=b&sig=mw%2BBoG6YYy7TTA%2B8ga5zzfWZDdOwjJ9al6ot2Z7b6wQ%3D\" -O 935553.probe.bundle.tar.gz Expand & Verify Download \u00b6 Now that you've downloaded the probe, you'll need to untar it and then verify the download is correct. export PROBE_BUILD=935553 tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz ls -l Note: example may not show current build numbers. [root@iris-centos-workloadclient-22 probe]# ls -l total 1840344 -rw-r--r--. 1 root root 937920831 Jul 12 12:44 935553.probe.bundle.tar.gz -rw-r--r--. 1 root root 946565338 Jul 12 12:44 935553.probe.image.gz -rwxr-xr-x. 1 root root 19579 Jul 12 12:44 probe_launcher.py Mount Filesystems Selected to Be Probed \u00b6 Validated Filesystems Include, But Are Not Limited To: NFS Lustre GPFS S3 with goofys CIFS/SMB For the most accurate results, do not use root-squash. It's recommended to set read-only access on the mounted filesystem Create Probe Directories \u00b6 Change /mnt/ to the SSD-backed local disk to be used by the probe for the hash database and logging directories sudo mkdir -p /mnt/probe/db sudo mkdir -p /mnt/probe/out sudo chmod -Rf 777 /mnt/probe Size of the Data Set \u00b6 The input to the probe is a defined directory ( --input-dir ) The probe will automatically query the input filesystem about space consumed and file count (inodes) and use that in its calculations Depending on the method of mounting and underlying storage, this can often provide an inaccurate query response It's highly recommended that manual estimated entries be defined for space consumed ( --data-size-gb ) and file count ( --number-of-files ) These estimates do not have to be accurate, round up reasonably Running The Probe \u00b6 The probe runs as a foreground application. This means that if your session is closed for whatever reason, the probe will stop. It's recommended running the probe as a screen session. Here is an example of a command line. Edit the bold variables for the environment: NOTE: Use underscores instead of spaces in COMPANY_NAME and WORKLOAD export DB_DIR=/mnt/probe/db export OUTPUT_DIR=/mnt/probe/out export INPUT_DIR=/mnt/filesystem_to_be_probed/sub_directory export INPUT_SIZE_GB=10000 export QTY_FILES=1000000 export COMPANY_NAME=Your_Amazing_Company export WORKLOAD=Describe_Your_Workload Start the probe: (This may take up to five minutes to start displaying output) sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir $INPUT_DIR \\ --metadata-dir $DB_DIR \\ --output-dir $OUTPUT_DIR \\ --data-size-gb $INPUT_SIZE_GB \\ --number-of-files $QTY_FILES \\ --customer-name ${COMPANY_NAME}---${WORKLOAD} Example One: Small Data Sets To probe the directory interesting_data of 15 TB in-use and 5,000,000 files at the company ACME, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/acme_filer/interesting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 15000 \\ --number-of-files 5000000 \\ --customer-name ACME---Interesting_Data Example Two: Larger Data Sets To probe the directory fascinating_data of 60 TB in-use and 750,000,000 files at the company FOO, and are using defined parameters for RAM and SSD-backed local disk the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/foo_filer/fascinating_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 60000 \\ --number-of-files 750000000 \\ --customer-name FOO---Facinating_Data Example Three: Performance Throttling To probe the directory riviting_data of 250 TB in-use and 1,250,000,000 files at the company Initech, using defined parameters for RAM and SSD-backed local disk, but wish to have a lower performance impact on the filesystem, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/initech_filer/riviting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 250000 \\ --number-of-files 1250000000 \\ --number-of-threads 4 --customer-name Initech---Riviting_Data Note the --number-of-threads flag. By default the probe will use all CPU cores in the system but this can be used to throttle performance and reduce potential impact of the scanned filesystem. Other Probe Flags \u00b6 While the probe is running and after completion, telemetry logs are automatically uploaded to HPE. To prevent this, add the following flag: --dont-send-logs \\ If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names \\ Probing filesystems which contain snapshots can often cause recursion issues and inaccurate results. As a result the probe automatically ignores directories named .snapshot. If your file system uses another convention, use the --regexp-filter command. If for some reason you want the probe to read the .snapshot directories, specify false rather than true for --filter-snapshots . --filter-snapshots \\ (this is the default) Under most circumstances the probe should be run with adaptive chunking. However you can disable that feature by specifying this flag: --disable-adaptive-chunking \\ Understanding the Results \u00b6 Once started, the probe will display the current projection of potential data reduction. Once completed, the probe will display output and is further described in Understanding Output Re-Running The Probe \u00b6 The hash database must be empty before running the probe again: sudo rm -r /mnt/probe/db/* Troubleshooting \u00b6 Refer to the Troubleshooting document and contact HPE Support.","title":"Deployment"},{"location":"deployment/index.html#deployment_overview","text":"The HPE GreenLake for File Storage Data Reduction Estimation Probe provides estimated data reduction rate achieable based on an example data set. Make sure to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This article will guide you through the process of deployment and execution of the probe. Deployment Overview Download Expand & Verify Download Mount Filesystems Selected to Be Probed Create Probe Directories Size of the Data Set Running The Probe Other Probe Flags Understanding the Results Re-Running The Probe Troubleshooting","title":"Deployment Overview"},{"location":"deployment/index.html#download","text":"You can download the bundle from here: 935553.probe.bundle.tar.gz and scp or sftp to the Linux client that you wish to run the probe. Or you may choose to use wget to download the bundle directly on the Linux client that you wish to run the probe: Americas: wget \"https://vastdatasupport.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-08-08T14:31:50Z&se=2024-08-01T22:31:50Z&spr=https&sv=2022-11-02&sr=b&sig=5zlspUUlKFfwHr%2BpDaE1UCWWeJMZLpCFHq5VQpujbHY%3D\" -O 935553.probe.bundle.tar.gz Europe: wget \"https://vastdatasupporteuwest.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:46:51Z&se=2026-04-24T23:46:51Z&spr=https&sv=2021-12-02&sr=b&sig=Nska0jbs3Fz%2BrX7IavgLBq8lZeoLyo3n2sQ%2Bz3CrdOM%3D\" -O 935553.probe.bundle.tar.gz Asia/Pacific: wget \"https://vastsupportjapanwest.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:47:22Z&se=2026-04-24T23:47:22Z&spr=https&sv=2021-12-02&sr=b&sig=EURsk5b4LHKM%2Bk32qyVMiab%2FZXnfIodpDiTCm5wB%2F1w%3D\" -O 935553.probe.bundle.tar.gz South Africa: wget \"https://vastsupportsanorth.blob.core.windows.net/probe/935553.probe.bundle.tar.gz?sp=r&st=2023-04-24T15:46:13Z&se=2026-04-24T23:46:13Z&spr=https&sv=2021-12-02&sr=b&sig=mw%2BBoG6YYy7TTA%2B8ga5zzfWZDdOwjJ9al6ot2Z7b6wQ%3D\" -O 935553.probe.bundle.tar.gz","title":"Download"},{"location":"deployment/index.html#expand_verify_download","text":"Now that you've downloaded the probe, you'll need to untar it and then verify the download is correct. export PROBE_BUILD=935553 tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz ls -l Note: example may not show current build numbers. [root@iris-centos-workloadclient-22 probe]# ls -l total 1840344 -rw-r--r--. 1 root root 937920831 Jul 12 12:44 935553.probe.bundle.tar.gz -rw-r--r--. 1 root root 946565338 Jul 12 12:44 935553.probe.image.gz -rwxr-xr-x. 1 root root 19579 Jul 12 12:44 probe_launcher.py","title":"Expand & Verify Download"},{"location":"deployment/index.html#mount_filesystems_selected_to_be_probed","text":"Validated Filesystems Include, But Are Not Limited To: NFS Lustre GPFS S3 with goofys CIFS/SMB For the most accurate results, do not use root-squash. It's recommended to set read-only access on the mounted filesystem","title":"Mount Filesystems Selected to Be Probed"},{"location":"deployment/index.html#create_probe_directories","text":"Change /mnt/ to the SSD-backed local disk to be used by the probe for the hash database and logging directories sudo mkdir -p /mnt/probe/db sudo mkdir -p /mnt/probe/out sudo chmod -Rf 777 /mnt/probe","title":"Create Probe Directories"},{"location":"deployment/index.html#size_of_the_data_set","text":"The input to the probe is a defined directory ( --input-dir ) The probe will automatically query the input filesystem about space consumed and file count (inodes) and use that in its calculations Depending on the method of mounting and underlying storage, this can often provide an inaccurate query response It's highly recommended that manual estimated entries be defined for space consumed ( --data-size-gb ) and file count ( --number-of-files ) These estimates do not have to be accurate, round up reasonably","title":"Size of the Data Set"},{"location":"deployment/index.html#running_the_probe","text":"The probe runs as a foreground application. This means that if your session is closed for whatever reason, the probe will stop. It's recommended running the probe as a screen session. Here is an example of a command line. Edit the bold variables for the environment: NOTE: Use underscores instead of spaces in COMPANY_NAME and WORKLOAD export DB_DIR=/mnt/probe/db export OUTPUT_DIR=/mnt/probe/out export INPUT_DIR=/mnt/filesystem_to_be_probed/sub_directory export INPUT_SIZE_GB=10000 export QTY_FILES=1000000 export COMPANY_NAME=Your_Amazing_Company export WORKLOAD=Describe_Your_Workload Start the probe: (This may take up to five minutes to start displaying output) sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir $INPUT_DIR \\ --metadata-dir $DB_DIR \\ --output-dir $OUTPUT_DIR \\ --data-size-gb $INPUT_SIZE_GB \\ --number-of-files $QTY_FILES \\ --customer-name ${COMPANY_NAME}---${WORKLOAD} Example One: Small Data Sets To probe the directory interesting_data of 15 TB in-use and 5,000,000 files at the company ACME, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/acme_filer/interesting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 15000 \\ --number-of-files 5000000 \\ --customer-name ACME---Interesting_Data Example Two: Larger Data Sets To probe the directory fascinating_data of 60 TB in-use and 750,000,000 files at the company FOO, and are using defined parameters for RAM and SSD-backed local disk the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/foo_filer/fascinating_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 60000 \\ --number-of-files 750000000 \\ --customer-name FOO---Facinating_Data Example Three: Performance Throttling To probe the directory riviting_data of 250 TB in-use and 1,250,000,000 files at the company Initech, using defined parameters for RAM and SSD-backed local disk, but wish to have a lower performance impact on the filesystem, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/initech_filer/riviting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 250000 \\ --number-of-files 1250000000 \\ --number-of-threads 4 --customer-name Initech---Riviting_Data Note the --number-of-threads flag. By default the probe will use all CPU cores in the system but this can be used to throttle performance and reduce potential impact of the scanned filesystem.","title":"Running The Probe"},{"location":"deployment/index.html#other_probe_flags","text":"While the probe is running and after completion, telemetry logs are automatically uploaded to HPE. To prevent this, add the following flag: --dont-send-logs \\ If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names \\ Probing filesystems which contain snapshots can often cause recursion issues and inaccurate results. As a result the probe automatically ignores directories named .snapshot. If your file system uses another convention, use the --regexp-filter command. If for some reason you want the probe to read the .snapshot directories, specify false rather than true for --filter-snapshots . --filter-snapshots \\ (this is the default) Under most circumstances the probe should be run with adaptive chunking. However you can disable that feature by specifying this flag: --disable-adaptive-chunking \\","title":"Other Probe Flags"},{"location":"deployment/index.html#understanding_the_results","text":"Once started, the probe will display the current projection of potential data reduction. Once completed, the probe will display output and is further described in Understanding Output","title":"Understanding the Results"},{"location":"deployment/index.html#re-running_the_probe","text":"The hash database must be empty before running the probe again: sudo rm -r /mnt/probe/db/*","title":"Re-Running The Probe"},{"location":"deployment/index.html#troubleshooting","text":"Refer to the Troubleshooting document and contact HPE Support.","title":"Troubleshooting"},{"location":"faq/index.html","text":"General FAQ \u00b6 Q: How does the probe handle symbolic links? A : The probe ignores symbolic links. Thus if it is scanning a directory tree and encounters a symbolic link to some other area in the file system, it will not follow it. Q: How does the probe handle hard links? A : The probe attempts to detect if two files in the tree it is scanning point to the same data and automatically ignores the duplication. Q: How does the probe handle sparse files? A : By default the probe is not aware of sparse files. This means that it will read zero values for the sparse regions of the files, which can result in artificially high data reduction. The probe reports zero chunks to hint at this potential issue. Refer to Understanding VAST Probe Output for more details. Note that the probe can be run to recognize sparse files on some files systems as described in the document just referenced. Q: Can the probe scan multiple unrelated directory trees? A : Yes it can. This is done by providing multiple --input-dir values. Security FAQ \u00b6 The HPE GreenLake for File Storage Data Reduction Estimation Probe software is provided at zero cost with zero warranty to HPE\u2019s current and prospective customers in order to accurately estimate Data Reduction Rates of specific data not yet on HPE Storage systems. The probe software is run on physical or virtualized customer-maintained hardware and analyzes data that the customer allows access to through traditional filesystem based access. The results of the probe are used to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Q: Where does the VAST Probe originate? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is a Docker container of scripts and libraries maintained and assembled solely by HPE and VAST Data engineering which is updated frequently, usually quarterly. The links to download the probe are posted on this GitHub repository. Q: Where does the VAST Probe run? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is designed to be run within a customer environment on physical or virtualized customer-maintained equipment. The provided container requires a base Linux operating system which is expected to be installed and updated by the customer before the probe is launched. Q: What information does the VAST Probe collect? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe generates a series of logs for each iteration of data scanning. These logs are by default saved on the same physical or virtualized customer-maintained equipment that the probe runs. These logs contain references to paths which have been provided as inputs, and can refer to any path within that directory structure when making declarative statements about data reduction results. The analysis log file that is generated upon completion of the Data Reduction Probe prints each full path with figures about data reduction rate for that path. In addition, a secondary section of same analysis log file prints aggregate information about specific file extensions with figures about data reduction rate for that file extension. Q: What information does the HPE GreenLake for File Storage Data Reduction Estimation Probe send back to HPE? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe as built-in call home telemetry which is on by default when executed assuming the probe has access to specific HPE endpoints via the internet. While the probe is running, telemetry logs will be sent approximately every 5 minutes. These telemetry logs, by default, omit references to full paths with the exception of the of the root input path and simply upload a percentage-based status of the probe as well as any error messages. The final telemetry log is similar to the local analysis log file but, by default, removes full paths with the exception of the of the root input path. The final telemetry log will send the aggregated data reduction rates based on file extensions as illustrated below: file extension statistics: file type .xlsx, original_size=143.7GB, global_compression_reduced_size=126.6GB, global_compression_factor=1.14, dedup_percentage=10.34%, similarity_match_percentage=15.12%, similarity_gain=310.9MB, local_compression_only_size=126.9GB file type .tsv, original_size=291.5GB, global_compression_reduced_size=30.8GB, global_compression_factor=9.47, dedup_percentage=1.95%, similarity_match_percentage=84.83%, similarity_gain=9.6GB, local_compression_only_size=40.4GB Q: Who can access the logs sent to VAST Data? A: Anyone at HPE engineering or sales has access to the call home backend that is used as the telemetry destination for the HPE GreenLake for File Storage Data Reduction Estimation Probe. Q: What actions are performed with the logs sent to HPE? A: The telemetry logs are primarily used by sales to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Alternatively, any telemetry logs can be used to determine an expected Data Reduction Rate for a given industry or use case which may be similar to a sales team\u2019s customer which has not run the probe. HPE engineering also uses the telemetry data for bug fixes and over all improvements to the software and user experience. Q: How do I control what the VAST Probe sends back to VAST Data? A: This call home telemetry feature can be disabled at runtime with the added flag: --dont-send-logs If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names","title":"FAQ"},{"location":"faq/index.html#general_faq","text":"Q: How does the probe handle symbolic links? A : The probe ignores symbolic links. Thus if it is scanning a directory tree and encounters a symbolic link to some other area in the file system, it will not follow it. Q: How does the probe handle hard links? A : The probe attempts to detect if two files in the tree it is scanning point to the same data and automatically ignores the duplication. Q: How does the probe handle sparse files? A : By default the probe is not aware of sparse files. This means that it will read zero values for the sparse regions of the files, which can result in artificially high data reduction. The probe reports zero chunks to hint at this potential issue. Refer to Understanding VAST Probe Output for more details. Note that the probe can be run to recognize sparse files on some files systems as described in the document just referenced. Q: Can the probe scan multiple unrelated directory trees? A : Yes it can. This is done by providing multiple --input-dir values.","title":"General FAQ"},{"location":"faq/index.html#security_faq","text":"The HPE GreenLake for File Storage Data Reduction Estimation Probe software is provided at zero cost with zero warranty to HPE\u2019s current and prospective customers in order to accurately estimate Data Reduction Rates of specific data not yet on HPE Storage systems. The probe software is run on physical or virtualized customer-maintained hardware and analyzes data that the customer allows access to through traditional filesystem based access. The results of the probe are used to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Q: Where does the VAST Probe originate? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is a Docker container of scripts and libraries maintained and assembled solely by HPE and VAST Data engineering which is updated frequently, usually quarterly. The links to download the probe are posted on this GitHub repository. Q: Where does the VAST Probe run? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is designed to be run within a customer environment on physical or virtualized customer-maintained equipment. The provided container requires a base Linux operating system which is expected to be installed and updated by the customer before the probe is launched. Q: What information does the VAST Probe collect? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe generates a series of logs for each iteration of data scanning. These logs are by default saved on the same physical or virtualized customer-maintained equipment that the probe runs. These logs contain references to paths which have been provided as inputs, and can refer to any path within that directory structure when making declarative statements about data reduction results. The analysis log file that is generated upon completion of the Data Reduction Probe prints each full path with figures about data reduction rate for that path. In addition, a secondary section of same analysis log file prints aggregate information about specific file extensions with figures about data reduction rate for that file extension. Q: What information does the HPE GreenLake for File Storage Data Reduction Estimation Probe send back to HPE? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe as built-in call home telemetry which is on by default when executed assuming the probe has access to specific HPE endpoints via the internet. While the probe is running, telemetry logs will be sent approximately every 5 minutes. These telemetry logs, by default, omit references to full paths with the exception of the of the root input path and simply upload a percentage-based status of the probe as well as any error messages. The final telemetry log is similar to the local analysis log file but, by default, removes full paths with the exception of the of the root input path. The final telemetry log will send the aggregated data reduction rates based on file extensions as illustrated below: file extension statistics: file type .xlsx, original_size=143.7GB, global_compression_reduced_size=126.6GB, global_compression_factor=1.14, dedup_percentage=10.34%, similarity_match_percentage=15.12%, similarity_gain=310.9MB, local_compression_only_size=126.9GB file type .tsv, original_size=291.5GB, global_compression_reduced_size=30.8GB, global_compression_factor=9.47, dedup_percentage=1.95%, similarity_match_percentage=84.83%, similarity_gain=9.6GB, local_compression_only_size=40.4GB Q: Who can access the logs sent to VAST Data? A: Anyone at HPE engineering or sales has access to the call home backend that is used as the telemetry destination for the HPE GreenLake for File Storage Data Reduction Estimation Probe. Q: What actions are performed with the logs sent to HPE? A: The telemetry logs are primarily used by sales to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Alternatively, any telemetry logs can be used to determine an expected Data Reduction Rate for a given industry or use case which may be similar to a sales team\u2019s customer which has not run the probe. HPE engineering also uses the telemetry data for bug fixes and over all improvements to the software and user experience. Q: How do I control what the VAST Probe sends back to VAST Data? A: This call home telemetry feature can be disabled at runtime with the added flag: --dont-send-logs If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names","title":"Security FAQ"},{"location":"legal/eula/index.html","text":"This software is provided according to HPE license restrictions . The deployment documentation describes how to indicate your acceptance of these terms.","title":"End User License Agreement"},{"location":"legal/notices/index.html","text":"","title":"Notices"},{"location":"legal/support/index.html","text":"Typically you would work with your HPE Sales engineer to deploy and use The HPE GreenLake for File Storage Data Reduction Estimation Probe. Should HPE Sales engineers have issues or additional questions please contact us at Slack channel #ask-greenlake-for-filestorage.","title":"Support"},{"location":"manual/index.html","text":"Manual Execution Overview \u00b6 The HPE GreenLake for File Storage Data Reduction Estimation Probe is a long running process in a docker container. The docker container needs to run on a linux system that has read only access to the files you want to examine for data reduction as well as reasonable memory and substantial fast local disk. When having issues with the probe_launcher.py script or you need more experimental features, you should use this page. Manual Execution Overview Manual Execution Procedure Download the bundle Configure the run Launch the probe run Probe Stages Treewalk Phase DB Initialization Phase DataScan Phase Monitoring progress Understanding the ouput Low Level Output Probe Analyze I/O Behavior Manual Execution Procedure \u00b6 Follow the steps in Prerequisites to verify requirements are met to properly run the probe. Get docker container image via links in Deployment . When the probe docker container is launched you'll then be able to connect into it and then run the probe itself with key configuration information. The probe will then run until completion and report results. Download the bundle \u00b6 Download the docker image Follow the download links in Deployment to download the bundle. Then set the variable for the build number: export PROBE_BUILD=[PROBE BUILD NUMBER] Untar the bundle: tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz Load the docker image: docker load -i ${PROBE_BUILD}.probe.image.gz This step will take a few minutes without meaningful output. Tag the loaded image by doing a docker images and noting the new image, and it's ID. Recall that images are identified by unique image IDs and human readable tags. Tag it as shown below - get the ID from the docker images output, and the value of the name is by convention the probe build. docker images Notice the Image IDs in the output list docker tag vast-probe-${PROBE_BUILD} Configure the run \u00b6 Launch a 'screen' session (or tmux). We recommend some kind of long lived session tool since the probe can take a very long time to run and we do not want it to terminate if there is an issue with the client system. screen -R probe Run the container while mapping the required directories run with the image tag/name you set earlier. The -v specifies mounts from the real operating system that should be made available to docker. These are directories that the probe can use and scan. Include as many -v 's as needed, just ensuring that at least one is the actual probe scratch directory ( /mnt/probe in this case). docker run -v /mnt/fileserver1:/mnt/fileserver1 -v /mnt/probe:/mnt/probe -it vast-probe-${PROBE_BUILD} from within the docker container, Create relevant output directories, eg: sudo mkdir -p /mnt/probe/vast-probe/output sudo mkdir -p /mnt/probe/vast-probe/db sudo chmod -R 777 /mnt/probe/vast-probe #note: If you get permission denied then disable selinux on your host Edit probe config file: vim /vast/install/probe/sim_init_file.yml See example config below, but also some items to note: input_dir : you can specify more than one. Just prefix a newline with '-'. This will allow the probe to scan multiple mountpoints/filesystems. Each input directory is scanned in a parallel thread which can slightly improve probe scan times. output_dir : this is where the summary files and some stats files will go. this is relatively small (< GB, although could get larger if you are scanning a lot of paths) metadata_dir : if using disk based indexes the space here needs to be pretty large (1% of total dataset to be very safe). match_disable : if you set to '1' , it will do 'local-only' compression/dedup. This completes much more quickly, but will not do any similarity hashing. max_number_of_files : This effectively pre-allocates some RAM to hold for file pointers. Set this value to somewhat higher than the total number of files you expect the probe to scan. Every 1-million files takes up 50MB of RAM. 1-billion is 50GB. Make sure not to set a value that causes the file pointer cache to exceed 50% of system RAM. disk_size_gb : set this to use disk based index. If you set to 0 it will instead use a RAM based index (see next variable) Index is ~80% of the probe metadata so rule of the thumb here so if you have a dedicated SSD-based file system for probe md the rule of thumb is to put here 80% of the disk size. And remember the free disk space size needs to be 0.6% of the total dataset size (this has a safety margin). ram_size_gb : if disk index is not used the probe will use RAM for indexing. This is faster but may produce inaccurate results for large data sets. If this value is left unset the probe will use 80% of the available system memory. IOPS_limit : can be used to limit the read rate from the target system. The IO size is the chunk size (default 32K), e.g. IOPS_limit: 1000 \u2192 ~320 MB/sec. Example config : input_dir: - '/mnt/fileserver/data/stuff' filter: '*' output_dir: '/mnt/probe/vast-probe/output' # dir for log files metdata_dir: '/mnt/probe/vast-probe/db' # dir for probe metadata regexp_filter: '' # files/directories matching the filter will NOT be scanned by the probe send_from: 'andy@vastdata.com' send_to: - 'andy@vastdata.com' - 'probe.callhome@vastdata.com' remote_monitoring_freq: 100 # sending mail with stats line, every remote_monitoring_freq seconds SMTP_host: 'localhost' # put an SMTP relay here remove_db_dir: 0 #remove db dir after each run? 1 for yes, 0 for no ignore_links: 1 #1 for yes, 0 for no IOPS_limit: 0 #for no limit, put 0 number_of_threads: 0 # for one thread per core, put 0 printing_frequency: 1 # in seconds open_files_limit: 0 #for no limit, put 0 obfuscate_files_names: 0 #1 for yes, 0 for no match_disable: 0 #1 to disable matches, 0 to enable ram_size_gb: 0 # RAM for indexes (in GB), 0 will make the probe us ~80% of the available system memory disk_size_gb: 100 # if set will use disk to store the similarity index. pause: #'7:15' #hh:mm or leave blank resume: #'17:16' #hh:mm or leave blank split: ... Once you are satisfied, copy the .yml file to somewhere outside of the container (NFS mount or via SCP), since it will not survive container restart cp /vast/install/probe/sim_init_file.yml /mnt/probe/ Launch the probe run \u00b6 While still connected to the probe's docker container, go to the probe's home directory. Note: if you need to run the probe a second time you can copy the save sim_init_file.yml file from /mnt/probe into the container at /vast/install/probe . cd /vast/install/probe Run it: sudo is required if root is needed in order to access one of the directories configured in the init file. sudo python3 ./probe.py Probe Stages \u00b6 Some of these stages run concurrently (eg: Treewalk can run in the background throughout) Treewalk Phase \u00b6 When the probe is first kicked off, it builds a list of all files, along with the size-in-bytes for each file. This process has recently been parallelized to try and use more threads to perform this treeewalk, however depending on the source filesystem, this may still take a significant amount of time. Note that this runs in the background, such that the probe can make progress with other stages even while the treewalk phase is active. As an alternative, you can specify the --csv option to point to a CSV file which looks like this: /path/to/file.file,1234 where 1234 = sizeInBytes DB Initialization Phase \u00b6 The probe needs to initialize the Dictionary/Database which is used for storing matches. Depending on the speed of the storage which is hosting the database (specified via 'metadata_dir' ) , this can take some time. Also note that the 'disk_size_gb' parameter is directly related to how large the DB will be. During this phase, the probe will pre-allocate the DB by writing XX-GB to the metadata_dir. DataScan Phase \u00b6 Once initialization has occurred, this is when the actual probe-scanning happens. During this time, multiple threads are walking through the generated list of files and reading them to generate the various hashes which are then inserted into the DB. You can monitor progress during this phase as described below. Monitoring progress \u00b6 While the probe is running, there are 2 ways to get progress: Watch the screen sudo python3 ./probe.py mail sending off Scanning input directories, this might take a while... Scanned 144932 files, size 3.2TB File scan completed open file limit is 65536, it is recommended to allow as many open files as possible Initializing probe. 336.386 GB/3.1718 TB (10.4%) process_rate = 850.45 MB/sec factor = 1.54 Check the log file tail -f /mnt/probe/vast-probe/output/probe_Mon_Jan_21_11_57_52_2019.log The log will give you information like this: n_chunks = 482991, n_matched_chunks = 392628, dedups = 918, match_percent = 81.291% , sum_of_gain = 403245081, gain = 64.877, avg_gain_per_match = 1027.04, avg_match_hashes_per_match = 9.28467, decompressed_sum = 1.212 GB, compressed_sum = 204.86 MB, factor = 6.05886, ratio = 0.165048, sum_of_self_compress = 592.76 MB size_of_data_processed = 1.212 GB/1.324 GB 91.5786% number_of_inaccessible_files = 0/517401 size_of_inaccessible = 0 B/1.324 GB READ = 1.212 GB, RE-READ = 940.70 MB, Total READ = 2.131 GB process_rate = 34.33 MB/sec Understanding the ouput \u00b6 The summary probe output is described in Un derstanding Output Low Level Output \u00b6 In addition to the previous output, the probe will also output lower level information periodically. These days that information is not typically useful, but here is an explanation just in case. n_chunks - amount of chunks processed by the probe (default size is 32K) avg_chunk_size - average chunk size n_matched_chunks - amount of chunks identified as similar to pre existing chunks by similarity search match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search sum_of_gain - total space saved by similarity compression gain - percentage of space saved by similarity compression avg_gain_per_match - average amount of space saved per chunk from similarity compression avg_match_hashes_per_match - average amount of matching hashes found during similarity seach n_duplicate_chunks - amount of identical chunks found dedup_percent - percentage of identical chunks found original_size - amount of data processed by the probe compressed_sum - estimated size of data post compression, dedup and similarity compression factor - compression factor (original_size / compressed_sum) ratio - 1 / factor sum_of_self_compress - data size if only local compression (with the given chunk size) was applied size_of_data_processed - progress indication number_of_inaccessible_files - number of files that were found in the initial scan but the probe didn't manage to read from when trying to process them size_of_inaccessible - amount of data that were found in the initial scan but the probe didn't manage to read from when trying to process them READ - amount of scanned data RE-READ - amount of data that was re-read in order to perform global compression Total READ - READ + RE-READ I thought it would be helpful to share results from a test run and an interpretation of those results for the benefit of others: Here\u2019s the last line of output with a summary: n_chunks = 1120059762, avg_chunk_size = 32754.2, n_matched_chunks = 507860164, match_percent = 45.3422% , sum_of_gain = 99.367 GB, gain = 0.603369, avg_gain_per_match = 210.087, avg_match_hashes_per_match = 3.4, n_duplicate_chunks = 529286922, dedup_percent=47.2552, original_size = 33.3664 TB, compressed_sum = 2.7112 TB, factor = 12.3069, ratio = 0.0812552, sum_of_self_compress = 16.0828 TB, size_of_data_processed = 33.3664 TB/33.7412 TB 98.8889%, number_of_inaccessible_files = 17233/880015, size_of_inaccessible = 384.025 GB/33.7412 TB, READ = 33.3664 TB, RE-READ = 15.1350 TB, Total READ = 48.5014 TB And the definitions that I think are most pertinent: * match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search * sum_of_gain - total space saved by similarity compression * gain - percentage of space saved by similarity compression * dedup_percent - percentage of identical chunks found * original_size - amount of data processed by the probe * compressed_sum - estimated size of data post compression, dedup and similarity compression * factor - compression factor (original_size / compressed_sum) * sum_of_self_compress - data size if only local compression (with the given chunk size) was applied * size_of_data_processed - progress indication This means the probe processed 33TB of data. The \u201cnative\u201d compressed size would have been 16TB the actual compressed size including compression, dedup, and similarity compression was 2.7TB, thus the total factor of savings was 12 (33/2.7). Digging a little deeper we see that the majority of the savings came from dedup (47% of the chunks were identical) and compression as it looks like similarity compression saved 0.6% for a total of 99GB. Probe Analyze \u00b6 After the probe completes a run it will automatically analyze its own output from the log files and generate an analysis log (still quite long) with a breakdown by directory and file extension of the data reduction achieved. In rare cases you may need to run this manually, here's how: cd /vast/install/probe python3 ./probe.py --analyze_log .... output about processing files .... Processed 1967860 files Writing probe run analysis to ..../probe_Date.log.analysis I/O Behavior \u00b6 Speed : From a scan-speed perspective, what we've found is that on average we see approximately ~60 MByte/sec per physical CPU core when running the probe in full \"similarity hash\" mode (default value for match_disable ). Thus, a 20-core system would net approximately 1.2 GByte/sec. Having that said performance is also highly dependant on the disk latency of the target system being scanned and is often delayed by doing random reads on that system. Read amplification : The way our similarity hasher works, if it discovers any matches, it will need to re-read a portion of the dataset again to look for additional opportunities for dataReduction. In the case where your data has a lot of similarity, this can result in significant read-amplification. Therefore, when determining the amount of time it will take to scan a file-system, it is necessary to allow the probe to run for a period of time to determine the approximate 'Re-Read' ratio. look at the /mnt/probe/db/*.stats output to see. match_disable=1 : If you choose this setting (non-default), the probe will bypass similarity hashing, and instead only look for local compression opportunity, and full-chunk matches (for dedup). This is much less CPU intensive, and we've found that the bottleneck will typically be either networking or the filesystem which it is scanning, up to a point. In my testing on a system with 25gigE, using this mode saw an average of 1.3GByte/sec (about 66MB/sec/physCore). At times the network throughput got close to line-rate (2+GByte/sec). If you have a subset of data which is representative of a larger set: it would be advisable to run against the smaller set in this mode first, to determine the local compression & dedup rates. Once that rate is established, running the probe again in similarity-hash mode against the full dataset is recommended.","title":"Manual Deployment"},{"location":"manual/index.html#manual_execution_overview","text":"The HPE GreenLake for File Storage Data Reduction Estimation Probe is a long running process in a docker container. The docker container needs to run on a linux system that has read only access to the files you want to examine for data reduction as well as reasonable memory and substantial fast local disk. When having issues with the probe_launcher.py script or you need more experimental features, you should use this page. Manual Execution Overview Manual Execution Procedure Download the bundle Configure the run Launch the probe run Probe Stages Treewalk Phase DB Initialization Phase DataScan Phase Monitoring progress Understanding the ouput Low Level Output Probe Analyze I/O Behavior","title":"Manual Execution Overview"},{"location":"manual/index.html#manual_execution_procedure","text":"Follow the steps in Prerequisites to verify requirements are met to properly run the probe. Get docker container image via links in Deployment . When the probe docker container is launched you'll then be able to connect into it and then run the probe itself with key configuration information. The probe will then run until completion and report results.","title":"Manual Execution Procedure"},{"location":"manual/index.html#download_the_bundle","text":"Download the docker image Follow the download links in Deployment to download the bundle. Then set the variable for the build number: export PROBE_BUILD=[PROBE BUILD NUMBER] Untar the bundle: tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz Load the docker image: docker load -i ${PROBE_BUILD}.probe.image.gz This step will take a few minutes without meaningful output. Tag the loaded image by doing a docker images and noting the new image, and it's ID. Recall that images are identified by unique image IDs and human readable tags. Tag it as shown below - get the ID from the docker images output, and the value of the name is by convention the probe build. docker images Notice the Image IDs in the output list docker tag vast-probe-${PROBE_BUILD}","title":"Download the bundle"},{"location":"manual/index.html#configure_the_run","text":"Launch a 'screen' session (or tmux). We recommend some kind of long lived session tool since the probe can take a very long time to run and we do not want it to terminate if there is an issue with the client system. screen -R probe Run the container while mapping the required directories run with the image tag/name you set earlier. The -v specifies mounts from the real operating system that should be made available to docker. These are directories that the probe can use and scan. Include as many -v 's as needed, just ensuring that at least one is the actual probe scratch directory ( /mnt/probe in this case). docker run -v /mnt/fileserver1:/mnt/fileserver1 -v /mnt/probe:/mnt/probe -it vast-probe-${PROBE_BUILD} from within the docker container, Create relevant output directories, eg: sudo mkdir -p /mnt/probe/vast-probe/output sudo mkdir -p /mnt/probe/vast-probe/db sudo chmod -R 777 /mnt/probe/vast-probe #note: If you get permission denied then disable selinux on your host Edit probe config file: vim /vast/install/probe/sim_init_file.yml See example config below, but also some items to note: input_dir : you can specify more than one. Just prefix a newline with '-'. This will allow the probe to scan multiple mountpoints/filesystems. Each input directory is scanned in a parallel thread which can slightly improve probe scan times. output_dir : this is where the summary files and some stats files will go. this is relatively small (< GB, although could get larger if you are scanning a lot of paths) metadata_dir : if using disk based indexes the space here needs to be pretty large (1% of total dataset to be very safe). match_disable : if you set to '1' , it will do 'local-only' compression/dedup. This completes much more quickly, but will not do any similarity hashing. max_number_of_files : This effectively pre-allocates some RAM to hold for file pointers. Set this value to somewhat higher than the total number of files you expect the probe to scan. Every 1-million files takes up 50MB of RAM. 1-billion is 50GB. Make sure not to set a value that causes the file pointer cache to exceed 50% of system RAM. disk_size_gb : set this to use disk based index. If you set to 0 it will instead use a RAM based index (see next variable) Index is ~80% of the probe metadata so rule of the thumb here so if you have a dedicated SSD-based file system for probe md the rule of thumb is to put here 80% of the disk size. And remember the free disk space size needs to be 0.6% of the total dataset size (this has a safety margin). ram_size_gb : if disk index is not used the probe will use RAM for indexing. This is faster but may produce inaccurate results for large data sets. If this value is left unset the probe will use 80% of the available system memory. IOPS_limit : can be used to limit the read rate from the target system. The IO size is the chunk size (default 32K), e.g. IOPS_limit: 1000 \u2192 ~320 MB/sec. Example config : input_dir: - '/mnt/fileserver/data/stuff' filter: '*' output_dir: '/mnt/probe/vast-probe/output' # dir for log files metdata_dir: '/mnt/probe/vast-probe/db' # dir for probe metadata regexp_filter: '' # files/directories matching the filter will NOT be scanned by the probe send_from: 'andy@vastdata.com' send_to: - 'andy@vastdata.com' - 'probe.callhome@vastdata.com' remote_monitoring_freq: 100 # sending mail with stats line, every remote_monitoring_freq seconds SMTP_host: 'localhost' # put an SMTP relay here remove_db_dir: 0 #remove db dir after each run? 1 for yes, 0 for no ignore_links: 1 #1 for yes, 0 for no IOPS_limit: 0 #for no limit, put 0 number_of_threads: 0 # for one thread per core, put 0 printing_frequency: 1 # in seconds open_files_limit: 0 #for no limit, put 0 obfuscate_files_names: 0 #1 for yes, 0 for no match_disable: 0 #1 to disable matches, 0 to enable ram_size_gb: 0 # RAM for indexes (in GB), 0 will make the probe us ~80% of the available system memory disk_size_gb: 100 # if set will use disk to store the similarity index. pause: #'7:15' #hh:mm or leave blank resume: #'17:16' #hh:mm or leave blank split: ... Once you are satisfied, copy the .yml file to somewhere outside of the container (NFS mount or via SCP), since it will not survive container restart cp /vast/install/probe/sim_init_file.yml /mnt/probe/","title":"Configure the run"},{"location":"manual/index.html#launch_the_probe_run","text":"While still connected to the probe's docker container, go to the probe's home directory. Note: if you need to run the probe a second time you can copy the save sim_init_file.yml file from /mnt/probe into the container at /vast/install/probe . cd /vast/install/probe Run it: sudo is required if root is needed in order to access one of the directories configured in the init file. sudo python3 ./probe.py","title":"Launch the probe run"},{"location":"manual/index.html#probe_stages","text":"Some of these stages run concurrently (eg: Treewalk can run in the background throughout)","title":"Probe Stages"},{"location":"manual/index.html#treewalk_phase","text":"When the probe is first kicked off, it builds a list of all files, along with the size-in-bytes for each file. This process has recently been parallelized to try and use more threads to perform this treeewalk, however depending on the source filesystem, this may still take a significant amount of time. Note that this runs in the background, such that the probe can make progress with other stages even while the treewalk phase is active. As an alternative, you can specify the --csv option to point to a CSV file which looks like this: /path/to/file.file,1234 where 1234 = sizeInBytes","title":"Treewalk Phase"},{"location":"manual/index.html#db_initialization_phase","text":"The probe needs to initialize the Dictionary/Database which is used for storing matches. Depending on the speed of the storage which is hosting the database (specified via 'metadata_dir' ) , this can take some time. Also note that the 'disk_size_gb' parameter is directly related to how large the DB will be. During this phase, the probe will pre-allocate the DB by writing XX-GB to the metadata_dir.","title":"DB Initialization Phase"},{"location":"manual/index.html#datascan_phase","text":"Once initialization has occurred, this is when the actual probe-scanning happens. During this time, multiple threads are walking through the generated list of files and reading them to generate the various hashes which are then inserted into the DB. You can monitor progress during this phase as described below.","title":"DataScan Phase"},{"location":"manual/index.html#monitoring_progress","text":"While the probe is running, there are 2 ways to get progress: Watch the screen sudo python3 ./probe.py mail sending off Scanning input directories, this might take a while... Scanned 144932 files, size 3.2TB File scan completed open file limit is 65536, it is recommended to allow as many open files as possible Initializing probe. 336.386 GB/3.1718 TB (10.4%) process_rate = 850.45 MB/sec factor = 1.54 Check the log file tail -f /mnt/probe/vast-probe/output/probe_Mon_Jan_21_11_57_52_2019.log The log will give you information like this: n_chunks = 482991, n_matched_chunks = 392628, dedups = 918, match_percent = 81.291% , sum_of_gain = 403245081, gain = 64.877, avg_gain_per_match = 1027.04, avg_match_hashes_per_match = 9.28467, decompressed_sum = 1.212 GB, compressed_sum = 204.86 MB, factor = 6.05886, ratio = 0.165048, sum_of_self_compress = 592.76 MB size_of_data_processed = 1.212 GB/1.324 GB 91.5786% number_of_inaccessible_files = 0/517401 size_of_inaccessible = 0 B/1.324 GB READ = 1.212 GB, RE-READ = 940.70 MB, Total READ = 2.131 GB process_rate = 34.33 MB/sec","title":"Monitoring progress"},{"location":"manual/index.html#understanding_the_ouput","text":"The summary probe output is described in Un derstanding Output","title":"Understanding the ouput"},{"location":"manual/index.html#low_level_output","text":"In addition to the previous output, the probe will also output lower level information periodically. These days that information is not typically useful, but here is an explanation just in case. n_chunks - amount of chunks processed by the probe (default size is 32K) avg_chunk_size - average chunk size n_matched_chunks - amount of chunks identified as similar to pre existing chunks by similarity search match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search sum_of_gain - total space saved by similarity compression gain - percentage of space saved by similarity compression avg_gain_per_match - average amount of space saved per chunk from similarity compression avg_match_hashes_per_match - average amount of matching hashes found during similarity seach n_duplicate_chunks - amount of identical chunks found dedup_percent - percentage of identical chunks found original_size - amount of data processed by the probe compressed_sum - estimated size of data post compression, dedup and similarity compression factor - compression factor (original_size / compressed_sum) ratio - 1 / factor sum_of_self_compress - data size if only local compression (with the given chunk size) was applied size_of_data_processed - progress indication number_of_inaccessible_files - number of files that were found in the initial scan but the probe didn't manage to read from when trying to process them size_of_inaccessible - amount of data that were found in the initial scan but the probe didn't manage to read from when trying to process them READ - amount of scanned data RE-READ - amount of data that was re-read in order to perform global compression Total READ - READ + RE-READ I thought it would be helpful to share results from a test run and an interpretation of those results for the benefit of others: Here\u2019s the last line of output with a summary: n_chunks = 1120059762, avg_chunk_size = 32754.2, n_matched_chunks = 507860164, match_percent = 45.3422% , sum_of_gain = 99.367 GB, gain = 0.603369, avg_gain_per_match = 210.087, avg_match_hashes_per_match = 3.4, n_duplicate_chunks = 529286922, dedup_percent=47.2552, original_size = 33.3664 TB, compressed_sum = 2.7112 TB, factor = 12.3069, ratio = 0.0812552, sum_of_self_compress = 16.0828 TB, size_of_data_processed = 33.3664 TB/33.7412 TB 98.8889%, number_of_inaccessible_files = 17233/880015, size_of_inaccessible = 384.025 GB/33.7412 TB, READ = 33.3664 TB, RE-READ = 15.1350 TB, Total READ = 48.5014 TB And the definitions that I think are most pertinent: * match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search * sum_of_gain - total space saved by similarity compression * gain - percentage of space saved by similarity compression * dedup_percent - percentage of identical chunks found * original_size - amount of data processed by the probe * compressed_sum - estimated size of data post compression, dedup and similarity compression * factor - compression factor (original_size / compressed_sum) * sum_of_self_compress - data size if only local compression (with the given chunk size) was applied * size_of_data_processed - progress indication This means the probe processed 33TB of data. The \u201cnative\u201d compressed size would have been 16TB the actual compressed size including compression, dedup, and similarity compression was 2.7TB, thus the total factor of savings was 12 (33/2.7). Digging a little deeper we see that the majority of the savings came from dedup (47% of the chunks were identical) and compression as it looks like similarity compression saved 0.6% for a total of 99GB.","title":"Low Level Output"},{"location":"manual/index.html#probe_analyze","text":"After the probe completes a run it will automatically analyze its own output from the log files and generate an analysis log (still quite long) with a breakdown by directory and file extension of the data reduction achieved. In rare cases you may need to run this manually, here's how: cd /vast/install/probe python3 ./probe.py --analyze_log .... output about processing files .... Processed 1967860 files Writing probe run analysis to ..../probe_Date.log.analysis","title":"Probe Analyze"},{"location":"manual/index.html#io_behavior","text":"Speed : From a scan-speed perspective, what we've found is that on average we see approximately ~60 MByte/sec per physical CPU core when running the probe in full \"similarity hash\" mode (default value for match_disable ). Thus, a 20-core system would net approximately 1.2 GByte/sec. Having that said performance is also highly dependant on the disk latency of the target system being scanned and is often delayed by doing random reads on that system. Read amplification : The way our similarity hasher works, if it discovers any matches, it will need to re-read a portion of the dataset again to look for additional opportunities for dataReduction. In the case where your data has a lot of similarity, this can result in significant read-amplification. Therefore, when determining the amount of time it will take to scan a file-system, it is necessary to allow the probe to run for a period of time to determine the approximate 'Re-Read' ratio. look at the /mnt/probe/db/*.stats output to see. match_disable=1 : If you choose this setting (non-default), the probe will bypass similarity hashing, and instead only look for local compression opportunity, and full-chunk matches (for dedup). This is much less CPU intensive, and we've found that the bottleneck will typically be either networking or the filesystem which it is scanning, up to a point. In my testing on a system with 25gigE, using this mode saw an average of 1.3GByte/sec (about 66MB/sec/physCore). At times the network throughput got close to line-rate (2+GByte/sec). If you have a subset of data which is representative of a larger set: it would be advisable to run against the smaller set in this mode first, to determine the local compression & dedup rates. Once that rate is established, running the probe again in similarity-hash mode against the full dataset is recommended.","title":"I/O Behavior"},{"location":"output/index.html","text":"Understanding Output Overview \u00b6 Periodically while running and at the end of a run, the probe will output data reduction results to the probe log file. These results are very helpful for understanding the data reduction that is expected when the data is placed on HPE GreenLake for File Storage as well as for helping to understand why that level of data reduction was achieved. Output format \u00b6 The output will look something like this: --------------------------------current-probe-stats-------------------------------- Probe version: probe-version-4-4-703050 Scanned: 258.14GB out of 258.13GB (100.00%) Files Scanned: 22481 files out of 22481 files (100.00%) ============= Main Results: ============= Total Global Data Reduction Factor = 5.32:1 (81.20% reduction) Sparse Size = 258.14GB Reduced Size = 48.54GB Number of Inaccessible Files = 3 out of 22481 files (0.01% of scan) Size of Inaccessible Files = 0.00B out of 258.13GB (0.00% of scan) - Duplicate Block Elimination Gain: 0.61% (1.56GB) Zero Block Elimination Gain: 0.00% (1.80MB) Number of Duplicate Chunks: 58917 Number of Zero Chunks: 35 - Similarity Reduction Global DAC vs. Local DAC Gain: 1.69% (4.37GB out of total bytes using Similarity: 233.62GB) Number of Similar Chunks: 4572414 out of 5414721 total unique chunks Average Chunk Size: 49.99KB Similarity Percentage: 84.44% Average Size of Chunks Using Similarity: 53.58KB Average Gain post DAC Per Similarity Match: 1.00KB Vast Array Performance Impact: green - Local Compression Gain including DAC: 79.03% (204.01GB out of a total Compression scan of 252.21GB) Compression ratio for local compress only: 4.88:1 ================== Adaptive Chunking: ================== ... ======================= Data Aware Compression: ======================= ... ====================== Experimental Features: ====================== ... There are two types of output above: normal or routine information relevant to most, and more advanced information that is more internal in nature (shown here with ...). In this article we will consider both types of information in the output, but please focus on the routine information as that is almost always more relevant. Routine Considerations (Main Results) \u00b6 The intent of this output is to summarize what the probe has found so far. The interesting results are: Scanned shows the space before reduction Files Scanned shows the number of files in the entire data set that were scanned Total Global Data Reduction Factor shows how effectively data reduction was done overall. This value includes compression, deduplication, and similarity reduction. Reduced Size is the space after reduction Sparse Size should be ignored unless the probe is run with --sparse-mode as described below. Number/Size of Inaccessible Files indicates data the probe tried to read but couldn't. If this number is large, the probe results are not valid. This almost always happens due to permission issues or files being deleted while the probe was running. Duplicate Block Elimination Gain shows how much space is saved just by removal of duplicate blocks. Number of Duplicate Chunks shows literally how many blocks were identical to other existing blocks. Zero Block Elimination Gain tells you how much of the gain from deduplication was due to zero blocks. That helpful for understanding the implications of the next item. Number of Zero Chunks is a count of number of chunks that are all zeros. That often indicates sparse files. If the number of such chunks is high relative to the number of chunks (exceeding say 10%), the probe estimates may be misleading. Use tools such as du and df to determine the actual space used and compare that to the probe's report of the space scanned. If there is a large difference, sparse files are likely to blame. If your file system supports the advanced ioctl for sparse file reporting (Lustre and XFS do), you can try running the probe again with --sparse-mode . Similarity Reduction Global DAC vs. Local DAC Gain is the gain from similarity with data aware compression vs. the gain without similarity. This is just a more verbose way of saying \"this is how much gain similarity provided.\" Number of Similar Chunks / Similarity Percentage is the number of data chunks that benefited from similarity matching. The percentage is simply the number of chunks that benefited from similarity divided by the total number of chunks. A high value for the similarity match percentage (significantly over 10%) and a low value of Average Gain Post DAC Per Similarity Match relative to Average Size of Chunks Using Similarity is a potential problem. This indicates a high similarity match rate, but a low gain from those matches. The amount reported is bytes per chunk. Average Chunk Size is the average size (before reduction) of all chunks Average Size of Chunks Using Similarity is the average size (before reduction) of a chunk that benefited from similarity Array Performance Impact should be ignored for now. Local Compression Gain shows how much space would be saved just by transparent compression as files are saved. This is also helpfully expressed at the end via Compression ratio for local compress only. Essentially that ratio vs. the reported Total Global Data Reduction factor shows how much better DRR was thanks to global deduplication and similarity reduction. In the above example we can see that we scanned 22481 files that consumed 258GB of space before any data reduction. After data reduction the probe predicts the files will consume 48GB of space for a reduction of 81%. Of that simple compression gains 79% (204GB), deduplication 1% (1GB), and similarity 2% (4GB). Please keep in mind these aren't typical results as actual data reduction varies widely for different data sets. Advanced Considerations \u00b6 In addition to the common and most relevant output described above, there are more advanced bits of information shared by the probe. Most of this information is only relevant to VAST engineering (we hope you can share it with us) but we document it here for the curious. Here is an example of the more advanced outputs: ================== Adaptive Chunking: ================== min_chunk_size=16384 max_chunk_size=65043 desired_chunk_size=29950 inverse_probability=13999 split_threshold=17871601040105585914 Theoretical Average Chunk Size: 29.25KB (error: -70.92%) Number of chunks split via hash: 2423353 (44.75%) Number of chunks split via buffer end: 44620 (0.82%) Number of chunks split via max size reached: 2969226 (54.84%) ======================= Data Aware Compression: ======================= Total Number of Predictions: 5414686 Predictions Per Encoder Type: {ENCODER_NONE=5402314, ENCODER_SHUFFLE=11164, ENCODER_DELTA_ENCODE=681, ENCODER_DELTA_ENCODE_4_SHUFFLE=527} Percentage of Chunks Per Encoder: - Encoder ENCODER_NONE: 99.77% - Encoder ENCODER_SHUFFLE: 0.21% - Encoder ENCODER_DELTA_ENCODE: 0.01% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE: 0.01% Encoding Sampling Reduction Summary (sampling 1.99%): ---------------------------------------------------------------------------------------------------------------------------------------------------- Encoders | None | Shuffle | Delta Shuffle | Delta ---------------------------------------------------------------------------------------------------------------------------------------------------- DRR (Global) | 5.33 | 3.60 | 3.07 | 4.83 Compressed Size | 48.44GB | 71.73GB | 84.06GB | 53.44GB Num Chunks Improved Percentage | 98.87% | 13.43% | 13.25% | 13.36% Num Chunks Improved | 5353433 | 727142 | 717668 | 723182 Total Chunks Num | 5414721 | 5414721 | 5414721 | 5414721 Similarity Reduction Percentage | 1.68% | 1.77% | 2.62% | 2.01% Similarity Reduction | 4.34GB | 4.59GB | 6.78GB | 5.19GB Total Bytes Using Similarity | 233.62GB | 233.62GB | 233.62GB | 233.62GB Similarity Reduction Gain if ref chain Percentage | 86.88% | 0.00% | 0.00% | 0.00% Similarity Reduction Gain if ref chain | 224.86GB | 0.00B | 0.00B | 0.00B Data Aware Compression Accuracy: Total Chunks Compared for Discovering Optimal Encoding: 108046 Total Correct Optimal Encoding Predictions: 107781 Total Wrong Optimal Encoding Predictions: 265 Correct Predictions Percentage: 99.75% Predictions Per Encoder Type: {ENCODER_NONE=107825, ENCODER_SHUFFLE=195, ENCODER_DELTA_ENCODE=14, ENCODER_DELTA_ENCODE_4_SHUFFLE=12} Wrong Predictions Per Encoder Type: {ENCODER_NONE=98, ENCODER_SHUFFLE=160, ENCODER_DELTA_ENCODE=1, ENCODER_DELTA_ENCODE_4_SHUFFLE=6} Wrong Predictions Percentage Per Encoder: - Encoder ENCODER_NONE = 0.09% - Encoder ENCODER_SHUFFLE = 82.05% - Encoder ENCODER_DELTA_ENCODE = 7.14% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE = 50.00% * Note: Wrong predictions does not mean that there is no gain from the encoder, but rather that there is a better one. Total Pre-Encoding Compressed Size of Chunks Used in Predictions: 1.07GB Total Post-Encoding Compression Size of Chunks Used in Predictions: 1.07GB Total Optimal Compression Size of Chunks Used in Predictions: 1.07GB Total Size Difference Between Predicted and Optimal Encoded Compression: 274.36KB (Optimal compression size is smaller than the predicted compression size by 0.02%) Approximate Total Local Data-Reduction Factor Without Data Aware Compression: 4.83:1 (79.30% reduction) Actual Total Global Data-Reduction Factor Without Data Aware Compression (available at 100% sampling): N/a ====================== Experimental Features: ====================== Similarity Reduction Gain if ref chain: 1.74% (4.49GB out of total bytes using Similarity: 233.62GB) Extra space gain in optimal compression: 47.57GB - Extra local compression space gain in case of using compression_level 8: 4.35GB - Extra local compression space gain in case of using compression_level 8: 3.16GB Adaptive Chunking min_chunk_size=AAA max_chunk_size=BBB desired_chunk_size=CCC are all internal settings that we may change from probe version to probe version. Otherwise they should be ignored. Theoretical Average Chunk Size should be ignored Number of chunks split via XXXX : adaptive chunking automatically adjusts the size of data chunks to improve deduplication and similarity matching. These three metrics tell us a bit about how we are doing. via hash : the count of chunks that were split using the automated data sensitive splitting. Typically this will be a high value. via buffer end : the count of chunks that were split simply because we reached the end of the relevant data stream. A likely cause is simply the end of a file. via max size reached : the count of chunks that were split because the chunks would have otherwise been too large. Data Aware Compression Encoding Sampling Reduction Summary summarizes the various different data aware compression (DAC) encodings and how well they worked for all of the data chunks. The probe randomly selects some number of chunks (sampling) and tries all encoding schemes. This is not what VAST or the probe does for all chunks as it is too expensive. Instead, the system examines a bit of each data chunk and decides on the DAC encoding scheme to use and then uses it - we call this prediction. This table show how the different schemes fared and helps us understand if our predictions are accurate. In general this table can be ignored. Correct Predictions Percentage tells us how often our predictions where correct. This calculation is based upon these values: Total Chunks Compared for Discovering Optimal Encoding : how many chunks were sampled for checking purposes Total Correct Optimal Encoding Predictions : how often the predictor was correct Total Wrong Optimal Encoding Predictions : how often the predictor was wrong Total Size Difference Between Predicted and Optimal Encoded Compression indicates how well our predictor selected the optimal DAC encoding scheme in terms of space used. If the number here is small (less than 5%) then the predictor is doing well. If it is larger, please let us know. These are the inputs to this calculation: Total Pre-Encoding Compressed Size of Chunks Used in Predictions : size of chunks before reduction Total Post-Encoding Compression Size of Chunks Used in Predictions : size of chunks after reduction Total Optimal Compression Size of Chunks Used in Predictions - the optimal reduction (basically trying all possible encodings based upon sampling) Approximate Total Local Data-Reduction Factor Without Data Aware Compression - our estimate (based upon sampling) of the data reduction without DAC. Basically if the value here is smaller than the value reported in the first part of the summary, DAC was a win. Experimental Features Extra space gain in optimal compression - this considers advanced data reduction algorithms that are under consider for future versions but have not yet implemented in actual released products. If you see a very large value here relative to the total data, let us know. That's very interesting to us! Extra local compression space gain in case of using compression_level 8 - this indicates how much space could be saved in local compression if the most expensive ZSTD compression setting. This isn't done on real clusters as it impact performance, but it's a useful metric for our engineering. Typically the additional savings is minimal which is good.","title":"Understanding Output"},{"location":"output/index.html#understanding_output_overview","text":"Periodically while running and at the end of a run, the probe will output data reduction results to the probe log file. These results are very helpful for understanding the data reduction that is expected when the data is placed on HPE GreenLake for File Storage as well as for helping to understand why that level of data reduction was achieved.","title":"Understanding Output Overview"},{"location":"output/index.html#output_format","text":"The output will look something like this: --------------------------------current-probe-stats-------------------------------- Probe version: probe-version-4-4-703050 Scanned: 258.14GB out of 258.13GB (100.00%) Files Scanned: 22481 files out of 22481 files (100.00%) ============= Main Results: ============= Total Global Data Reduction Factor = 5.32:1 (81.20% reduction) Sparse Size = 258.14GB Reduced Size = 48.54GB Number of Inaccessible Files = 3 out of 22481 files (0.01% of scan) Size of Inaccessible Files = 0.00B out of 258.13GB (0.00% of scan) - Duplicate Block Elimination Gain: 0.61% (1.56GB) Zero Block Elimination Gain: 0.00% (1.80MB) Number of Duplicate Chunks: 58917 Number of Zero Chunks: 35 - Similarity Reduction Global DAC vs. Local DAC Gain: 1.69% (4.37GB out of total bytes using Similarity: 233.62GB) Number of Similar Chunks: 4572414 out of 5414721 total unique chunks Average Chunk Size: 49.99KB Similarity Percentage: 84.44% Average Size of Chunks Using Similarity: 53.58KB Average Gain post DAC Per Similarity Match: 1.00KB Vast Array Performance Impact: green - Local Compression Gain including DAC: 79.03% (204.01GB out of a total Compression scan of 252.21GB) Compression ratio for local compress only: 4.88:1 ================== Adaptive Chunking: ================== ... ======================= Data Aware Compression: ======================= ... ====================== Experimental Features: ====================== ... There are two types of output above: normal or routine information relevant to most, and more advanced information that is more internal in nature (shown here with ...). In this article we will consider both types of information in the output, but please focus on the routine information as that is almost always more relevant.","title":"Output format"},{"location":"output/index.html#routine_considerations_main_results","text":"The intent of this output is to summarize what the probe has found so far. The interesting results are: Scanned shows the space before reduction Files Scanned shows the number of files in the entire data set that were scanned Total Global Data Reduction Factor shows how effectively data reduction was done overall. This value includes compression, deduplication, and similarity reduction. Reduced Size is the space after reduction Sparse Size should be ignored unless the probe is run with --sparse-mode as described below. Number/Size of Inaccessible Files indicates data the probe tried to read but couldn't. If this number is large, the probe results are not valid. This almost always happens due to permission issues or files being deleted while the probe was running. Duplicate Block Elimination Gain shows how much space is saved just by removal of duplicate blocks. Number of Duplicate Chunks shows literally how many blocks were identical to other existing blocks. Zero Block Elimination Gain tells you how much of the gain from deduplication was due to zero blocks. That helpful for understanding the implications of the next item. Number of Zero Chunks is a count of number of chunks that are all zeros. That often indicates sparse files. If the number of such chunks is high relative to the number of chunks (exceeding say 10%), the probe estimates may be misleading. Use tools such as du and df to determine the actual space used and compare that to the probe's report of the space scanned. If there is a large difference, sparse files are likely to blame. If your file system supports the advanced ioctl for sparse file reporting (Lustre and XFS do), you can try running the probe again with --sparse-mode . Similarity Reduction Global DAC vs. Local DAC Gain is the gain from similarity with data aware compression vs. the gain without similarity. This is just a more verbose way of saying \"this is how much gain similarity provided.\" Number of Similar Chunks / Similarity Percentage is the number of data chunks that benefited from similarity matching. The percentage is simply the number of chunks that benefited from similarity divided by the total number of chunks. A high value for the similarity match percentage (significantly over 10%) and a low value of Average Gain Post DAC Per Similarity Match relative to Average Size of Chunks Using Similarity is a potential problem. This indicates a high similarity match rate, but a low gain from those matches. The amount reported is bytes per chunk. Average Chunk Size is the average size (before reduction) of all chunks Average Size of Chunks Using Similarity is the average size (before reduction) of a chunk that benefited from similarity Array Performance Impact should be ignored for now. Local Compression Gain shows how much space would be saved just by transparent compression as files are saved. This is also helpfully expressed at the end via Compression ratio for local compress only. Essentially that ratio vs. the reported Total Global Data Reduction factor shows how much better DRR was thanks to global deduplication and similarity reduction. In the above example we can see that we scanned 22481 files that consumed 258GB of space before any data reduction. After data reduction the probe predicts the files will consume 48GB of space for a reduction of 81%. Of that simple compression gains 79% (204GB), deduplication 1% (1GB), and similarity 2% (4GB). Please keep in mind these aren't typical results as actual data reduction varies widely for different data sets.","title":"Routine Considerations (Main Results)"},{"location":"output/index.html#advanced_considerations","text":"In addition to the common and most relevant output described above, there are more advanced bits of information shared by the probe. Most of this information is only relevant to VAST engineering (we hope you can share it with us) but we document it here for the curious. Here is an example of the more advanced outputs: ================== Adaptive Chunking: ================== min_chunk_size=16384 max_chunk_size=65043 desired_chunk_size=29950 inverse_probability=13999 split_threshold=17871601040105585914 Theoretical Average Chunk Size: 29.25KB (error: -70.92%) Number of chunks split via hash: 2423353 (44.75%) Number of chunks split via buffer end: 44620 (0.82%) Number of chunks split via max size reached: 2969226 (54.84%) ======================= Data Aware Compression: ======================= Total Number of Predictions: 5414686 Predictions Per Encoder Type: {ENCODER_NONE=5402314, ENCODER_SHUFFLE=11164, ENCODER_DELTA_ENCODE=681, ENCODER_DELTA_ENCODE_4_SHUFFLE=527} Percentage of Chunks Per Encoder: - Encoder ENCODER_NONE: 99.77% - Encoder ENCODER_SHUFFLE: 0.21% - Encoder ENCODER_DELTA_ENCODE: 0.01% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE: 0.01% Encoding Sampling Reduction Summary (sampling 1.99%): ---------------------------------------------------------------------------------------------------------------------------------------------------- Encoders | None | Shuffle | Delta Shuffle | Delta ---------------------------------------------------------------------------------------------------------------------------------------------------- DRR (Global) | 5.33 | 3.60 | 3.07 | 4.83 Compressed Size | 48.44GB | 71.73GB | 84.06GB | 53.44GB Num Chunks Improved Percentage | 98.87% | 13.43% | 13.25% | 13.36% Num Chunks Improved | 5353433 | 727142 | 717668 | 723182 Total Chunks Num | 5414721 | 5414721 | 5414721 | 5414721 Similarity Reduction Percentage | 1.68% | 1.77% | 2.62% | 2.01% Similarity Reduction | 4.34GB | 4.59GB | 6.78GB | 5.19GB Total Bytes Using Similarity | 233.62GB | 233.62GB | 233.62GB | 233.62GB Similarity Reduction Gain if ref chain Percentage | 86.88% | 0.00% | 0.00% | 0.00% Similarity Reduction Gain if ref chain | 224.86GB | 0.00B | 0.00B | 0.00B Data Aware Compression Accuracy: Total Chunks Compared for Discovering Optimal Encoding: 108046 Total Correct Optimal Encoding Predictions: 107781 Total Wrong Optimal Encoding Predictions: 265 Correct Predictions Percentage: 99.75% Predictions Per Encoder Type: {ENCODER_NONE=107825, ENCODER_SHUFFLE=195, ENCODER_DELTA_ENCODE=14, ENCODER_DELTA_ENCODE_4_SHUFFLE=12} Wrong Predictions Per Encoder Type: {ENCODER_NONE=98, ENCODER_SHUFFLE=160, ENCODER_DELTA_ENCODE=1, ENCODER_DELTA_ENCODE_4_SHUFFLE=6} Wrong Predictions Percentage Per Encoder: - Encoder ENCODER_NONE = 0.09% - Encoder ENCODER_SHUFFLE = 82.05% - Encoder ENCODER_DELTA_ENCODE = 7.14% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE = 50.00% * Note: Wrong predictions does not mean that there is no gain from the encoder, but rather that there is a better one. Total Pre-Encoding Compressed Size of Chunks Used in Predictions: 1.07GB Total Post-Encoding Compression Size of Chunks Used in Predictions: 1.07GB Total Optimal Compression Size of Chunks Used in Predictions: 1.07GB Total Size Difference Between Predicted and Optimal Encoded Compression: 274.36KB (Optimal compression size is smaller than the predicted compression size by 0.02%) Approximate Total Local Data-Reduction Factor Without Data Aware Compression: 4.83:1 (79.30% reduction) Actual Total Global Data-Reduction Factor Without Data Aware Compression (available at 100% sampling): N/a ====================== Experimental Features: ====================== Similarity Reduction Gain if ref chain: 1.74% (4.49GB out of total bytes using Similarity: 233.62GB) Extra space gain in optimal compression: 47.57GB - Extra local compression space gain in case of using compression_level 8: 4.35GB - Extra local compression space gain in case of using compression_level 8: 3.16GB Adaptive Chunking min_chunk_size=AAA max_chunk_size=BBB desired_chunk_size=CCC are all internal settings that we may change from probe version to probe version. Otherwise they should be ignored. Theoretical Average Chunk Size should be ignored Number of chunks split via XXXX : adaptive chunking automatically adjusts the size of data chunks to improve deduplication and similarity matching. These three metrics tell us a bit about how we are doing. via hash : the count of chunks that were split using the automated data sensitive splitting. Typically this will be a high value. via buffer end : the count of chunks that were split simply because we reached the end of the relevant data stream. A likely cause is simply the end of a file. via max size reached : the count of chunks that were split because the chunks would have otherwise been too large. Data Aware Compression Encoding Sampling Reduction Summary summarizes the various different data aware compression (DAC) encodings and how well they worked for all of the data chunks. The probe randomly selects some number of chunks (sampling) and tries all encoding schemes. This is not what VAST or the probe does for all chunks as it is too expensive. Instead, the system examines a bit of each data chunk and decides on the DAC encoding scheme to use and then uses it - we call this prediction. This table show how the different schemes fared and helps us understand if our predictions are accurate. In general this table can be ignored. Correct Predictions Percentage tells us how often our predictions where correct. This calculation is based upon these values: Total Chunks Compared for Discovering Optimal Encoding : how many chunks were sampled for checking purposes Total Correct Optimal Encoding Predictions : how often the predictor was correct Total Wrong Optimal Encoding Predictions : how often the predictor was wrong Total Size Difference Between Predicted and Optimal Encoded Compression indicates how well our predictor selected the optimal DAC encoding scheme in terms of space used. If the number here is small (less than 5%) then the predictor is doing well. If it is larger, please let us know. These are the inputs to this calculation: Total Pre-Encoding Compressed Size of Chunks Used in Predictions : size of chunks before reduction Total Post-Encoding Compression Size of Chunks Used in Predictions : size of chunks after reduction Total Optimal Compression Size of Chunks Used in Predictions - the optimal reduction (basically trying all possible encodings based upon sampling) Approximate Total Local Data-Reduction Factor Without Data Aware Compression - our estimate (based upon sampling) of the data reduction without DAC. Basically if the value here is smaller than the value reported in the first part of the summary, DAC was a win. Experimental Features Extra space gain in optimal compression - this considers advanced data reduction algorithms that are under consider for future versions but have not yet implemented in actual released products. If you see a very large value here relative to the total data, let us know. That's very interesting to us! Extra local compression space gain in case of using compression_level 8 - this indicates how much space could be saved in local compression if the most expensive ZSTD compression setting. This isn't done on real clusters as it impact performance, but it's a useful metric for our engineering. Typically the additional savings is minimal which is good.","title":"Advanced Considerations"},{"location":"prerequisites/index.html","text":"Prerequisites Overview \u00b6 Before we can start deploying the HPE GreenLake for File Storage Data Reduction Estimation Probe, take a moment to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This is intended for customers that are running the probe on their own infrastructure. Prerequisites Overview Hardware Minimum Requirements Operating System Minimum Requirements Software Requirements Sample Data Set Filesystem Requirements Hardware Requirement Examples Hardware Minimum Requirements \u00b6 Actual hardware requirements depend on the amount of data to be scanned. Examples on how to scope hardware based on dataset size are provided at the end of this page. 16 CPU cores or higher Intel Broadwell-compatible or later CPUs The Probe requires CPU instructions that are not available on older CPUs The Probe will run virtually on Intel based hardware that has a Virtual Cluster vMotion minimum compatibility of Intel Broadwell-compatible or later The Probe has not been evaluated on AMD CPUs 128 GB RAM or higher The probe consumes almost 100GB of RAM upon launch The more RAM, the better the Probe will perform and the more data can be scanned 10 GbE Networking or higher 50 GB SSD-backed local storage or higher (NVMe or FC/iSCISI LUNs) This local SSD capacity is needed for the database the probe builds and logging Must be equivalent to 0.6% of the data to be scanned Disk storage must have very high sustained IOPs The larger the local SSD allocated, the more data can be scanned Local SSD filesystem should be ext4 or xfs Operating System Minimum Requirements \u00b6 We've tested the following, but most modern Linux distributions should be fine: Ubuntu 18.04, 20.04 Centos/RHEL 7.4+ Rocky/RHEL 8.3+ Software Requirements \u00b6 Docker: 17.05 + python3 (for launching the probe) screen (for running the probe in the background) wget (for downloading the probe image) Sample Data Set Filesystem Requirements \u00b6 Be aware that if the filesystem has atime enabled, any scanning, even while mounted as read-only will update the atime clock. NFS : The Probe host has be provided root-squash and read only access For faster scanning, use an operating system that has nconenct support: Ubuntu 20.04+ RHEL/Rocky 8.4+ Lustre : The Probe host and container must be able to read as a root user GPFS : The Probe host and container must be able to read as a root user SMB : The Probe host should be mounted with a user in the BUILTIN\\Backup Operators group to avoid file access issues. S3/Object : We have tested internally with goofys as a method of imitating a filesystem It is not recommend to scan anything in AWS Glacier or equivalent Hardware Requirement Examples \u00b6 Example A : You have a server with 768GB of RAM: 154GB is for the Operating System, leaving 614GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 609GB of RAM... 50-bytes per 'filename' This leaves 609GB of RAM available for the RAM index --ram-index-size-gb 609 This can scan up to 99TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Use of a disk index you can scan far more data and the file count could exceed 10 billion with a 500GB file name cache Example B : You have a server with 128GB of RAM and a Local SSD: 26GB is for the Operating System, leaving 102GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 97GB of RAM... 50-bytes per 'filename' This leaves 97GB of RAM available for the RAM index --ram-index-size-gb 97 This can scan up to 15TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Using a disk index you can scan far more data and the file count could be as high as 2 billion with a 100GB file name cache 15TB of data requires 90GB of local SSD disk 100TB of data requires 600GB of local SSD disk","title":"Prerequisites"},{"location":"prerequisites/index.html#prerequisites_overview","text":"Before we can start deploying the HPE GreenLake for File Storage Data Reduction Estimation Probe, take a moment to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This is intended for customers that are running the probe on their own infrastructure. Prerequisites Overview Hardware Minimum Requirements Operating System Minimum Requirements Software Requirements Sample Data Set Filesystem Requirements Hardware Requirement Examples","title":"Prerequisites Overview"},{"location":"prerequisites/index.html#hardware_minimum_requirements","text":"Actual hardware requirements depend on the amount of data to be scanned. Examples on how to scope hardware based on dataset size are provided at the end of this page. 16 CPU cores or higher Intel Broadwell-compatible or later CPUs The Probe requires CPU instructions that are not available on older CPUs The Probe will run virtually on Intel based hardware that has a Virtual Cluster vMotion minimum compatibility of Intel Broadwell-compatible or later The Probe has not been evaluated on AMD CPUs 128 GB RAM or higher The probe consumes almost 100GB of RAM upon launch The more RAM, the better the Probe will perform and the more data can be scanned 10 GbE Networking or higher 50 GB SSD-backed local storage or higher (NVMe or FC/iSCISI LUNs) This local SSD capacity is needed for the database the probe builds and logging Must be equivalent to 0.6% of the data to be scanned Disk storage must have very high sustained IOPs The larger the local SSD allocated, the more data can be scanned Local SSD filesystem should be ext4 or xfs","title":"Hardware Minimum Requirements"},{"location":"prerequisites/index.html#operating_system_minimum_requirements","text":"We've tested the following, but most modern Linux distributions should be fine: Ubuntu 18.04, 20.04 Centos/RHEL 7.4+ Rocky/RHEL 8.3+","title":"Operating System Minimum Requirements"},{"location":"prerequisites/index.html#software_requirements","text":"Docker: 17.05 + python3 (for launching the probe) screen (for running the probe in the background) wget (for downloading the probe image)","title":"Software Requirements"},{"location":"prerequisites/index.html#sample_data_set_filesystem_requirements","text":"Be aware that if the filesystem has atime enabled, any scanning, even while mounted as read-only will update the atime clock. NFS : The Probe host has be provided root-squash and read only access For faster scanning, use an operating system that has nconenct support: Ubuntu 20.04+ RHEL/Rocky 8.4+ Lustre : The Probe host and container must be able to read as a root user GPFS : The Probe host and container must be able to read as a root user SMB : The Probe host should be mounted with a user in the BUILTIN\\Backup Operators group to avoid file access issues. S3/Object : We have tested internally with goofys as a method of imitating a filesystem It is not recommend to scan anything in AWS Glacier or equivalent","title":"Sample Data Set Filesystem Requirements"},{"location":"prerequisites/index.html#hardware_requirement_examples","text":"Example A : You have a server with 768GB of RAM: 154GB is for the Operating System, leaving 614GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 609GB of RAM... 50-bytes per 'filename' This leaves 609GB of RAM available for the RAM index --ram-index-size-gb 609 This can scan up to 99TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Use of a disk index you can scan far more data and the file count could exceed 10 billion with a 500GB file name cache Example B : You have a server with 128GB of RAM and a Local SSD: 26GB is for the Operating System, leaving 102GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 97GB of RAM... 50-bytes per 'filename' This leaves 97GB of RAM available for the RAM index --ram-index-size-gb 97 This can scan up to 15TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Using a disk index you can scan far more data and the file count could be as high as 2 billion with a 100GB file name cache 15TB of data requires 90GB of local SSD disk 100TB of data requires 600GB of local SSD disk","title":"Hardware Requirement Examples"},{"location":"troubleshooting/index.html","text":"Troubleshooting Overview \u00b6 In general you can monitor the probe's behavior by watching the log file it generates as well as the standard output it generates to the console. Here we document common errors with the probe. tail -f /mnt/probe/log/XXXX.log Troubleshooting Overview Launcher Hang CPU Compatibility CGroup Error Privilege Error Illegal Instruction Launcher Hang \u00b6 In rare cases the probe will complete its successfully but the python launcher script will hang. If this happens, simply control-C the launcher if you are still attached to the terminal. Or use ps -ef to find the probe process and kill it. CPU Compatibility \u00b6 Occasionally, the probe will launch and finish without scanning any files and may produce an error related to log files. When viewing logs you may see: probe terminated by signal: SIGILL Check the CPU Compatibility: cat /sys/devices/cpu/caps/pmu_name Review the CPU Requirements . CGroup Error \u00b6 You may get the following error on older Linux builds: Loading docker image Starting probe docker container 783735ffbf1a722ebbcd43622476ea2364a8873dc2a6f95a4d006778636ed513 /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"Cannot set property TasksAccounting, or unknown property.\\\"\". Failed starting the probe If you do you need to update the systemd related packages. Here are the versions we use: rpm -qa |grep -i systemd systemd-libs-219-67.el7_7.2.x86_64 systemd-sysv-219-67.el7_7.2.x86_64 systemd-219-67.el7_7.2.x86_64 oci-systemd-hook-0.2.0-1.git05e6923.el7_6.x86_64 Privilege Error \u00b6 When the container is launched it may fail with this error message: Starting probe docker container docker: Error response from daemon: privileged mode is incompatible with user namespaces. You must run the container in the host namespace when running privileged mode. This message is a warning that your docker environment will not allow docker to run with heightened permissions. This probably caused by a more secure docker configuration, such as placing the following text into the /etc/docker/daemon.json file which basically prevents a container from running as root with privileges: { \"userns-remap\": \"dockremap:dockremap\" } The easiest way to address that is to hand edit the VAST provided probe_launcher.py and look for the 'docker run' line. It will look something like this: cmd = f'docker run --privileged -v {args.metadata_dir}:/probe_mnt/{args.metadata_dir} -v {args.output_dir}:/probe_mnt/{args.output_dir} ' Notice the --privileged . Remove it, save the file, and try again. If that doesn't fix the issue, we've found that removing the userns-remap line and restarting the docker service can be more effective. Illegal Instruction \u00b6 If the probe fails with a core dump and it shows an illegal instruction, you may be using an old CPU type which is not compatible with our compiled code. We require Intel Broadwell or newer compatible CPUs. If you use GDB to debug the core you can confirm by looking for something like this: GDB output from core: $ gdb -c core.97296 For help, type \"help\". Type \"apropos word\" to search for commands related to \"word\". Core was generated by `/vast/install/probe/sim_estimator --similarity-function fast_hash_8 --split-win'. Program terminated with signal SIGILL, Illegal instruction. #0 0x00007fffec9deea7 in ?? () (gdb)","title":"Troubleshooting"},{"location":"troubleshooting/index.html#troubleshooting_overview","text":"In general you can monitor the probe's behavior by watching the log file it generates as well as the standard output it generates to the console. Here we document common errors with the probe. tail -f /mnt/probe/log/XXXX.log Troubleshooting Overview Launcher Hang CPU Compatibility CGroup Error Privilege Error Illegal Instruction","title":"Troubleshooting Overview"},{"location":"troubleshooting/index.html#launcher_hang","text":"In rare cases the probe will complete its successfully but the python launcher script will hang. If this happens, simply control-C the launcher if you are still attached to the terminal. Or use ps -ef to find the probe process and kill it.","title":"Launcher Hang"},{"location":"troubleshooting/index.html#cpu_compatibility","text":"Occasionally, the probe will launch and finish without scanning any files and may produce an error related to log files. When viewing logs you may see: probe terminated by signal: SIGILL Check the CPU Compatibility: cat /sys/devices/cpu/caps/pmu_name Review the CPU Requirements .","title":"CPU Compatibility"},{"location":"troubleshooting/index.html#cgroup_error","text":"You may get the following error on older Linux builds: Loading docker image Starting probe docker container 783735ffbf1a722ebbcd43622476ea2364a8873dc2a6f95a4d006778636ed513 /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"Cannot set property TasksAccounting, or unknown property.\\\"\". Failed starting the probe If you do you need to update the systemd related packages. Here are the versions we use: rpm -qa |grep -i systemd systemd-libs-219-67.el7_7.2.x86_64 systemd-sysv-219-67.el7_7.2.x86_64 systemd-219-67.el7_7.2.x86_64 oci-systemd-hook-0.2.0-1.git05e6923.el7_6.x86_64","title":"CGroup Error"},{"location":"troubleshooting/index.html#privilege_error","text":"When the container is launched it may fail with this error message: Starting probe docker container docker: Error response from daemon: privileged mode is incompatible with user namespaces. You must run the container in the host namespace when running privileged mode. This message is a warning that your docker environment will not allow docker to run with heightened permissions. This probably caused by a more secure docker configuration, such as placing the following text into the /etc/docker/daemon.json file which basically prevents a container from running as root with privileges: { \"userns-remap\": \"dockremap:dockremap\" } The easiest way to address that is to hand edit the VAST provided probe_launcher.py and look for the 'docker run' line. It will look something like this: cmd = f'docker run --privileged -v {args.metadata_dir}:/probe_mnt/{args.metadata_dir} -v {args.output_dir}:/probe_mnt/{args.output_dir} ' Notice the --privileged . Remove it, save the file, and try again. If that doesn't fix the issue, we've found that removing the userns-remap line and restarting the docker service can be more effective.","title":"Privilege Error"},{"location":"troubleshooting/index.html#illegal_instruction","text":"If the probe fails with a core dump and it shows an illegal instruction, you may be using an old CPU type which is not compatible with our compiled code. We require Intel Broadwell or newer compatible CPUs. If you use GDB to debug the core you can confirm by looking for something like this: GDB output from core: $ gdb -c core.97296 For help, type \"help\". Type \"apropos word\" to search for commands related to \"word\". Core was generated by `/vast/install/probe/sim_estimator --similarity-function fast_hash_8 --split-win'. Program terminated with signal SIGILL, Illegal instruction. #0 0x00007fffec9deea7 in ?? () (gdb)","title":"Illegal Instruction"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"index.html","text":"HPE GreenLake for File Storage Data Reduction Estimation Probe \u00b6 This tool is designed to run against an existing one or more File Storage mounts and provide an accurate estimation of how much data reduction you should expect to see when moving your data set to an HPE GreenLake for File Storage solution. Synopsis \u00b6 This documentation shows how to check the prerequisites , deploy the probe, and understand the output . Support \u00b6 Typically you would work with your HPE Sales engineer to deploy and use The HPE GreenLake for File Storage Data Reduction Estimation Probe. Should HPE Sales engineers have issues or additional questions please contact us at Slack channel #ask-greenlake-for-filestorage.","title":"HPE GreenLake for File Storage Data Reduction Estimation Probe"},{"location":"index.html#hpe_greenlake_for_file_storage_data_reduction_estimation_probe","text":"This tool is designed to run against an existing one or more File Storage mounts and provide an accurate estimation of how much data reduction you should expect to see when moving your data set to an HPE GreenLake for File Storage solution.","title":"HPE GreenLake for File Storage Data Reduction Estimation Probe"},{"location":"index.html#synopsis","text":"This documentation shows how to check the prerequisites , deploy the probe, and understand the output .","title":"Synopsis"},{"location":"index.html#support","text":"Typically you would work with your HPE Sales engineer to deploy and use The HPE GreenLake for File Storage Data Reduction Estimation Probe. Should HPE Sales engineers have issues or additional questions please contact us at Slack channel #ask-greenlake-for-filestorage.","title":"Support"},{"location":"deployment/index.html","text":"Deployment Overview \u00b6 The HPE GreenLake for File Storage Data Reduction Estimation Probe provides estimated data reduction rate achieable based on an example data set. Make sure to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This article will guide you through the process of deployment and execution of the probe. Deployment Overview Download Expand & Verify Download Mount Filesystems Selected to Be Probed Create Probe Directories Size of the Data Set Running The Probe Other Probe Flags Understanding the Results Re-Running The Probe Troubleshooting Download \u00b6 Download using sftp to the Linux client that you wish to run the probe. % sftp gl4f_probe@halo.storagelr5.ext.hpe.com:/935553.probe.bundle.tar.gz . The authenticity of host 'halo.storagelr5.ext.hpe.com (63.215.98.146)' can't be established. Type in password: HPE@cc3$$4SFTP gl4f_probe@halo.storagelr5.ext.hpe.com's password: Connected to halo.storagelr5.ext.hpe.com. Fetching /935553.probe.bundle.tar.gz to ./935553.probe.bundle.tar.gz Expand & Verify Download \u00b6 Now that you've downloaded the probe, you'll need to untar it and then verify the download is correct. export PROBE_BUILD=935553 tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz ls -l Note: example may not show current build numbers. [root@iris-centos-workloadclient-22 probe]# ls -l total 1840344 -rw-r--r--. 1 root root 937920831 Jul 12 12:44 935553.probe.bundle.tar.gz -rw-r--r--. 1 root root 946565338 Jul 12 12:44 935553.probe.image.gz -rwxr-xr-x. 1 root root 19579 Jul 12 12:44 probe_launcher.py Mount Filesystems Selected to Be Probed \u00b6 Validated Filesystems Include, But Are Not Limited To: NFS Lustre GPFS S3 with goofys CIFS/SMB For the most accurate results, do not use root-squash. It's recommended to set read-only access on the mounted filesystem Create Probe Directories \u00b6 Change /mnt/ to the SSD-backed local disk to be used by the probe for the hash database and logging directories sudo mkdir -p /mnt/probe/db sudo mkdir -p /mnt/probe/out sudo chmod -Rf 777 /mnt/probe Size of the Data Set \u00b6 The input to the probe is a defined directory ( --input-dir ) The probe will automatically query the input filesystem about space consumed and file count (inodes) and use that in its calculations Depending on the method of mounting and underlying storage, this can often provide an inaccurate query response It's highly recommended that manual estimated entries be defined for space consumed ( --data-size-gb ) and file count ( --number-of-files ) These estimates do not have to be accurate, round up reasonably Running The Probe \u00b6 The probe runs as a foreground application. This means that if your session is closed for whatever reason, the probe will stop. It's recommended running the probe as a screen session. Here is an example of a command line. Edit the bold variables for the environment: NOTE: Use underscores instead of spaces in COMPANY_NAME and WORKLOAD export DB_DIR=/mnt/probe/db export OUTPUT_DIR=/mnt/probe/out export INPUT_DIR=/mnt/filesystem_to_be_probed/sub_directory export INPUT_SIZE_GB=10000 export QTY_FILES=1000000 export COMPANY_NAME=Your_Amazing_Company export WORKLOAD=Describe_Your_Workload Start the probe: (This may take up to five minutes to start displaying output) sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir $INPUT_DIR \\ --metadata-dir $DB_DIR \\ --output-dir $OUTPUT_DIR \\ --data-size-gb $INPUT_SIZE_GB \\ --number-of-files $QTY_FILES \\ --customer-name ${COMPANY_NAME}---${WORKLOAD} Example One: Small Data Sets To probe the directory interesting_data of 15 TB in-use and 5,000,000 files at the company ACME, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/acme_filer/interesting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 15000 \\ --number-of-files 5000000 \\ --customer-name ACME---Interesting_Data Example Two: Larger Data Sets To probe the directory fascinating_data of 60 TB in-use and 750,000,000 files at the company FOO, and are using defined parameters for RAM and SSD-backed local disk the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/foo_filer/fascinating_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 60000 \\ --number-of-files 750000000 \\ --customer-name FOO---Facinating_Data Example Three: Performance Throttling To probe the directory riviting_data of 250 TB in-use and 1,250,000,000 files at the company Initech, using defined parameters for RAM and SSD-backed local disk, but wish to have a lower performance impact on the filesystem, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/initech_filer/riviting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 250000 \\ --number-of-files 1250000000 \\ --number-of-threads 4 --customer-name Initech---Riviting_Data Note the --number-of-threads flag. By default the probe will use all CPU cores in the system but this can be used to throttle performance and reduce potential impact of the scanned filesystem. Other Probe Flags \u00b6 While the probe is running and after completion, telemetry logs are automatically uploaded to HPE. To prevent this, add the following flag: --dont-send-logs \\ If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names \\ Probing filesystems which contain snapshots can often cause recursion issues and inaccurate results. As a result the probe automatically ignores directories named .snapshot. If your file system uses another convention, use the --regexp-filter command. If for some reason you want the probe to read the .snapshot directories, specify false rather than true for --filter-snapshots . --filter-snapshots \\ (this is the default) Under most circumstances the probe should be run with adaptive chunking. However you can disable that feature by specifying this flag: --disable-adaptive-chunking \\ Understanding the Results \u00b6 Once started, the probe will display the current projection of potential data reduction. Once completed, the probe will display output and is further described in Understanding Output Re-Running The Probe \u00b6 The hash database must be empty before running the probe again: sudo rm -r /mnt/probe/db/* Troubleshooting \u00b6 Refer to the Troubleshooting document and contact HPE Support.","title":"Deployment"},{"location":"deployment/index.html#deployment_overview","text":"The HPE GreenLake for File Storage Data Reduction Estimation Probe provides estimated data reduction rate achieable based on an example data set. Make sure to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This article will guide you through the process of deployment and execution of the probe. Deployment Overview Download Expand & Verify Download Mount Filesystems Selected to Be Probed Create Probe Directories Size of the Data Set Running The Probe Other Probe Flags Understanding the Results Re-Running The Probe Troubleshooting","title":"Deployment Overview"},{"location":"deployment/index.html#download","text":"Download using sftp to the Linux client that you wish to run the probe. % sftp gl4f_probe@halo.storagelr5.ext.hpe.com:/935553.probe.bundle.tar.gz . The authenticity of host 'halo.storagelr5.ext.hpe.com (63.215.98.146)' can't be established. Type in password: HPE@cc3$$4SFTP gl4f_probe@halo.storagelr5.ext.hpe.com's password: Connected to halo.storagelr5.ext.hpe.com. Fetching /935553.probe.bundle.tar.gz to ./935553.probe.bundle.tar.gz","title":"Download"},{"location":"deployment/index.html#expand_verify_download","text":"Now that you've downloaded the probe, you'll need to untar it and then verify the download is correct. export PROBE_BUILD=935553 tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz ls -l Note: example may not show current build numbers. [root@iris-centos-workloadclient-22 probe]# ls -l total 1840344 -rw-r--r--. 1 root root 937920831 Jul 12 12:44 935553.probe.bundle.tar.gz -rw-r--r--. 1 root root 946565338 Jul 12 12:44 935553.probe.image.gz -rwxr-xr-x. 1 root root 19579 Jul 12 12:44 probe_launcher.py","title":"Expand & Verify Download"},{"location":"deployment/index.html#mount_filesystems_selected_to_be_probed","text":"Validated Filesystems Include, But Are Not Limited To: NFS Lustre GPFS S3 with goofys CIFS/SMB For the most accurate results, do not use root-squash. It's recommended to set read-only access on the mounted filesystem","title":"Mount Filesystems Selected to Be Probed"},{"location":"deployment/index.html#create_probe_directories","text":"Change /mnt/ to the SSD-backed local disk to be used by the probe for the hash database and logging directories sudo mkdir -p /mnt/probe/db sudo mkdir -p /mnt/probe/out sudo chmod -Rf 777 /mnt/probe","title":"Create Probe Directories"},{"location":"deployment/index.html#size_of_the_data_set","text":"The input to the probe is a defined directory ( --input-dir ) The probe will automatically query the input filesystem about space consumed and file count (inodes) and use that in its calculations Depending on the method of mounting and underlying storage, this can often provide an inaccurate query response It's highly recommended that manual estimated entries be defined for space consumed ( --data-size-gb ) and file count ( --number-of-files ) These estimates do not have to be accurate, round up reasonably","title":"Size of the Data Set"},{"location":"deployment/index.html#running_the_probe","text":"The probe runs as a foreground application. This means that if your session is closed for whatever reason, the probe will stop. It's recommended running the probe as a screen session. Here is an example of a command line. Edit the bold variables for the environment: NOTE: Use underscores instead of spaces in COMPANY_NAME and WORKLOAD export DB_DIR=/mnt/probe/db export OUTPUT_DIR=/mnt/probe/out export INPUT_DIR=/mnt/filesystem_to_be_probed/sub_directory export INPUT_SIZE_GB=10000 export QTY_FILES=1000000 export COMPANY_NAME=Your_Amazing_Company export WORKLOAD=Describe_Your_Workload Start the probe: (This may take up to five minutes to start displaying output) sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir $INPUT_DIR \\ --metadata-dir $DB_DIR \\ --output-dir $OUTPUT_DIR \\ --data-size-gb $INPUT_SIZE_GB \\ --number-of-files $QTY_FILES \\ --customer-name ${COMPANY_NAME}---${WORKLOAD} Example One: Small Data Sets To probe the directory interesting_data of 15 TB in-use and 5,000,000 files at the company ACME, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/acme_filer/interesting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 15000 \\ --number-of-files 5000000 \\ --customer-name ACME---Interesting_Data Example Two: Larger Data Sets To probe the directory fascinating_data of 60 TB in-use and 750,000,000 files at the company FOO, and are using defined parameters for RAM and SSD-backed local disk the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/foo_filer/fascinating_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 60000 \\ --number-of-files 750000000 \\ --customer-name FOO---Facinating_Data Example Three: Performance Throttling To probe the directory riviting_data of 250 TB in-use and 1,250,000,000 files at the company Initech, using defined parameters for RAM and SSD-backed local disk, but wish to have a lower performance impact on the filesystem, the command would be: sudo python3 ./probe_launcher.py \\ --probe-image-path ${PROBE_BUILD}.probe.image.gz \\ --input-dir /mnt/initech_filer/riviting_data \\ --metadata-dir /mnt/data/probe/db \\ --output-dir /mnt/data/probe/out \\ --data-size-gb 250000 \\ --number-of-files 1250000000 \\ --number-of-threads 4 --customer-name Initech---Riviting_Data Note the --number-of-threads flag. By default the probe will use all CPU cores in the system but this can be used to throttle performance and reduce potential impact of the scanned filesystem.","title":"Running The Probe"},{"location":"deployment/index.html#other_probe_flags","text":"While the probe is running and after completion, telemetry logs are automatically uploaded to HPE. To prevent this, add the following flag: --dont-send-logs \\ If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names \\ Probing filesystems which contain snapshots can often cause recursion issues and inaccurate results. As a result the probe automatically ignores directories named .snapshot. If your file system uses another convention, use the --regexp-filter command. If for some reason you want the probe to read the .snapshot directories, specify false rather than true for --filter-snapshots . --filter-snapshots \\ (this is the default) Under most circumstances the probe should be run with adaptive chunking. However you can disable that feature by specifying this flag: --disable-adaptive-chunking \\","title":"Other Probe Flags"},{"location":"deployment/index.html#understanding_the_results","text":"Once started, the probe will display the current projection of potential data reduction. Once completed, the probe will display output and is further described in Understanding Output","title":"Understanding the Results"},{"location":"deployment/index.html#re-running_the_probe","text":"The hash database must be empty before running the probe again: sudo rm -r /mnt/probe/db/*","title":"Re-Running The Probe"},{"location":"deployment/index.html#troubleshooting","text":"Refer to the Troubleshooting document and contact HPE Support.","title":"Troubleshooting"},{"location":"faq/index.html","text":"General FAQ \u00b6 Q: How does the probe handle symbolic links? A : The probe ignores symbolic links. Thus if it is scanning a directory tree and encounters a symbolic link to some other area in the file system, it will not follow it. Q: How does the probe handle hard links? A : The probe attempts to detect if two files in the tree it is scanning point to the same data and automatically ignores the duplication. Q: How does the probe handle sparse files? A : By default the probe is not aware of sparse files. This means that it will read zero values for the sparse regions of the files, which can result in artificially high data reduction. The probe reports zero chunks to hint at this potential issue. Refer to Understanding VAST Probe Output for more details. Note that the probe can be run to recognize sparse files on some files systems as described in the document just referenced. Q: Can the probe scan multiple unrelated directory trees? A : Yes it can. This is done by providing multiple --input-dir values. Security FAQ \u00b6 The HPE GreenLake for File Storage Data Reduction Estimation Probe software is provided at zero cost with zero warranty to HPE\u2019s current and prospective customers in order to accurately estimate Data Reduction Rates of specific data not yet on HPE Storage systems. The probe software is run on physical or virtualized customer-maintained hardware and analyzes data that the customer allows access to through traditional filesystem based access. The results of the probe are used to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Q: Where does the VAST Probe originate? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is a Docker container of scripts and libraries maintained and assembled solely by HPE and VAST Data engineering which is updated frequently, usually quarterly. The links to download the probe are posted on this GitHub repository. Q: Where does the VAST Probe run? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is designed to be run within a customer environment on physical or virtualized customer-maintained equipment. The provided container requires a base Linux operating system which is expected to be installed and updated by the customer before the probe is launched. Q: What information does the VAST Probe collect? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe generates a series of logs for each iteration of data scanning. These logs are by default saved on the same physical or virtualized customer-maintained equipment that the probe runs. These logs contain references to paths which have been provided as inputs, and can refer to any path within that directory structure when making declarative statements about data reduction results. The analysis log file that is generated upon completion of the Data Reduction Probe prints each full path with figures about data reduction rate for that path. In addition, a secondary section of same analysis log file prints aggregate information about specific file extensions with figures about data reduction rate for that file extension. Q: What information does the HPE GreenLake for File Storage Data Reduction Estimation Probe send back to HPE? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe as built-in call home telemetry which is on by default when executed assuming the probe has access to specific HPE endpoints via the internet. While the probe is running, telemetry logs will be sent approximately every 5 minutes. These telemetry logs, by default, omit references to full paths with the exception of the of the root input path and simply upload a percentage-based status of the probe as well as any error messages. The final telemetry log is similar to the local analysis log file but, by default, removes full paths with the exception of the of the root input path. The final telemetry log will send the aggregated data reduction rates based on file extensions as illustrated below: file extension statistics: file type .xlsx, original_size=143.7GB, global_compression_reduced_size=126.6GB, global_compression_factor=1.14, dedup_percentage=10.34%, similarity_match_percentage=15.12%, similarity_gain=310.9MB, local_compression_only_size=126.9GB file type .tsv, original_size=291.5GB, global_compression_reduced_size=30.8GB, global_compression_factor=9.47, dedup_percentage=1.95%, similarity_match_percentage=84.83%, similarity_gain=9.6GB, local_compression_only_size=40.4GB Q: Who can access the logs sent to VAST Data? A: Anyone at HPE engineering or sales has access to the call home backend that is used as the telemetry destination for the HPE GreenLake for File Storage Data Reduction Estimation Probe. Q: What actions are performed with the logs sent to HPE? A: The telemetry logs are primarily used by sales to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Alternatively, any telemetry logs can be used to determine an expected Data Reduction Rate for a given industry or use case which may be similar to a sales team\u2019s customer which has not run the probe. HPE engineering also uses the telemetry data for bug fixes and over all improvements to the software and user experience. Q: How do I control what the VAST Probe sends back to VAST Data? A: This call home telemetry feature can be disabled at runtime with the added flag: --dont-send-logs If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names","title":"FAQ"},{"location":"faq/index.html#general_faq","text":"Q: How does the probe handle symbolic links? A : The probe ignores symbolic links. Thus if it is scanning a directory tree and encounters a symbolic link to some other area in the file system, it will not follow it. Q: How does the probe handle hard links? A : The probe attempts to detect if two files in the tree it is scanning point to the same data and automatically ignores the duplication. Q: How does the probe handle sparse files? A : By default the probe is not aware of sparse files. This means that it will read zero values for the sparse regions of the files, which can result in artificially high data reduction. The probe reports zero chunks to hint at this potential issue. Refer to Understanding VAST Probe Output for more details. Note that the probe can be run to recognize sparse files on some files systems as described in the document just referenced. Q: Can the probe scan multiple unrelated directory trees? A : Yes it can. This is done by providing multiple --input-dir values.","title":"General FAQ"},{"location":"faq/index.html#security_faq","text":"The HPE GreenLake for File Storage Data Reduction Estimation Probe software is provided at zero cost with zero warranty to HPE\u2019s current and prospective customers in order to accurately estimate Data Reduction Rates of specific data not yet on HPE Storage systems. The probe software is run on physical or virtualized customer-maintained hardware and analyzes data that the customer allows access to through traditional filesystem based access. The results of the probe are used to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Q: Where does the VAST Probe originate? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is a Docker container of scripts and libraries maintained and assembled solely by HPE and VAST Data engineering which is updated frequently, usually quarterly. The links to download the probe are posted on this GitHub repository. Q: Where does the VAST Probe run? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe is designed to be run within a customer environment on physical or virtualized customer-maintained equipment. The provided container requires a base Linux operating system which is expected to be installed and updated by the customer before the probe is launched. Q: What information does the VAST Probe collect? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe generates a series of logs for each iteration of data scanning. These logs are by default saved on the same physical or virtualized customer-maintained equipment that the probe runs. These logs contain references to paths which have been provided as inputs, and can refer to any path within that directory structure when making declarative statements about data reduction results. The analysis log file that is generated upon completion of the Data Reduction Probe prints each full path with figures about data reduction rate for that path. In addition, a secondary section of same analysis log file prints aggregate information about specific file extensions with figures about data reduction rate for that file extension. Q: What information does the HPE GreenLake for File Storage Data Reduction Estimation Probe send back to HPE? A: The HPE GreenLake for File Storage Data Reduction Estimation Probe as built-in call home telemetry which is on by default when executed assuming the probe has access to specific HPE endpoints via the internet. While the probe is running, telemetry logs will be sent approximately every 5 minutes. These telemetry logs, by default, omit references to full paths with the exception of the of the root input path and simply upload a percentage-based status of the probe as well as any error messages. The final telemetry log is similar to the local analysis log file but, by default, removes full paths with the exception of the of the root input path. The final telemetry log will send the aggregated data reduction rates based on file extensions as illustrated below: file extension statistics: file type .xlsx, original_size=143.7GB, global_compression_reduced_size=126.6GB, global_compression_factor=1.14, dedup_percentage=10.34%, similarity_match_percentage=15.12%, similarity_gain=310.9MB, local_compression_only_size=126.9GB file type .tsv, original_size=291.5GB, global_compression_reduced_size=30.8GB, global_compression_factor=9.47, dedup_percentage=1.95%, similarity_match_percentage=84.83%, similarity_gain=9.6GB, local_compression_only_size=40.4GB Q: Who can access the logs sent to VAST Data? A: Anyone at HPE engineering or sales has access to the call home backend that is used as the telemetry destination for the HPE GreenLake for File Storage Data Reduction Estimation Probe. Q: What actions are performed with the logs sent to HPE? A: The telemetry logs are primarily used by sales to determine a Data Reduction Rate which will often be used to project an aggregate financial savings for HPE\u2019s current and prospective customers. Alternatively, any telemetry logs can be used to determine an expected Data Reduction Rate for a given industry or use case which may be similar to a sales team\u2019s customer which has not run the probe. HPE engineering also uses the telemetry data for bug fixes and over all improvements to the software and user experience. Q: How do I control what the VAST Probe sends back to VAST Data? A: This call home telemetry feature can be disabled at runtime with the added flag: --dont-send-logs If you wish to send file names with the default telemetry logs, add the following flag: --send-logs-with-file-names","title":"Security FAQ"},{"location":"legal/eula/index.html","text":"This software is provided according to HPE license restrictions . The deployment documentation describes how to indicate your acceptance of these terms.","title":"End User License Agreement"},{"location":"legal/notices/index.html","text":"","title":"Notices"},{"location":"legal/support/index.html","text":"Typically you would work with your HPE Sales engineer to deploy and use The HPE GreenLake for File Storage Data Reduction Estimation Probe. Should HPE Sales engineers have issues or additional questions please contact us at Slack channel #ask-greenlake-for-filestorage.","title":"Support"},{"location":"manual/index.html","text":"Manual Execution Overview \u00b6 The HPE GreenLake for File Storage Data Reduction Estimation Probe is a long running process in a docker container. The docker container needs to run on a linux system that has read only access to the files you want to examine for data reduction as well as reasonable memory and substantial fast local disk. When having issues with the probe_launcher.py script or you need more experimental features, you should use this page. Manual Execution Overview Manual Execution Procedure Download the bundle Configure the run Launch the probe run Probe Stages Treewalk Phase DB Initialization Phase DataScan Phase Monitoring progress Understanding the ouput Low Level Output Probe Analyze I/O Behavior Manual Execution Procedure \u00b6 Follow the steps in Prerequisites to verify requirements are met to properly run the probe. Get docker container image via links in Deployment . When the probe docker container is launched you'll then be able to connect into it and then run the probe itself with key configuration information. The probe will then run until completion and report results. Download the bundle \u00b6 Download the docker image Follow the download links in Deployment to download the bundle. Then set the variable for the build number: export PROBE_BUILD=[PROBE BUILD NUMBER] Untar the bundle: tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz Load the docker image: docker load -i ${PROBE_BUILD}.probe.image.gz This step will take a few minutes without meaningful output. Tag the loaded image by doing a docker images and noting the new image, and it's ID. Recall that images are identified by unique image IDs and human readable tags. Tag it as shown below - get the ID from the docker images output, and the value of the name is by convention the probe build. docker images Notice the Image IDs in the output list docker tag vast-probe-${PROBE_BUILD} Configure the run \u00b6 Launch a 'screen' session (or tmux). We recommend some kind of long lived session tool since the probe can take a very long time to run and we do not want it to terminate if there is an issue with the client system. screen -R probe Run the container while mapping the required directories run with the image tag/name you set earlier. The -v specifies mounts from the real operating system that should be made available to docker. These are directories that the probe can use and scan. Include as many -v 's as needed, just ensuring that at least one is the actual probe scratch directory ( /mnt/probe in this case). docker run -v /mnt/fileserver1:/mnt/fileserver1 -v /mnt/probe:/mnt/probe -it vast-probe-${PROBE_BUILD} from within the docker container, Create relevant output directories, eg: sudo mkdir -p /mnt/probe/vast-probe/output sudo mkdir -p /mnt/probe/vast-probe/db sudo chmod -R 777 /mnt/probe/vast-probe #note: If you get permission denied then disable selinux on your host Edit probe config file: vim /vast/install/probe/sim_init_file.yml See example config below, but also some items to note: input_dir : you can specify more than one. Just prefix a newline with '-'. This will allow the probe to scan multiple mountpoints/filesystems. Each input directory is scanned in a parallel thread which can slightly improve probe scan times. output_dir : this is where the summary files and some stats files will go. this is relatively small (< GB, although could get larger if you are scanning a lot of paths) metadata_dir : if using disk based indexes the space here needs to be pretty large (1% of total dataset to be very safe). match_disable : if you set to '1' , it will do 'local-only' compression/dedup. This completes much more quickly, but will not do any similarity hashing. max_number_of_files : This effectively pre-allocates some RAM to hold for file pointers. Set this value to somewhat higher than the total number of files you expect the probe to scan. Every 1-million files takes up 50MB of RAM. 1-billion is 50GB. Make sure not to set a value that causes the file pointer cache to exceed 50% of system RAM. disk_size_gb : set this to use disk based index. If you set to 0 it will instead use a RAM based index (see next variable) Index is ~80% of the probe metadata so rule of the thumb here so if you have a dedicated SSD-based file system for probe md the rule of thumb is to put here 80% of the disk size. And remember the free disk space size needs to be 0.6% of the total dataset size (this has a safety margin). ram_size_gb : if disk index is not used the probe will use RAM for indexing. This is faster but may produce inaccurate results for large data sets. If this value is left unset the probe will use 80% of the available system memory. IOPS_limit : can be used to limit the read rate from the target system. The IO size is the chunk size (default 32K), e.g. IOPS_limit: 1000 \u2192 ~320 MB/sec. Example config : input_dir: - '/mnt/fileserver/data/stuff' filter: '*' output_dir: '/mnt/probe/vast-probe/output' # dir for log files metdata_dir: '/mnt/probe/vast-probe/db' # dir for probe metadata regexp_filter: '' # files/directories matching the filter will NOT be scanned by the probe send_from: 'andy@vastdata.com' send_to: - 'andy@vastdata.com' - 'probe.callhome@vastdata.com' remote_monitoring_freq: 100 # sending mail with stats line, every remote_monitoring_freq seconds SMTP_host: 'localhost' # put an SMTP relay here remove_db_dir: 0 #remove db dir after each run? 1 for yes, 0 for no ignore_links: 1 #1 for yes, 0 for no IOPS_limit: 0 #for no limit, put 0 number_of_threads: 0 # for one thread per core, put 0 printing_frequency: 1 # in seconds open_files_limit: 0 #for no limit, put 0 obfuscate_files_names: 0 #1 for yes, 0 for no match_disable: 0 #1 to disable matches, 0 to enable ram_size_gb: 0 # RAM for indexes (in GB), 0 will make the probe us ~80% of the available system memory disk_size_gb: 100 # if set will use disk to store the similarity index. pause: #'7:15' #hh:mm or leave blank resume: #'17:16' #hh:mm or leave blank split: ... Once you are satisfied, copy the .yml file to somewhere outside of the container (NFS mount or via SCP), since it will not survive container restart cp /vast/install/probe/sim_init_file.yml /mnt/probe/ Launch the probe run \u00b6 While still connected to the probe's docker container, go to the probe's home directory. Note: if you need to run the probe a second time you can copy the save sim_init_file.yml file from /mnt/probe into the container at /vast/install/probe . cd /vast/install/probe Run it: sudo is required if root is needed in order to access one of the directories configured in the init file. sudo python3 ./probe.py Probe Stages \u00b6 Some of these stages run concurrently (eg: Treewalk can run in the background throughout) Treewalk Phase \u00b6 When the probe is first kicked off, it builds a list of all files, along with the size-in-bytes for each file. This process has recently been parallelized to try and use more threads to perform this treeewalk, however depending on the source filesystem, this may still take a significant amount of time. Note that this runs in the background, such that the probe can make progress with other stages even while the treewalk phase is active. As an alternative, you can specify the --csv option to point to a CSV file which looks like this: /path/to/file.file,1234 where 1234 = sizeInBytes DB Initialization Phase \u00b6 The probe needs to initialize the Dictionary/Database which is used for storing matches. Depending on the speed of the storage which is hosting the database (specified via 'metadata_dir' ) , this can take some time. Also note that the 'disk_size_gb' parameter is directly related to how large the DB will be. During this phase, the probe will pre-allocate the DB by writing XX-GB to the metadata_dir. DataScan Phase \u00b6 Once initialization has occurred, this is when the actual probe-scanning happens. During this time, multiple threads are walking through the generated list of files and reading them to generate the various hashes which are then inserted into the DB. You can monitor progress during this phase as described below. Monitoring progress \u00b6 While the probe is running, there are 2 ways to get progress: Watch the screen sudo python3 ./probe.py mail sending off Scanning input directories, this might take a while... Scanned 144932 files, size 3.2TB File scan completed open file limit is 65536, it is recommended to allow as many open files as possible Initializing probe. 336.386 GB/3.1718 TB (10.4%) process_rate = 850.45 MB/sec factor = 1.54 Check the log file tail -f /mnt/probe/vast-probe/output/probe_Mon_Jan_21_11_57_52_2019.log The log will give you information like this: n_chunks = 482991, n_matched_chunks = 392628, dedups = 918, match_percent = 81.291% , sum_of_gain = 403245081, gain = 64.877, avg_gain_per_match = 1027.04, avg_match_hashes_per_match = 9.28467, decompressed_sum = 1.212 GB, compressed_sum = 204.86 MB, factor = 6.05886, ratio = 0.165048, sum_of_self_compress = 592.76 MB size_of_data_processed = 1.212 GB/1.324 GB 91.5786% number_of_inaccessible_files = 0/517401 size_of_inaccessible = 0 B/1.324 GB READ = 1.212 GB, RE-READ = 940.70 MB, Total READ = 2.131 GB process_rate = 34.33 MB/sec Understanding the ouput \u00b6 The summary probe output is described in Un derstanding Output Low Level Output \u00b6 In addition to the previous output, the probe will also output lower level information periodically. These days that information is not typically useful, but here is an explanation just in case. n_chunks - amount of chunks processed by the probe (default size is 32K) avg_chunk_size - average chunk size n_matched_chunks - amount of chunks identified as similar to pre existing chunks by similarity search match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search sum_of_gain - total space saved by similarity compression gain - percentage of space saved by similarity compression avg_gain_per_match - average amount of space saved per chunk from similarity compression avg_match_hashes_per_match - average amount of matching hashes found during similarity seach n_duplicate_chunks - amount of identical chunks found dedup_percent - percentage of identical chunks found original_size - amount of data processed by the probe compressed_sum - estimated size of data post compression, dedup and similarity compression factor - compression factor (original_size / compressed_sum) ratio - 1 / factor sum_of_self_compress - data size if only local compression (with the given chunk size) was applied size_of_data_processed - progress indication number_of_inaccessible_files - number of files that were found in the initial scan but the probe didn't manage to read from when trying to process them size_of_inaccessible - amount of data that were found in the initial scan but the probe didn't manage to read from when trying to process them READ - amount of scanned data RE-READ - amount of data that was re-read in order to perform global compression Total READ - READ + RE-READ I thought it would be helpful to share results from a test run and an interpretation of those results for the benefit of others: Here\u2019s the last line of output with a summary: n_chunks = 1120059762, avg_chunk_size = 32754.2, n_matched_chunks = 507860164, match_percent = 45.3422% , sum_of_gain = 99.367 GB, gain = 0.603369, avg_gain_per_match = 210.087, avg_match_hashes_per_match = 3.4, n_duplicate_chunks = 529286922, dedup_percent=47.2552, original_size = 33.3664 TB, compressed_sum = 2.7112 TB, factor = 12.3069, ratio = 0.0812552, sum_of_self_compress = 16.0828 TB, size_of_data_processed = 33.3664 TB/33.7412 TB 98.8889%, number_of_inaccessible_files = 17233/880015, size_of_inaccessible = 384.025 GB/33.7412 TB, READ = 33.3664 TB, RE-READ = 15.1350 TB, Total READ = 48.5014 TB And the definitions that I think are most pertinent: * match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search * sum_of_gain - total space saved by similarity compression * gain - percentage of space saved by similarity compression * dedup_percent - percentage of identical chunks found * original_size - amount of data processed by the probe * compressed_sum - estimated size of data post compression, dedup and similarity compression * factor - compression factor (original_size / compressed_sum) * sum_of_self_compress - data size if only local compression (with the given chunk size) was applied * size_of_data_processed - progress indication This means the probe processed 33TB of data. The \u201cnative\u201d compressed size would have been 16TB the actual compressed size including compression, dedup, and similarity compression was 2.7TB, thus the total factor of savings was 12 (33/2.7). Digging a little deeper we see that the majority of the savings came from dedup (47% of the chunks were identical) and compression as it looks like similarity compression saved 0.6% for a total of 99GB. Probe Analyze \u00b6 After the probe completes a run it will automatically analyze its own output from the log files and generate an analysis log (still quite long) with a breakdown by directory and file extension of the data reduction achieved. In rare cases you may need to run this manually, here's how: cd /vast/install/probe python3 ./probe.py --analyze_log .... output about processing files .... Processed 1967860 files Writing probe run analysis to ..../probe_Date.log.analysis I/O Behavior \u00b6 Speed : From a scan-speed perspective, what we've found is that on average we see approximately ~60 MByte/sec per physical CPU core when running the probe in full \"similarity hash\" mode (default value for match_disable ). Thus, a 20-core system would net approximately 1.2 GByte/sec. Having that said performance is also highly dependant on the disk latency of the target system being scanned and is often delayed by doing random reads on that system. Read amplification : The way our similarity hasher works, if it discovers any matches, it will need to re-read a portion of the dataset again to look for additional opportunities for dataReduction. In the case where your data has a lot of similarity, this can result in significant read-amplification. Therefore, when determining the amount of time it will take to scan a file-system, it is necessary to allow the probe to run for a period of time to determine the approximate 'Re-Read' ratio. look at the /mnt/probe/db/*.stats output to see. match_disable=1 : If you choose this setting (non-default), the probe will bypass similarity hashing, and instead only look for local compression opportunity, and full-chunk matches (for dedup). This is much less CPU intensive, and we've found that the bottleneck will typically be either networking or the filesystem which it is scanning, up to a point. In my testing on a system with 25gigE, using this mode saw an average of 1.3GByte/sec (about 66MB/sec/physCore). At times the network throughput got close to line-rate (2+GByte/sec). If you have a subset of data which is representative of a larger set: it would be advisable to run against the smaller set in this mode first, to determine the local compression & dedup rates. Once that rate is established, running the probe again in similarity-hash mode against the full dataset is recommended.","title":"Manual Deployment"},{"location":"manual/index.html#manual_execution_overview","text":"The HPE GreenLake for File Storage Data Reduction Estimation Probe is a long running process in a docker container. The docker container needs to run on a linux system that has read only access to the files you want to examine for data reduction as well as reasonable memory and substantial fast local disk. When having issues with the probe_launcher.py script or you need more experimental features, you should use this page. Manual Execution Overview Manual Execution Procedure Download the bundle Configure the run Launch the probe run Probe Stages Treewalk Phase DB Initialization Phase DataScan Phase Monitoring progress Understanding the ouput Low Level Output Probe Analyze I/O Behavior","title":"Manual Execution Overview"},{"location":"manual/index.html#manual_execution_procedure","text":"Follow the steps in Prerequisites to verify requirements are met to properly run the probe. Get docker container image via links in Deployment . When the probe docker container is launched you'll then be able to connect into it and then run the probe itself with key configuration information. The probe will then run until completion and report results.","title":"Manual Execution Procedure"},{"location":"manual/index.html#download_the_bundle","text":"Download the docker image Follow the download links in Deployment to download the bundle. Then set the variable for the build number: export PROBE_BUILD=[PROBE BUILD NUMBER] Untar the bundle: tar -xzf ${PROBE_BUILD}.probe.bundle.tar.gz Load the docker image: docker load -i ${PROBE_BUILD}.probe.image.gz This step will take a few minutes without meaningful output. Tag the loaded image by doing a docker images and noting the new image, and it's ID. Recall that images are identified by unique image IDs and human readable tags. Tag it as shown below - get the ID from the docker images output, and the value of the name is by convention the probe build. docker images Notice the Image IDs in the output list docker tag vast-probe-${PROBE_BUILD}","title":"Download the bundle"},{"location":"manual/index.html#configure_the_run","text":"Launch a 'screen' session (or tmux). We recommend some kind of long lived session tool since the probe can take a very long time to run and we do not want it to terminate if there is an issue with the client system. screen -R probe Run the container while mapping the required directories run with the image tag/name you set earlier. The -v specifies mounts from the real operating system that should be made available to docker. These are directories that the probe can use and scan. Include as many -v 's as needed, just ensuring that at least one is the actual probe scratch directory ( /mnt/probe in this case). docker run -v /mnt/fileserver1:/mnt/fileserver1 -v /mnt/probe:/mnt/probe -it vast-probe-${PROBE_BUILD} from within the docker container, Create relevant output directories, eg: sudo mkdir -p /mnt/probe/vast-probe/output sudo mkdir -p /mnt/probe/vast-probe/db sudo chmod -R 777 /mnt/probe/vast-probe #note: If you get permission denied then disable selinux on your host Edit probe config file: vim /vast/install/probe/sim_init_file.yml See example config below, but also some items to note: input_dir : you can specify more than one. Just prefix a newline with '-'. This will allow the probe to scan multiple mountpoints/filesystems. Each input directory is scanned in a parallel thread which can slightly improve probe scan times. output_dir : this is where the summary files and some stats files will go. this is relatively small (< GB, although could get larger if you are scanning a lot of paths) metadata_dir : if using disk based indexes the space here needs to be pretty large (1% of total dataset to be very safe). match_disable : if you set to '1' , it will do 'local-only' compression/dedup. This completes much more quickly, but will not do any similarity hashing. max_number_of_files : This effectively pre-allocates some RAM to hold for file pointers. Set this value to somewhat higher than the total number of files you expect the probe to scan. Every 1-million files takes up 50MB of RAM. 1-billion is 50GB. Make sure not to set a value that causes the file pointer cache to exceed 50% of system RAM. disk_size_gb : set this to use disk based index. If you set to 0 it will instead use a RAM based index (see next variable) Index is ~80% of the probe metadata so rule of the thumb here so if you have a dedicated SSD-based file system for probe md the rule of thumb is to put here 80% of the disk size. And remember the free disk space size needs to be 0.6% of the total dataset size (this has a safety margin). ram_size_gb : if disk index is not used the probe will use RAM for indexing. This is faster but may produce inaccurate results for large data sets. If this value is left unset the probe will use 80% of the available system memory. IOPS_limit : can be used to limit the read rate from the target system. The IO size is the chunk size (default 32K), e.g. IOPS_limit: 1000 \u2192 ~320 MB/sec. Example config : input_dir: - '/mnt/fileserver/data/stuff' filter: '*' output_dir: '/mnt/probe/vast-probe/output' # dir for log files metdata_dir: '/mnt/probe/vast-probe/db' # dir for probe metadata regexp_filter: '' # files/directories matching the filter will NOT be scanned by the probe send_from: 'andy@vastdata.com' send_to: - 'andy@vastdata.com' - 'probe.callhome@vastdata.com' remote_monitoring_freq: 100 # sending mail with stats line, every remote_monitoring_freq seconds SMTP_host: 'localhost' # put an SMTP relay here remove_db_dir: 0 #remove db dir after each run? 1 for yes, 0 for no ignore_links: 1 #1 for yes, 0 for no IOPS_limit: 0 #for no limit, put 0 number_of_threads: 0 # for one thread per core, put 0 printing_frequency: 1 # in seconds open_files_limit: 0 #for no limit, put 0 obfuscate_files_names: 0 #1 for yes, 0 for no match_disable: 0 #1 to disable matches, 0 to enable ram_size_gb: 0 # RAM for indexes (in GB), 0 will make the probe us ~80% of the available system memory disk_size_gb: 100 # if set will use disk to store the similarity index. pause: #'7:15' #hh:mm or leave blank resume: #'17:16' #hh:mm or leave blank split: ... Once you are satisfied, copy the .yml file to somewhere outside of the container (NFS mount or via SCP), since it will not survive container restart cp /vast/install/probe/sim_init_file.yml /mnt/probe/","title":"Configure the run"},{"location":"manual/index.html#launch_the_probe_run","text":"While still connected to the probe's docker container, go to the probe's home directory. Note: if you need to run the probe a second time you can copy the save sim_init_file.yml file from /mnt/probe into the container at /vast/install/probe . cd /vast/install/probe Run it: sudo is required if root is needed in order to access one of the directories configured in the init file. sudo python3 ./probe.py","title":"Launch the probe run"},{"location":"manual/index.html#probe_stages","text":"Some of these stages run concurrently (eg: Treewalk can run in the background throughout)","title":"Probe Stages"},{"location":"manual/index.html#treewalk_phase","text":"When the probe is first kicked off, it builds a list of all files, along with the size-in-bytes for each file. This process has recently been parallelized to try and use more threads to perform this treeewalk, however depending on the source filesystem, this may still take a significant amount of time. Note that this runs in the background, such that the probe can make progress with other stages even while the treewalk phase is active. As an alternative, you can specify the --csv option to point to a CSV file which looks like this: /path/to/file.file,1234 where 1234 = sizeInBytes","title":"Treewalk Phase"},{"location":"manual/index.html#db_initialization_phase","text":"The probe needs to initialize the Dictionary/Database which is used for storing matches. Depending on the speed of the storage which is hosting the database (specified via 'metadata_dir' ) , this can take some time. Also note that the 'disk_size_gb' parameter is directly related to how large the DB will be. During this phase, the probe will pre-allocate the DB by writing XX-GB to the metadata_dir.","title":"DB Initialization Phase"},{"location":"manual/index.html#datascan_phase","text":"Once initialization has occurred, this is when the actual probe-scanning happens. During this time, multiple threads are walking through the generated list of files and reading them to generate the various hashes which are then inserted into the DB. You can monitor progress during this phase as described below.","title":"DataScan Phase"},{"location":"manual/index.html#monitoring_progress","text":"While the probe is running, there are 2 ways to get progress: Watch the screen sudo python3 ./probe.py mail sending off Scanning input directories, this might take a while... Scanned 144932 files, size 3.2TB File scan completed open file limit is 65536, it is recommended to allow as many open files as possible Initializing probe. 336.386 GB/3.1718 TB (10.4%) process_rate = 850.45 MB/sec factor = 1.54 Check the log file tail -f /mnt/probe/vast-probe/output/probe_Mon_Jan_21_11_57_52_2019.log The log will give you information like this: n_chunks = 482991, n_matched_chunks = 392628, dedups = 918, match_percent = 81.291% , sum_of_gain = 403245081, gain = 64.877, avg_gain_per_match = 1027.04, avg_match_hashes_per_match = 9.28467, decompressed_sum = 1.212 GB, compressed_sum = 204.86 MB, factor = 6.05886, ratio = 0.165048, sum_of_self_compress = 592.76 MB size_of_data_processed = 1.212 GB/1.324 GB 91.5786% number_of_inaccessible_files = 0/517401 size_of_inaccessible = 0 B/1.324 GB READ = 1.212 GB, RE-READ = 940.70 MB, Total READ = 2.131 GB process_rate = 34.33 MB/sec","title":"Monitoring progress"},{"location":"manual/index.html#understanding_the_ouput","text":"The summary probe output is described in Un derstanding Output","title":"Understanding the ouput"},{"location":"manual/index.html#low_level_output","text":"In addition to the previous output, the probe will also output lower level information periodically. These days that information is not typically useful, but here is an explanation just in case. n_chunks - amount of chunks processed by the probe (default size is 32K) avg_chunk_size - average chunk size n_matched_chunks - amount of chunks identified as similar to pre existing chunks by similarity search match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search sum_of_gain - total space saved by similarity compression gain - percentage of space saved by similarity compression avg_gain_per_match - average amount of space saved per chunk from similarity compression avg_match_hashes_per_match - average amount of matching hashes found during similarity seach n_duplicate_chunks - amount of identical chunks found dedup_percent - percentage of identical chunks found original_size - amount of data processed by the probe compressed_sum - estimated size of data post compression, dedup and similarity compression factor - compression factor (original_size / compressed_sum) ratio - 1 / factor sum_of_self_compress - data size if only local compression (with the given chunk size) was applied size_of_data_processed - progress indication number_of_inaccessible_files - number of files that were found in the initial scan but the probe didn't manage to read from when trying to process them size_of_inaccessible - amount of data that were found in the initial scan but the probe didn't manage to read from when trying to process them READ - amount of scanned data RE-READ - amount of data that was re-read in order to perform global compression Total READ - READ + RE-READ I thought it would be helpful to share results from a test run and an interpretation of those results for the benefit of others: Here\u2019s the last line of output with a summary: n_chunks = 1120059762, avg_chunk_size = 32754.2, n_matched_chunks = 507860164, match_percent = 45.3422% , sum_of_gain = 99.367 GB, gain = 0.603369, avg_gain_per_match = 210.087, avg_match_hashes_per_match = 3.4, n_duplicate_chunks = 529286922, dedup_percent=47.2552, original_size = 33.3664 TB, compressed_sum = 2.7112 TB, factor = 12.3069, ratio = 0.0812552, sum_of_self_compress = 16.0828 TB, size_of_data_processed = 33.3664 TB/33.7412 TB 98.8889%, number_of_inaccessible_files = 17233/880015, size_of_inaccessible = 384.025 GB/33.7412 TB, READ = 33.3664 TB, RE-READ = 15.1350 TB, Total READ = 48.5014 TB And the definitions that I think are most pertinent: * match_percent - percentage of chunks identified as similar to pre existing chunks by similarity search * sum_of_gain - total space saved by similarity compression * gain - percentage of space saved by similarity compression * dedup_percent - percentage of identical chunks found * original_size - amount of data processed by the probe * compressed_sum - estimated size of data post compression, dedup and similarity compression * factor - compression factor (original_size / compressed_sum) * sum_of_self_compress - data size if only local compression (with the given chunk size) was applied * size_of_data_processed - progress indication This means the probe processed 33TB of data. The \u201cnative\u201d compressed size would have been 16TB the actual compressed size including compression, dedup, and similarity compression was 2.7TB, thus the total factor of savings was 12 (33/2.7). Digging a little deeper we see that the majority of the savings came from dedup (47% of the chunks were identical) and compression as it looks like similarity compression saved 0.6% for a total of 99GB.","title":"Low Level Output"},{"location":"manual/index.html#probe_analyze","text":"After the probe completes a run it will automatically analyze its own output from the log files and generate an analysis log (still quite long) with a breakdown by directory and file extension of the data reduction achieved. In rare cases you may need to run this manually, here's how: cd /vast/install/probe python3 ./probe.py --analyze_log .... output about processing files .... Processed 1967860 files Writing probe run analysis to ..../probe_Date.log.analysis","title":"Probe Analyze"},{"location":"manual/index.html#io_behavior","text":"Speed : From a scan-speed perspective, what we've found is that on average we see approximately ~60 MByte/sec per physical CPU core when running the probe in full \"similarity hash\" mode (default value for match_disable ). Thus, a 20-core system would net approximately 1.2 GByte/sec. Having that said performance is also highly dependant on the disk latency of the target system being scanned and is often delayed by doing random reads on that system. Read amplification : The way our similarity hasher works, if it discovers any matches, it will need to re-read a portion of the dataset again to look for additional opportunities for dataReduction. In the case where your data has a lot of similarity, this can result in significant read-amplification. Therefore, when determining the amount of time it will take to scan a file-system, it is necessary to allow the probe to run for a period of time to determine the approximate 'Re-Read' ratio. look at the /mnt/probe/db/*.stats output to see. match_disable=1 : If you choose this setting (non-default), the probe will bypass similarity hashing, and instead only look for local compression opportunity, and full-chunk matches (for dedup). This is much less CPU intensive, and we've found that the bottleneck will typically be either networking or the filesystem which it is scanning, up to a point. In my testing on a system with 25gigE, using this mode saw an average of 1.3GByte/sec (about 66MB/sec/physCore). At times the network throughput got close to line-rate (2+GByte/sec). If you have a subset of data which is representative of a larger set: it would be advisable to run against the smaller set in this mode first, to determine the local compression & dedup rates. Once that rate is established, running the probe again in similarity-hash mode against the full dataset is recommended.","title":"I/O Behavior"},{"location":"output/index.html","text":"Understanding Output Overview \u00b6 Periodically while running and at the end of a run, the probe will output data reduction results to the probe log file. These results are very helpful for understanding the data reduction that is expected when the data is placed on HPE GreenLake for File Storage as well as for helping to understand why that level of data reduction was achieved. Output format \u00b6 The output will look something like this: --------------------------------current-probe-stats-------------------------------- Probe version: probe-version-4-4-703050 Scanned: 258.14GB out of 258.13GB (100.00%) Files Scanned: 22481 files out of 22481 files (100.00%) ============= Main Results: ============= Total Global Data Reduction Factor = 5.32:1 (81.20% reduction) Sparse Size = 258.14GB Reduced Size = 48.54GB Number of Inaccessible Files = 3 out of 22481 files (0.01% of scan) Size of Inaccessible Files = 0.00B out of 258.13GB (0.00% of scan) - Duplicate Block Elimination Gain: 0.61% (1.56GB) Zero Block Elimination Gain: 0.00% (1.80MB) Number of Duplicate Chunks: 58917 Number of Zero Chunks: 35 - Similarity Reduction Global DAC vs. Local DAC Gain: 1.69% (4.37GB out of total bytes using Similarity: 233.62GB) Number of Similar Chunks: 4572414 out of 5414721 total unique chunks Average Chunk Size: 49.99KB Similarity Percentage: 84.44% Average Size of Chunks Using Similarity: 53.58KB Average Gain post DAC Per Similarity Match: 1.00KB Vast Array Performance Impact: green - Local Compression Gain including DAC: 79.03% (204.01GB out of a total Compression scan of 252.21GB) Compression ratio for local compress only: 4.88:1 ================== Adaptive Chunking: ================== ... ======================= Data Aware Compression: ======================= ... ====================== Experimental Features: ====================== ... There are two types of output above: normal or routine information relevant to most, and more advanced information that is more internal in nature (shown here with ...). In this article we will consider both types of information in the output, but please focus on the routine information as that is almost always more relevant. Routine Considerations (Main Results) \u00b6 The intent of this output is to summarize what the probe has found so far. The interesting results are: Scanned shows the space before reduction Files Scanned shows the number of files in the entire data set that were scanned Total Global Data Reduction Factor shows how effectively data reduction was done overall. This value includes compression, deduplication, and similarity reduction. Reduced Size is the space after reduction Sparse Size should be ignored unless the probe is run with --sparse-mode as described below. Number/Size of Inaccessible Files indicates data the probe tried to read but couldn't. If this number is large, the probe results are not valid. This almost always happens due to permission issues or files being deleted while the probe was running. Duplicate Block Elimination Gain shows how much space is saved just by removal of duplicate blocks. Number of Duplicate Chunks shows literally how many blocks were identical to other existing blocks. Zero Block Elimination Gain tells you how much of the gain from deduplication was due to zero blocks. That helpful for understanding the implications of the next item. Number of Zero Chunks is a count of number of chunks that are all zeros. That often indicates sparse files. If the number of such chunks is high relative to the number of chunks (exceeding say 10%), the probe estimates may be misleading. Use tools such as du and df to determine the actual space used and compare that to the probe's report of the space scanned. If there is a large difference, sparse files are likely to blame. If your file system supports the advanced ioctl for sparse file reporting (Lustre and XFS do), you can try running the probe again with --sparse-mode . Similarity Reduction Global DAC vs. Local DAC Gain is the gain from similarity with data aware compression vs. the gain without similarity. This is just a more verbose way of saying \"this is how much gain similarity provided.\" Number of Similar Chunks / Similarity Percentage is the number of data chunks that benefited from similarity matching. The percentage is simply the number of chunks that benefited from similarity divided by the total number of chunks. A high value for the similarity match percentage (significantly over 10%) and a low value of Average Gain Post DAC Per Similarity Match relative to Average Size of Chunks Using Similarity is a potential problem. This indicates a high similarity match rate, but a low gain from those matches. The amount reported is bytes per chunk. Average Chunk Size is the average size (before reduction) of all chunks Average Size of Chunks Using Similarity is the average size (before reduction) of a chunk that benefited from similarity Array Performance Impact should be ignored for now. Local Compression Gain shows how much space would be saved just by transparent compression as files are saved. This is also helpfully expressed at the end via Compression ratio for local compress only. Essentially that ratio vs. the reported Total Global Data Reduction factor shows how much better DRR was thanks to global deduplication and similarity reduction. In the above example we can see that we scanned 22481 files that consumed 258GB of space before any data reduction. After data reduction the probe predicts the files will consume 48GB of space for a reduction of 81%. Of that simple compression gains 79% (204GB), deduplication 1% (1GB), and similarity 2% (4GB). Please keep in mind these aren't typical results as actual data reduction varies widely for different data sets. Advanced Considerations \u00b6 In addition to the common and most relevant output described above, there are more advanced bits of information shared by the probe. Most of this information is only relevant to VAST engineering (we hope you can share it with us) but we document it here for the curious. Here is an example of the more advanced outputs: ================== Adaptive Chunking: ================== min_chunk_size=16384 max_chunk_size=65043 desired_chunk_size=29950 inverse_probability=13999 split_threshold=17871601040105585914 Theoretical Average Chunk Size: 29.25KB (error: -70.92%) Number of chunks split via hash: 2423353 (44.75%) Number of chunks split via buffer end: 44620 (0.82%) Number of chunks split via max size reached: 2969226 (54.84%) ======================= Data Aware Compression: ======================= Total Number of Predictions: 5414686 Predictions Per Encoder Type: {ENCODER_NONE=5402314, ENCODER_SHUFFLE=11164, ENCODER_DELTA_ENCODE=681, ENCODER_DELTA_ENCODE_4_SHUFFLE=527} Percentage of Chunks Per Encoder: - Encoder ENCODER_NONE: 99.77% - Encoder ENCODER_SHUFFLE: 0.21% - Encoder ENCODER_DELTA_ENCODE: 0.01% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE: 0.01% Encoding Sampling Reduction Summary (sampling 1.99%): ---------------------------------------------------------------------------------------------------------------------------------------------------- Encoders | None | Shuffle | Delta Shuffle | Delta ---------------------------------------------------------------------------------------------------------------------------------------------------- DRR (Global) | 5.33 | 3.60 | 3.07 | 4.83 Compressed Size | 48.44GB | 71.73GB | 84.06GB | 53.44GB Num Chunks Improved Percentage | 98.87% | 13.43% | 13.25% | 13.36% Num Chunks Improved | 5353433 | 727142 | 717668 | 723182 Total Chunks Num | 5414721 | 5414721 | 5414721 | 5414721 Similarity Reduction Percentage | 1.68% | 1.77% | 2.62% | 2.01% Similarity Reduction | 4.34GB | 4.59GB | 6.78GB | 5.19GB Total Bytes Using Similarity | 233.62GB | 233.62GB | 233.62GB | 233.62GB Similarity Reduction Gain if ref chain Percentage | 86.88% | 0.00% | 0.00% | 0.00% Similarity Reduction Gain if ref chain | 224.86GB | 0.00B | 0.00B | 0.00B Data Aware Compression Accuracy: Total Chunks Compared for Discovering Optimal Encoding: 108046 Total Correct Optimal Encoding Predictions: 107781 Total Wrong Optimal Encoding Predictions: 265 Correct Predictions Percentage: 99.75% Predictions Per Encoder Type: {ENCODER_NONE=107825, ENCODER_SHUFFLE=195, ENCODER_DELTA_ENCODE=14, ENCODER_DELTA_ENCODE_4_SHUFFLE=12} Wrong Predictions Per Encoder Type: {ENCODER_NONE=98, ENCODER_SHUFFLE=160, ENCODER_DELTA_ENCODE=1, ENCODER_DELTA_ENCODE_4_SHUFFLE=6} Wrong Predictions Percentage Per Encoder: - Encoder ENCODER_NONE = 0.09% - Encoder ENCODER_SHUFFLE = 82.05% - Encoder ENCODER_DELTA_ENCODE = 7.14% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE = 50.00% * Note: Wrong predictions does not mean that there is no gain from the encoder, but rather that there is a better one. Total Pre-Encoding Compressed Size of Chunks Used in Predictions: 1.07GB Total Post-Encoding Compression Size of Chunks Used in Predictions: 1.07GB Total Optimal Compression Size of Chunks Used in Predictions: 1.07GB Total Size Difference Between Predicted and Optimal Encoded Compression: 274.36KB (Optimal compression size is smaller than the predicted compression size by 0.02%) Approximate Total Local Data-Reduction Factor Without Data Aware Compression: 4.83:1 (79.30% reduction) Actual Total Global Data-Reduction Factor Without Data Aware Compression (available at 100% sampling): N/a ====================== Experimental Features: ====================== Similarity Reduction Gain if ref chain: 1.74% (4.49GB out of total bytes using Similarity: 233.62GB) Extra space gain in optimal compression: 47.57GB - Extra local compression space gain in case of using compression_level 8: 4.35GB - Extra local compression space gain in case of using compression_level 8: 3.16GB Adaptive Chunking min_chunk_size=AAA max_chunk_size=BBB desired_chunk_size=CCC are all internal settings that we may change from probe version to probe version. Otherwise they should be ignored. Theoretical Average Chunk Size should be ignored Number of chunks split via XXXX : adaptive chunking automatically adjusts the size of data chunks to improve deduplication and similarity matching. These three metrics tell us a bit about how we are doing. via hash : the count of chunks that were split using the automated data sensitive splitting. Typically this will be a high value. via buffer end : the count of chunks that were split simply because we reached the end of the relevant data stream. A likely cause is simply the end of a file. via max size reached : the count of chunks that were split because the chunks would have otherwise been too large. Data Aware Compression Encoding Sampling Reduction Summary summarizes the various different data aware compression (DAC) encodings and how well they worked for all of the data chunks. The probe randomly selects some number of chunks (sampling) and tries all encoding schemes. This is not what VAST or the probe does for all chunks as it is too expensive. Instead, the system examines a bit of each data chunk and decides on the DAC encoding scheme to use and then uses it - we call this prediction. This table show how the different schemes fared and helps us understand if our predictions are accurate. In general this table can be ignored. Correct Predictions Percentage tells us how often our predictions where correct. This calculation is based upon these values: Total Chunks Compared for Discovering Optimal Encoding : how many chunks were sampled for checking purposes Total Correct Optimal Encoding Predictions : how often the predictor was correct Total Wrong Optimal Encoding Predictions : how often the predictor was wrong Total Size Difference Between Predicted and Optimal Encoded Compression indicates how well our predictor selected the optimal DAC encoding scheme in terms of space used. If the number here is small (less than 5%) then the predictor is doing well. If it is larger, please let us know. These are the inputs to this calculation: Total Pre-Encoding Compressed Size of Chunks Used in Predictions : size of chunks before reduction Total Post-Encoding Compression Size of Chunks Used in Predictions : size of chunks after reduction Total Optimal Compression Size of Chunks Used in Predictions - the optimal reduction (basically trying all possible encodings based upon sampling) Approximate Total Local Data-Reduction Factor Without Data Aware Compression - our estimate (based upon sampling) of the data reduction without DAC. Basically if the value here is smaller than the value reported in the first part of the summary, DAC was a win. Experimental Features Extra space gain in optimal compression - this considers advanced data reduction algorithms that are under consider for future versions but have not yet implemented in actual released products. If you see a very large value here relative to the total data, let us know. That's very interesting to us! Extra local compression space gain in case of using compression_level 8 - this indicates how much space could be saved in local compression if the most expensive ZSTD compression setting. This isn't done on real clusters as it impact performance, but it's a useful metric for our engineering. Typically the additional savings is minimal which is good.","title":"Understanding Output"},{"location":"output/index.html#understanding_output_overview","text":"Periodically while running and at the end of a run, the probe will output data reduction results to the probe log file. These results are very helpful for understanding the data reduction that is expected when the data is placed on HPE GreenLake for File Storage as well as for helping to understand why that level of data reduction was achieved.","title":"Understanding Output Overview"},{"location":"output/index.html#output_format","text":"The output will look something like this: --------------------------------current-probe-stats-------------------------------- Probe version: probe-version-4-4-703050 Scanned: 258.14GB out of 258.13GB (100.00%) Files Scanned: 22481 files out of 22481 files (100.00%) ============= Main Results: ============= Total Global Data Reduction Factor = 5.32:1 (81.20% reduction) Sparse Size = 258.14GB Reduced Size = 48.54GB Number of Inaccessible Files = 3 out of 22481 files (0.01% of scan) Size of Inaccessible Files = 0.00B out of 258.13GB (0.00% of scan) - Duplicate Block Elimination Gain: 0.61% (1.56GB) Zero Block Elimination Gain: 0.00% (1.80MB) Number of Duplicate Chunks: 58917 Number of Zero Chunks: 35 - Similarity Reduction Global DAC vs. Local DAC Gain: 1.69% (4.37GB out of total bytes using Similarity: 233.62GB) Number of Similar Chunks: 4572414 out of 5414721 total unique chunks Average Chunk Size: 49.99KB Similarity Percentage: 84.44% Average Size of Chunks Using Similarity: 53.58KB Average Gain post DAC Per Similarity Match: 1.00KB Vast Array Performance Impact: green - Local Compression Gain including DAC: 79.03% (204.01GB out of a total Compression scan of 252.21GB) Compression ratio for local compress only: 4.88:1 ================== Adaptive Chunking: ================== ... ======================= Data Aware Compression: ======================= ... ====================== Experimental Features: ====================== ... There are two types of output above: normal or routine information relevant to most, and more advanced information that is more internal in nature (shown here with ...). In this article we will consider both types of information in the output, but please focus on the routine information as that is almost always more relevant.","title":"Output format"},{"location":"output/index.html#routine_considerations_main_results","text":"The intent of this output is to summarize what the probe has found so far. The interesting results are: Scanned shows the space before reduction Files Scanned shows the number of files in the entire data set that were scanned Total Global Data Reduction Factor shows how effectively data reduction was done overall. This value includes compression, deduplication, and similarity reduction. Reduced Size is the space after reduction Sparse Size should be ignored unless the probe is run with --sparse-mode as described below. Number/Size of Inaccessible Files indicates data the probe tried to read but couldn't. If this number is large, the probe results are not valid. This almost always happens due to permission issues or files being deleted while the probe was running. Duplicate Block Elimination Gain shows how much space is saved just by removal of duplicate blocks. Number of Duplicate Chunks shows literally how many blocks were identical to other existing blocks. Zero Block Elimination Gain tells you how much of the gain from deduplication was due to zero blocks. That helpful for understanding the implications of the next item. Number of Zero Chunks is a count of number of chunks that are all zeros. That often indicates sparse files. If the number of such chunks is high relative to the number of chunks (exceeding say 10%), the probe estimates may be misleading. Use tools such as du and df to determine the actual space used and compare that to the probe's report of the space scanned. If there is a large difference, sparse files are likely to blame. If your file system supports the advanced ioctl for sparse file reporting (Lustre and XFS do), you can try running the probe again with --sparse-mode . Similarity Reduction Global DAC vs. Local DAC Gain is the gain from similarity with data aware compression vs. the gain without similarity. This is just a more verbose way of saying \"this is how much gain similarity provided.\" Number of Similar Chunks / Similarity Percentage is the number of data chunks that benefited from similarity matching. The percentage is simply the number of chunks that benefited from similarity divided by the total number of chunks. A high value for the similarity match percentage (significantly over 10%) and a low value of Average Gain Post DAC Per Similarity Match relative to Average Size of Chunks Using Similarity is a potential problem. This indicates a high similarity match rate, but a low gain from those matches. The amount reported is bytes per chunk. Average Chunk Size is the average size (before reduction) of all chunks Average Size of Chunks Using Similarity is the average size (before reduction) of a chunk that benefited from similarity Array Performance Impact should be ignored for now. Local Compression Gain shows how much space would be saved just by transparent compression as files are saved. This is also helpfully expressed at the end via Compression ratio for local compress only. Essentially that ratio vs. the reported Total Global Data Reduction factor shows how much better DRR was thanks to global deduplication and similarity reduction. In the above example we can see that we scanned 22481 files that consumed 258GB of space before any data reduction. After data reduction the probe predicts the files will consume 48GB of space for a reduction of 81%. Of that simple compression gains 79% (204GB), deduplication 1% (1GB), and similarity 2% (4GB). Please keep in mind these aren't typical results as actual data reduction varies widely for different data sets.","title":"Routine Considerations (Main Results)"},{"location":"output/index.html#advanced_considerations","text":"In addition to the common and most relevant output described above, there are more advanced bits of information shared by the probe. Most of this information is only relevant to VAST engineering (we hope you can share it with us) but we document it here for the curious. Here is an example of the more advanced outputs: ================== Adaptive Chunking: ================== min_chunk_size=16384 max_chunk_size=65043 desired_chunk_size=29950 inverse_probability=13999 split_threshold=17871601040105585914 Theoretical Average Chunk Size: 29.25KB (error: -70.92%) Number of chunks split via hash: 2423353 (44.75%) Number of chunks split via buffer end: 44620 (0.82%) Number of chunks split via max size reached: 2969226 (54.84%) ======================= Data Aware Compression: ======================= Total Number of Predictions: 5414686 Predictions Per Encoder Type: {ENCODER_NONE=5402314, ENCODER_SHUFFLE=11164, ENCODER_DELTA_ENCODE=681, ENCODER_DELTA_ENCODE_4_SHUFFLE=527} Percentage of Chunks Per Encoder: - Encoder ENCODER_NONE: 99.77% - Encoder ENCODER_SHUFFLE: 0.21% - Encoder ENCODER_DELTA_ENCODE: 0.01% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE: 0.01% Encoding Sampling Reduction Summary (sampling 1.99%): ---------------------------------------------------------------------------------------------------------------------------------------------------- Encoders | None | Shuffle | Delta Shuffle | Delta ---------------------------------------------------------------------------------------------------------------------------------------------------- DRR (Global) | 5.33 | 3.60 | 3.07 | 4.83 Compressed Size | 48.44GB | 71.73GB | 84.06GB | 53.44GB Num Chunks Improved Percentage | 98.87% | 13.43% | 13.25% | 13.36% Num Chunks Improved | 5353433 | 727142 | 717668 | 723182 Total Chunks Num | 5414721 | 5414721 | 5414721 | 5414721 Similarity Reduction Percentage | 1.68% | 1.77% | 2.62% | 2.01% Similarity Reduction | 4.34GB | 4.59GB | 6.78GB | 5.19GB Total Bytes Using Similarity | 233.62GB | 233.62GB | 233.62GB | 233.62GB Similarity Reduction Gain if ref chain Percentage | 86.88% | 0.00% | 0.00% | 0.00% Similarity Reduction Gain if ref chain | 224.86GB | 0.00B | 0.00B | 0.00B Data Aware Compression Accuracy: Total Chunks Compared for Discovering Optimal Encoding: 108046 Total Correct Optimal Encoding Predictions: 107781 Total Wrong Optimal Encoding Predictions: 265 Correct Predictions Percentage: 99.75% Predictions Per Encoder Type: {ENCODER_NONE=107825, ENCODER_SHUFFLE=195, ENCODER_DELTA_ENCODE=14, ENCODER_DELTA_ENCODE_4_SHUFFLE=12} Wrong Predictions Per Encoder Type: {ENCODER_NONE=98, ENCODER_SHUFFLE=160, ENCODER_DELTA_ENCODE=1, ENCODER_DELTA_ENCODE_4_SHUFFLE=6} Wrong Predictions Percentage Per Encoder: - Encoder ENCODER_NONE = 0.09% - Encoder ENCODER_SHUFFLE = 82.05% - Encoder ENCODER_DELTA_ENCODE = 7.14% - Encoder ENCODER_DELTA_ENCODE_4_SHUFFLE = 50.00% * Note: Wrong predictions does not mean that there is no gain from the encoder, but rather that there is a better one. Total Pre-Encoding Compressed Size of Chunks Used in Predictions: 1.07GB Total Post-Encoding Compression Size of Chunks Used in Predictions: 1.07GB Total Optimal Compression Size of Chunks Used in Predictions: 1.07GB Total Size Difference Between Predicted and Optimal Encoded Compression: 274.36KB (Optimal compression size is smaller than the predicted compression size by 0.02%) Approximate Total Local Data-Reduction Factor Without Data Aware Compression: 4.83:1 (79.30% reduction) Actual Total Global Data-Reduction Factor Without Data Aware Compression (available at 100% sampling): N/a ====================== Experimental Features: ====================== Similarity Reduction Gain if ref chain: 1.74% (4.49GB out of total bytes using Similarity: 233.62GB) Extra space gain in optimal compression: 47.57GB - Extra local compression space gain in case of using compression_level 8: 4.35GB - Extra local compression space gain in case of using compression_level 8: 3.16GB Adaptive Chunking min_chunk_size=AAA max_chunk_size=BBB desired_chunk_size=CCC are all internal settings that we may change from probe version to probe version. Otherwise they should be ignored. Theoretical Average Chunk Size should be ignored Number of chunks split via XXXX : adaptive chunking automatically adjusts the size of data chunks to improve deduplication and similarity matching. These three metrics tell us a bit about how we are doing. via hash : the count of chunks that were split using the automated data sensitive splitting. Typically this will be a high value. via buffer end : the count of chunks that were split simply because we reached the end of the relevant data stream. A likely cause is simply the end of a file. via max size reached : the count of chunks that were split because the chunks would have otherwise been too large. Data Aware Compression Encoding Sampling Reduction Summary summarizes the various different data aware compression (DAC) encodings and how well they worked for all of the data chunks. The probe randomly selects some number of chunks (sampling) and tries all encoding schemes. This is not what VAST or the probe does for all chunks as it is too expensive. Instead, the system examines a bit of each data chunk and decides on the DAC encoding scheme to use and then uses it - we call this prediction. This table show how the different schemes fared and helps us understand if our predictions are accurate. In general this table can be ignored. Correct Predictions Percentage tells us how often our predictions where correct. This calculation is based upon these values: Total Chunks Compared for Discovering Optimal Encoding : how many chunks were sampled for checking purposes Total Correct Optimal Encoding Predictions : how often the predictor was correct Total Wrong Optimal Encoding Predictions : how often the predictor was wrong Total Size Difference Between Predicted and Optimal Encoded Compression indicates how well our predictor selected the optimal DAC encoding scheme in terms of space used. If the number here is small (less than 5%) then the predictor is doing well. If it is larger, please let us know. These are the inputs to this calculation: Total Pre-Encoding Compressed Size of Chunks Used in Predictions : size of chunks before reduction Total Post-Encoding Compression Size of Chunks Used in Predictions : size of chunks after reduction Total Optimal Compression Size of Chunks Used in Predictions - the optimal reduction (basically trying all possible encodings based upon sampling) Approximate Total Local Data-Reduction Factor Without Data Aware Compression - our estimate (based upon sampling) of the data reduction without DAC. Basically if the value here is smaller than the value reported in the first part of the summary, DAC was a win. Experimental Features Extra space gain in optimal compression - this considers advanced data reduction algorithms that are under consider for future versions but have not yet implemented in actual released products. If you see a very large value here relative to the total data, let us know. That's very interesting to us! Extra local compression space gain in case of using compression_level 8 - this indicates how much space could be saved in local compression if the most expensive ZSTD compression setting. This isn't done on real clusters as it impact performance, but it's a useful metric for our engineering. Typically the additional savings is minimal which is good.","title":"Advanced Considerations"},{"location":"prerequisites/index.html","text":"Prerequisites Overview \u00b6 Before we can start deploying the HPE GreenLake for File Storage Data Reduction Estimation Probe, take a moment to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This is intended for customers that are running the probe on their own infrastructure. Prerequisites Overview Hardware Minimum Requirements Operating System Minimum Requirements Software Requirements Sample Data Set Filesystem Requirements Hardware Requirement Examples Hardware Minimum Requirements \u00b6 Actual hardware requirements depend on the amount of data to be scanned. Examples on how to scope hardware based on dataset size are provided at the end of this page. 16 CPU cores or higher Intel Broadwell-compatible or later CPUs The Probe requires CPU instructions that are not available on older CPUs The Probe will run virtually on Intel based hardware that has a Virtual Cluster vMotion minimum compatibility of Intel Broadwell-compatible or later The Probe has not been evaluated on AMD CPUs 128 GB RAM or higher The probe consumes almost 100GB of RAM upon launch The more RAM, the better the Probe will perform and the more data can be scanned 10 GbE Networking or higher 50 GB SSD-backed local storage or higher (NVMe or FC/iSCISI LUNs) This local SSD capacity is needed for the database the probe builds and logging Must be equivalent to 0.6% of the data to be scanned Disk storage must have very high sustained IOPs The larger the local SSD allocated, the more data can be scanned Local SSD filesystem should be ext4 or xfs Operating System Minimum Requirements \u00b6 We've tested the following, but most modern Linux distributions should be fine: Ubuntu 18.04, 20.04 Centos/RHEL 7.4+ Rocky/RHEL 8.3+ Software Requirements \u00b6 Docker: 17.05 + python3 (for launching the probe) screen (for running the probe in the background) wget (for downloading the probe image) Sample Data Set Filesystem Requirements \u00b6 Be aware that if the filesystem has atime enabled, any scanning, even while mounted as read-only will update the atime clock. NFS : The Probe host has be provided root-squash and read only access For faster scanning, use an operating system that has nconenct support: Ubuntu 20.04+ RHEL/Rocky 8.4+ Lustre : The Probe host and container must be able to read as a root user GPFS : The Probe host and container must be able to read as a root user SMB : The Probe host should be mounted with a user in the BUILTIN\\Backup Operators group to avoid file access issues. S3/Object : We have tested internally with goofys as a method of imitating a filesystem It is not recommend to scan anything in AWS Glacier or equivalent Hardware Requirement Examples \u00b6 Example A : You have a server with 768GB of RAM: 154GB is for the Operating System, leaving 614GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 609GB of RAM... 50-bytes per 'filename' This leaves 609GB of RAM available for the RAM index --ram-index-size-gb 609 This can scan up to 99TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Use of a disk index you can scan far more data and the file count could exceed 10 billion with a 500GB file name cache Example B : You have a server with 128GB of RAM and a Local SSD: 26GB is for the Operating System, leaving 102GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 97GB of RAM... 50-bytes per 'filename' This leaves 97GB of RAM available for the RAM index --ram-index-size-gb 97 This can scan up to 15TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Using a disk index you can scan far more data and the file count could be as high as 2 billion with a 100GB file name cache 15TB of data requires 90GB of local SSD disk 100TB of data requires 600GB of local SSD disk","title":"Prerequisites"},{"location":"prerequisites/index.html#prerequisites_overview","text":"Before we can start deploying the HPE GreenLake for File Storage Data Reduction Estimation Probe, take a moment to review the prerequisites to understand the hardware and software requirements to successfully run the probe. This is intended for customers that are running the probe on their own infrastructure. Prerequisites Overview Hardware Minimum Requirements Operating System Minimum Requirements Software Requirements Sample Data Set Filesystem Requirements Hardware Requirement Examples","title":"Prerequisites Overview"},{"location":"prerequisites/index.html#hardware_minimum_requirements","text":"Actual hardware requirements depend on the amount of data to be scanned. Examples on how to scope hardware based on dataset size are provided at the end of this page. 16 CPU cores or higher Intel Broadwell-compatible or later CPUs The Probe requires CPU instructions that are not available on older CPUs The Probe will run virtually on Intel based hardware that has a Virtual Cluster vMotion minimum compatibility of Intel Broadwell-compatible or later The Probe has not been evaluated on AMD CPUs 128 GB RAM or higher The probe consumes almost 100GB of RAM upon launch The more RAM, the better the Probe will perform and the more data can be scanned 10 GbE Networking or higher 50 GB SSD-backed local storage or higher (NVMe or FC/iSCISI LUNs) This local SSD capacity is needed for the database the probe builds and logging Must be equivalent to 0.6% of the data to be scanned Disk storage must have very high sustained IOPs The larger the local SSD allocated, the more data can be scanned Local SSD filesystem should be ext4 or xfs","title":"Hardware Minimum Requirements"},{"location":"prerequisites/index.html#operating_system_minimum_requirements","text":"We've tested the following, but most modern Linux distributions should be fine: Ubuntu 18.04, 20.04 Centos/RHEL 7.4+ Rocky/RHEL 8.3+","title":"Operating System Minimum Requirements"},{"location":"prerequisites/index.html#software_requirements","text":"Docker: 17.05 + python3 (for launching the probe) screen (for running the probe in the background) wget (for downloading the probe image)","title":"Software Requirements"},{"location":"prerequisites/index.html#sample_data_set_filesystem_requirements","text":"Be aware that if the filesystem has atime enabled, any scanning, even while mounted as read-only will update the atime clock. NFS : The Probe host has be provided root-squash and read only access For faster scanning, use an operating system that has nconenct support: Ubuntu 20.04+ RHEL/Rocky 8.4+ Lustre : The Probe host and container must be able to read as a root user GPFS : The Probe host and container must be able to read as a root user SMB : The Probe host should be mounted with a user in the BUILTIN\\Backup Operators group to avoid file access issues. S3/Object : We have tested internally with goofys as a method of imitating a filesystem It is not recommend to scan anything in AWS Glacier or equivalent","title":"Sample Data Set Filesystem Requirements"},{"location":"prerequisites/index.html#hardware_requirement_examples","text":"Example A : You have a server with 768GB of RAM: 154GB is for the Operating System, leaving 614GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 609GB of RAM... 50-bytes per 'filename' This leaves 609GB of RAM available for the RAM index --ram-index-size-gb 609 This can scan up to 99TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Use of a disk index you can scan far more data and the file count could exceed 10 billion with a 500GB file name cache Example B : You have a server with 128GB of RAM and a Local SSD: 26GB is for the Operating System, leaving 102GB of RAM... There are 100 million files to scan, that will occupy ~5GB of RAM, leaving 97GB of RAM... 50-bytes per 'filename' This leaves 97GB of RAM available for the RAM index --ram-index-size-gb 97 This can scan up to 15TB of data using just RAM and no significant local SSD space is needed This calculation is based on a 0.6% rule to accommodate similarity and deduplication hashes Using a disk index you can scan far more data and the file count could be as high as 2 billion with a 100GB file name cache 15TB of data requires 90GB of local SSD disk 100TB of data requires 600GB of local SSD disk","title":"Hardware Requirement Examples"},{"location":"troubleshooting/index.html","text":"Troubleshooting Overview \u00b6 In general you can monitor the probe's behavior by watching the log file it generates as well as the standard output it generates to the console. Here we document common errors with the probe. tail -f /mnt/probe/log/XXXX.log Troubleshooting Overview Launcher Hang CPU Compatibility CGroup Error Privilege Error Illegal Instruction Launcher Hang \u00b6 In rare cases the probe will complete its successfully but the python launcher script will hang. If this happens, simply control-C the launcher if you are still attached to the terminal. Or use ps -ef to find the probe process and kill it. CPU Compatibility \u00b6 Occasionally, the probe will launch and finish without scanning any files and may produce an error related to log files. When viewing logs you may see: probe terminated by signal: SIGILL Check the CPU Compatibility: cat /sys/devices/cpu/caps/pmu_name Review the CPU Requirements . CGroup Error \u00b6 You may get the following error on older Linux builds: Loading docker image Starting probe docker container 783735ffbf1a722ebbcd43622476ea2364a8873dc2a6f95a4d006778636ed513 /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"Cannot set property TasksAccounting, or unknown property.\\\"\". Failed starting the probe If you do you need to update the systemd related packages. Here are the versions we use: rpm -qa |grep -i systemd systemd-libs-219-67.el7_7.2.x86_64 systemd-sysv-219-67.el7_7.2.x86_64 systemd-219-67.el7_7.2.x86_64 oci-systemd-hook-0.2.0-1.git05e6923.el7_6.x86_64 Privilege Error \u00b6 When the container is launched it may fail with this error message: Starting probe docker container docker: Error response from daemon: privileged mode is incompatible with user namespaces. You must run the container in the host namespace when running privileged mode. This message is a warning that your docker environment will not allow docker to run with heightened permissions. This probably caused by a more secure docker configuration, such as placing the following text into the /etc/docker/daemon.json file which basically prevents a container from running as root with privileges: { \"userns-remap\": \"dockremap:dockremap\" } The easiest way to address that is to hand edit the VAST provided probe_launcher.py and look for the 'docker run' line. It will look something like this: cmd = f'docker run --privileged -v {args.metadata_dir}:/probe_mnt/{args.metadata_dir} -v {args.output_dir}:/probe_mnt/{args.output_dir} ' Notice the --privileged . Remove it, save the file, and try again. If that doesn't fix the issue, we've found that removing the userns-remap line and restarting the docker service can be more effective. Illegal Instruction \u00b6 If the probe fails with a core dump and it shows an illegal instruction, you may be using an old CPU type which is not compatible with our compiled code. We require Intel Broadwell or newer compatible CPUs. If you use GDB to debug the core you can confirm by looking for something like this: GDB output from core: $ gdb -c core.97296 For help, type \"help\". Type \"apropos word\" to search for commands related to \"word\". Core was generated by `/vast/install/probe/sim_estimator --similarity-function fast_hash_8 --split-win'. Program terminated with signal SIGILL, Illegal instruction. #0 0x00007fffec9deea7 in ?? () (gdb)","title":"Troubleshooting"},{"location":"troubleshooting/index.html#troubleshooting_overview","text":"In general you can monitor the probe's behavior by watching the log file it generates as well as the standard output it generates to the console. Here we document common errors with the probe. tail -f /mnt/probe/log/XXXX.log Troubleshooting Overview Launcher Hang CPU Compatibility CGroup Error Privilege Error Illegal Instruction","title":"Troubleshooting Overview"},{"location":"troubleshooting/index.html#launcher_hang","text":"In rare cases the probe will complete its successfully but the python launcher script will hang. If this happens, simply control-C the launcher if you are still attached to the terminal. Or use ps -ef to find the probe process and kill it.","title":"Launcher Hang"},{"location":"troubleshooting/index.html#cpu_compatibility","text":"Occasionally, the probe will launch and finish without scanning any files and may produce an error related to log files. When viewing logs you may see: probe terminated by signal: SIGILL Check the CPU Compatibility: cat /sys/devices/cpu/caps/pmu_name Review the CPU Requirements .","title":"CPU Compatibility"},{"location":"troubleshooting/index.html#cgroup_error","text":"You may get the following error on older Linux builds: Loading docker image Starting probe docker container 783735ffbf1a722ebbcd43622476ea2364a8873dc2a6f95a4d006778636ed513 /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:235: starting container process caused \"process_linux.go:258: applying cgroup configuration for process caused \\\"Cannot set property TasksAccounting, or unknown property.\\\"\". Failed starting the probe If you do you need to update the systemd related packages. Here are the versions we use: rpm -qa |grep -i systemd systemd-libs-219-67.el7_7.2.x86_64 systemd-sysv-219-67.el7_7.2.x86_64 systemd-219-67.el7_7.2.x86_64 oci-systemd-hook-0.2.0-1.git05e6923.el7_6.x86_64","title":"CGroup Error"},{"location":"troubleshooting/index.html#privilege_error","text":"When the container is launched it may fail with this error message: Starting probe docker container docker: Error response from daemon: privileged mode is incompatible with user namespaces. You must run the container in the host namespace when running privileged mode. This message is a warning that your docker environment will not allow docker to run with heightened permissions. This probably caused by a more secure docker configuration, such as placing the following text into the /etc/docker/daemon.json file which basically prevents a container from running as root with privileges: { \"userns-remap\": \"dockremap:dockremap\" } The easiest way to address that is to hand edit the VAST provided probe_launcher.py and look for the 'docker run' line. It will look something like this: cmd = f'docker run --privileged -v {args.metadata_dir}:/probe_mnt/{args.metadata_dir} -v {args.output_dir}:/probe_mnt/{args.output_dir} ' Notice the --privileged . Remove it, save the file, and try again. If that doesn't fix the issue, we've found that removing the userns-remap line and restarting the docker service can be more effective.","title":"Privilege Error"},{"location":"troubleshooting/index.html#illegal_instruction","text":"If the probe fails with a core dump and it shows an illegal instruction, you may be using an old CPU type which is not compatible with our compiled code. We require Intel Broadwell or newer compatible CPUs. If you use GDB to debug the core you can confirm by looking for something like this: GDB output from core: $ gdb -c core.97296 For help, type \"help\". Type \"apropos word\" to search for commands related to \"word\". Core was generated by `/vast/install/probe/sim_estimator --similarity-function fast_hash_8 --split-win'. Program terminated with signal SIGILL, Illegal instruction. #0 0x00007fffec9deea7 in ?? () (gdb)","title":"Illegal Instruction"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 033b340dce464da4929b17b17a1d10a5b38b1d6d..790a3ffaf35df5c02ee79df09c2169839a440d8c 100644
GIT binary patch
delta 15
WcmZ3_w4RAgzMF$%_sflJnv4J@QUs^~
delta 15
WcmZ3_w4RAgzMF$X?Ab;(O-2ABWdr~K