Full end-to-end test harness for the Vector log & metrics router. This is the test framework used to generate the performance and correctness results displayed in the Vector docs. You can learn more about how this test harness works in the How It Works section, and you can begin using this test harness via the Usage section.
disk_buffer_performance
testfile_to_tcp_performance
testtcp_to_blackhole_performance
testtcp_to_tcp_performance
testtcp_to_http_performance
testregex_parsing_performance
test
disk_buffer_persistence_correctness
testfile_rotate_create_correctness
testfile_rotate_truncate_correctness
testfile_truncate_correctness
testsighup_correctness
testwrapped_json_correctness
test
/ansible
- global ansible resources and tasks/bin
- contains all scripts/cases
- contains all test cases/packer
- packer script to build the AMIs necessart for tests/terraform
- global terraform state, resources, and modules
-
Ensure you have Ansible (2.7+) and Terraform (0.12.20+) installed.
-
This step is optional, but highly recommended. Setup a
vector
specific AWS profile in your~/.aws/credentials
file. We highly recommend running the Vector test harness in a separate AWS sandbox account if possible. -
Create an Amazon compatible key pair. This will be used for SSH access to test instances.
-
Run
cp .envrc.example .envrc
. Read through the file, update as necessary. -
Run
source .envrc
to prepare the environment. Alternatively install direnv to do this automatically. Note that the.env
file, if it exists, will be automatically sourced into the scripts environment - so it's another option to set the environment variables for thebin/*
commands of this repo. -
Run:
./bin/test -t [tcp_to_tcp_performance]
This script will take care of running the necessary Terraform and Ansible scripts.
bin/build-amis
- builds AMIs for use in test casesbin/compare
- compare of test results across all subjectsbin/ssh
- utility script to SSH into a test serverbin/test
- run a specific test
- High-level results can be found in the Vector performance and correctness documentation sections.
- Detailed results can be found within each test case's README.
- Raw performance result data can be found in our public S3 bucket.
- You can run your own queries against the raw data. See the Usage section.
We recommend cloning a similar to test since it removes a lot of the boilerplate. If you prefer to start from scratch:
- Create a new folder in the
/cases
directory. Your name should end with_performance
or_correctness
to clarify the type of test this is. - Add a
README.md
providing an overview of the test. See thetcp_to_tcp_performance
test for an example. - Add a
terraform/main.tf
file for provisioning test resources. - Add a
ansible/bootstrap.yml
to bootstrap the environment. - Add a
ansible/run.yml
to run the test againt each subject. - Add any additional files as you see fit for each test.
- Run
bin/test -t <name_of_test>
.
You should not be changing tests with historical test data. You can change test subject versions since test data is partitioned by version, but you cannot change a test's execution strategy as this would corrupt historical test data. If you need to change the test in such a way that would violate historical data we recommend creating an entirely new test.
Simply delete the folder and any data in the s3 bucket.
If you encounter an error it's likely you'll need to SSH onto the server to investigate.
ssh -o 'IdentityFile="~/.ssh/vector_management"' [email protected]
Where:
~/.ssh/vector_management
= theVECTOR_TEST_SSH_PRIVATE_KEY
value provided in your.envrc
file.ubuntu
= the default root username for the instance.51.5.210.84
= the public IP address of the instance.
We provide a command that wraps the system ssh
and provides the same
credentials that ansible uses when connecting to the VM:
./bin/ssh 51.5.210.84
All services are configured with systemd where their logs can be accessed with
journalctl
:
sudo journactl -fu <service>
If you find that the service failed to start, I find it helpful to manually
attempt to start the service by inspecting the command in the .service
file:
cat /etc/systemd/system/<name>.service
Then copy the command specified in ExecStart
and run it manually. Ex:
/usr/bin/vector
Things can go wrong on your end (i.e. on the local system you're running the test harness) too.
export ANSIBLE_ENABLE_TASK_DEBUGGER=True
Set the environment variable above, and Ansible will drop you in a debug mode on any task failure.
See Ansible documentation on Playbook Debugger to learn more.
Some useful commands:
pprint task_vars['hostvars'][str(host)]['last_message']
export ANSIBLE_EXTRA_ARGS=-vvv
Set the environment variable above, and Ansible will print verbose debug information for every task it executes.
The Vector test harness is a mix of bash, Terraform, and Ansible
scripts. Each test case lives in the /cases
directory and has full reign of it's
bootstrap and test process via it's own Terraform and Ansible scripts.
The location of these scripts is dictated by the test
script and is outlined in more
detail in the Adding a test section. Each test falls into one of 2 categories:
performance tests and correctness tests:
Performance tests measure performance and MUST capture detailed performance data as outlined in the Performance Data and Rules sections.
In addition to the test
script, there is a compare
scripts.
This script analyzes the performance data captured when executing a test. More information
on this data and how it's captured and analyzed can be found in the
Performance Data section. Finally, each script includes a usage
overview that you can access with the --help
flag.
Performance test data is captured via dstat
, which is a lightweight utility that
captures a variety of system statistics in 1-second snapshot intervals. The final result is a CSV
where each row represents a snapshot. You can see the dstat
command used in the
ansible/roles/profiling/start.yml
file.
The performance data schema is reflected in the Athena table definition as well as the CSV itself. The following is an ordered list of columns:
Name | Type |
---|---|
epoch |
double |
cpu_usr |
double |
cpu_sys |
double |
cpu_idl |
double |
cpu_wai |
double |
cpu_hiq |
double |
cpu_siq |
double |
disk_read |
double |
disk_writ |
double |
io_read |
double |
io_writ |
double |
load_avg_1m |
double |
load_avg_5m |
double |
load_avg_15m |
double |
mem_used |
double |
mem_buff |
double |
mem_cach |
double |
mem_free |
double |
net_recv |
double |
net_send |
double |
procs_run |
double |
procs_bulk |
double |
procs_new |
double |
procs_total |
double |
sys_init |
double |
sys_csw |
double |
sock_total |
double |
sock_tcp |
double |
sock_udp |
double |
sock_raw |
double |
sock_frg |
double |
tcp_lis |
double |
tcp_act |
double |
tcp_syn |
double |
tcp_tim |
double |
tcp_clo |
double |
All performance data is made public via the vector-tests
S3 bucket in the
us-east-1
region. The partitioning structure follows the Hive partitioning structure with
variable names in the path. For example:
name=tcp_to_tcp_performance/configuration=default/subject=vector/version=v0.2.0-dev.1-20-gae8eba2/timestamp=1559073720
And the same in a tree form:
name=tcp_to_tcp_performance/
configuration=default/
subject=vector/
version=v0.2.0-dev.1-20-gae8eba2/
timestamp=1559073720
name
= the test name.configuration
= refers to the test's specific configuration (tests can have multiple configurations if necessary).subject
= the test subject, such asvector
.version
= the version fo the test subject.timestamp
= when the test was executed.
Analysis of this data is performed through the AWS Athena service. This allows us
to execute complex queries on the performance data stored in S3. You can see
the queries ran in the compare
script.
Correctness tests simply verify behavior. These tests are not required to capture or to persist any data. The results can be manually verified and placed in the test's README.
Since correctness tests are pass/fail there is no data to capture other than the successful running of the test.
Generally, correctness tests verify the output. Because of the various test subjects, we use a variety of output methods to capture output (tcp, http, and file). This is highly dependent on the test subject and the methods available. For example, the Splunk Forwarders only support TCP and Splunk specific outputs.
To make capturing this data easy, we created a test_server
Ansible role
that spins up various test servers and provides a simple way to capture summary output.
Tests must operate in isolated reproducible environments, they must never run locally. The obvious benefit is that it removes variables across tests, but it also improves collaboration since remote environments are easily accessible and reproducible by other engineers.
- ALWAYS filter to resources specific to your
test_name
,test_configuration
, anduser_id
(ex: ansible host targeting) - ALWAYS make sure the initial instance state is identical across test subjects. We recommend explicitly stopping all test subjects to properly handle the case of preceding failure and the situation where a subject was not cleanly shutdown.
- ALWAYS use the
profile
ansible role to capture data. This ensures a consistent data structure across tests. - ALWAYS run performance tests for at least 1 minute to calculate a 1m CPU load average.
- Use ansible roles whenever possible.
- If you are not testing local data collection we recommend using TCP as a data source since it is a lightweight source that is more likely to be consistent, performance wise, across subjects.