Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance testing/tuning/refactoring #30

Open
dipinhora opened this issue Sep 19, 2017 · 3 comments
Open

performance testing/tuning/refactoring #30

dipinhora opened this issue Sep 19, 2017 · 3 comments

Comments

@dipinhora
Copy link
Contributor

The code is mostly unoptimized except for some performance oriented design decisions to keep things as asynchronous as possible.

This code should be properly performance tested and tuned as required.

@dipinhora
Copy link
Contributor Author

dipinhora commented Jan 25, 2018

The following is an unscientific and incomplete performance comparison of pony kafka with librdkafka to get an idea of how far we have to go. It is by no way meant to be definitive nor a real benchmark.

TLDR:

Pony Kafka sends data to Kafka about 5% - 10% slower than librdkafka but reads data from Kafka about 75% slower than librdkafka. Pony Kafka also uses more cpu than librdkafka (at least part of this is due to how pony scheduler threads and work stealing function).


All testing was done on an i3.8xlarge in AWS using the wallaroo ochestration framework started with command:

make cluster cluster_name=dh2 mem_required=30 cpus_required=32 num_followers=0 force_instance=i3.8xlarge spot_bid_factor=100 ansible_system_cpus=0,16 no_spot=true cluster_project_name=wallaroo_dev ansible_install_devtools=true

This includes the following:

  • a total 32 vcpus (of which 15 are disabled because they're hyperthreads) - 1 hyperthread is purposefully left enabled (sibling to core 0) to be used for system activity..
  • two cpu sets - /system (core 0 and 16) and /user (cores 1 - 15)
  • all system activity is limited to the /system cpu set
  • Raided instance store volumes for temp storage (i.e. kafka logs)
  • and a variety of system/kernel tweaks

The following steps were taken after ssh'ing in:

Clone pony-kafka:

cd ~
git clone https://github.com/WallarooLabs/pony-kafka
cd ~/pony-kafka
git checkout code_improvements_new

Build pony-kafka performance app:

cd ~/pony-kafka
ponyc examples/performance

Clone librdkafka:

cd ~
git clone https://github.com/edenhill/librdkafka
cd ~/librdkafka
git checkout v0.11.3

Build librdkafka performance app:

cd ~/librdkafka
./configure
make examples

Install java/kafka:

~/pony-kafka/misc/kafka/download_kafka_java.sh

Everything was run using sudo cset proc -s user -e bash to minimize system cpu contention by using the /user cpu set.
Everything was run assigned to dedicate cpu cores to avoid thrashing/context switches. It was also all run with realtime priority (cpu assignments/usage was verified using htop).

Start zookeeper:

numactl -C 15 chrt -f 80 env KAFKA_HEAP_OPTS="-Xmx40960M -Xms40960M" ~/pony-kafka/misc/kafka/start_zookeeper.sh

Start kafka broker 0:

numactl -C 12-14 chrt -f 80 env KAFKA_HEAP_OPTS="-Xmx40960M -Xms40960M" ~/pony-kafka/misc/kafka/start_kafka_0.sh

Start kafka broker 1:

numactl -C 9-11 chrt -f 80 env KAFKA_HEAP_OPTS="-Xmx40960M -Xms40960M" ~/pony-kafka/misc/kafka/start_kafka_1.sh

Start kafka broker 2:

numactl -C 6-8 chrt -f 80 env KAFKA_HEAP_OPTS="-Xmx40960M -Xms40960M" ~/pony-kafka/misc/kafka/start_kafka_2.sh

Create topic:

~/pony-kafka/misc/kafka/create_replicate_topic.sh

Producing tests (acks = -1):

Everything was run using sudo cset proc -s user -e bash to minimize system cpu contention by using the /user cpu set.
Everything was run assigned to dedicate cpu cores to avoid thrashing/context switches. It was also all run with realtime priority (cpu assignments/usage was verified using htop).

Each application was run 3 times alternating between one and the other.
NOTE: Kafka/zookeeper were not restarted between runs.

Run librdkafka performance app in producer mode with acks=-1:

cd ~/librdkafka
numactl -C 1-5 chrt -f 80 ./examples/rdkafka_performance -P -t test -s 100 -c 1000000 -m "_____________Test2:OneBrokers:500kmsgs:100bytes" -S 1 -a -1 -b 127.0.0.1:9092

Results:

Run 1:

% 416665 backpressures for 1000000 produce calls: 41.666% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 131967ms: 7577 msgs/s and 0.76 MB/s, 416665 produce failures, 0 in queue, no compression

Run 2:

% 407898 backpressures for 1000000 produce calls: 40.790% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 129668ms: 7711 msgs/s and 0.77 MB/s, 407898 produce failures, 0 in queue, no compression

Run 3:

% 409888 backpressures for 1000000 produce calls: 40.989% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 131556ms: 7601 msgs/s and 0.76 MB/s, 409888 produce failures, 0 in queue, no compression

Run pony-kafka performance app in producer mode with acks=-1:

cd ~/pony-kafka
numactl -C 1-5 chrt -f 80 ./performance --client_mode producer --produce_message_size 100 --num_messages 1000000 --brokers 127.0.0.1:9092 --produce_acks -1 --topic test --ponythreads 4 --ponyminthreads 4 --ponypinasio --ponynoblock

Results:

Run 1:

2018-01-29 19:04:37: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 136.187 seconds. Throughput: 7342.85/sec.

Run 2:

2018-01-29 19:09:58: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 137.69 seconds. Throughput: 7262.71/sec.

Run 3:

2018-01-29 19:15:26: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 136.954 seconds. Throughput: 7301.71/sec.


Producing tests (acks = 1):

Everything was run using sudo cset proc -s user -e bash to minimize system cpu contention by using the /user cpu set.
Everything was run assigned to dedicate cpu cores to avoid thrashing/context switches. It was also all run with realtime priority (cpu assignments/usage was verified using htop).

Each application was run 3 times alternating between one and the other.
NOTE: Kafka/zookeeper were not restarted between runs or the previous test.

Run librdkafka performance app in producer mode with acks=1:

cd ~/librdkafka
numactl -C 1-5 chrt -f 80 ./examples/rdkafka_performance -P -t test -s 100 -c 1000000 -m "_____________Test2:OneBrokers:500kmsgs:100bytes" -S 1 -a 1 -b 127.0.0.1:9092

Results:

Run 1:

% 365065 backpressures for 1000000 produce calls: 36.506% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 55392ms: 18052 msgs/s and 1.81 MB/s, 365065 produce failures, 0 in queue, no compression

Run 2:

% 348412 backpressures for 1000000 produce calls: 34.841% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 54089ms: 18487 msgs/s and 1.85 MB/s, 348412 produce failures, 0 in queue, no compression

Run 3:

% 338523 backpressures for 1000000 produce calls: 33.852% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 50750ms: 19704 msgs/s and 1.97 MB/s, 338523 produce failures, 0 in queue, no compression

Run pony-kafka performance app in producer mode with acks=1:

cd ~/pony-kafka
numactl -C 1-5 chrt -f 80 ./performance --client_mode producer --produce_message_size 100 --num_messages 1000000 --brokers 127.0.0.1:9092 --produce_acks 1 --topic test --ponythreads 4 --ponyminthreads 4 --ponypinasio --ponynoblock

Results:

Run 1:

2018-01-29 19:19:37: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 59.0292 seconds. Throughput: 16940.8/sec.

Run 2:

2018-01-29 19:22:19: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 56.5403 seconds. Throughput: 17686.5/sec.

Run 3:

2018-01-29 19:24:24: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 57.6605 seconds. Throughput: 17342.9/sec.


Producing tests (acks = 0):

Everything was run using sudo cset proc -s user -e bash to minimize system cpu contention by using the /user cpu set.
Everything was run assigned to dedicate cpu cores to avoid thrashing/context switches. It was also all run with realtime priority (cpu assignments/usage was verified using htop).

Each application was run 3 times alternating between one and the other.
NOTE: Kafka/zookeeper were not restarted between runs or the previous test.

Run librdkafka performance app in producer mode with acks=0:

cd ~/librdkafka
numactl -C 1-5 chrt -f 80 ./examples/rdkafka_performance -P -t test -s 100 -c 1000000 -m "_____________Test2:OneBrokers:500kmsgs:100bytes" -S 1 -a 0 -b 127.0.0.1:9092

Results:

Run 1:

% 177 backpressures for 1000000 produce calls: 0.018% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 33178ms: 30139 msgs/s and 3.01 MB/s, 177 produce failures, 0 in queue, no compression

Run 2:

% 244 backpressures for 1000000 produce calls: 0.024% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 31846ms: 31400 msgs/s and 3.14 MB/s, 244 produce failures, 0 in queue, no compression

Run 3:

% 2 backpressures for 1000000 produce calls: 0.000% backpressure rate
% 1000000 messages produced (100000000 bytes), 1000000 delivered (offset 0, 0 failed) in 31666ms: 31579 msgs/s and 3.16 MB/s, 2 produce failures, 0 in queue, no compression

Run pony-kafka performance app in producer mode with acks=0:

cd ~/pony-kafka
numactl -C 1-5 chrt -f 80 ./performance --client_mode producer --produce_message_size 100 --num_messages 1000000 --brokers 127.0.0.1:9092 --produce_acks 0 --topic test --ponythreads 4 --ponyminthreads 4 --ponypinasio --ponynoblock

Results:

Run 1:

2018-01-29 19:26:28: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 34.5447 seconds. Throughput: 28948/sec.

Run 2:

2018-01-29 19:27:50: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 31.7206 seconds. Throughput: 31525.3/sec.

Run 2:

2018-01-29 19:29:37: Received acks for all 1000000 messages produced. num_errors: 0. Time taken: 35.2084 seconds. Throughput: 28402.3/sec.


Consuming tests:

Everything was run using sudo cset proc -s user -e bash to minimize system cpu contention by using the /user cpu set.
Everything was run assigned to dedicate cpu cores to avoid thrashing/context switches. It was also all run with realtime priority (cpu assignments/usage was verified using htop).

Each application was run 3 times alternating between one and the other. Prior to running this, data was loaded into kafka using numactl -C 1-5 chrt -f 80 ./examples/rdkafka_performance -P -t test -s 100 -c 10000000 -m "_____________Test2:OneBrokers:500kmsgs:100bytes" -S 1 -a 0 -b 127.0.0.1:9092
NOTE: Kafka/zookeeper were not restarted between runs or the previous test.

Run librdkafka performance app in consumer mode:

cd ~/librdkafka
numactl -C 1-5 chrt -f 80 ./examples/rdkafka_performance -C -t test -b 127.0.0.1:9092 -o beginning -c 10000000 -G test1 # use a unique number each time (1,2,3)

Results:

Run 1:

% 10000000 messages (1000000000 bytes) consumed in 8050ms: 1242185 msgs/s (124.22 MB/s)

Run 2:

% 10000000 messages (1000000000 bytes) consumed in 8452ms: 1183124 msgs/s (118.31 MB/s)

Run 3:

% 10000000 messages (1000000000 bytes) consumed in 8261ms: 1210407 msgs/s (121.04 MB/s)

Run pony-kafka performance app in consumer mode:

cd ~/pony-kafka
numactl -C 1-5 chrt -f 80 ./performance --client_mode consumer --num_messages 10000000 --brokers 127.0.0.1:9092 --topic test --ponythreads 4 --ponyminthreads 4 --ponypinasio --ponynoblock

Results:

Run 1:

2018-01-29 19:45:46: Received 10000000 messages as requested. Time taken: 29.58 seconds. Throughput: 338066/sec.

Run 2:

2018-01-29 19:47:19: Received 10000000 messages as requested. Time taken: 29.944 seconds. Throughput: 333957/sec.

Run 3:

2018-01-29 19:47:56: Received 10000000 messages as requested. Time taken: 29.6287 seconds. Throughput: 337511/sec.

@edenhill
Copy link

Good work on the client and the blog post - a lot of good insight into early stage client development. 👍

You might want to try rdkafka_performance with -X linger.ms=100 in producer mode. The default change of linger.ms from 1000ms to 0ms in 0.11.0 gives poor producer performance for the sake of improved latency. A proper fix is scheduled for 0.11.4.

@dipinhora
Copy link
Contributor Author

@edenhill Thank you very much for the kind words. As mentioned in the blog post, librdkafka has been a great source of inspiration for us and we wouldn't be this far along without it.

I'll definitely do another round of testing with the -X linger.ms=100 (and a similar change for pony kafka). I'll keep an eye out for how you resolve the two competing concerns of latency and throughput. Defaults are hard. 8*/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants