Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: consolidate vm gas consumption #1430

Merged
merged 16 commits into from
Apr 25, 2024
Merged

fix: consolidate vm gas consumption #1430

merged 16 commits into from
Apr 25, 2024

Conversation

piux2
Copy link
Contributor

@piux2 piux2 commented Dec 11, 2023

Contributors' checklist...
  • Added new tests
  • Provided an example (e.g. screenshot) to aid review or the PR is self-explanatory
  • Updated the official documentation or not needed
  • No breaking changes were made
  • Added references to related issues and PRs
  • Provided any useful hints for running manual tests
  • Added new benchmarks to generated graphs, if any. More info here.

Ref: #1070 #1067 #649 #1281

Summary

The current gno.land node, optimized for development purposes, has a simplified verification process and gas meter implementation. To transition the gno.land node to a production-ready state, it is necessary to implement a comprehensive gas metering system that accurately accounts for VM gas consumption. This includes refining the gas fee structure to encompass all relevant costs, ensuring robust transaction validation, and calculating gas consumption based on actual computational load.

This PR aims to address these limitations by introducing a complete gas meter and validation flow, laying the groundwork for further gas meter profiling and configuration.

Problem Definition

Current State and Limitations for Production:

  • VM Gas Consumption Not Accounted in Gas Meter: The current gas meter fails to calculate VM gas consumption, potentially allowing heavy contract loads without corresponding gas meter deductions. A refined system should measure and charge for VM gas usage accurately.

  • Gas Fee Structure: Presently, the gas fee structure only includes storage access, transaction size, and signature verification. VM gas fees are levied as a separate, flat fee, which might lead to confusion among users expecting the total fee to match the amount specified in the 'gas-fee' argument. For improved transparency and precision, the gas fee structure should integrate all these aspects.

  • Transaction Validation: The system currently validates basic information for VM msg_addpkg and msg_call. However, gas consumption cannot be determined before fully executing these messages against the VM. Consequently, VM transactions are placed in the mempool and propagated to other nodes, even if they may not meet the gas fee requirements to execute these transactions. This undermines the purpose of using gas fees to prevent VM spamming.

Solution: ( Updated )

This is a high-level description of the implemented features:

Added an anteHandler in VM to monitor gas consumption
Implemented chained VM anteHandler in auth.anteHandler

  • Consume gas to verify account, signature and tx size in CheckTx
  • Consume VM gas in DeliverTx
  • Accumulated VM CPU cycles, memory allocation, store access, transaction size, and signature verification into a single gas meter.
  • Enabled local node checks of VM resource usage. The VM message is only aborted if it runs out of gas in basic CheckTx. However, the message is still propagated to other nodes if execution fails to prevent censorship
  • Introduced a structured format for logging gas consumption for profiling and metrics.
  • Introduced a gas factor linking gas to vm CPU cycles and memory allocation to balance between vm gas consumption with the rest.

Trade-offs and Future Optimization: ( Updated )

The current implementation processes messages against the VM to check gas consumption in abci.CheckTx() before inclusion in the mempool and propagation to other nodes.

Messages lacking sufficient gas-wanted will be dropped, preventing abuse without adequate gas fees. However, the trade-off is that for each message with enough gas, the VM executes the transaction twice: once in CheckTx() and once in DeliverTx(). As these occur in separate execution contexts and are not in synchronized sequence, the performance impact is currently a secondary concern.

We moved the VM gas check from CheckTx to DeliverTx for the following reasons:

  • We only know the VM gas consumption after the messages have been processed.
  • Running VM execution for many CheckTx requests from the peers could overload the mempool that is executing CheckTx.
  • This could slow down the propagation of transactions across the entire network.

By moving the VM gas check from CheckTx to DeliverTx, we are able to reduce the load on the mempool of a node and allow transactions to propagate through the network faster.

In the future, we may use a predicted median value instead of the exact value from transaction execution for efficiency.

What's Next:

  • Add a minimum gas price flag and configuration for node operation.
  • Provide a user-friendly fee input interface, offering 'gas-wanted' and 'gas price' as alternatives to the current 'gas-wanted' and 'gas-fee' inputs.
  • Tune the gas factor based on VM CPU and Memory Profiling. The current factor is 1:1 between gas and VM CPU cycles and memory allocation.

@piux2 piux2 requested review from jaekwon, moul and a team as code owners December 11, 2023 11:18
@github-actions github-actions bot added the 📦 🌐 tendermint v2 Issues or PRs tm2 related label Dec 11, 2023
@github-actions github-actions bot added the 📦 ⛰️ gno.land Issues or PRs gno.land package related label Dec 11, 2023
@piux2 piux2 changed the title consolidate vm gas consumption with the rest fix: consolidate vm gas consumption Dec 11, 2023
Copy link

codecov bot commented Dec 11, 2023

Codecov Report

Attention: Patch coverage is 48.86364% with 45 lines in your changes are missing coverage. Please review.

Project coverage is 46.27%. Comparing base (0ba95bf) to head (0677d5d).

Files Patch % Lines
gno.land/pkg/sdk/vm/keeper.go 45.56% 42 Missing and 1 partial ⚠️
gno.land/pkg/sdk/vm/handler.go 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1430      +/-   ##
==========================================
+ Coverage   46.25%   46.27%   +0.02%     
==========================================
  Files         483      483              
  Lines       68837    68851      +14     
==========================================
+ Hits        31838    31861      +23     
+ Misses      34368    34357      -11     
- Partials     2631     2633       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tbruyelle
Copy link
Contributor

A very minimal change for such a big improvement, that is impressive!

Copy link

@MichaelFrazzy MichaelFrazzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The anteHandler will be super helpful, thanks for taking the time to knock all this out @piux2. Really amazing work.

To fill you in on where we're at for the mechanics (would love your feedback/thoughts here, maybe a meeting soon before sims start):

Currently it seems like we'll largely be mirroring EIP 1559. So as you probably already know in that case... we'd have a target block size, actual block size, max amount gas can change per block, and real block size threshold. If the threshold is 200%, then when the actual block size is 200% larger than the target then gas would increase by the max amount, 12.5% in ETH's case.

From there we can feed the resulting gas value into a final fee equation that also takes into account CPU cycles and/or bytes required for a specific transaction. To account for the resources needed we'd need to assign a fixed value to each byte stored and each CPU cycle to multiply by the required amounts. CPU cycles would be hardware specific though, right?

I also still like the idea of a logistic or expo curve that kicks in after a certain level, as a circuit breaker spam protection method.

At the same time the resource parameters present a game-able situation where people can create less efficient/optimized code to earn additional fees. But considering these submissions will often undergo manual review early on and even with us planning to give a % of all fees earned to the realm's builder, we should have some counter measures to play with.

Copy link

@MichaelFrazzy MichaelFrazzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handful of small nitpicks, nothing major

gno.land/pkg/sdk/vm/ante_test.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante_test.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante_test.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante_test.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/ante_test.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/gas.go Outdated Show resolved Hide resolved
gno.land/pkg/sdk/vm/keeper.go Outdated Show resolved Hide resolved
Copy link
Contributor

@deelawn deelawn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just left a couple comments on style and consolidating code, but the overall approach seems good to me. Great job 🎉

piux2 and others added 2 commits December 13, 2023 10:06
Copy link
Member

@thehowl thehowl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Me & Ray went through the PR to discuss changes and optimisations. From Ray's benchmarking, it seems that a considerable amount of the lost optimisations are due to the overflow checks which are now in place in the gas package.

To address the benchmarking issues, I see two future improvements which can be made:

  1. Changing gnolang/overflow to use the OF/CF flags instead, checking for overflow directly on hardware rather than with a software solution. This is tracked in gnolang/overflow: optimise package for common architectures #1844.
  2. Eventually moving to having one single system to track resource consumption, rather than the double system of Gas meter + MaxCycles. (For now, it makes sense to keep the latter as a "dumb fallback".)

Another thing to note, is that the keeper code is painfully repetitive, but it's not this PR's fault. Some refactor should aim to try to make it not repeat itself, while trying not to harm readability.

After resolving outstanding comments, I'd say good to merge.

@jaekwon
Copy link
Contributor

jaekwon commented Mar 27, 2024

still reviewing plz hold off on merging just yet.

@jaekwon
Copy link
Contributor

jaekwon commented Mar 28, 2024

it seems that a considerable amount of the lost optimisations are due to the overflow checks which are now in place in the gas package.

Are we sure about this?
Try benchmarking with overflow.*() usage removed.
It could be something else, like, maybe if there are too much logic in the switch statement (in the machine for the opcode) it becomes slower... subtle things like that. If this is the case, then it might make sense to first look up the CPU cost per opcode, and then incrCPU() once before the switch statement... (I have no idea if this is true, I'm just spitballing here; my point is fine tuning optimizations like this can be very tricky, and sometimes you have to look at the resulting pseudo-assembly itself to understand what is happening).

Another example of something tricky... struct fields that are later in field order take longer to access. (this is actually true, or was true when I was first developing GNO). If a struct field is used often, such as the gas meter, it is better to declare it higher, rather than how it is how, where it is among the last of fields.

Another fact is that interface methods take much longer to call than concrete implementation methods. We probably want to use the tm2.GasMeter interface, but read the next point:

If it really is the call to incrCPU that is slow, there is something else we can do... we can batch increment CPUs... e.g. run 10, 20, or 100 GNO OpCode operations, accumulate the CPU cycles used, and then call incrCPU just once with the accumulated figure. We just need to remember to call incrCPU at the end of Machine.Run() before the batch call to incrCPU happens again for any remainder CPU cycles. But I would only do this once we are absolutely certain that it is the call to gasMeter that is slow.

We could implement a concrete BatchGasMeter and have the Machine hold a reference to *BatchGasMeter rather than an interface, and BatchGasMeter could "flush" the batch to the underlying tm2.GasMeter every 10/20/100 runs; this way the more expensive call to an interface can be amortized. And then we can call BatchGasMeter.Flush() to flush any remainders to the underlying tm2.GasMeter upon completion of Machine.Run. <-- probably will help.

But how much would this actually help in performance?

@piux2
Copy link
Contributor Author

piux2 commented Apr 2, 2024

it seems that a considerable amount of the lost optimisations are due to the overflow checks which are now in place in the gas package.

Are we sure about this? Try benchmarking with overflow.*() usage removed. It could be something else, like, maybe if there are too much logic in the switch statement (in the machine for the opcode) it becomes slower... subtle things like that. If this is the case, then it might make sense to first look up the CPU cost per opcode, and then incrCPU() once before the switch statement... (I have no idea if this is true, I'm just spitballing here; my point is fine tuning optimizations like this can be very tricky, and sometimes you have to look at the resulting pseudo-assembly itself to understand what is happening).

Another example of something tricky... struct fields that are later in field order take longer to access. (this is actually true, or was true when I was first developing GNO). If a struct field is used often, such as the gas meter, it is better to declare it higher, rather than how it is how, where it is among the last of fields.

Another fact is that interface methods take much longer to call than concrete implementation methods. We probably want to use the tm2.GasMeter interface, but read the next point:

If it really is the call to incrCPU that is slow, there is something else we can do... we can batch increment CPUs... e.g. run 10, 20, or 100 GNO OpCode operations, accumulate the CPU cycles used, and then call incrCPU just once with the accumulated figure. We just need to remember to call incrCPU at the end of Machine.Run() before the batch call to incrCPU happens again for any remainder CPU cycles. But I would only do this once we are absolutely certain that it is the call to gasMeter that is slow.

We could implement a concrete BatchGasMeter and have the Machine hold a reference to *BatchGasMeter rather than an interface, and BatchGasMeter could "flush" the batch to the underlying tm2.GasMeter every 10/20/100 runs; this way the more expensive call to an interface can be amortized. And then we can call BatchGasMeter.Flush() to flush any remainders to the underlying tm2.GasMeter upon completion of Machine.Run. <-- probably will help.

But how much would this actually help in performance?

The phrase 'a considerable amount of' is based solely on limited benchmarking of the incrCPU test. The overflow is accounted for 27% of the gas consumption logic in incrCPU.

The benchmarking method and results discussed here focus on a Gno contract, emphasizing nested for-range loops and two-dimensional slice access, without considering I/O or any other factors. More information can be found here: https://github.com/gnolang/gno/blob/master/gnovm/pkg/gnolang/benchdata/matrix.gno

After conducting several more rounds of benchmark tests using the same method, there was a 7% increase observed, instead of the previously mentioned 20%. All of this increase can be attributed to the incrCPU() function. Of this 7%, 27% is from overflow checking, and the remainder is from the gas consumption function.

ns/op master fix_gasmeter fix_gasmeter.GasConsumption.NoOverFlowCheck fix_gasmeter.NoGasConsumption
BenchmarkBenchdata/fib.gno_param:4-8 6019 6350 6321 6013
BenchmarkBenchdata/fib.gno_param:8-8 43752 46529 46045 43868
BenchmarkBenchdata/fib.gno_param:16-8 2087269 2220203 2188264 2084855
BenchmarkBenchdata/loop.gno-8 156.4 173.1 170.2 156.4
BenchmarkBenchdata/matrix.gno_param:3-8 132421 140119 138349 130790
BenchmarkBenchdata/matrix.gno_param:4-8 369017 393032 387500 366743
BenchmarkBenchdata/matrix.gno_param:5-8 1347503 1440471 1418249 1340457
BenchmarkBenchdata/matrix.gno_param:6-8 6890003 7380312 7248520 6880262
7.12% 5.20% -0.14%

All of these are good points for us to consider during performance optimizations. How about we tackle performance by prioritizing IO > VM > GasMeter after we complete the basic functionalities and fixing critical issues?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📦 🌐 tendermint v2 Issues or PRs tm2 related 📦 ⛰️ gno.land Issues or PRs gno.land package related 📦 🤖 gnovm Issues or PRs gnovm related
Projects
Status: Done
Status: Done
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants