Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ranking for function signatures. #23

Open
pipermerriam opened this issue May 23, 2018 · 18 comments
Open

Ranking for function signatures. #23

pipermerriam opened this issue May 23, 2018 · 18 comments
Labels
enhancement New feature or request

Comments

@pipermerriam
Copy link

Here's a starting point for how we might objectively rank function signatures.

  • Let C be the set of all contract addresses which contain bytecode matching the pattern used to JUMPDEST based on the first 4 bytes of the message data.

With just C we can establish a basic ranking for signatures. This ranking is however trivial to game.

  • Let T be the set of all transactions who's first 4 bytes match the signature.

With len(T) or sum(t.gas_price * t.gas for t in T) we should have a less easy to game metric. I suspect that this will be suitable until we find someone directly attacking the rankings at which point we can iterate on this.

Question is, how do we easily get these metrics. I think there is a BigQuery database for most of the chain data that I may be able to get access to, otherwise, maybe someone else knows of a relational database with all the chain data?

@pipermerriam
Copy link
Author

cc @holiman @ligi

@ligi
Copy link
Member

ligi commented May 23, 2018

Sounds like a good idea! Just asked amberdata.io via twitter - maybe they can run this query.

@holiman
Copy link
Contributor

holiman commented May 23, 2018

Why rank them? What's the objective? And why not just look at all method calso in the last N (100k) blocks and produce a toplist? That would also show the size of supplied data, which can be checked if it matches the abi parameters

@TrevorJTClarke
Copy link

Seems like a cool idea! Speaking internally (Amberdata). Thanks for reaching out on twitter!
I could see a couple useful cases however, esp if you identified top methods that cause issues/security flaws.

@ligi
Copy link
Member

ligi commented May 23, 2018

Why rank them?

I think ranking them is a great way to know which one to show as a default. Perhaps even hide the lower ranked one when the difference in rank is really big (e.g. in the case of the ERC-20 forced collisions)

@holiman
Copy link
Contributor

holiman commented May 23, 2018

To weed out collisions, though... Then you need to validate calldata against abi params. And then the collision-game just becomes a bit harder since they need to use identical params.

Or would you use natspec to resolve conflict?

@holiman
Copy link
Contributor

holiman commented May 23, 2018

So what I meant with 'why rank' is that it won't solve collisions, not easily at least.

@holiman
Copy link
Contributor

holiman commented May 23, 2018

The whole thing can be done pretty nicely with geth tracing, though. Except the natspec part, I guess.

@pipermerriam
Copy link
Author

@holiman my thought is that

  1. ranking can at least partially solve artificial/malicious collisions.
  2. all things being equal (selector and function params), allows for ordering based on a metric for which is most likely to be the desired one.

@holiman
Copy link
Contributor

holiman commented May 24, 2018

Re 2). How? How can you tell which sig is the one being called?

@pipermerriam
Copy link
Author

@holiman I don't think we can know, but my thought is that we're just providing some contextual data and it's up to people who use this list to decide what to do with it.

Theoretically EthPM is the real fix for what this list is used for, at least in the case of wallets. If people use packages as their gateway to accessing contracts, then the wallet can just use the ABI data from the package instead of having to use a possibly wrong reverse lookup from this list.

@holiman
Copy link
Contributor

holiman commented Jun 1, 2018

Oh wow. I remembered having implemented a 4byte-tracer long ago, and thought it may have gotten lost since then. But it's actually sitting right here: https://github.com/ethereum/go-ethereum/blob/master/eth/tracers/internal/tracers/4byte_tracer.js

So it's fairly trivial to just let it run on a few thousand blocks, and we can see how good coverage 4byte has right now, and what the big blind spots are.

EDIT: linked to the wrong file.

@holiman
Copy link
Contributor

holiman commented Jun 1, 2018

Usage

> debug.traceTransaction( "0x214e597e35da083692f5386141e69f47e973b2c56e7a8073b1ea08fd7571e9de", {tracer: "4byteTracer"})
   {
     0x27dc297e-128: 1,
     0x38cc4831-0: 2,
     0x524f3889-96: 1,
     0xadf59f99-288: 1,
     0xc281d19e-0: 1
   }

The tracer output format is a dictionary, where each key is <4bytesig><length of CALLDATA> and value is number of occurrences.

It's also possible to trace a range of blocks, and continually get pushes when new blocks arrive, if you use ipc+subscription.

@holiman
Copy link
Contributor

holiman commented Jun 9, 2018

Doing a dump now, going to take a few more hours. In the meanwhile, I've downloaded the data for about 500K (top-level) transactions (the data also covers internal CALL-variants), and did some preliminary analysis. Here's the top 50:

 198053 0xa9059cbb-64
  20939 0xef343588-576
  10129 0x23b872dd-96
   9588 0xd0e30db0-0
   7257 0x095ea7b3-64
   6514 0xf2c298be-128
   5859 0x70a08231-32
   5856 0x2295115b-256
   5500 0x8da5cb5b-0
   5284 0x88c2a0bf-32
   5000 0xb9b8af0b-0
   5000 0xb269681d-0
   5000 0x6ea056a9-64
   4994 0x97dc97cb-0
   4994 0x3c18d318-32
   4209 0x338b5dea-64
   4181 0x00000000-16
   4052 0x28090abb-128
   3374 0x1a695230-32
   3182 0x8d12a197-184
   2891 0xf7d8c883-64
   2698 0x16c72721-0
   2633 0x0a19b14a-352
   2591 0x3ccfd60b-0
   2402 0x0f2c9329-64
   2363 0xf088d547-32
   2120 0x688abbf7-32
   1979 0xdd62ed3e-64
   1880 0x40c10f19-64
   1878 0xc31e0547-0
   1728 0x18160ddd-0
   1645 0x0d9f5aed-96
   1624 0x38cc4831-0
   1505 0x00000000-112
   1486 0x1134269a-544
   1395 0x313ce567-0
   1356 0x867904b4-64
   1350 0x4b75f54f-0
   1345 0x5e5144eb-128
   1327 0x29a00e7c-128
   1323 0x49f9b0f7-128
   1303 0xa24835d1-64
   1295 0x14c9035e-512
   1292 0xd0679d34-64
   1276 0xf7654176-0
   1235 0x00000001-32
   1217 0x2e1a7d4d-32
   1188 0x9e281a98-64
   1159 0xcc135813-32
   1060 0x278b8c0e-288

I'll need to download a new dump of the 4byte db, so I can correlate the signatures with known ones. Then we could also get a toplist of unknowns. Plus also get a list of mismatches, where the supplied data length does not match any possible length of the signature in the db.

@holiman
Copy link
Contributor

holiman commented Jun 9, 2018

@pipermerriam , you wanted to weigh in the gas and gasprice. Do you really that is needed?
Also, I still don't see how this will solve anything regarding duplicates, but I think it's good stats to have to measure how good coverage the directory has.

@ligi
Copy link
Member

ligi commented Jun 11, 2018

@holiman 👍

I do not think the gas and gas-price is needed.
Can I have the full list somewhere? With the calldata-size I can already resolve some conflicts - e.g. with:
available_assert_time(uint16,uint64);decimals()

so 1395 0x313ce567-0 shows me that decimals() is the most likely candidate

but was just thinking if there is a collision that is really not forced but done by chance - then this is really problematic. This method will never really have the chance to change it's position if we hide it in the dark.
Perhaps @yann300 can show a warning in remix if a function signature resolves in a known 4byte?

In the end we really need natspec - and then shame contracts that did not upload a natspec to swarm by making a big red warning sign on all transactions that interact with such contracts.

@holiman
Copy link
Contributor

holiman commented Jun 11, 2018

@ligi the full json is 105M, raw list of signatures is 40M, the sorted and counted list is 6.5M (zipped 1.4M) is here:
signatures_sorted_counted.lst.gz

@ligi ligi mentioned this issue Jun 11, 2018
@ligi ligi added the enhancement New feature or request label May 7, 2019
@kumavis
Copy link

kumavis commented May 4, 2022

for comparing 4byte collisions, preferring the one with the simplest signature (shortest name, fewest arguments) is a decent heuristic. tho not exactly sure how best to unify those two dimensions (name, arguments). could maybe just measure the amount of entropy in the stringified function signature and take the smallest one (?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants