-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nip45: add hyperloglog relay response #1561
base: master
Are you sure you want to change the base?
Conversation
Well, that's pretty mind-blowing. 🤯 🤔 It weirdly incentives PoW on the middle of the ID. The relay is responsible for preventing that. Existing measures (WoT) might be enough for some relays. But the default behavior is to trust not just the relay operator but also the users of that relay to not PoW the middle of the ID. |
Why shouldn't the relay just generate a random number every time instead of inspecting the event itself? Then it's not idempotent but it prevents abuse. |
You can't generate a random number because each distinct item must yield the same number, therefore you need a hash of the item. The event id is already that hash, but yes, people can do PoW to make their reaction count more or something like that. I considered letting each relay pick a different random offset of the id to count from, which would mitigate this, but in my tests that often overshoots the results by a lot when merging from multiple different sources that have used different offsets. One thing we can do make all relays use the same offset for each query by making the offset be given by a hash of the Another idea, maybe better, is to use a fixed offset like proposed here, but instead of using the event id, use the author pubkey. Although this makes it so some pubkeys will always have more weight than others when liking posts or following people, which is like creating a funny caste system within Nostr. |
OK, I think I got the solution: make the offset dependent on the filter, not the subscription id. Something like
Just checking for the first "#e", #p", "#a" or "#q" tag (in this order) will cover 99% of use cases, and if there is no unambiguous way to determine the offset, use a predefined hardcoded value. |
wow, nice! |
Very cool! But I do think this is highly gamefiable because Nostr is so open and the ID is not random at all. Even with the latest filter-based offsets, it's possible to run a PoW procedure on expected filters and significantly change the count for that filter. The procedure must use the same offset for all relays for the merging math to work. How about a sha256(id + subID) with the instruction that the subscription ID must be random and the same for all counting relays? |
If you're using the subid there is no need to hash anything, you can just assume it will be random enough. If it's not that's the client's choice and their problem. But I would still like something deterministic to allow for the storageless counter relay. I think the latest procedure plus using the event |
This is very cool. I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think. |
Maybe something like this instead:
EDIT: oops, I guess you can't have an offset that far in, or there are no zeroes to count. Anyhow, you get the idea.
True. But if someone manages to get a pubkey with a hell of a lot of zeroes, many things that they react to will seem to be highly reacted to. I think one more step fixes this. Instead of counting zeroes in the pubkey directly, you count zeroes in the XOR of the pubkey and some hash that is generated from the filter query. EDIT: and in this case we no longer need an offset. The first 8 bits of the hash are the bucket, the next 248 bits could be a count of zeroes but we really probably shouldn't bother counting past the next 32 bits... or 64 if we are bold. The hash could be sha256() of the filter somehow made reproducable (e.g. not just JSON), maybe the '#e' tag contents, but maybe other parts of the filter matter? limit should probably be discarded. |
That's why I think the hash is needed (subid may not be random). It would get rid of any PoW made for the event ID and whatever comes in as subID creates enough of a variance that the whole ID changes. |
@vitorpamplona If we take a hash they will just PoW the hash. |
Take a look at HLL-RB (HyperLogLog with random buckets). I think relays may be able to use a random number... even for the same query twice. EDIT: I think I have been fooled by A.I. I can't find any such thing except as the output of AI. ;-( I was thinking we could use randomness, but that the harmonic mean might no longer be the right way to combine. |
We can get the offset from something like a xor between the pubkey and some part of the filter (I'd rather not take the filter JSON verbatim because that would have been preprocessed already in most cases, it would complicate implementations), then read the stuff from the event id based on that offset. |
I think it's impossible to create a deterministic identifier that can't be mined. |
Anonymous Bitcoiners would offer Nostr SEO packages where you zap them 5000 sats and they do proof of work on hyperloglog reactions they publish so you always have thousands of likes and reposts on everything. I suppose they can do that a lot already by just generating thousands of events. |
Yep, that's why I suggested hash(event id + subid). Sub id just needs to be random enough. |
We are talking about mining fresh pubkeys that would like your post. Relays can use their spam prevention logic to reject such fresh pubkeys. I think that adding randomness would not mess up the count (on a given relay) but it would make it impossible to combine counts between multiple relays (if you naively did, you would overshoot). |
Exactly, so we are not adding any new weaknesses. Relying on reaction counts from relays that will accept anything is already a very bad idea.
Agreed. |
@mikedilger I don't fully get this, but the mod 24 was in relation to the bytes, not to the hex characters. It was maximum 24 in order to leave room for 8 bytes at the end of the id/pubkey from which we would read from. I guess we should also skip the first 8 characters since they're so often used for generic PoW, that leaves us 16 possible values for the |
Ok then I think it is worded in a confusing way. It sounded like you were taking the hex characters as ascii, and doing a mod 24 on that. |
I couldn't get this to work when trying to translate the https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (see the top of page 140). The original 2007 algorithm uses Relays could each use their own different estimation algorithms, but the 256 counts will have to be normalized to either be a count of zeroes, or the position of the first 1 (a count of zeroes +1) in order to be interoperable. And it seems to me that a count of zeroes loses data. |
There are different implementations out there, I tried to do the simplest possible way that would work and wrote it on the NIP. We will have to agree on something, we can't have each relay implementing it in a different way.
This is what my implementation does too, it counts zeroes and adds 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a restatement of my prior comments as edits
45.md
Outdated
|
||
This section describes the steps a relay should take in order to return HLL values to clients. | ||
|
||
1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a byte and take its modulo over 24 to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Upon receiving a filter, if it has a
#e
,#p
,#a
or#q
item with a single value (checked in that order in case there are multiple of these), then covert that value into 32 bytes (for the 'a' tag use the pubkey part only) and take the 16th byte modulo 24 as an 'offset'. In the unlikely case that the filter doesn't meet these conditions, setoffset
to the number 16;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh, for the 'a' tag I guess it doesn't matter, the kind number isn't too long so chars 32 and 33 are still inside the hex part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So instead of fixing this I doubled down on my error and took the opportunity to exclude the first 8 bytes of all pubkeys. If you say this is stupid I'll do as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me. The offset isn't that important. The original algorithm uses offset=0 on hash outputs.
Because NIP-45 COUNT counts the number of events, but hll counts the number of distinct pubkeys, I think perhaps the key should be "hll_pubkeys" instead of just "hll" to at least give a hint that it isn't counting the events that match the filter, and to leave room for an "hll" that does. |
I think in all cases so far what we actually want is distinct pubkeys, so we can be pragmatic about it. If we ever make a different thing for the the number of distinct ids we can come up with a different name, like I would prefer a JSON array over an object with keys because then we wouldn't have these discussions. |
You aren't wrong. |
The usage of registers has nothing to do with IP addresses. Any user will be able to skyrocket the number if they can generate infinite number of pubkeys and issue likes/follows/etc with all these pubkeys. If you can generate 3000 pubkeys you can make the follow count go up by ~3000. That is true today, will be true with this change, will always be true. We have no way around it except
If HLL values are cached on the relay side or counted at request time is irrelevant for the client and the results will be exactly the same. |
Edit: got triggered by a cat avatar and wrote too much. Don't want to flood the PR discussion. Edit: Lol I thought semisol said I was wrong |
In order to game this system, you have to produce close to 256 pubkeys each with effective PoW of about 24 (to reach 8 million likes, for example) and these newly mined pubkeys would have to be accepted by the target relay and not seen as spam-pubkeys. IMHO this is not too hard to game. It just takes a little effort and ingenuity. But how severe is that? So what if it looks like millions of people upvoted you. Reactions are stupid anyways. It's not like they are forging your events or reading your DMs. And if clients look at the first results from the COUNT commands they can quite easily detect the HLL abuse. Still... as this is quite easy to repair as I've said by having some kind of randomness (perhaps supplied by the client with the COUNT is best to make it consistent across relays without being known to the attacker gaming things). But again the downside is that relays can't cache things anymore (or they can, but probably won't get cache hits anymore) and they have to keep the original reaction events. But at least clients don't need those events anymore. I don't have a strong opinion either way, but I lean towards having the client-supplied random thing, because I have a penchant for reliability and trustability moreso than efficiency. I think I've said all I can say about this so I'll stop commenting on it until or unless we start debating particulars after this choice is made. |
Why can't the relay count and ALSO run HLL to make sure values match? Relays MUST guarantee that the HLL is under x% of the actual number and not return an HLL if it goes over. Relays can do whatever they want to do to fix the HLL estimation. This seems like a better protection than these shenanigans to try to block gamification. |
But having some sort of post-processing of the ID with a client supplied value already fixes this. They would need to also target the attack towards 1 client nonce only which they don't know. What you said basically makes it able to force clients to fall back to inefficient methods and deny service to the COUNT HLL functionality. |
Yep, but the client cannot fix the HLL if it's wrong. The relay can automatically find which event IDs are creating the distortion and fix it (ignore them) before sending it back. |
With the client nonce system, it does not need to be fixed at all.
For this reason HLL uses a harmonic mean. For more than a few registers to get an abnormal value would be very hard, unless there were a lot of events (and then the count would be accurate). Combining the event ID or the pubkey of the author with a client nonce means it is not possible to intentionally bias pubkeys or event IDs |
But I don't see any real nonce in this spec. Or are we back to using the subscription ID as a random nonce? |
My proposal is to add one. |
I am in favor of the client-based nonce. It's the only way to keep the counter usable. |
Came to the conclusion that both approaches should co-exist. A relay that drops events: A relay that doesn't drop events (or drops with lower frequency): Client decides which key to use when merging counts depending on what it receives from the visited relays, e.g. if client talked to 3 relays and just one of them filled the |
Why not 3 ways? 4 ways? 17 ways? Why have a standard at all? Let people do what they want! |
Yeah, I don't know what the double approach offers. If there are outliers, the whole thing is moot. Clients can't do anything about it. It's ok if relays just fix it. We are already trusting them on the count anyway... |
I'm aware of the priniciple that there should be preferably only one way of doing things. My reasoning was that some relays can freely discard events and still provide a less reliable (can be gamed, can't subtract items), though cacheable, hll count. |
OK, I tried to build the cache thing and it's too complicated. Nostr query system is too flexible. It will never work to support open-ended queries, it will end up using more space than storing the actual events. It's tempting to just declare caching a failed idea and standardize hyperloglog without that possibility, but it would be so good to not have to store millions of reactions and that stuff! Now I wonder if we should standardize just some of the "basic" queries to support cached hyperloglog, like number of reactions, number of quotes, number of comments, number of followers -- so these and only these could be cached, and for all the rest (and for all the subqueries of these or other complexities) relays have to either count at runtime or not fulfill at all? What are the "basic" count queries, besides the ones I listed above? Or am I crazy to insist in caching? |
This seems very similar to negentropy. Caching is just not viable. |
I think the idea of trying to avoid storing of many low-value events on relays is laudable. I think we should spend more time trying to figure that out. If we still need to return a normal COUNT though, we will at least need to store the ID prefixes to avoid counting duplicates. And we will need a normal COUNT result in order to detect gaming of the system. In order for relays to keep these HLL counts without keeping the events, they need to start counting them for well-specified filters before anybody even asks for one. So I agree that we need to specify well-known filters representing things to count (e.g. number of reactions, number of followers). And in that case we cannot hash the event value (pubkey or id) with a user-supplied nonce. And since all parties need to use the same method of generating HLLs for them to combine properly, the system will be well-known and thus gameable. But outliers can be detected and rejected, preventing most gaming. I'm almost in favor of doing this both ways... but no more. Not 17 ways. I could be persuaded to do it 2.718 ways. |
Or instead of keeping all the ID prefixes, you keep a bloom filter and when a "definitely not in set" happens you add 1 but when a "possibly in set" happens you add the false positive rate instead. |
Kind of. Caching should be done at the filter level (possibly breaking down filters into smaller ones or caching more general filters), not the final result level. |
Storing reactions is not a big issue: dedicated indexes and encoding schemes can be made for high-volume optimizable events, which will mean you can store about 5M reactions in 1GB. This does not include some other methods you could use such as public key lookup tables that can cut the size even further. Relays may also implement sampling and adjust the HLL result accordingly (as it is likely that the relay will know a rough count of the amount of events it may have to explore): you don't need to add every event to the HLL sketch, only some, and add a correction factor to every register. |
HLL is exactly such a thing already and defining what goes into this dedicated scheme and what is ignored is the question I posed above: if you want a limited functionality for specific use cases then HLL caching can be very good, otherwise nothing is possible. |
Anyway, I think one solution is to define HLL to be returned ONLY in the following queries (exact templates), at least for now:
All other queries should not return HLL responses. And then whenever someone has another use case we add it to the list. In the queries above it is declared that |
You have not solved the problem that this is open to manipulation |
A lot of things are possible. |
I am pretty sure I did: do not trust counts from relays that accept events from random keys. It's literally the same problem with or without HLL. |
So, this NIP is entirely useless? Good to know. Should a new user's reaction count? Are they a "random key"? How do you distinguish new users from attempts at manipulation if they are spread over multiple IPs? It is inherently costly to manipulate the reaction count as each event increases the risk of detection and uses up finite resources (IP addresses), but HLL cuts down the cost from |
Hey semisol just repeated what i said. Lol some comments above I got pissed thinking they said I was wrong without any additional argument. But now I see semisol actually said "You aren't wrong." Lol I'm really distracted these days edit: so as not to waste this comment, I will just say I hope any version of this get merged and become popular so that as a side effect most relays implement the good ol nip45 count, which is already useful imo |
If our only tool for fighting spam is IP address detection then I'm sorry for us. I don't even consider IP as a way to identify spam at all. Much better solutions are paid relays, WoT relays, captcha relays, relays that use any other algorithm for checking what is a "valid" key like activity traacking or whatever, or any other form of whitelisted relays. |
Here's a nice colorful video explanation of HyperLogLog: https://www.youtube.com/watch?v=lJYufx0bfpw
And here's a very interesting article with explanations, graphs and other stuff: http://antirez.com/news/75
If relays implement this we can finally get follower counts that do not suck and without having to use a single relay (aka relay.nostr.band) as the global source of truth for the entire network -- at the same time as we save the world by consuming an incomparably small fraction of the bandwidth.
Even if one was to download 2 reaction events in order to display a silly reaction count number in a UI that would already be using more bytes than this HLL value does (actually considering deflate compression the COUNT response with the HLL value is already smaller than a single reaction EVENT response).
This requires trusting relays to not lie about the counts and the HLL values, but this NIP always required that anyway, so no change there.
HyperLogLog can be implement in multiple ways, with different parameters and whatnot. Luckily most of the customizations (for example, the differences between HyperLogLog++ and HyperLogLog) can be applied at the final step, so it is a client choice. This NIP only describes the part that is needed for interoperability, which is how relays should compute the values and then return them to clients.
Because implementations would have to agree on parameters such as the number of registers to use, this NIP also fixes that number in 256 for simplicity's sake (makes it simpler implement since it's the maximum value of one byte) and also because it is a reasonable amount.
These are some random estimations I did, to showcase how efficient those 256 bytes can be:
As you can see they are almost perfect for small counts, but still pretty good for giant counts.