Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nip45: add hyperloglog relay response #1561

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

nip45: add hyperloglog relay response #1561

wants to merge 4 commits into from

Conversation

fiatjaf
Copy link
Member

@fiatjaf fiatjaf commented Nov 3, 2024

Here's a nice colorful video explanation of HyperLogLog: https://www.youtube.com/watch?v=lJYufx0bfpw
And here's a very interesting article with explanations, graphs and other stuff: http://antirez.com/news/75

If relays implement this we can finally get follower counts that do not suck and without having to use a single relay (aka relay.nostr.band) as the global source of truth for the entire network -- at the same time as we save the world by consuming an incomparably small fraction of the bandwidth.

Even if one was to download 2 reaction events in order to display a silly reaction count number in a UI that would already be using more bytes than this HLL value does (actually considering deflate compression the COUNT response with the HLL value is already smaller than a single reaction EVENT response).

This requires trusting relays to not lie about the counts and the HLL values, but this NIP always required that anyway, so no change there.


HyperLogLog can be implement in multiple ways, with different parameters and whatnot. Luckily most of the customizations (for example, the differences between HyperLogLog++ and HyperLogLog) can be applied at the final step, so it is a client choice. This NIP only describes the part that is needed for interoperability, which is how relays should compute the values and then return them to clients.

Because implementations would have to agree on parameters such as the number of registers to use, this NIP also fixes that number in 256 for simplicity's sake (makes it simpler implement since it's the maximum value of one byte) and also because it is a reasonable amount.


These are some random estimations I did, to showcase how efficient those 256 bytes can be:

real count: 2 estimation: 2
real count: 4 estimation: 4
real count: 6 estimation: 6
real count: 7 estimation: 7
real count: 12 estimation: 12
real count: 15 estimation: 15
real count: 22 estimation: 20
real count: 36 estimation: 36
real count: 44 estimation: 43
real count: 47 estimation: 44
real count: 64 estimation: 65
real count: 77 estimation: 72
real count: 89 estimation: 88
real count: 95 estimation: 93
real count: 104 estimation: 101
real count: 116 estimation: 113
real count: 122 estimation: 131
real count: 144 estimation: 145
real count: 150 estimation: 154
real count: 199 estimation: 196
real count: 300 estimation: 282
real count: 350 estimation: 371
real count: 400 estimation: 428
real count: 500 estimation: 468
real count: 600 estimation: 595
real count: 777 estimation: 848
real count: 922 estimation: 922
real count: 1000 estimation: 978
real count: 1500 estimation: 1599
real count: 2222 estimation: 2361
real count: 9999 estimation: 10650
real count: 13600 estimation: 13528
real count: 80000 estimation: 73439
real count: 133333 estimation: 135973
real count: 200000 estimation: 189470

As you can see they are almost perfect for small counts, but still pretty good for giant counts.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 3, 2024

@alexgleason
Copy link
Member

Well, that's pretty mind-blowing. 🤯

🤔

It weirdly incentives PoW on the middle of the ID. The relay is responsible for preventing that. Existing measures (WoT) might be enough for some relays. But the default behavior is to trust not just the relay operator but also the users of that relay to not PoW the middle of the ID.

@alexgleason
Copy link
Member

Why shouldn't the relay just generate a random number every time instead of inspecting the event itself? Then it's not idempotent but it prevents abuse.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 3, 2024

You can't generate a random number because each distinct item must yield the same number, therefore you need a hash of the item. The event id is already that hash, but yes, people can do PoW to make their reaction count more or something like that.

I considered letting each relay pick a different random offset of the id to count from, which would mitigate this, but in my tests that often overshoots the results by a lot when merging from multiple different sources that have used different offsets.

One thing we can do make all relays use the same offset for each query by making the offset be given by a hash of the <subscription_id>, for example (and then clients have to ensure they use the same subscription id for all relays they query). That is sad because it negates the possibility of having relays that only store an HLL value for each query and discards all events.

Another idea, maybe better, is to use a fixed offset like proposed here, but instead of using the event id, use the author pubkey. Although this makes it so some pubkeys will always have more weight than others when liking posts or following people, which is like creating a funny caste system within Nostr.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 3, 2024

OK, I think I got the solution: make the offset dependent on the filter, not the subscription id. Something like

  • if you are querying for reactions to event "abc123" the query will be {"#e": ["abc123"], "kinds": [7]}, then the offset would be given by "c1", for example.
  • if you are querying for followers of pubkey "mno456" the query will be {"#p": ["mno456"], "kinds": [3]}, then the offset would be given by "o4", for example.
  • and so on.

Just checking for the first "#e", #p", "#a" or "#q" tag (in this order) will cover 99% of use cases, and if there is no unambiguous way to determine the offset, use a predefined hardcoded value.

@v0l
Copy link
Member

v0l commented Nov 4, 2024

wow, nice!

@vitorpamplona
Copy link
Collaborator

vitorpamplona commented Nov 4, 2024

Very cool! But I do think this is highly gamefiable because Nostr is so open and the ID is not random at all. Even with the latest filter-based offsets, it's possible to run a PoW procedure on expected filters and significantly change the count for that filter.

The procedure must use the same offset for all relays for the merging math to work.

How about a sha256(id + subID) with the instruction that the subscription ID must be random and the same for all counting relays?

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 4, 2024

If you're using the subid there is no need to hash anything, you can just assume it will be random enough. If it's not that's the client's choice and their problem.

But I would still like something deterministic to allow for the storageless counter relay. I think the latest procedure plus using the event pubkey instead of event id is sufficient. People can't change their pubkey after the fact just to game one specific like count. And if the relay is going to accept events from completely new and random pubkeys then the counts from that relay are already borked and meaningless anyway.

@mikedilger
Copy link
Contributor

This is very cool.

I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think.

@mikedilger
Copy link
Contributor

mikedilger commented Nov 4, 2024

Maybe something like this instead:

byte offset in pubkey = filterbytes[16] AND 31   (16th byte of the filter data, mapped to a number from 0-31).

EDIT: oops, I guess you can't have an offset that far in, or there are no zeroes to count. Anyhow, you get the idea.

People can't change their pubkey after the fact just to game one specific like count

True. But if someone manages to get a pubkey with a hell of a lot of zeroes, many things that they react to will seem to be highly reacted to.

I think one more step fixes this. Instead of counting zeroes in the pubkey directly, you count zeroes in the XOR of the pubkey and some hash that is generated from the filter query.

EDIT: and in this case we no longer need an offset. The first 8 bits of the hash are the bucket, the next 248 bits could be a count of zeroes but we really probably shouldn't bother counting past the next 32 bits... or 64 if we are bold. The hash could be sha256() of the filter somehow made reproducable (e.g. not just JSON), maybe the '#e' tag contents, but maybe other parts of the filter matter? limit should probably be discarded.

@vitorpamplona
Copy link
Collaborator

vitorpamplona commented Nov 5, 2024

If you're using the subid there is no need to hash anything, you can just assume it will be random enough. If it's not that's the client's choice and their problem.

That's why I think the hash is needed (subid may not be random). It would get rid of any PoW made for the event ID and whatever comes in as subID creates enough of a variance that the whole ID changes.

@alexgleason
Copy link
Member

@vitorpamplona If we take a hash they will just PoW the hash.

@mikedilger
Copy link
Contributor

mikedilger commented Nov 5, 2024

Take a look at HLL-RB (HyperLogLog with random buckets). I think relays may be able to use a random number... even for the same query twice.

EDIT: I think I have been fooled by A.I. I can't find any such thing except as the output of AI. ;-( I was thinking we could use randomness, but that the harmonic mean might no longer be the right way to combine.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 5, 2024

True. But if someone manages to get a pubkey with a hell of a lot of zeroes, many things that they react to will seem to be highly reacted to.

We can get the offset from something like a xor between the pubkey and some part of the filter (I'd rather not take the filter JSON verbatim because that would have been preprocessed already in most cases, it would complicate implementations), then read the stuff from the event id based on that offset.

@alexgleason
Copy link
Member

I think it's impossible to create a deterministic identifier that can't be mined.

@alexgleason
Copy link
Member

Anonymous Bitcoiners would offer Nostr SEO packages where you zap them 5000 sats and they do proof of work on hyperloglog reactions they publish so you always have thousands of likes and reposts on everything. I suppose they can do that a lot already by just generating thousands of events.

@vitorpamplona
Copy link
Collaborator

vitorpamplona commented Nov 5, 2024

@vitorpamplona If we take a hash they will just PoW the hash.

Yep, that's why I suggested hash(event id + subid). Sub id just needs to be random enough.

@mikedilger
Copy link
Contributor

I think it's impossible to create a deterministic identifier that can't be mined.

We are talking about mining fresh pubkeys that would like your post. Relays can use their spam prevention logic to reject such fresh pubkeys.

I think that adding randomness would not mess up the count (on a given relay) but it would make it impossible to combine counts between multiple relays (if you naively did, you would overshoot).

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 5, 2024

I suppose they can do that a lot already by just generating thousands of events.

Exactly, so we are not adding any new weaknesses. Relying on reaction counts from relays that will accept anything is already a very bad idea.

We are talking about mining fresh pubkeys that would like your post. Relays can use their spam prevention logic to reject such fresh pubkeys.

Agreed.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 5, 2024

I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think.

@mikedilger I don't fully get this, but the mod 24 was in relation to the bytes, not to the hex characters. It was maximum 24 in order to leave room for 8 bytes at the end of the id/pubkey from which we would read from. I guess we should also skip the first 8 characters since they're so often used for generic PoW, that leaves us 16 possible values for the offset, which means we can read that from a single hex character. I know this is all made in a very ad-hoc stupid way, but I like it because it's simple.

@mikedilger
Copy link
Contributor

I'm not sure the mod 24 part is doing what you intended. The hex characters are lowercase so 'a'-'f' gives you 1-6 overlapping with what '1'-'6' gives you... which doesn't hurt but is less effective I think.

@mikedilger I don't fully get this, but the mod 24 was in relation to the bytes, not to the hex characters. It was maximum 24 in order to leave room for 8 bytes at the end of the id/pubkey from which we would read from. I guess we should also skip the first 8 characters since they're so often used for generic PoW, that leaves us 16 possible values for the offset, which means we can read that from a single hex character. I know this is all made in a very ad-hoc stupid way, but I like it because it's simple.

Ok then I think it is worded in a confusing way. It sounded like you were taking the hex characters as ascii, and doing a mod 24 on that.

@mikedilger
Copy link
Contributor

mikedilger commented Nov 5, 2024

I couldn't get this to work when trying to translate the go code to rust (probably my fault). So I tried reading and implementing the 2007 paper and I got it to work. However, these algorithms differ.

https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf (see the top of page 140).

The original 2007 algorithm uses ρ(s) as being the position of the first 1 bit, not the count of leading 0s, meaning the minimum value will be 1. This can't just be corrected by a factor of 2 because it then changes the count of slots that are still 0 (every slot actually used will be at least 1). (original LogLog set slots to -∞ to do this differentiation)

Relays could each use their own different estimation algorithms, but the 256 counts will have to be normalized to either be a count of zeroes, or the position of the first 1 (a count of zeroes +1) in order to be interoperable. And it seems to me that a count of zeroes loses data.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 6, 2024

There are different implementations out there, I tried to do the simplest possible way that would work and wrote it on the NIP.

We will have to agree on something, we can't have each relay implementing it in a different way.

The original 2007 algorithm uses ρ(s) as being the position of the first 1 bit, not the count of leading 0s, meaning the minimum value will be 1.

This is what my implementation does too, it counts zeroes and adds 1.

Copy link
Contributor

@mikedilger mikedilger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a restatement of my prior comments as edits

45.md Outdated

This section describes the steps a relay should take in order to return HLL values to clients.

1. Upon receiving a filter, if it has a single `#e`, `#p`, `#a` or `#q` item, read its 32th ascii character as a byte and take its modulo over 24 to obtain an `offset` -- in the unlikely case that the filter doesn't meet these conditions, set `offset` to the number 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Upon receiving a filter, if it has a #e, #p, #a or #q item with a single value (checked in that order in case there are multiple of these), then covert that value into 32 bytes (for the 'a' tag use the pubkey part only) and take the 16th byte modulo 24 as an 'offset'. In the unlikely case that the filter doesn't meet these conditions, set offset to the number 16;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eh, for the 'a' tag I guess it doesn't matter, the kind number isn't too long so chars 32 and 33 are still inside the hex part.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So instead of fixing this I doubled down on my error and took the opportunity to exclude the first 8 bytes of all pubkeys. If you say this is stupid I'll do as you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine by me. The offset isn't that important. The original algorithm uses offset=0 on hash outputs.

45.md Outdated Show resolved Hide resolved
@mikedilger
Copy link
Contributor

Because NIP-45 COUNT counts the number of events, but hll counts the number of distinct pubkeys, I think perhaps the key should be "hll_pubkeys" instead of just "hll" to at least give a hint that it isn't counting the events that match the filter, and to leave room for an "hll" that does.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 12, 2024

I think in all cases so far what we actually want is distinct pubkeys, so we can be pragmatic about it.

If we ever make a different thing for the the number of distinct ids we can come up with a different name, like hlli, I don't know, these keys are all hardcoded anyway.

I would prefer a JSON array over an object with keys because then we wouldn't have these discussions.

@Semisol
Copy link
Collaborator

Semisol commented Nov 14, 2024

You aren't wrong.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 14, 2024

The usage of registers has nothing to do with IP addresses. Any user will be able to skyrocket the number if they can generate infinite number of pubkeys and issue likes/follows/etc with all these pubkeys. If you can generate 3000 pubkeys you can make the follow count go up by ~3000. That is true today, will be true with this change, will always be true. We have no way around it except

  1. downloading a bazillion number of events from public relays and filtering locally using WoT or whatever;
  2. relying on relays that are known to do a good job in filtering stuff -- paid relays and WoT relays from reputable humans, for example -- then merging their HLL values locally.

If HLL values are cached on the relay side or counted at request time is irrelevant for the client and the results will be exactly the same.

@arthurfranca
Copy link
Contributor

arthurfranca commented Nov 15, 2024

Edit: got triggered by a cat avatar and wrote too much. Don't want to flood the PR discussion.

Edit: Lol I thought semisol said I was wrong

@mikedilger
Copy link
Contributor

In order to game this system, you have to produce close to 256 pubkeys each with effective PoW of about 24 (to reach 8 million likes, for example) and these newly mined pubkeys would have to be accepted by the target relay and not seen as spam-pubkeys. IMHO this is not too hard to game. It just takes a little effort and ingenuity.

But how severe is that? So what if it looks like millions of people upvoted you. Reactions are stupid anyways. It's not like they are forging your events or reading your DMs. And if clients look at the first results from the COUNT commands they can quite easily detect the HLL abuse.

Still... as this is quite easy to repair as I've said by having some kind of randomness (perhaps supplied by the client with the COUNT is best to make it consistent across relays without being known to the attacker gaming things). But again the downside is that relays can't cache things anymore (or they can, but probably won't get cache hits anymore) and they have to keep the original reaction events. But at least clients don't need those events anymore.

I don't have a strong opinion either way, but I lean towards having the client-supplied random thing, because I have a penchant for reliability and trustability moreso than efficiency. I think I've said all I can say about this so I'll stop commenting on it until or unless we start debating particulars after this choice is made.

@vitorpamplona
Copy link
Collaborator

vitorpamplona commented Nov 15, 2024

Why can't the relay count and ALSO run HLL to make sure values match?

Relays MUST guarantee that the HLL is under x% of the actual number and not return an HLL if it goes over. Relays can do whatever they want to do to fix the HLL estimation.

This seems like a better protection than these shenanigans to try to block gamification.

@Semisol
Copy link
Collaborator

Semisol commented Nov 15, 2024

But having some sort of post-processing of the ID with a client supplied value already fixes this. They would need to also target the attack towards 1 client nonce only which they don't know.

What you said basically makes it able to force clients to fall back to inefficient methods and deny service to the COUNT HLL functionality.

@vitorpamplona
Copy link
Collaborator

Yep, but the client cannot fix the HLL if it's wrong. The relay can automatically find which event IDs are creating the distortion and fix it (ignore them) before sending it back.

@Semisol
Copy link
Collaborator

Semisol commented Nov 15, 2024

Yep, but the client cannot fix the HLL if it's wrong.

With the client nonce system, it does not need to be fixed at all.

The relay can automatically find which event IDs are creating the distortion and fix it (ignore them) before sending it back.

For this reason HLL uses a harmonic mean. For more than a few registers to get an abnormal value would be very hard, unless there were a lot of events (and then the count would be accurate).

Combining the event ID or the pubkey of the author with a client nonce means it is not possible to intentionally bias pubkeys or event IDs

@vitorpamplona
Copy link
Collaborator

Combining the event ID or the pubkey of the author with a client nonce means it is not possible to intentionally bias pubkeys or event IDs

But I don't see any real nonce in this spec. Or are we back to using the subscription ID as a random nonce?

@Semisol
Copy link
Collaborator

Semisol commented Nov 16, 2024

But I don't see any real nonce in this spec. Or are we back to using the subscription ID as a random nonce?

My proposal is to add one.

@vitorpamplona
Copy link
Collaborator

I am in favor of the client-based nonce. It's the only way to keep the counter usable.

@arthurfranca
Copy link
Contributor

Came to the conclusion that both approaches should co-exist.

A relay that drops events:
["COUNT", <sub_id>, { "count": 123, "hll_fixed": "<fiatjaf_way>" }]

A relay that doesn't drop events (or drops with lower frequency):
["COUNT", <sub_id>, { "count": 456, "hll_fixed": "<fiatjaf_way>", "hll_dyn": "<vitorpamplona_way>" }]

Client decides which key to use when merging counts depending on what it receives from the visited relays, e.g. if client talked to 3 relays and just one of them filled the hll_dyn field, use the hll_fixed field instead.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 19, 2024

Why not 3 ways? 4 ways? 17 ways? Why have a standard at all? Let people do what they want!

@vitorpamplona
Copy link
Collaborator

Yeah, I don't know what the double approach offers. If there are outliers, the whole thing is moot. Clients can't do anything about it. It's ok if relays just fix it. We are already trusting them on the count anyway...

@arthurfranca
Copy link
Contributor

arthurfranca commented Nov 21, 2024

I'm aware of the priniciple that there should be preferably only one way of doing things.
What having 2 ways (not 17 ways heheh) offer?

My reasoning was that some relays can freely discard events and still provide a less reliable (can be gamed, can't subtract items), though cacheable, hll count.
While other relays can dynamically generate a much more reliable hll count, e.g specially on the first days after a nostr event creation when it most likely won't have discarded reactions etc yet. Then after some days, given it may have discarded old reactions etc, these relays can choose to only provide the less reliable hll.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 21, 2024

OK, I tried to build the cache thing and it's too complicated. Nostr query system is too flexible. It will never work to support open-ended queries, it will end up using more space than storing the actual events.

It's tempting to just declare caching a failed idea and standardize hyperloglog without that possibility, but it would be so good to not have to store millions of reactions and that stuff!

Now I wonder if we should standardize just some of the "basic" queries to support cached hyperloglog, like number of reactions, number of quotes, number of comments, number of followers -- so these and only these could be cached, and for all the rest (and for all the subqueries of these or other complexities) relays have to either count at runtime or not fulfill at all?

What are the "basic" count queries, besides the ones I listed above?

Or am I crazy to insist in caching?

@vitorpamplona
Copy link
Collaborator

am I crazy to insist in caching?

This seems very similar to negentropy. Caching is just not viable.

@mikedilger
Copy link
Contributor

I think the idea of trying to avoid storing of many low-value events on relays is laudable. I think we should spend more time trying to figure that out.

If we still need to return a normal COUNT though, we will at least need to store the ID prefixes to avoid counting duplicates. And we will need a normal COUNT result in order to detect gaming of the system.

In order for relays to keep these HLL counts without keeping the events, they need to start counting them for well-specified filters before anybody even asks for one. So I agree that we need to specify well-known filters representing things to count (e.g. number of reactions, number of followers). And in that case we cannot hash the event value (pubkey or id) with a user-supplied nonce. And since all parties need to use the same method of generating HLLs for them to combine properly, the system will be well-known and thus gameable. But outliers can be detected and rejected, preventing most gaming.

I'm almost in favor of doing this both ways... but no more. Not 17 ways. I could be persuaded to do it 2.718 ways.

@mikedilger
Copy link
Contributor

Or instead of keeping all the ID prefixes, you keep a bloom filter and when a "definitely not in set" happens you add 1 but when a "possibly in set" happens you add the false positive rate instead.

@Semisol
Copy link
Collaborator

Semisol commented Nov 22, 2024

Or am I crazy to insist in caching?

Kind of. Caching should be done at the filter level (possibly breaking down filters into smaller ones or caching more general filters), not the final result level.

@Semisol
Copy link
Collaborator

Semisol commented Nov 22, 2024

Storing reactions is not a big issue: dedicated indexes and encoding schemes can be made for high-volume optimizable events, which will mean you can store about 5M reactions in 1GB. This does not include some other methods you could use such as public key lookup tables that can cut the size even further.

Relays may also implement sampling and adjust the HLL result accordingly (as it is likely that the relay will know a rough count of the amount of events it may have to explore): you don't need to add every event to the HLL sketch, only some, and add a correction factor to every register.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 22, 2024

dedicated indexes and encoding schemes can be made for high-volume optimizable events

HLL is exactly such a thing already and defining what goes into this dedicated scheme and what is ignored is the question I posed above: if you want a limited functionality for specific use cases then HLL caching can be very good, otherwise nothing is possible.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 22, 2024

Anyway, I think one solution is to define HLL to be returned ONLY in the following queries (exact templates), at least for now:

  • {"#e":["<anchor>"],"kinds":[7]}
  • {"#p":["<anchor>"],"kinds":[3]}

All other queries should not return HLL responses.

And then whenever someone has another use case we add it to the list.

In the queries above it is declared that <anchor> will be used to determine how to produce the HLL value for each related event deterministically.

@Semisol
Copy link
Collaborator

Semisol commented Nov 22, 2024

Anyway, I think one solution is to define HLL to be returned ONLY in the following queries (exact templates), at least for now:

* `{"#e":["<anchor>"],"kinds":[7]}`

* `{"#p":["<anchor>"],"kinds":[3]}`

All other queries should not return HLL responses.

And then whenever someone has another use case we add it to the list.

In the queries above it is declared that <anchor> will be used to determine how to produce the HLL value for each related event deterministically.

You have not solved the problem that this is open to manipulation

@Semisol
Copy link
Collaborator

Semisol commented Nov 22, 2024

otherwise nothing is possible.

A lot of things are possible.

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 22, 2024

You have not solved the problem that this is open to manipulation

I am pretty sure I did: do not trust counts from relays that accept events from random keys. It's literally the same problem with or without HLL.

@Semisol
Copy link
Collaborator

Semisol commented Nov 22, 2024

I am pretty sure I did: do not trust counts from relays that accept events from random keys. It's literally the same problem with or without HLL.

So, this NIP is entirely useless? Good to know.

Should a new user's reaction count? Are they a "random key"? How do you distinguish new users from attempts at manipulation if they are spread over multiple IPs?

It is inherently costly to manipulate the reaction count as each event increases the risk of detection and uses up finite resources (IP addresses), but HLL cuts down the cost from n events to 256 (or less with increases in brute forcing time, which is really easy) events only.

@arthurfranca
Copy link
Contributor

arthurfranca commented Nov 22, 2024

Hey semisol just repeated what i said. Lol some comments above I got pissed thinking they said I was wrong without any additional argument. But now I see semisol actually said "You aren't wrong."

Lol I'm really distracted these days

edit: so as not to waste this comment, I will just say I hope any version of this get merged and become popular so that as a side effect most relays implement the good ol nip45 count, which is already useful imo

@fiatjaf
Copy link
Member Author

fiatjaf commented Nov 22, 2024

Should a new user's reaction count? Are they a "random key"? How do you distinguish new users from attempts at manipulation if they are spread over multiple IPs?

If our only tool for fighting spam is IP address detection then I'm sorry for us. I don't even consider IP as a way to identify spam at all.

Much better solutions are paid relays, WoT relays, captcha relays, relays that use any other algorithm for checking what is a "valid" key like activity traacking or whatever, or any other form of whitelisted relays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants