-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spark zipnumcluster job (draft) #8
base: main
Are you sure you want to change the base?
Conversation
also note: @sebastian-nagel - I do have a "more perfect sorting" version which uses reservoir sampling in my local history - if we end up needing it, it's already done. |
After thinking longer about it: some types of queries will still work but others don't. The problem starts when the software reading the CDX index assumes that it is totally sorted. This especially applies to any kind of range queries. For example, the query
returns So, for the Basically this is counting the number of lines in the
Because there might be also results in the zipnum block before the first one, 1 is added to the number of lines. If the zipnum blocks are non-contiguous, we'd need to add 1 for every contiguous range of block. Naturally, the result will become less precise. In addition, there's more work to do for larger range queries. That's what the statement "Generally, this overhead [of the zipnum index] is negligible when looking up large indexes, and non-existent when doing a range query across many CDX lines." (https://pywb.readthedocs.io/en/latest/manual/indexing.html#zipnum) On the other end, queries for single URLs might work the same and with the same performance independent from the partitioning scheme.
What does it mean? Total order sorting?
All kind of range queries also need to be tested:
Of course, even then: we'd need to document for our users the new CDX index sorting and spread this information. We do not know which assumptions are made in any third-party software and whether they rely on the total order sorting. This alone might make it less work to implement the total order sorting. |
zipnumcluster-ccpyspark.py
Outdated
if len(current_chunk) >= chunk_size: | ||
# Compress and write chunk | ||
chunk_data = ''.join(current_chunk).encode('utf-8') | ||
compressed = z.compress(chunk_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever a new chunk is started the zlib compressobj needs to be reset. Also need to flush. Should be:
z = zlib.compressobj(6, zlib.DEFLATED, zlib.MAX_WBITS + 16)
compressed = z.compress(chunk_data) + z.flush()
zipnumcluster-ccpyspark.py
Outdated
# Handle final chunk | ||
if current_chunk: | ||
chunk_data = ''.join(current_chunk).encode('utf-8') | ||
compressed = z.compress(chunk_data) + z.flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as well: start a new compressobj.
zipnumcluster-ccpyspark.py
Outdated
|
||
# Create single index entry per record | ||
for sk, ts in chunk_records: | ||
index_entries.append((sk, ts, partition_id, current_offset, chunk_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The secondary index index (cluster.idx
) only contains the offset for the first record of every zipnum block but not all of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah - this (and your explanation above) makes more sense to me now, thank you.
(I am still not seeing any of that in the docs I was able to find, but, I trust you, and will adjust)
zipnumcluster-ccpyspark.py
Outdated
f.write(compressed) | ||
|
||
for sk, ts in chunk_records: | ||
index_entries.append((sk, ts, partition_id, current_offset, chunk_length)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: only first record.
I mentioned in next sentence fragment - I used same technique as was used in hadoop version - reservoir sampling to produce the ranges, then another pass using those ranges to do the shards. I will find that version in my local history and check it when I work on this next. |
Maybe it's not necessary to do the sampling step - Spark has a sortBy (or sortByKey) method which does a total order sorting with N partitions. We use it to sort the vertices of the host-level webgraph before enumerating them. Same with the reservoir sampling: the partitions are not perfectly balanced, but the balance is acceptable. Note: Spark has also methods to only sort the data within the partitions, they are usually named by "WithinPartitions", see for example repartitionAndSortWithinPartitions. |
Indeed - I was aware of these, but not all of them, and only some of the nuance. I've done some deep reading, and, by my best assessment: my informal definition of "perfect sort" is the last record of 1 partition will be less than the first record of the next partition (so, if I go through the partitions in order, I will never get records out of order.)
I'm leaning towards repartitionAndSortWithinPartitions, using hash of the url - but I may change my mind after running a few jobs and seeing how uneven they are... IMHO, 5-10% variance seems fine, if it's much more than that, it doesn't feel as good (though, that's why I want to read the zipnum code as I state below - it may not have a practical issue... so, could be fine) Since I'm waiting on/monitoring other jobs anyway, I'm going to take another block of time tomorrow to do similar deep read of zipnum code, just so I have much better understanding of that as well. (specifically, I'm going to read the index server's code which USES zipnum, as that's the part that is still murky to me) Thanks again for the input @sebastian-nagel , much appreciated. |
Everyone's expectation is that the cdx index shards and parquet shards surt values should not overlap. For cdx, that's expected by pywb. For parquet, it's important for optimization. We do have a few parquet indexes for which that isn't true, and it's a problem we will fix someday. |
Got it - I think with hash and reservoir sampled approaches I outlined, they should not overlap at the shard level (as the former would not allow it, and the latter would be matching what we already do today pretty exactly). There MAY be overlapping gzip chunks that overlap (very small amounts with reservoir, and potentially rather larger amounts with hash) - but, as long as the secondary index reflects those properly, I don't think it'll be an issue based on my read of the index server side of things. |
I will get back to this task Monday, so, plenty of time to discuss if I'm wrong there... I will bring it up on eng. call, and if we need to talk, we can do it then. |
What Greg means is that there must be zero overlap for all ranges defined by the first and the last SURT in a zipnum block. It's important because the secondary index (cluster.idx) only stores the first SURT but not the last one. But with strict sorting, the last one must be lower (sorts before) than the first one of the next zipnum block. For Parquet zero overlap is an optimization but no requirement: every Parquet file and row group has the min and max values in the statistics in the footer. |
I have updates to this task in another branch - for now, I'm going to preserve the existing approach, reservoir, and get this task finished. I have it all working locally and will be testing tonight/tomorrow in s3 on a full crawl, and will then re-do the PR to reflect (I'll probably merge it into this branch, to keep things simple, and preserve the above history. |
This is a cc-pyspark version of the zipnum clustering job (without the use of mrjob framework)