-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add unique ID to each entry #199
Comments
(The easiest way to generate these unique IDs is probably to add some large truly random number – given that it is truly random, that would avoid the whole issue of detecting collisions with current and previous ID. Something like 40 hex digits (like SHA1) should be more than enough in that case.) |
At the moment, the database format can no longer be changed due to backwards compatibility issues. However, in the future the entire database will be moved to a different format, and a unique ID can be added at that time. |
You could also just do what I do in the Flutter port of boQwI' and combine the values of multiple fields. The Flutter port of boQwI' uses a JSON database format, where the key for each entry is the entry name concatenated with the part of speech and homophone number, if present. See: klingon-assistant-data/xml2json.py Line 146 in 64b62e8
|
Those fields can still change somewhat, though. While it likely won't change from a noun to a verb, I often update a verb's transitivity as we get more information. The homophone number can also change when we receive additional words which are homophones of existing ones, or get additional meanings for an existing word. But in that case, it's fine to treat entries without numbers as if they had the number 1. |
Yeah, I suppose the problem I'm solving (uniquely identifying a word within a single version of the database) is different from the one @zrajm is asking about (uniquely identifying a word across database changes). It might be interesting to use the same scheme (and IDs) as the KA dictionary, although this would require a bit of coordination as new words come along to ensure that the same ID is used for the new words as they're added in each database. Maybe that would be a good reason not to use the same scheme and IDs, and instead use different schemes, with somebody maintaining a list of ID mappings. Here's the description of the KA ID scheme:
@zrajm are the IDs assigned by any particular predictable means (e.g., they're monotonically increasing serial numbers that conform to the constraints listed above), or do you just randomly generate IDs until a non-colliding one is found? |
@dadap The ID numbers are random. I went with human-readable, rather than great randomness butthe small size of the numbers mean that the database need to keep track of "abandoned" numbers of entries that have been removed (so far zero), this wouldn't be a problem if there was (much) greater randomness (because that would make the risk of a collision negligible). |
It's been a while, has any thought been given to IDing the words? (Granted, it's mainly because I misused Azure Tables, and having an ID wouldn't necessarily have solved the problem, but it could have lessened the impact. Not that much is lost except a weekend of my time...) Also, fwiw, I've used essentially what dadap suggested, but the problem with that is that it's simply too long. A mere 7character ID would be ideal, IMO. One letter to satisfy restrictions, and 6 numbers. I doubt we'd run out of numbers any time soon. Worst case, change the first letter from W (for word) to S (for sentence) and move sentences to a separate number range. Or have separate number ranges for the types of words. (Should be needless to say, but for the record - this is low priority of course. I mainly wanted to add my vote to having a unique boQwI'-driven ID.) |
The ID field isn't empty. It is populated (see Do you require the ID to be stable? |
Ah, I see - I'm using the raw XML, so for me the field is always empty :-) I do think it would be useful for others as well, though. I don't know how many are using the boQwI' database, though. |
Would it be possible to get a
<column name="unique_id">…</column>
field for each entry? Exactly what the value looks like is not important, as long as it's guaranteed to be unique in the dictionary, and never reused (should a word ever be removed from the dictionary).This would make it easier for me to keep track of changes in your database over time, there allowing me to work through all the new stuff piecemeal and add the corresponding info to Archive of Okrandian Canon, and the Klingonska Akademien Dictionary.
The text was updated successfully, but these errors were encountered: