Add unique ID to each entry #199

zrajm · 2018-11-27T14:57:26Z

Would it be possible to get a <column name="unique_id">…</column> field for each entry? Exactly what the value looks like is not important, as long as it's guaranteed to be unique in the dictionary, and never reused (should a word ever be removed from the dictionary).

This would make it easier for me to keep track of changes in your database over time, there allowing me to work through all the new stuff piecemeal and add the corresponding info to Archive of Okrandian Canon, and the Klingonska Akademien Dictionary.

The text was updated successfully, but these errors were encountered:

zrajm · 2018-11-27T15:00:40Z

(The easiest way to generate these unique IDs is probably to add some large truly random number – given that it is truly random, that would avoid the whole issue of detecting collisions with current and previous ID. Something like 40 hex digits (like SHA1) should be more than enough in that case.)

De7vID · 2018-11-27T18:05:39Z

At the moment, the database format can no longer be changed due to backwards compatibility issues. However, in the future the entire database will be moved to a different format, and a unique ID can be added at that time.

dadap · 2018-11-27T18:23:11Z

You could also just do what I do in the Flutter port of boQwI' and combine the values of multiple fields. The Flutter port of boQwI' uses a JSON database format, where the key for each entry is the entry name concatenated with the part of speech and homophone number, if present. See:

klingon-assistant-data/xml2json.py

Line 146 in 64b62e8

def normalize(name, pos):

De7vID · 2018-11-28T13:29:57Z

Those fields can still change somewhat, though. While it likely won't change from a noun to a verb, I often update a verb's transitivity as we get more information. The homophone number can also change when we receive additional words which are homophones of existing ones, or get additional meanings for an existing word. But in that case, it's fine to treat entries without numbers as if they had the number 1.

dadap · 2018-11-28T13:59:20Z

Yeah, I suppose the problem I'm solving (uniquely identifying a word within a single version of the database) is different from the one @zrajm is asking about (uniquely identifying a word across database changes).

It might be interesting to use the same scheme (and IDs) as the KA dictionary, although this would require a bit of coordination as new words come along to ensure that the same ID is used for the new words as they're added in each database. Maybe that would be a good reason not to use the same scheme and IDs, and instead use different schemes, with somebody maintaining a list of ID mappings. Here's the description of the KA ID scheme:

id: unique, permanent entry ID [REQUIRED]

The entry ID uniquely identifies an entry in the dictionary (even across
dictionary updates). IDs are guaranteed to be unique, and will not be
reused (i.e. the ID of a deleted entry will never re-appear in the
dictionary, unless the same dictionary entry is taken back). The ID of an
entry remain the same, even when the entry is otherwise modified.

The ID value is intentionally kept as short as possible, to facilitate
their use for both humans an machines. (We suggest writing id:Qmp when
using an ID in text, as this value can be pasted straight into the online
Klingon Pocket Dictionary to view the entry.)

ID values, written in base58, are always three characters long. Values that
look similar to Klingon words are disallowed (so as to avoid confusion).
This gives us: 58^3-(16516) = 193,832 possible ID values -- far more than
is likely to ever be needed. (Base58 was chosen for clarity, it is also
used for Bitcoin addresses.)

@zrajm are the IDs assigned by any particular predictable means (e.g., they're monotonically increasing serial numbers that conform to the constraints listed above), or do you just randomly generate IDs until a non-colliding one is found?

zrajm · 2018-11-28T14:43:34Z

@dadap The ID numbers are random. I went with human-readable, rather than great randomness butthe small size of the numbers mean that the database need to keep track of "abandoned" numbers of entries that have been removed (so far zero), this wouldn't be a problem if there was (much) greater randomness (because that would make the risk of a collision negligible).

Cyberman-tM · 2020-04-27T05:24:41Z

It's been a while, has any thought been given to IDing the words?
The structure has an ID field, but it's always empty.
I've recently run into some minor trouble because there is no ID...

(Granted, it's mainly because I misused Azure Tables, and having an ID wouldn't necessarily have solved the problem, but it could have lessened the impact. Not that much is lost except a weekend of my time...)

Also, fwiw, I've used essentially what dadap suggested, but the problem with that is that it's simply too long. A mere 7character ID would be ideal, IMO. One letter to satisfy restrictions, and 6 numbers. I doubt we'd run out of numbers any time soon. Worst case, change the first letter from W (for word) to S (for sentence) and move sentences to a separate number range. Or have separate number ranges for the types of words.

(Should be needless to say, but for the record - this is low priority of course. I mainly wanted to add my vote to having a unique boQwI'-driven ID.)

De7vID · 2020-04-30T17:32:27Z

The ID field isn't empty. It is populated (see renumber.py) when the dictionary is compiled to binary, but the numbers are sequentially assigned and hence not stable (i.e., they change if new entries are added).

Do you require the ID to be stable?

Cyberman-tM · 2020-04-30T17:35:37Z

Ah, I see - I'm using the raw XML, so for me the field is always empty :-)
"require" is too strong a word. I would like it to be stable - AND in the XML.
But it's not really a huge problem for me to create my own ID.

I do think it would be useful for others as well, though. I don't know how many are using the boQwI' database, though.

De7vID mentioned this issue Aug 20, 2022

Added copyright notice for Finnish translation #663

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add unique ID to each entry #199

Add unique ID to each entry #199

zrajm commented Nov 27, 2018

zrajm commented Nov 27, 2018

De7vID commented Nov 27, 2018

dadap commented Nov 27, 2018

De7vID commented Nov 28, 2018

dadap commented Nov 28, 2018

zrajm commented Nov 28, 2018

Cyberman-tM commented Apr 27, 2020

De7vID commented Apr 30, 2020

Cyberman-tM commented Apr 30, 2020

Add unique ID to each entry #199

Add unique ID to each entry #199

Comments

zrajm commented Nov 27, 2018

zrajm commented Nov 27, 2018

De7vID commented Nov 27, 2018

dadap commented Nov 27, 2018

De7vID commented Nov 28, 2018

dadap commented Nov 28, 2018

zrajm commented Nov 28, 2018

Cyberman-tM commented Apr 27, 2020

De7vID commented Apr 30, 2020

Cyberman-tM commented Apr 30, 2020