Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add unique ID to each entry #199

Open
zrajm opened this issue Nov 27, 2018 · 9 comments
Open

Add unique ID to each entry #199

zrajm opened this issue Nov 27, 2018 · 9 comments

Comments

@zrajm
Copy link
Contributor

zrajm commented Nov 27, 2018

Would it be possible to get a <column name="unique_id">…</column> field for each entry? Exactly what the value looks like is not important, as long as it's guaranteed to be unique in the dictionary, and never reused (should a word ever be removed from the dictionary).

This would make it easier for me to keep track of changes in your database over time, there allowing me to work through all the new stuff piecemeal and add the corresponding info to Archive of Okrandian Canon, and the Klingonska Akademien Dictionary.

@zrajm
Copy link
Contributor Author

zrajm commented Nov 27, 2018

(The easiest way to generate these unique IDs is probably to add some large truly random number – given that it is truly random, that would avoid the whole issue of detecting collisions with current and previous ID. Something like 40 hex digits (like SHA1) should be more than enough in that case.)

@De7vID
Copy link
Owner

De7vID commented Nov 27, 2018

At the moment, the database format can no longer be changed due to backwards compatibility issues. However, in the future the entire database will be moved to a different format, and a unique ID can be added at that time.

@dadap
Copy link
Collaborator

dadap commented Nov 27, 2018

You could also just do what I do in the Flutter port of boQwI' and combine the values of multiple fields. The Flutter port of boQwI' uses a JSON database format, where the key for each entry is the entry name concatenated with the part of speech and homophone number, if present. See:

def normalize(name, pos):

@De7vID
Copy link
Owner

De7vID commented Nov 28, 2018

Those fields can still change somewhat, though. While it likely won't change from a noun to a verb, I often update a verb's transitivity as we get more information. The homophone number can also change when we receive additional words which are homophones of existing ones, or get additional meanings for an existing word. But in that case, it's fine to treat entries without numbers as if they had the number 1.

@dadap
Copy link
Collaborator

dadap commented Nov 28, 2018

Yeah, I suppose the problem I'm solving (uniquely identifying a word within a single version of the database) is different from the one @zrajm is asking about (uniquely identifying a word across database changes).

It might be interesting to use the same scheme (and IDs) as the KA dictionary, although this would require a bit of coordination as new words come along to ensure that the same ID is used for the new words as they're added in each database. Maybe that would be a good reason not to use the same scheme and IDs, and instead use different schemes, with somebody maintaining a list of ID mappings. Here's the description of the KA ID scheme:

  • id: unique, permanent entry ID [REQUIRED]

    The entry ID uniquely identifies an entry in the dictionary (even across
    dictionary updates). IDs are guaranteed to be unique, and will not be
    reused (i.e. the ID of a deleted entry will never re-appear in the
    dictionary, unless the same dictionary entry is taken back). The ID of an
    entry remain the same, even when the entry is otherwise modified.

    The ID value is intentionally kept as short as possible, to facilitate
    their use for both humans an machines. (We suggest writing id:Qmp when
    using an ID in text, as this value can be pasted straight into the online
    Klingon Pocket Dictionary to view the entry.)

    ID values, written in base58, are always three characters long. Values that
    look similar to Klingon words are disallowed (so as to avoid confusion).
    This gives us: 58^3-(16516) = 193,832 possible ID values -- far more than
    is likely to ever be needed. (Base58 was chosen for clarity, it is also
    used for Bitcoin addresses.)

@zrajm are the IDs assigned by any particular predictable means (e.g., they're monotonically increasing serial numbers that conform to the constraints listed above), or do you just randomly generate IDs until a non-colliding one is found?

@zrajm
Copy link
Contributor Author

zrajm commented Nov 28, 2018

@dadap The ID numbers are random. I went with human-readable, rather than great randomness butthe small size of the numbers mean that the database need to keep track of "abandoned" numbers of entries that have been removed (so far zero), this wouldn't be a problem if there was (much) greater randomness (because that would make the risk of a collision negligible).

@Cyberman-tM
Copy link
Contributor

It's been a while, has any thought been given to IDing the words?
The structure has an ID field, but it's always empty.
I've recently run into some minor trouble because there is no ID...

(Granted, it's mainly because I misused Azure Tables, and having an ID wouldn't necessarily have solved the problem, but it could have lessened the impact. Not that much is lost except a weekend of my time...)

Also, fwiw, I've used essentially what dadap suggested, but the problem with that is that it's simply too long. A mere 7character ID would be ideal, IMO. One letter to satisfy restrictions, and 6 numbers. I doubt we'd run out of numbers any time soon. Worst case, change the first letter from W (for word) to S (for sentence) and move sentences to a separate number range. Or have separate number ranges for the types of words.

(Should be needless to say, but for the record - this is low priority of course. I mainly wanted to add my vote to having a unique boQwI'-driven ID.)

@De7vID
Copy link
Owner

De7vID commented Apr 30, 2020

The ID field isn't empty. It is populated (see renumber.py) when the dictionary is compiled to binary, but the numbers are sequentially assigned and hence not stable (i.e., they change if new entries are added).

Do you require the ID to be stable?

@Cyberman-tM
Copy link
Contributor

Ah, I see - I'm using the raw XML, so for me the field is always empty :-)
"require" is too strong a word. I would like it to be stable - AND in the XML.
But it's not really a huge problem for me to create my own ID.

I do think it would be useful for others as well, though. I don't know how many are using the boQwI' database, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants