-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pgsql schema tables to triple store aka graphification #1
Comments
It would be best if you gave examples. I think that what you said was this: Given a table
you want to get
and likewise for the next row, is that right? This is opposed to the current default map of
which preserves the original row structure. Yes, I like that idea! it would use more memory, but I like the "denormalized" nature of this. It does make foreign keys a bit awkward; see next post. I propose adding a "roll your own mapping" to the API, and this would be a good test case for that. Basically, you as a user, could define any kind of mapping you'd like. |
There is a bit of nastiness with foreign keys. Suppose there is a second table
The point here is that the primary key is unique for that table, but the foreign key, there will typically be duplicates. (The foreign key references the primary key of the earlier table.) Then you get the somewhat awkward content
and somehow, outside of the system, you would need to know that |
One can permute things around, but this does not seem to gain anything. Instead of writing
one can write
But this does not feel "better", just "different". It's still the same triplet of three things: |
thanks, @linas! your first example is exactly what i'm thinking, except i don't think the table reference in the predicate name is necessary unless there are tables with different variables with the same name? is that even allowed in a sql schema? i also don't think that the foreign key duplication is a problem. in your example, doesn't the duplication just mean that samples 24 and 25 have the same values (line 42) for the variables that live in the table referenced by that foreign key? i think spivak's category theory representation of sql schema shows that a well formed set of tables is isomorphic to a unique set of triples. |
Hi @mjsduncan
Two different tables can have columns with the same name, holding completely different data. Thus only
They necessarily have different values. That's the whole point of foreign keys -- to allow many-to-one mappings. If it was always one-to-one, then you wouldn't use two tables, you would just use one table.
Heh. Long before Spivak, and before the internet, this was called "table normalization" and it is covered in textbooks, and consistently caused confusion in the poor students studying this stuff. You can see evidence of this even today, with assorted blog posts and stack-exchange questions about "table normalization". You will still see db admins explaining why they can't actually do it, due to reason xyz. You will see plenty of DBs designed by people who "never got the concept", and no one ever came around afterwards to clean it up. And don't get me started on software written by grad students. It is almost the worst, lowest-quality software out there. I say almost, because software written by full professors is even worse. You might believe I'm saying this to be obnoxious or bombastic or gate-keeping, but it's true -- I worked at IBM for a while, and we'd get regular requests from both IBM Research, internally, and from external universities, wishing to license code for resale. We'd look at it, and the code would always be trash, you'd get estimates like "it would take us 30 man-years to rewrite the code, at a cost of $10 million, and the most we could ever make selling it would be $5 million because there are only 1000 users world-wide" type of estimates. This is the hard reality of this kind of stuff, why proprietary software is so freakin expensive, and why open source became really popular. Open-source overcame this crazy cost-vs-maintainability, (re-)design & bug-fixing barrier. In short, I pin no great hopes on software and datasets from the genomics community; I figure its mostly written by grad students and professors. Maybe it's clean ... but the likelihood of that seems low. |
so insuring non-duplication of table names would have to be carried after import or by normalizing the db beforehand. i haven't tried spivak et al's categorical databse software cql but the theory is interesting because among other things it shows that schema map to simplicial sets, and the formulation can handle duplicate primary keys so you could potentially keep track of multiple probabilistically weighted entries. |
anyway, from the agi point of view, sucking an external db into the atomspace via the bridge is like a metabolic modality in cyberspace allowing absorption of dbs found in the environment to harvest for new knowledge! |
Hi @mjsduncan Some short remarks:
The atomspace bridge does not "import" the data. It is a two-way bridge, or a window between two places, or a "view" of the SQL data. The goal of writing the bridge was to avoid all the down-sides of data import/export. It gives you live access to a live database: changes there are reflected in the Atomspace, and vice-versa. It hooks the two together, unifying them into one. A harness.
This is not usually something that is either easy or quick. Whoever designed some particular SQL did it that way for a reason, and it's not going to be easy to do archeology to figure out why they did it that way, or to just "fix it" without risking data loss & inconsistency. But I suppose it depends on the DB.
But this is very easy: just give each DB a name! Like ... one line of code!
I dunno, maybe; but why the heck would you want that? If you've got one database, with a column called "id" and it has peoples phone numbers, and a second database, with a column called "id", and its their social security numbers, why would you want to mash all that into one big ball?
Careful. Theory is always interesting, but making things easy-to-use is much harder. That's why I keep asking for specific, concrete examples. Actually written in actual Atomese, reflecting what actual users would have to go through, day-to-day. The user experience is what matters. |
hi @linas, my use case is actually importing a snapshot of the data in the database as knowledge for the system. i don't need the extra table structure meant to optimize user queries. the theory shows that there is a minimal canonical graph/triple store representation of the database and that is what i want to put into the atomspace. the ultimate end user is the agi! |
Hi @mjsduncan If you can explicitly demonstrate what you want, then maybe I could do that. I cannot make progress without some explicit sketch of how to do this. |
hi @linas ! again, you laid out in your first couple comments exactly what i'm looking for. as the category theory makes clear (and you seem to agree) it's the natural thing to do. you say the mapping from the foreign key value to the primary key is "outside the system" but isn't that connection an explicit part of the schema? |
@linas here is a rough example of a way to deal with keys:
i know below is wrong but it represents keys better than just a number
then there would need to be a pre or post processing step to remove redundant columns |
Hi @mjsduncan
How does this differ from the existing structure? It appears that you took a key that is a number, and just appended the string Please be aware that SQL allows keys to be numbers, strings, dates, timestamps, jpeg images, ... anything at all. The current bridge code does not treat keys differently than other columns: if a column is a number, it will turn into a Yes, I envision allowing a user to specify a translation mapping "convert table X column c1, c2 into Atoms type t1, t2" but just ... haven't gotten around to that yet.
I don't understand what you're trying to accomplish there.
Have you actually tried to use the current code, to import the flybase dataset, and/or to use the browser to explore it? When I tried to import the flybase in full, it got up over 100 GB, and I gave up at that point. The browser only loads as much as needed to browse some particular row or column, and seemed to work well, as long as I didn't tumble over some column with a bazillion rows in it. But I know nothing of drosophilia or genetics, so got bored after a while. |
hi linus! the idea is that each distinct column becomes a link between the primary key in each row and the value of that row for that column. i think sql-ness guarantees that there is a unique way to do this
The text was updated successfully, but these errors were encountered: