Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tackle the dreaded multi-column primary key issue in meta.row_id. #30

Closed
erichanson opened this issue Dec 12, 2023 · 2 comments
Closed

Comments

@erichanson
Copy link
Collaborator

erichanson commented Dec 12, 2023

Right now, meta.relation supports compound primary keys via the redundant relation.primary_key_column_ids and relation.primary_key_column_names columns (though these should exist on meta.table, #9):

dev=# \d meta.relation
                             View "meta.relation"
          Column          |       Type       | Collation | Nullable | Default 
--------------------------+------------------+-----------+----------+---------
 id                       | meta.relation_id |           |          | 
 schema_id                | meta.schema_id   |           |          | 
 schema_name              | text             | C         |          | 
 name                     | text             | C         |          | 
 type                     | text             | C         |          | 
 primary_key_column_ids   | meta.column_id[] |           |          | 
 primary_key_column_names | text[]           | C         |          | 

However, the meta.row_id type only supports single column primary keys:

dev=# dev=# \d meta.row_id
              Composite type "meta.row_id"
     Column     | Type | Collation | Nullable | Default 
----------------+------+-----------+----------+---------
 schema_name    | text |           |          | 
 relation_name  | text |           |          | 
 pk_column_name | text |           |          | 
 pk_value       | text |           |          | 

Brief and ultimately irrelevant rant: This is such a awful and horrible space, because in the opinion of the author, having a single-column primary key on every table makes the world outside the database so much simpler, especially URLs and such, while hurting almost nothing if properly used. Multi-column unique constraints are of course also essential, especially with JOIN tables where say monkey is joined with trainer through the monkey_trainer table which has a monkey_id and a trainer_id. Each monkey has multiple trainers and each trainer trains multiple monkeys. unique(monkey_id, trainer_id), lovely. But, there's nothing about this approach that is orthoganal to still having a unique column to identify the row itself, aka monkey_trainer.id. To me it seems like a best practice, because, as systems evolve, maybe you want to add some columns to the relationship between each monkey and each trainer, like a affinity score, so what was formerly just a JOIN table becomes a thing, a description of their relationship. This happens in schema evolution all the time. Now there is a table originally designed as "just a join table" whose pk is two columns, and it's identity as a row is conflated with it's function as a join table. Yes, you can say that the relationship's identity is the pks monkey_id and trainer_id, but to what gain? Minor space saving, and a confounding and complication of the entire space of row identity. One argument was that when merging databases, serial integer primary key columns collided; so don't use integer for primary key, use UUID.

Really, most tools don't even fully solve this issue. Many ORMs just force you to have a single-column primary key and then take shortcuts to make "pretty" URLs like /stories/23. So pretty. So wrong. I'll pick on PostgREST even though I love it, because they took a different kind of shortcut:

Many APIs distinguish plural and singular resources using a special nested URL convention e.g. /stories vs /stories/1. Why do we use /stories?id=eq.1? The answer is because a singular resource is (for us) a row determined by a primary key, and primary keys can be compound (meaning defined across more than one column). The more familiar nested urls consider only a degenerate case of simple and overwhelmingly numeric primary keys. These so-called artificial keys are often introduced automatically by Object Relational Mapping libraries.

Admittedly PostgREST could detect when there is an equality condition holding on all columns constituting the primary key and automatically convert to singular. However this could lead to a surprising change of format that breaks unwary client code just by filtering on an extra column. Instead we allow manually specifying singular vs plural to decouple that choice from the URL format.

This actually works, and is better than most projects handle it, but it comes at a cost. It means that rows basically don't have URL identifiers. A REST request just says "give me all the rows with an id of 1" without any knowledge of whether or not id is actually a primary key, so results are always an array of rows rather than a single row. You can force the request to just return the first row instead of an array of results, but this isn't really primary key awareness. Primary key awareness lets your client know it is asking for exactly zero or one rows when supplying the primary key(s). Zero rows returns a 404 and one row returns the row.

So, we have to support this pattern completely, even though it complicates the entire space of row-identity, because people use it and sometimes even like it. Aquameta is supposed to be as opinion-free as possible, even when the author thinks it's an anti-pattern. Sigh.

The most straight-forward way to address this is to change meta.row_id to, instead of having pk_column_name and pk_value attributes, make them both arrays, pk_column_names[] and pk_values[]. This is simple and straight-forward, but befuddles so much code with this logic of, ok, how many columns does the pk have, over and over again, everywhere. It might be the best we can do, or...

There might be a way to encapsulate multi-column pk logic in another meta id, meta.primary_key_id, that contains one or more column_ids. It would mean that meta.row_id would have to change from pk_column_name to just pk or primary_key, whose type is primary_key_id. primary_key is singular even though it might contain multiple columns, because even in SQL, they say primary key (a,b,c,d) not primary keys(...). With this encapsulation, we could then have some nice helper functions for casting to where clauses and URLs. Bears some experimentation.

As an aside, this is the last of the Known Wretchedness in meta. Once this is fixed, we might dare to turn this thing loose on the pgxns and dpkgs of the world.

@erichanson
Copy link
Collaborator Author

erichanson commented Dec 13, 2023

-- key type
create type meta.key as (column_names text[], values text[]);                                      

-- single column constructor
create function meta.key (column_name text, value text) returns meta.key as $$                
    select row(array[column_name], array[value])::meta.key;                                   
$$ language sql;

-- multiple column constructor
create function meta.key (column_names text[], "values" text[]) returns meta.key as $$        
    select row(column_names, values)::meta.key;
$$ language sql;

-- change meta.row_id to this:
create type meta.row_id as (schema_name text, relation_name text, pk meta.key);

-- single pk-column constructor
create function meta.row_id(schema_name text, relation_name text, pk_column_name text, pk_value text) as $$
    select row(...)
$$ language sql;

-- multiple pk-coumn constructor
create function meta.row_id(schema_name text, relation_name text, pk_column_names text[], pk_values text[]) as $$
    select row(...)
$$ language sql;

Heck yeah. Starting a new branch for this, definitley a huge breaking change. Won't surface til 0.5.

@erichanson
Copy link
Collaborator Author

erichanson commented Dec 19, 2023

Went a different route on this and decided not to introduce another composite type for key. Instead, keep it simple. row_id is structured as follows:

create type row_id(schema_name text, relation_name text, pk_column_names text[], pk_values text[]);

The big change is that pk_column_names and pk_values are plural, text arrays. This realy isn't too bad to work with. There are constructors for row_id that also just take a single column name and single pk_value, so you're sure a table just has a single pk, they work just the same as things did before. Under the hood though the row_id puts the single value into an array regardless.

Then there's a _pk_stmt() templating function, that is very helpful when working with dynamic queries that need to use compound primary keys. It takes pk_column_names[] and pk_values[] text arrays, along with a template and delimiter, and "hydrate" the template with the values passed in, one for each item in the array:

-- helper function for iterating primary key arrays and generating a stmt fragment.
-- template is rendered by format(), using positional argument notation.
--     1: pk_column_names[i]
--     2: pk_values[i]
--     3: i

select _pk_stmt (
    array['id','other_id'],
    array[public.uuid_generate_v4()::text,public.uuid_generate_v4()::text],
    '(row_id).pk_values[%3$s] = x.%1$I'
);
                              _pk_stmt                               
---------------------------------------------------------------------
 (row_id).pk_values[1] = x.id and (row_id).pk_values[2] = x.other_id

meta.field_id also includes the same pk_column_names and pk_values structural changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant