Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pure pgSQL implementation #2

Open
shayded-exe opened this issue May 1, 2019 · 11 comments
Open

Pure pgSQL implementation #2

shayded-exe opened this issue May 1, 2019 · 11 comments

Comments

@shayded-exe
Copy link

shayded-exe commented May 1, 2019

Hello, I created a pure pgSQL implementation of the time-based generation method.

Hopefully this is useful for those of us using a service like AWS RDS where we can't install C-based extensions.

It doesn't support customizing interval_count, but that should be trivial to add.

CREATE OR REPLACE FUNCTION generate_sequential_uuid(p_interval_length int DEFAULT 60)
  RETURNS uuid
  LANGUAGE plpgsql
AS $$
DECLARE
  v_i int;
  v_time bigint;
  v_bytes int[16] = '{}';
  v_hex text[16] = '{}';
BEGIN
  v_time := floor(extract(epoch FROM clock_timestamp()) / p_interval_length);
  v_bytes[1] := v_time >> 8 & 255;
  v_bytes[2] := v_time & 255;

  FOR v_i IN 3..16 LOOP
    v_bytes[v_i] := floor(random() * 256);
  END LOOP;

  FOR v_i IN 1..16 LOOP
    v_hex[v_i] := lpad(to_hex(v_bytes[v_i]), 2, '0');
  END LOOP;

  RETURN array_to_string(v_hex, '');
END $$;
@jackturnbull
Copy link

jackturnbull commented May 6, 2019

Thanks for posting this. I decided to go with this implementation. What's worth considering for people attempting to write an equivalent function within their application code is that some languages, or at least Crystal, use a floor-rounded integer value for their float to integer conversion. In the above it appears to be that Postgres rounds the integer value of the float returned from extract(epoch FROM clock_timestamp()) / p_interval_length.

This caught me out for a while because I could see that with a interval length of 60, the UUID was ticking upwards at each half-minute within Postgres, but my Crystal implementation was ticking exactly on the minute.

I decided to round the above snippet rather than in my application code, so I'm using;

v_time := floor(extract(epoch FROM clock_timestamp()) / p_interval_length);

Where my application implementation is;

require "uuid"

struct UUID
  def self.sequential_random(random = Random::Secure, variant = Variant::RFC4122, version = Version::V4)
    new_bytes = random.random_bytes(16)
    v_time = (Time.utc.to_unix_f / 60).to_u
    new_bytes[0] = (v_time >> 8 & 255).to_u8;
    new_bytes[1] = (v_time & 255).to_u8;

    # `new` calls the default UUID constructor, so the byte-array to UUID conversion happens in that function.
    new(new_bytes, variant, version)
  end
end

Now I don't think too many people will have use for the Crystal snippet but this implementation detail might help others implementing it in their own application software.

@shayded-exe
Copy link
Author

@jackturnbull Interesting note about the rounding. I suppose that would only be an issue if you're generating ids both on Postgres and in your application and storing them in the same context.

Glad you found it useful! :)

@dinhduongha
Copy link

I also created another version - next_uuid - compatible with ulid. Hope this help.

CREATE OR REPLACE FUNCTION next_uuid(OUT result uuid) AS $$
DECLARE
    now_millis bigint;
    second_rand bigint;
    hex_value text;
    shard_id int:=1;
BEGIN
    -- Can use clock_timestamp() / statement_timestamp()
    select (extract(epoch from transaction_timestamp())*1000)::BIGINT+(extract(milliseconds from transaction_timestamp()))::BIGINT INTO now_millis;
    select ((random() * 10^18)::BIGINT) INTO second_rand;
    -- Uncomment below line to ignore sharding.
    -- select ((random() * 10^6)::INT) INTO shard_id;
    hex_value := RPAD(TO_HEX(now_millis), 12, '0')||LPAD(TO_HEX(shard_id), 4, '0')||LPAD(TO_HEX(second_rand), 16, '0');
    result := CAST(hex_value AS UUID);
END;
$$ LANGUAGE PLPGSQL;

@tvondra
Copy link
Owner

tvondra commented Sep 24, 2019

Nice! I wonder how to best track those, so that users can find them. Keeping them buried in a github issue is not great, because people are unlikely to find them here. OTOH I can't just add them to the extension as that won't work on managed services (which I think is the main point of those SQL-only variants).

I think two things might work

  1. Creating a separate SQL-only extension (still does not work on RDS-like environments, but it's easier to get working compared to extension with C functions).

  2. Adding them to the docs, and referencing them from the docs as an alternative for cases where installing a C extension is not possible/desirable.

I think (2) is fine, but I'd like to know your opinions as it's your code.

@shayded-exe
Copy link
Author

shayded-exe commented Sep 24, 2019

@tvondra I think just linking them in the readme should suffice. I published mine as a gist so it's easier to link to https://gist.github.com/rshea0/f4c2e26829d82ed8d38eb5e6e6374ec2

I imagine that most people won't bother installing such a small SQL-only extension anyways and will probably just copy the function into their DB scripts.

@daesu
Copy link

daesu commented May 30, 2020

@PachowStudios thanks for the snippet. Maybe I'm missing something but order by id is giving unordered results using the uuid's generated with this. Am I missing something ?

@shayded-exe
Copy link
Author

@daesu Over what period of time were the ids generated? The ids will only be ordered if they were generated within the same interval period. You can increase the interval length if needed. The point isn't to have every id be ordered, but for them to not be completely random.

@tvondra Maybe you can comment on this? Am I correct?

@KnowZero
Copy link

KnowZero commented Sep 14, 2020

I haven't tried the C version, but doing a quick benchmark of 100,000 all that is here

Baseline = EXPLAIN ANALYZE SELECT uuid_generate_v4() FROM generate_series (1,100000) AS g;
Execution Time: 395.623 ms

@PachowStudios = EXPLAIN ANALYZE SELECT generate_sequential_uuid() FROM generate_series (1,100000) AS g;
Execution Time: 3181.545 ms

@dinhduongha = EXPLAIN ANALYZE SELECT next_uuid() FROM generate_series (1,100000) AS g;
Execution Time: 1351.426 ms

pgulid with base32 removed:
EXPLAIN ANALYZE SELECT generate_ulid_uuid() FROM generate_series (1,100000) AS g;
Execution Time: 712.480 ms

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION generate_ulid_uuid()
RETURNS uuid
AS $$
DECLARE

  timestamp  BYTEA = E'\\000\\000\\000\\000\\000\\000';
  unix_time  BIGINT;

BEGIN
  -- 6 timestamp bytes
  unix_time = (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::BIGINT;
  timestamp = SET_BYTE(timestamp, 0, (unix_time >> 40)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 1, (unix_time >> 32)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 2, (unix_time >> 24)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 3, (unix_time >> 16)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 4, (unix_time >> 8)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 5, unix_time::BIT(8)::INTEGER);


  RETURN encode( timestamp || gen_random_bytes(10) ,'hex')::uuid;
END
$$
LANGUAGE plpgsql
VOLATILE;

So generate_sequential_uuid non-C version seems to be almost 8X slower than uuid_generate_v4. next_uuid seems to be about 3.416X slower and pgulid implementation is only 1.8X slower.

I think the pgulid implementation is the way to go. Do note it doesn't seem to be compatible with next_uuid since they use different methods of timestamp accuracy?

Edit:
The other downside is you are dependent on pgcrypto extension, which is installed by default with contrib. That said, I wanted to see what happens if we merge both version of next_uuid and pgulid.

EXPLAIN ANALYZE SELECT generate_ulid_uuid2() FROM generate_series (1,100000) AS g;
Execution Time: 615.667 ms

It's actually faster! And only 1.556X slower than uuid_generate_v4. That said, random is postgres and many languages isn't true random, so the pgcrypto version is probably still more reliable I guess?

CREATE OR REPLACE FUNCTION generate_ulid_uuid2()
RETURNS uuid
AS $$
DECLARE

  timestamp  BYTEA = E'\\000\\000\\000\\000\\000\\000';
  unix_time  BIGINT;

BEGIN

  unix_time = (EXTRACT(EPOCH FROM clock_timestamp()) * 1000)::BIGINT;
  timestamp = SET_BYTE(timestamp, 0, (unix_time >> 40)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 1, (unix_time >> 32)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 2, (unix_time >> 24)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 3, (unix_time >> 16)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 4, (unix_time >> 8)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 5, unix_time::BIT(8)::INTEGER);


  RETURN (encode( timestamp ,'hex')||LPAD(TO_HEX((random() * 10^6)::INT), 4, '0')||LPAD(TO_HEX((random() * 10^18)::BIGINT), 16, '0'))::uuid;
END
$$
LANGUAGE plpgsql
VOLATILE;

Edit2:

Also, if someone wants to give up ULID compatibility for larger random numbers pool but doesn't mind that it will wrap around in 84 years. (based on the pgcrypto implementation with same speed). What it does is effectively start epoch at 2020 instead of 1970, which allows 1 less character being used for timestamp without loss of precision, but it is limited to only 84 years after which it wraps around.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION generate_uuid_84years(
	)
    RETURNS uuid
    LANGUAGE 'plpgsql'

    COST 100
    VOLATILE 
    
AS $BODY$
DECLARE

  timestamp  BYTEA = E'\\000\\000\\000\\000\\000';
  unix_time  BIGINT;

BEGIN

  unix_time = ((EXTRACT(EPOCH FROM clock_timestamp())-1577836800) * 1000)::BIGINT;

  timestamp = SET_BYTE(timestamp, 0, (unix_time >> 32)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 1, (unix_time >> 24)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 2, (unix_time >> 16)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 3, (unix_time >> 8)::BIT(8)::INTEGER);
  timestamp = SET_BYTE(timestamp, 4, unix_time::BIT(8)::INTEGER);

  RETURN encode( timestamp || gen_random_bytes(11) ,'hex')::uuid;
END
$BODY$;

@Tostino
Copy link

Tostino commented Jan 22, 2021

Here is another implementation of the time based generator in plpgsql, with the same parameter support as the C extension: https://gist.github.com/Tostino/ca104bff40e704b6db70b9af492664ef

@tac-adam-brusselback
Copy link

tac-adam-brusselback commented May 8, 2024

Here is an improved implementation that is ~2.5x faster than the implementation I left above (as Tostino):

CREATE FUNCTION uuid_time_nextval(interval_length int DEFAULT 60, interval_count int DEFAULT 65536)
    RETURNS uuid
    LANGUAGE plpgsql
AS $BODY$
DECLARE
    v_time bigint;
    v_uuid bytea := E'\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000\\000'; -- Zero-filled bytea
    v_prefix_bytes int := 0;
    v_i int;
BEGIN
    -- Validate input parameters
    IF interval_length < 1 THEN
        RAISE EXCEPTION 'length of interval must be a positive integer';
    END IF;
    IF interval_count < 1 THEN
        RAISE EXCEPTION 'number of intervals must be a positive integer';
    END IF;

    -- Compute the time-based prefix
    v_time := floor(extract(epoch FROM clock_timestamp()) / interval_length);

    -- Calculate the number of bytes needed for the prefix
    v_i := interval_count;
    WHILE v_i > 1 LOOP
            v_i := v_i / 256;
            v_prefix_bytes := v_prefix_bytes + 1;
        END LOOP;

    -- Set the time-based prefix bytes
    FOR v_i IN 0..v_prefix_bytes-1 LOOP
            v_uuid := set_byte(v_uuid, v_i, ((v_time >> (8 * (v_prefix_bytes - v_i - 1))) & 255)::integer);
        END LOOP;

    -- Generate random bytes for the rest of the UUID
    FOR v_i IN v_prefix_bytes..15 LOOP
            v_uuid := set_byte(v_uuid, v_i, (floor(random() * 256))::int);
        END LOOP;

    -- Return the hexadecimal representation of the UUID
    RETURN encode(v_uuid, 'hex');
END $BODY$;

@alecmev
Copy link

alecmev commented May 27, 2024

A bigint version, with ~4.5 hour wrap-around:

create function id() returns bigint as $$ begin
  return (
    floor((extract(epoch from clock_timestamp()) / 60) % 256)::int8::bit(8)
    || ('x' || encode(gen_random_bytes(7), 'hex'))::bit(56)
  )::bit(64)::bigint;
end; $$ language plpgsql;

And an even further text-based departure, with log2(alphabet_size ^ number_of_bytes) bits of entropy:

create function id() returns text as $$ declare
  bytes bytea;
  alphabet bytea = '34abcdefhijkmnoprstwxy';
  output text = '';
begin
  bytes = set_byte(
    gen_random_bytes(14),
    0,
    floor((extract(epoch from clock_timestamp()) / 60) % 256)::integer
  );
 
  for i in 0..(length(bytes) - 1) loop
    output = output || chr(get_byte(
      alphabet,
      get_byte(bytes, i) % length(alphabet)
    ));
  end loop;
   
  return output;
end; $$ language plpgsql;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants