Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the public key/node ID changes, use a database with a separate namespace #225

Closed
Tracked by #118
richardhuaaa opened this issue Oct 15, 2024 · 14 comments
Closed
Tracked by #118

Comments

@richardhuaaa
Copy link
Contributor

richardhuaaa commented Oct 15, 2024

As a starting example (may or may not be the best approach), our unit tests use this approach to create namespaced databases:

func newCtlDB(t testing.TB) (*sql.DB, string, func()) {
return openDB(t, localTestDBDSNPrefix+localTestDBDSNSuffix)
}
func newInstanceDB(t testing.TB, ctx context.Context, ctlDB *sql.DB) (*sql.DB, string, func()) {
dbName := "test_" + RandomStringLower(12)
_, err := ctlDB.Exec("CREATE DATABASE " + dbName)
require.NoError(t, err)
db, dsn, cleanup := openDB(t, localTestDBDSNPrefix+"/"+dbName+localTestDBDSNSuffix)
require.NoError(t, migrations.Migrate(ctx, db))
return db, dsn, func() {
cleanup()
_, err = ctlDB.Exec("DROP DATABASE " + dbName)
require.NoError(t, err)
}
}

Easiest would be to derive the namespace name from the public key, so that for rolling upgrades all new nodes roll onto the same database

@neekolas
Copy link
Contributor

neekolas commented Nov 26, 2024

I think this is a good idea that we should implement. It stops us from needing to "wipe the database" and instead lets us move over to new databases, with the possibility of rolling back if we made a mistake. It also handles rolling upgrades well, since new code will write to the new DB while old code is still running smoothly.

The remaining question is "what are the inputs to the database identifier". We can create an arbitrary set of inputs, then derive a hash to be used in the DB identifier.

Node public key is one input. At least until we decide to add a per-smart-contract originator ID, we can also include the addresses of all the smart contracts. So, if you change the smart contract we start fresh with a new DB.

We could have some "generation ID" in the code that we can manually increment to force upgrades. This can live in the code, allowing us to deploy releases which force a DB upgrade. This should be a "last resort" option, since wiping the DB without changing the smart contract could lead to incompatibilities. But there are cases where it makes sense. For example, wire format changes to messages that don't impact the smart contracts. These should be rare.

So, the database name would look something like xmtpd-${HASH(CONCAT($GENERATION_ID, $NODE_PRIVATE_KEY, $NODES_CONTRACT_ADDRESS, $MESSAGES_CONTRACT_ADDRESS, $IDENTITY_UPDATES_CONTRACT_ADDRESS))}

To implement we'll need to do some gymnastics to connect to the postgres DB on startup, make sure the DB exists, and then update the connection string to point to the right database.

@richardhuaaa
Copy link
Contributor Author

If changing the generation ID switches to a new database without changing the public key, will that cause weird behavior replicating to other nodes during rolling upgrades?

@mkysel mkysel removed their assignment Nov 26, 2024
@neekolas
Copy link
Contributor

will that cause weird behavior replicating to other nodes during rolling upgrades?

Almost certainly. But they should be upgrading soon after and wiping their own databases, since the generation ID is in the code. Once everyone upgrades it'll quiet down, and anything that happens before the upgrade will be thrown away. So, will be noisy but self-resolve.

@mkysel
Copy link
Contributor

mkysel commented Nov 26, 2024

I am not a huge fan of generationID. Mostly because I see no issues with generating a new id from publickey/nodeId+contract since it gives us a clear boundary. New public key means new node using the same PSQL server which is safe. And new contract means new reality reset. Allowing to reset this manually would really complicate things as it would lead to inconsistencies.

@neekolas
Copy link
Contributor

Fair concern. If we didn't have a generation ID we would need to normalize creating new public/private keypairs as part of upgrades. Which would be fine, but takes a little more hand-holding with partners when they upgrade. Would need to re-register them into the nodes contract, etc.

@richardhuaaa
Copy link
Contributor Author

But they should be upgrading soon after and wiping their own databases, since the generation ID is in the code. Once everyone upgrades it'll quiet down, and anything that happens before the upgrade will be thrown away. So, will be noisy but self-resolve.

I guess what I'm worried about is nodes after upgrading syncing updates from nodes that haven't upgraded yet. If I interpret that situation correctly, it won't self-resolve. And I think the same issue also happens if you update the contract addresses.

@neekolas
Copy link
Contributor

So, the concern is what would happen if you have Node A (wiped DB) talking to Node B (not yet updated/wiped) and had replicated some of their pre-wipe messages?

Then Node B wipes and their originator sequence ID goes back to 0, causing conflicts?

Hmm. Maybe we do want to keep this simple and force everyone to update their private keys/nodes contract address whenever we do an upgrade. That way they have no way to talk to one another between versions.

@mkysel
Copy link
Contributor

mkysel commented Nov 26, 2024

I assumed that this would be hidden behind some test flag that explicitly states that it is hacky. In testnet/mainnet, the only way to wipe an instance should be to create a new nodeId.

@neekolas
Copy link
Contributor

Ya. This falls into the category of "training wheels" that we'd need to remove before mainnet.

@richardhuaaa
Copy link
Contributor Author

Seems like there are two use-cases here:

  1. A specific node wants to start from scratch. In that case, having the DB be namedspaced on private key, and changing the private key, should be able to solve that.
  2. Wanting to nuke the entire network and start from scratch. I'm wondering if we should just get good at running a script that automates this:
    1. Deploy new nodes contract
    2. Generate however many node keys we want
    3. Insert new node keys into contract
    4. Replace private key of all existing nodes

@richardhuaaa
Copy link
Contributor Author

richardhuaaa commented Nov 26, 2024

Alternatively, we could change the name derivation to:

xmtpd-${HASH(CONCAT($NODE_PRIVATE_KEY, $NODES_CONTRACT_ADDRESS))}

And use the act of setting a new node registry address as the act of resetting the network. Then we don't need to change the node keys - once all nodes have been reset, we can add them all to the registry.

@neekolas
Copy link
Contributor

neekolas commented Nov 26, 2024

Then we don't need to change the node keys - once all nodes have been reset, we can add them all to the registry.

If we add the nodes to the registry after wiping the DBs, all the nodes are going to crash loop until we add them to the registry since they can't find themselves. Maybe that's fine, but leads to more downtime and noise.

If the HTTP addresses and public keys aren't changing, the old nodes will still be able to connect to the new nodes and read messages, since the connection info in the old contract will work. That's not the end of the world, since they will wipe their DBs eventually but is still a little messy.

I'm wondering if we should just get good at running a script that automates this:

If we control all the nodes that script is straightforward enough. But when other people are running the nodes, there's a coordination cost. It feels hacky for us to be handing node operators their private keys. They should generate them locally and send us the public keys.

@neekolas
Copy link
Contributor

neekolas commented Nov 26, 2024

Let's make this specific: an upgrade plan for us to use the new namespaced databases to migrate testnet to the new chain/nodes contract.

xmtpd-${HASH(CONCAT($NODE_PRIVATE_KEY, $NODES_CONTRACT_ADDRESS))}

  1. Update Terraform with the new contract addresses
  2. Deploy
  3. Add the nodes to the contract after they have deployed, at which point they will stop crash looping and connect to any other nodes that are in the contract

xmtpd-${HASH(CONCAT($NODE_PRIVATE_KEY))}

  1. Generate new public/private keypairs for all nodes
  2. Deploy new Nodes smart contract
  3. Add nodes to the new smart contract
  4. Update Terraform with new private keys and contract addresses
  5. Deploy

Conclusion

I like the xmtpd-${HASH(CONCAT($NODE_PRIVATE_KEY, $NODES_CONTRACT_ADDRESS))} plan. It leads to some noisy deploys, but guarantees that after the wipe you are only talking to other nodes that have also been wiped (assuming we sequence adding nodes to the contract correctly)

neekolas added a commit that referenced this issue Dec 2, 2024
### TL;DR
Added support for namespaced databases with automatic creation and migration capabilities.

### What changed?
- Introduced `NewNamespacedDB` function to create and manage isolated database instances
- Refactored database connection logic into smaller, focused functions
- Added `createNamespace` function to handle database creation
- Implemented `waitUntilDBReady` for connection readiness checks
- Added unit tests to verify namespaced database functionality
- Made test database DSN constants public for reuse

### How to test?
1. Run the new unit test `TestNamespacedDB`
2. Verify that a new database is created with the specified namespace
3. Confirm that the created database is properly initialized with migrations
4. Check that the database name matches the provided namespace

### Why make this change?
This change enables better isolation between different database instances, allowing for separate namespaces per instance. This is particularly useful for testing scenarios and multi-tenant applications where data separation is crucial.

## Ticket

#225

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

- **New Features**
	- Introduced a function to check database readiness and a function to create a namespaced database.
	- Enhanced database configuration handling with a new parsing function.
- **Bug Fixes**
	- Improved error handling during namespace creation.
- **Tests**
	- Added tests to validate the creation and querying of namespaced databases, including support for multiple instances with the same name and error handling for invalid inputs.
- **Documentation**
	- Updated visibility of constants for better accessibility in the test utilities.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@mkysel
Copy link
Contributor

mkysel commented Dec 3, 2024

@neekolas has this been resolved?

@neekolas neekolas closed this as completed Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants