Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure gocql handles zero-token nodes properly #226

Closed
dkropachev opened this issue Aug 4, 2024 · 7 comments
Closed

Ensure gocql handles zero-token nodes properly #226

dkropachev opened this issue Aug 4, 2024 · 7 comments
Assignees

Comments

@dkropachev
Copy link
Collaborator

dkropachev commented Aug 4, 2024

PR#19684 brings possibility of having nodes coordinator-only nodes (or zero-token nodes).
These types of nodes are going to be supported only in RAFT.

Such nodes, despite being registered in the cluster, do not handle any queries and should be excluded from query routing.
This feature is already present in cassandra, but not merged into scylla yet, so we might want to start testing it on our drivers with cassandra first.

Difference between cassandra and scylla implementation

Major difference is that these nodes are absent from system.peers and system.peers_v2 in cassandra, while in scylla implementation these nodes are going to be present there.

Due to this fact we will need to test Apache and datastax drivers against scylla as well.

Approx. Testing plan

Regular cluster

  1. Spin up a cluster with 3 nodes
  2. Join one additional node in zero-token mode, by setting join_ring to false in it's configuration, or adding -Dcassandra.join_ring=false to cli (cassandra only).
  3. Make sure that drivers works as expected and do not throw any errors while reading schema with this node being in the cluster
  4. Make sure that drivers works as expected and do not throw any errors while processing topology events (if these events issues) when such node joins/leaves cluster.
  5. Make sure that zero-token node does not participate in the routing
  6. Test if driver works properly if only connection point provided is zero-token node
  7. Ensure that at no point driver throw error or warning caused by zero-token node presence.

Cluster that starts with zero-token node (DROPPED)

  1. Start single node cluster with join_ring=false
  2. Connect to it, to make sure that driver session is created and every query end up in no host available error.
  3. Populate cluster with 3 more nodes
  4. Make sure that driver can execute queries
  5. Ensure that at no point driver throw error or warning.

Zero-token Datacenter

Repeat this scenario for following policies:

  1. DCAwareRoundRobinPolicy
  2. TokenAwareHostPolicy(DCAwareRoundRobinPolicy())
  3. TokenAwareHostPolicy(RoundRobinHostPolicy())

For DCAwareRoundRobinPolicy use three variants:

  1. Target first DC with real nodes
  2. Target second DC with zero token nodes
  3. (For drivers that supports it, gocql does not) Do not target any DC, make sure that policy won't pick datacenter with no real nodes.

Steps:
3. Start cluster of 2 nodes with 1 DC
4. Provision 2 more nodes into 2nd DC in join_ring=false mode
5. Connect to the cluster, using policy to make sure that driver session is created and every query is being scheduled to regular nodes and executed successfully. In cases when zero-token DC is targeted queries suppose to fail with no host available error

Links

Original umbrella issue in scylladb/scylladb repo: scylladb/scylladb#19693
Core issue to bring join_ring option into scylla: scylladb/scylladb#6527
PR that brings this feature in scylladb/scylladb#19684

@sylwiaszunejko
Copy link
Collaborator

sylwiaszunejko commented Oct 30, 2024

@dkropachev I created a PR for things I discover when testing first scenario, but second scenario is impossible to use because when I try to start single node cluster with join_ring=false I have an error: ERROR 2024-10-30 11:37:46,746 [shard 0:main] init - Startup failed: std::runtime_error (Cannot start the first node in the cluster as zero-token)

@dkropachev
Copy link
Collaborator Author

@dkropachev I created a PR for things I discover when testing first scenario, but second scenario is impossible to use because when I try to start single node cluster with join_ring=false I have an error: ERROR 2024-10-30 11:37:46,746 [shard 0:main] init - Startup failed: std::runtime_error (Cannot start the first node in the cluster as zero-token)

Thanks, it looks like it is imposible, let's focuse then on zero-token DC case

@sylwiaszunejko
Copy link
Collaborator

sylwiaszunejko commented Nov 5, 2024

In cases when zero-token DC is targeted queries suppose to fail with no host available error

@dkropachev Is is ok if it fails with error like this: 2024/11/05 12:44:16 Unable to connect to cluster: gocql: unable to create session: gocql: datacenter datacenter2 in the policy was not found in the topology - probable DC aware policy misconfiguration when using DCAwareRoundRobinPolicy(zero_token_database). Except for that I didn't find any incorrect behavior related to zero-token nodes.

@sylwiaszunejko
Copy link
Collaborator

@dkropachev ping

@dkropachev
Copy link
Collaborator Author

@sylwiaszunejko , It needs some context, but I am looking for following scenarios
datacenter2 is a zero-token datacenter
target host - host you feed to NewCluster
target dc - dc name you feed to DCAwareRoundRobinPolicy

  1. target host = any host from datacenter1, target dc = datacenter1. It should succeed, you should be able to execute queries
  2. target host = any host from datacenter2, target dc = datacenter1. It should succeed, you should be able to execute queries
  3. target host = any host from datacenter1, target dc = datacenter2. It should fail with same error you have provided
  4. target host = any host from datacenter2, target dc = datacenter2. It should fail with same error you have provided

@roydahan
Copy link
Collaborator

Let's make sure we add a unit test for it.

@dkropachev
Copy link
Collaborator Author

Currently, all scenarios tested manually, two PRs merged to address issue @sylwiaszunejko have found:
#333
#324

Unfortunately, due to architecture from horror movie having unit tests let alone integration tests for these scenarios is not something simple these tasks will be addressed on separately:
#337
#338

I am closing this issue as fixed by:
#333
#324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants