Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split brain issue when migration from sentinel to sentinel #846

Open
tasszz2k opened this issue Jul 29, 2024 · 6 comments
Open

Split brain issue when migration from sentinel to sentinel #846

tasszz2k opened this issue Jul 29, 2024 · 6 comments
Labels
type: question Further information is requested

Comments

@tasszz2k
Copy link

Issue Description

Redis Sentinel Architecture
When the source Redis is deployed with the sentinel architecture and RedisShake is used to sync_readerconnect to the master database, it will be treated as a slave by the master database and may be elected as the new master by sentinel.

To avoid this situation, you should select a standby database as the source.

  • I migrated from the slave node of the source cluster to the master node of the destination cluster. I worked fine.
  • However, if the source cluster did failover during the migration process, the migration process would fail (it's ok, I understand that).
    But the problem is that the destination cluster got split brain issue which means the sentinel slaves of the destination cluster contain the nodes of both the source cluster and the destination cluster. (see the screenshot below)

Environment

  • RedisShake 版本(RedisShake Version):v3
  • Redis 源端版本(Redis Source Version):6.2.5
  • Redis 目的端版本(Redis Destination Version):6.2.5
  • Redis 部署方式(sentinel/cluster/sentinel):sentinel -> sentinel
  • 是否在云服务商实例上部署(Deployed on Cloud Provider):in-house kubernetes

Logs

If there are any error logs or other relevant logs, please provide them here.

source extra config
2024-07-29 08:56:54 INF GOOS: linux, GOARCH: amd64
2024-07-29 08:56:54 INF Ncpu: 2, GOMAXPROCS: 2
2024-07-29 08:56:54 INF pid: 136
2024-07-29 08:56:54 INF pprof_port: 6060
2024-07-29 08:56:54 INF metrics url: http://localhost:8080
2024-07-29 08:56:54 INF auth successful. address=[rs-host:6379]
2024-07-29 08:56:54 INF redisWriter connected to redis successful. address=[rs-host:6379]
2024-07-29 08:56:54 INF no password. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF psyncReader connected to redis successful. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF start save RDB. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF send [replconf listening-port 10007]
2024-07-29 08:56:54 INF send [PSYNC ? -1]
2024-07-29 08:56:54 INF receive [FULLRESYNC f239bc3e8e9082b1987b4829b8f0d65658f890af 1440733]
2024-07-29 08:56:54 INF source db is doing bgsave. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF source db bgsave finished. timeUsed=[0.05]s, address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF received rdb length. length=[178]
2024-07-29 08:56:54 INF create dump.rdb file. filename_path=[dump.rdb]
2024-07-29 08:56:54 INF save RDB finished. address=[rs-temp4.redis-sentinel-dev:6379], total_bytes=[178]
2024-07-29 08:56:54 INF start send RDB. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF start save AOF. address=[rs-temp4.redis-sentinel-dev:6379]
2024-07-29 08:56:54 INF RDB version: 9
2024-07-29 08:56:54 INF AOFWriter open file. filename=[1440733.aof]
2024-07-29 08:56:54 INF RDB AUX fields. key=[redis-ver], value=[6.2.7]
2024-07-29 08:56:54 INF RDB AUX fields. key=[redis-bits], value=[64]
2024-07-29 08:56:54 INF RDB AUX fields. key=[ctime], value=[1722243414]
2024-07-29 08:56:54 INF RDB AUX fields. key=[used-mem], value=[1923136]
2024-07-29 08:56:54 INF RDB repl-stream-db: 0
2024-07-29 08:56:54 INF RDB AUX fields. key=[repl-id], value=[f239bc3e8e9082b1987b4829b8f0d65658f890af]
2024-07-29 08:56:54 INF RDB AUX fields. key=[repl-offset], value=[1440733]
2024-07-29 08:56:54 INF RDB AUX fields. key=[aof-preamble], value=[0]
2024-07-29 08:56:54 INF send RDB finished. address=[rs-temp4.redis-sentinel-dev:6379], repl-stream-db=[0]
2024-07-29 08:56:55 INF AOFReader open file. aof_filename=[1440733.aof]
2024-07-29 08:57:04 INF Detect data sent by reader, stop pinging
2024-07-29 08:57:04 INF goroutine 21 [running]:  [runtime/debug.Stack()]<-runtime/debug/stack.go:24 +0x65  [github.com/alibaba/RedisShake/internal/log.Panicf({0x7b87fc, 0x45}, {0xc00008b778, 0x4, 0x4})]<-github.com/alibaba/RedisShake/internal/log/func.go:27 +0x36  [github.com/alibaba/RedisShake/internal/writer.(*redisWriter).flushInterval(0xc000267480)]<-github.com/alibaba/RedisShake/internal/writer/redis.go:88 +0x369  [created by github.com/alibaba/RedisShake/internal/writer.NewRedisWriter]<-github.com/alibaba/RedisShake/internal/writer/redis.go:37 +0x19c  [
2024-07-29 08:57:04 PNC redisWriter received error. error=[EOF], argv=[ping], slots=], reply=[<nil>]
panic: redisWriter received error. error=[EOF], argv=[ping], slots=], reply=[<nil>]

goroutine 21 [running]:
github.com/rs/zerolog.(*Logger).Panic.func1({0xc0001b00a0, 0x0})
    github.com/rs/[email protected]/log.go:375 +0x2d
github.com/rs/zerolog.(*Event).msg(0xc000112300, {0xc0001b00a0, 0x4d})
    github.com/rs/[email protected]/event.go:156 +0x2b8
github.com/rs/zerolog.(*Event).Msgf(0xc000112300, {0x7b87fc, 0x21d}, {0xc0000d1f78, 0x7a03e6, 0x3})
    github.com/rs/[email protected]/event.go:129 +0x4e
github.com/alibaba/RedisShake/internal/log.Panicf({0x7b87fc, 0x45}, {0xc0000d1f78, 0x4, 0x4})
    github.com/alibaba/RedisShake/internal/log/func.go:32 +0xef
github.com/alibaba/RedisShake/internal/writer.(*redisWriter).flushInterval(0xc000267480)
    github.com/alibaba/RedisShake/internal/writer/redis.go:88 +0x369
created by github.com/alibaba/RedisShake/internal/writer.NewRedisWriter
    github.com/alibaba/RedisShake/internal/writer/redis.go:37 +0x19c
Stream closed EOF for zlpsaas-dev/redis-migration-532a16bf-903b-4e25-97ee-8c7793c0e095-0-zpnkf (redis-shake)

Additional Information

Redis Source Cluster (1 master, 1 slave and 3 sentinel servers)

I have no name!@rfs-source5-75659ff8b7-bp7gw:/data$ redis-cli -p 26379
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.175.242"
    5) "port"
    6) "6379"
    7) "runid"
    8) "94001b2edb7702056817fa3374bda2a3eaee47fe"
    9) "flags"
   10) "master"
127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "172.16.80.247:6379"
    3) "ip"
    4) "172.16.80.247"
    5) "port"
    6) "6379"
    7) "runid"
    8) "54a241b5cb25790ff1970fdeba04bd01685da3ca"
    9) "flags"
   10) "slave"
   
#<== Failover here

127.0.0.1:26379> sentinel failover mymaster 
OK
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.80.247"
    5) "port"
    6) "6379"
    7) "runid"
    8) "54a241b5cb25790ff1970fdeba04bd01685da3ca"
    9) "flags"
   10) "master"

Redis Destination Cluster (1 master, 0 slaves and 3 sentinel servers)

I have no name!@rfs-dest5-5b697974fd-kh6fp:/data$ redis-cli -p 26379
127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.175.244"
    5) "port"
    6) "6379"
    7) "runid"
    8) "afd550514d067234f8b6a5cebc9809201ef67014"
    9) "flags"
   10) "master"
   11) "link-pending-commands"
   12) "0"
127.0.0.1:26379> sentinel slaves mymaster
(empty array)

#<== Failover here

127.0.0.1:26379> sentinel masters
1)  1) "name"
    2) "mymaster"
    3) "ip"
    4) "172.16.80.247" # <== the master node of the source cluster
    5) "port"
    6) "6379"
    7) "runid"
    8) ""
    9) "flags"
   10) "master,disconnected"
127.0.0.1:26379> sentinel slaves mymaster
1)  1) "name"
    2) "172.16.175.244:6379" # <== the "old" master node of the destination cluster
    3) "ip"
    4) "172.16.175.244"
    5) "port"
    6) "6379"
    7) "runid"
    8) "afd550514d067234f8b6a5cebc9809201ef67014"
    9) "flags"
   10) "slave"
@tasszz2k tasszz2k added the type: question Further information is requested label Jul 29, 2024
@suxb201
Copy link
Member

suxb201 commented Jul 29, 2024

Could you please provide the specific version of your RedisShake? Does it include the fix mentioned in this issue: #656 ( 513fc62a )?

@tasszz2k
Copy link
Author

the current version we are using is 2937df8

@suxb201
Copy link
Member

suxb201 commented Jul 30, 2024

Try the latest version, or modify the code to filter out__sentinel__:hello, just like what 513fc62a did.

@tasszz2k
Copy link
Author

Try the latest version, or modify the code to filter out__sentinel__:hello, just like what 513fc62a did.

let me try it.
thx

@tasszz2k
Copy link
Author

thank you @suxb201

it saves the day

@tasszz2k
Copy link
Author

however, If we ignore the condition like this cmd_name == "PUBLISH" and keys[1] == "__sentinel__:hello" only, it will not work. After that, I update the logic to ignore this one cmd_name == "PUBLISH" and (keys[1]== nil or keys[1] == '' or keys[1] == "__sentinel__:hello"), it works normally.

the final filter.lua is:

function filter(id, is_base, group, cmd_name, keys, slots, db_id, timestamp_ms)
    if cmd_name == "PING" then
        return 1, db_id -- disallow
    end
    if cmd_name == "REPLCONF" then
        return 1, db_id -- disallow
    end
    if cmd_name == "OPINFO" then
        return 1, db_id -- disallow
    end
    if cmd_name == "PUBLISH" and (keys[1]== nil or keys[1] == '' or keys[1] == "__sentinel__:hello") then
        return 1, db_id -- disallow
    end

    return 0, db_id -- always allow and redirect to the same db_id
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants