Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Duplicate data when sync data from milvus upstream to downstream #145

Open
anhnch30820 opened this issue Oct 18, 2024 · 21 comments
Open

Comments

@anhnch30820
Copy link

Current Behavior

dc
dr

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

@SimFG
Copy link
Collaborator

SimFG commented Oct 18, 2024

What version of milvus is used? What version of cdc is used? Is there high concurrent insert/delete?

@anhnch30820
Copy link
Author

@SimFG I use milvus version 2.4.13, cdc version v2.0.0-rc2. TPS is 2
My flow here:

  1. source milvus create collectionA and insert data create index, flush
  2. backup collection A, and restore CollectionA on target milvus While insert/delete data to source milvus
  3. set target milvus's ttMsgEnabled to false
  4. create cdc task use collectionA's backup positions
  5. Stop insert/delete data to source milvus
  6. set target milvus's ttMsgEnabled to true
  7. Check data in GUI Attu

@anhnch30820
Copy link
Author

I just tested again only add data, no delete data and I see,
total data in source milvus: 987
total data in checkpoint backup: 492
total data in target milvus: 987 + 492 = 1479

@SimFG
Copy link
Collaborator

SimFG commented Oct 18, 2024

@anhnch30820 you can try to use the cdc server in the latest main branch.

@anhnch30820
Copy link
Author

anhnch30820 commented Oct 22, 2024

@SimFG I tried using cdc server in lastest main branch and I got error when I create task

[INFO] [reader/etcd_op.go:566] ["get all collection data"] [count=2]
[INFO] [reader/replicate_channel_manager.go:162] ["has added dropped collection"] [ids="[]"]
[2024/10/22 04:54:20.822 +00:00] [INFO] [reader/collection_reader.go:241] ["the collection is not in the watch list"] [task_id=1af9cdba993148c69a6162f49040642b] [name=vdsmm] [collection_id=453395456371199765]
[2024/10/22 04:54:20.822 +00:00] [INFO] [reader/collection_reader.go:241] ["the collection is not in the watch list"] [task_id=1af9cdba993148c69a6162f49040642b] [name=vdsmb] [collection_id=453395456372200058]
[2024/10/22 04:54:20.822 +00:00] [DEBUG] [[email protected]/call.go:35] ["retrying of unary invoker"] [target=etcd-endpoints://0xc000841dc0/milvus-etcd:2379] [attempt=0]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:710] ["get all partition data"] [partition_num=2]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:742] ["partition state is not created/dropped or partition name is default"] [partition_name=_default] [state=PartitionCreated]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/etcd_op.go:742] ["partition state is not created/dropped or partition name is default"] [partition_name=_default] [state=PartitionCreated]
[2024/10/22 04:54:20.824 +00:00] [INFO] [reader/collection_reader.go:319] ["has started to read collection and partition"] [task_id=1af9cdba993148c69a6162f49040642b]
[2024/10/22 04:54:20.824 +00:00] [INFO] [server/cdc_impl.go:332] ["create request done"]

@SimFG
Copy link
Collaborator

SimFG commented Oct 22, 2024

It seems that this has correctly processed the create request

@anhnch30820
Copy link
Author

@SimFG But when I created collection, nothing changes in the target cluster

[2024/10/22 07:06:12.594 +00:00] [INFO] [reader/etcd_op.go:251] ["the collection state is not created"] [key=by-dev/meta/root-coord/database/collection-info/1/453395456372882628] [collection_name=vdsmb] [state=CollectionCreating]
[2024/10/22 07:06:13.680 +00:00] [INFO] [reader/etcd_op.go:389] ["partition state is not created or partition name is default"] [collection_id=453395456372882628] ["partition name"=_default] [state=PartitionCreated]
[2024/10/22 07:06:15.941 +00:00] [DEBUG] [[email protected]/call.go:35] ["retrying of unary invoker"] [target=etcd-endpoints://0xc0009e8700/milvus-etcd:2379] [attempt=0]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/collection_reader.go:117] ["has watched to read collection"] [task_id=c72583aafca1470a9d8d04330f77445a] [collection_name=vdsmb] [collection_id=453395456372882628]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/collection_reader.go:120] ["the collection should not be read"] [task_id=c72583aafca1470a9d8d04330f77445a] [collection_name=vdsmb] [collection_id=453395456372882628]
[2024/10/22 07:06:15.944 +00:00] [INFO] [reader/etcd_op.go:284] ["the collection is not consumed"] [collection_id=453395456372882628] [collection_name=vdsmb]

@SimFG
Copy link
Collaborator

SimFG commented Oct 22, 2024

From the log, the collection in source milvus has not been created yet, because its state is creating. However, I suspect that this problem is caused by the previous data residue. To ensure correctness, I suggest cleaning up all environmental data first, such as the meta storage information of cdc, and then redeploy the two milvus and cdc services.

@anhnch30820
Copy link
Author

@SimFG I tried again, it still duplicated

@SimFG
Copy link
Collaborator

SimFG commented Oct 23, 2024

How do you do it? Is it the following steps: insert data first, then delete data, and then use attu to check the number of data rows. Do you wait for a while before checking the number of rows? It may be because the deleted data may not have been applied yet. If you don't want to wait for a while, you can try using flush.

Can you find out the diff data and whether some delete operations have not taken effect.

@SimFG
Copy link
Collaborator

SimFG commented Oct 23, 2024

Each PR is guaranteed by integration testing, and there will be CDC process testing every day. In theory, such a small amount of data should be unlikely to go wrong.

@anhnch30820
Copy link
Author

anhnch30820 commented Oct 23, 2024

@SimFG Here is the upstream
dc (1)

And here is the downstream
dr (1)

@anhnch30820
Copy link
Author

@SimFG Could you provide me the latest file bin milvus-cdc?

@SimFG
Copy link
Collaborator

SimFG commented Oct 23, 2024

you can clone the repo, and in the repo dir, execute the command: make build

@SimFG
Copy link
Collaborator

SimFG commented Oct 23, 2024

Can you confirm whether the two milvus are completely independent? I feel that the downstream milvus seems to be abnormal. The extra data seems to be the data of one segment being repeatedly calculated on another segment.

318 = 169+149

@anhnch30820
Copy link
Author

anhnch30820 commented Oct 23, 2024

@SimFG
318 from milvus cdc
169 from milvus backup restore
It seems to get all the data from the beginning and not from the checkpoint.

@SimFG
Copy link
Collaborator

SimFG commented Oct 23, 2024

@anhnch30820 See if the point is not set correctly. You can try not to use the point first to see if the cdc can work properly.

@anhnch30820
Copy link
Author

@SimFG Not work with large data

@anhnch30820
Copy link
Author

anhnch30820 commented Oct 24, 2024

In reality most pages have only 8 lines of data, but the results from 631 to 644 should be 14 lines in the downstream
attu
And I checked the total amount of data with code and attu and the results are the same in the downstream, it should be 100999 like upstream
total_entities

@anhnch30820
Copy link
Author

I created a backup and compared their total capacity, dc is upstream, dr is downstream. the result shows that dr cluster has almost 2 times the capacity
compare_dc_dr

@SimFG
Copy link
Collaborator

SimFG commented Oct 24, 2024

This test is to see if the position parameter is passed in when creating the task. In addition, the performance of attu seems to be caused by duplicate data. Recently, I am developing a data difference checking tool.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants