Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically handle failed DBS migrations #8244

Closed
belforte opened this issue Feb 16, 2024 · 10 comments
Closed

Automatically handle failed DBS migrations #8244

belforte opened this issue Feb 16, 2024 · 10 comments

Comments

@belforte
Copy link
Member

belforte commented Feb 16, 2024

see #7469 (comment)

I did this [1] in a python shell in the Publisherl, following example in /afs/cern.ch/user/b/belforte/WORK/DBS/migDbg.py [2]
NOTE need to change from cmsweb to cmsweb-prod. migration server does not run on the "for users" cmsweb cluster

[1]
I replayed the migration request that failed in the TaskPublish script, then deleted the exisint migrationId and submitted again, eventually the new migration failed with status 4 which means "block already at destination", which is just fine. So I manually ran TaskPublish for that task and everything went OK

>>> migUrl='https://cmsweb-prod.cern.ch/dbs/prod/phys03/DBSMigrate'
>>> globUrl='https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader'
>>> phy3Url = 'https://cmsweb-prod.cern.ch/dbs/prod/phys03/DBSReader'
>>> apiG   = DbsApi(url=globUrl)
>>> apiP3  = DbsApi(url=phy3Url)
>>> apiMig = DbsApi(url=migUrl)
>>> 
>>> data= {'migration_url': globUrl, 'migration_input': b}
>>> data= {'migration_url': globUrl, 'migration_input': b}
>>> data
{'migration_url': 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader', 'migration_input': '/Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e'}
>>> result = apiMig.submitMigration(data)
DBS Server error: [{'error': {'reason': 'migration request /Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e is already exist in DB with id=4392537', 'message': 'Migration request /Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e, id=0', 'function': 'dbs.migrate.SubmitMigration', 'code': 125, 'stacktrace': '\ngoroutine 144286 [running]:\ngithub.com/dmwm/dbs2go/dbs.Error({0xae4bc0?, 0xc0005f8060?}, 0x7d, {0xc00053c000, 0xcc}, {0xa0d471, 0x1b})\n\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 +0x99\ngithub.com/dmwm/dbs2go/dbs.(*API).SubmitMigration(0xc0002f6000)\n\t/go/src/github.com/vkuznet/dbs2go/dbs/migrate.go:679 +0x5e5\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, 0xc00075c120}, 0xc000118200, {0x9fb1c9, 0x6})\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:561 +0x148a\ngithub.com/dmwm/dbs2go/web.MigrationSubmitHandler({0xae81b0?, 0xc00075c120?}, 0x454134?)\n\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:968 +0x2f\nnet/http.HandlerFunc.ServeHTTP(0xf30000c000476b01?, {0xae81b0?, 0xc00075c120?}, 0x0?)\n\t/usr/local/go/src/net/http/server.go:2109 +0x2f\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, 0xc00075c120?}, 0x0?)\n\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 +0x38\nnet/http.HandlerFunc.ServeHTTP(0x94c100?, {0xae81b0?, 0xc00075c120?}, 0x11?)\n\t/usr/local/go/src/net/http/server'}, 'http': {'method': 'POST', 'code': 400, 'timestamp': '2024-02-16 14:50:24.580880652 +0000 UTC m=+117787.938674225', 'path': '/dbs/prod/phys03/DBSMigrate/submit', 'user_agent': 'DBSClient/Unknown/', 'x_forwarded_host': 'dbs-prod.cern.ch', 'x_forwarded_for': '137.138.62.154', 'remote_addr': '10.100.39.0:36384'}, 'exception': 400, 'type': 'HTTPError', 'message': 'DBSError Code:125 Description:DBS Migration error Function:dbs.migrate.SubmitMigration Message:Migration request /Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e, id=0 Error: migration request /Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e is already exist in DB with id=4392537'}]
Traceback (most recent call last):
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.12/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 474, in __callServer
    self.http_response = method_func(self.url, method, params, data, request_headers)
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.9/lib/python3.8/site-packages/RestClient/RestApi.py", line 42, in post
    return http_request(self._curl)
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-pycurl/3.17.9/lib/python3.8/site-packages/RestClient/RequestHandling/HTTPRequest.py", line 62, in __call__
    raise HTTPError(effective_url, http_code, http_response.msg, http_response.raw_header, http_response.body)
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: Bad Request

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.12/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 1834, in submitMigration
    return self.__callServer("submit", data=migrationObj, callmethod='POST')
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.12/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 477, in __callServer
    self.__parseForException(http_error)
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.12/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 524, in __parseForException
    raise exp
  File "/data/srv/TaskManager/v3.240111-ac252900ca96a6784d9e561ca231c954/slc7_amd64_gcc630/cms/py3-dbs3-client/4.0.12/lib/python3.8/site-packages/dbs/apis/dbsClient.py", line 522, in __parseForException
    raise HTTPError(http_error.url, data['exception'], data['message'], http_error.header, http_error.body, serverCode(data))
RestClient.ErrorHandling.RestClientExceptions.HTTPError: HTTP Error 400: DBSError Code:125 Description:DBS Migration error Function:dbs.migrate.SubmitMigration Message:Migration request /Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e, id=0 Error: migration request /Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e is already exist in DB with id=4392537
>>> id=4392537
>>> status = apiMig.statusMigration(migration_rqst_id=id)
>>> status
[{'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1706784889, 'last_modification_date': 1706784889, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'migration_input': '/Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e', 'migration_request_id': 4392537, 'migration_server': 'dbs2go-phys03-migration-58997db5d9-gp6m5', 'migration_status': 9, 'migration_url': 'https://cmsweb-prod.cern.ch:8443/dbs/prod/global/DBSReader', 'retry_count': 4}]
>>> mdic={'migration_rqst_id':id}
>>> apiMig.removeMigration({'migration_rqst_id':mid})
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'mid' is not defined
>>> apiMig.removeMigration({'migration_rqst_id':id})
[{'status': 'success', 'migration_request_id': 4392537}]
>>> 
>>> result = apiMig.submitMigration(data)
>>> result
[{'migration_details': {'migration_request_id': 4415897, 'migration_url': 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader', 'migration_input': '/Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e', 'migration_status': 5, 'migration_server': '', 'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1708095162, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'last_modification_date': 1708095162, 'retry_count': 0}, 'migration_report': 'Migration request is started', 'status': 'QUEUED', 'error': None}]
>>> id=4415897
>>> status = apiMig.statusMigration(migration_rqst_id=id)
>>>
[...]
>>>
>>> apiMig.statusMigration(migration_rqst_id=id)
[{'create_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'creation_date': 1708095162, 'last_modification_date': 1708095162, 'last_modified_by': '/DC=ch/DC=cern/OU=computers/CN=tw/crab-prod-tw01.cern.ch', 'migration_input': '/Zto2Nu-2Jets_PTNuNu-40to100_1J_TuneCP5_13p6TeV_amcatnloFXFX-pythia8/Run3Summer22EEMiniAODv4-130X_mcRun3_2022_realistic_postEE_v6-v1/MINIAODSIM#1d24b8c6-ef15-4ffc-8de6-ba80cb3a0d6e', 'migration_request_id': 4415897, 'migration_server': 'dbs2go-phys03-migration-65dbd99c49-dptf6', 'migration_status': 4, 'migration_url': 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader', 'retry_count': 1}]
>>> 

[2] cat migDbg.py

from dbs.apis.dbsClient import DbsApi
migUrl='https://dbs3-test1.cern.ch/dbs/dev/global/DBSMigrate'
phy3Url = 'https://dbs3-test1.cern.ch/dbs/dev/global/DBSReader'
migUrl='https://cmsweb-testbed.cern.ch/dbs/int/phys03/DBSMigrate'
phy3Url = 'https://cmsweb-testbed.cern.ch/dbs/int/phys03/DBSReader'
migUrl='https://cmsweb.cern.ch/dbs/prod/phys03/DBSMigrate'
phy3Url = 'https://cmsweb.cern.ch/dbs/prod/phys03/DBSReader'
globUrl='https://cmsweb.cern.ch/dbs/prod/global/DBSReader'
apiG   = DbsApi(url=globUrl)
apiP3  = DbsApi(url=phy3Url)
apiMig = DbsApi(url=migUrl)

# pick  a dataset
dataset='/NonPrD0_pT-1p2_y-2p4_pp_13TeV_pythia8/RunIILowPUAutumn18DR-102X_upgrade2018_realistic_v15-v1/AODSIM'

#find blocks
bList=[]
for dic in apiG.listBlocks(dataset=dataset):
    bl1.append(dic['block_name'])

#submit migrations and collect their Id's
migrIds=[]
for b in bList:
    data= {'migration_url': globUrl, 'migration_input': b}
    result = apiMig.submitMigration(data)
    id = result.get("migration_details", {}).get("migration_request_id")
    migrIds.append(id)

#check status, to be repeated until OK (state=2) or terminally failed (state=9)
# https://github.com/dmwm/DBS/blob/625525721c3df9a9b842c84ad0ac29f55180524b/Client/src/python/dbs/apis/dbsClient.py#L1489-L1492
for id in migrIds:
        status = apiMig.statusMigration(migration_rqst_id=id)
        state = status[0].get("migration_status")
        retry_count = status[0].get("retry_count")
        print "migration id, state, retries : %d %d %s" % (id, state, retry_count)


# if there are old, failed migrations, need remove them in order to
# submit them again, the first loop will simly collect their ID's
for mig in migrIds:
    mdic={'migration_rqst_id':mid}
    apiMig.removeMigration({'migration_rqst_id':mid})
@belforte
Copy link
Member Author

belforte commented Feb 16, 2024

side task

  • fix cmsweb cluster name in /afs/cern.ch/user/b/belforte/WORK/DBS/migDbg.py and put in GH

@belforte
Copy link
Member Author

this happened again https://cms-talk.web.cern.ch/t/crab-job-pubilcation-failed/39892
and again I lost a couple hours on it.

Better fix the code.

@belforte
Copy link
Member Author

belforte commented Apr 29, 2024

take this change to

@belforte
Copy link
Member Author

dump of most recent error in human format (i.e. pprint)

{'error': {'code': 125,
           'function': 'dbs.migrate.SubmitMigration',
           'message': 'Migration request '
                      '/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24, '
                      'id=0',
           'reason': 'migration request '
                     '/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24 '
                     'is already exist in DB with id=4435739',
           'stacktrace': '\\ngoroutine 1684332 '
                         '[running]:\\ngithub.com/dmwm/dbs2go/dbs.Error({0xae4bc0?, '
                         '0xc00063a9c0?}, 0x7d, {0xc00069a140, 0xa0}, '
                         '{0xa0d471, '
                         '0x1b})\\n\\t/go/src/github.com/vkuznet/dbs2go/dbs/errors.go:185 '
                         '+0x99\\ngithub.com/dmwm/dbs2go/dbs.(*API).SubmitMigration(0xc000196000)\\n\\t/go/src/github.com/vkuznet/dbs2go/dbs/migrate.go:679 '
                         '+0x5e5\\ngithub.com/dmwm/dbs2go/web.DBSPostHandler({0xae81b0, '
                         '0xc000640108}, 0xc000d4aa00, {0x9fb1c9, '
                         '0x6})\\n\\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:561 '
                         '+0x148a\\ngithub.com/dmwm/dbs2go/web.MigrationSubmitHandler({0xae81b0?, '
                         '0xc000640108?}, '
                         '0x454134?)\\n\\t/go/src/github.com/vkuznet/dbs2go/web/handlers.go:968 '
                         '+0x2f\\nnet/http.HandlerFunc.ServeHTTP(0x890000c000c02c01?, '
                         '{0xae81b0?, 0xc000640108?}, '
                         '0x0?)\\n\\t/usr/local/go/src/net/http/server.go:2109 '
                         '+0x2f\\ngithub.com/dmwm/dbs2go/web.limitMiddleware.func1({0xae81b0?, '
                         '0xc000640108?}, '
                         '0x0?)\\n\\t/go/src/github.com/vkuznet/dbs2go/web/middlewares.go:111 '
                         '+0x38\\nnet/http.HandlerFunc.ServeHTTP(0x94c100?, '
                         '{0xae81b0?, 0xc000640108?}, '
                         '0x11?)\\n\\t/usr/local/go/src/net/http/serve'},
 'exception': 400,
 'http': {'code': 400,
          'method': 'POST',
          'path': '/dbs/prod/phys03/DBSMigrate/submit',
          'remote_addr': '10.100.164.0:47826',
          'timestamp': '2024-04-25 08:12:54.165349637 +0000 UTC '
                       'm=+1294416.993904549',
          'user_agent': 'DBSClient/Unknown/',
          'x_forwarded_for': '137.138.157.32',
          'x_forwarded_host': 'dbs-prod2.cern.ch'},
 'message': 'DBSError Code:125 Description:DBS Migration error '
            'Function:dbs.migrate.SubmitMigration Message:Migration request '
            '/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24, '
            'id=0 Error: migration request '
            '/RelValTTbar_14TeV/CMSSW_14_0_0_pre3-PU_140X_mcRun4_realistic_v1_STD_2026D98_PU-v1/GEN-SIM-DIGI-RAW#c627abb1-b2eb-49a5-9c9f-174fbd82ca24 '
            'is already exist in DB with id=4435739',
 'type': 'HTTPError'}

@mapellidario
Copy link
Member

notice that the document["error"]["code"] == 125, which matches the list of error codes from dbs [1]. Maybe this issue is a good place to start with #7469

[1] https://github.com/dmwm/dbs2go/blob/28e02bde209af797af7d59d4f7e3baba25a98605/dbs/errors.go#L64

@belforte
Copy link
Member Author

Thanks Dario,
I am aware. And current code already uses the error dictionary above. But there is no use for the DBS error code here since it is the useless "Migration failure" and I have to parse the reason string :-(

As much as I hate to parse messages, I'd rather not ask for a change in semantic where the API returns success and reports the existing migrationId, as the old server was doing. I tried to "protect us against changes" via
dmwm/dbs2go#112 . Let's hope DBS maintainer(s) can at least merge that.

@belforte
Copy link
Member Author

belforte commented Apr 30, 2024

all of this is now in my branch https://github.com/belforte/CRABServer/tree/deal-with-failed-migrations-8244
and at least Publisher starts (on crab-dev-tw01). I will try some publications. As usual it is not easy to test migrations, and in particular failed migrations !

@belforte
Copy link
Member Author

belforte commented May 1, 2024

my test task got 50 files correctly published. So at least code is not badly broken by the changes to unify DBS access in the common PublisherDbsUtils.

@belforte
Copy link
Member Author

belforte commented May 2, 2024

put my branch on preprod and ran a couple Jenkins ST tests.
publication for FTS ASO (TaskPublish.py) appears all OK (of course no migrations).
But I get unexptected problems with Rucio ASO. That thread is now moved to #8376

In the meanwhile I am declaring changes to TaskPublish.py (the FTS one) tested enought to make a PR.

I still need to find a way to test migrations. But do not know how to quickly find a dataset which nobody used yet !

@belforte
Copy link
Member Author

belforte commented May 6, 2024

Closed via #8378

I could not test failed migrations, but at worst they will be still broken and I will debug when I have an example

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants