From cd719d7302156a67b70af8c545092b1a9c9c60c9 Mon Sep 17 00:00:00 2001 From: Dario Mapelli Date: Mon, 28 Feb 2022 19:45:26 +0100 Subject: [PATCH] Master python3 manualmerge 3 (#3) * parent 23707a194a1df38cd8cda928c21b6723001a95ea (#6818) Initial changes for python3. Make it possible to run with python3 on sched. * use gocurl from CVMFS Fix #6822 (#6824) * Belforte patch 1 (#6825) * use gocurl from CVMFS Fix #6822 (#6823) * add comment about py2/3 compatibility needs * use status_cache in pickle format/. Fix #6820 (#6829) * Remove most old "Panda" code (#6835) * remove PandaServerInterface. for #6542 * remove unused taskbuffer. For #6542 * remove useless comment about Panda. For #6542 * remove PanDAExceptions. For #6542 * disallow panda scheduler in regexp. for #6542 * Remove old crab cache code (#6833) * remove code in UserFileCache. for #6776 * remove reference to UserFileCache in setup.py. For #6776 * remove all code references to UserFileCache. For #6776 * remove all calls to panda stuff in the code (#6836) * remove pada fields. For #6542 * remove references to pandajobid DB column in code. For #6542 * remove panda-related JobGroup. For #6542 * remove useless calls to JobGroup. For #6542 * remove all references in code to panda, jobset and jobgroups. For #6542 * Move away mysql fix 6837 (#6838) * add a place for obsolete code * move MYSQL code to obsolete dir. Fix #6837 * remove Databases/TaskDB/Oracle/JobGroup from build. Fix #6839 (#6840) * use urllib3 in place of urllib2 (#6841) * remove couchDb related code. Easy part for #6834 (#6842) * Proper fix for autom split (#6843) * py3 fix for hashblib * proper py3 porting of urllib2.urlopen * remove old code. For #6845 (#6847) * Remove couch db code (#6848) * remove couchDb related code. Easy part for #6834 * remove CouchDB code from DagmanResubmitter. For #6845 * remove CouchDB code from PostJob. For #6845 * remove isCouchDBURL, now unused. For #6845 * one more cleanup in PostJob. For #6845 * one more cleanup in PostJob. For #6845 * restore code deleted by mistake * [py3] src/python/Databases suports py2 and py3 (#6828) * scr/pytohn/CRABInterface supports py3 (#6831) * [py3] src/python/CRABInterface - changes suggested by futurize * removed uses of deprecated panda code * validate_str instead of validate_ustr, deprecated in WMCore * a hack to make it run for minimal purposes (#6850) * complete removal of unused taskbuffer * stop trying to remove failed migrations from 2019. Fix #6854 (#6856) * Port to python3 recent small fixes from master (#6858) * use gocurl from CVMFS Fix #6822 (#6823) * add comment about py2/3 compatibility needs (#6826) * add GH remote for Diego * upload new config version (#6852) * stop trying to remove failed migrations from 2019. Fix #6854 (#6855) Co-authored-by: Daina <60326190+ddaina@users.noreply.github.com> * better logging of acquired publication files. Fix #6860 (#6861) * remove unused/undef variable. fix #6864 (#6865) * Second batch of fixes for crabserver REST in py3. (#6873) * HTCondorWorkflow: decode to str before parsing * HTCondorWorkflow: convert to str output of literal eval * slight improve to stefano's `horrible hack` * updated version of wmcore to 1.5.5 in requirements.txt * Add more logging (#6877) * add logging of tmp file removal * avoid duplicating ids. Fix #6800 * get task (DAG) status form sched. Fix #6869 (#6874) * get task (DAG) status form sched. Fix #6869 * improve comments * rename cache_status_jel to cache_status and use it. Fix #6411 (#6878) * validate both temp and final output LFNs. Fix #6871 (#6879) * change back to use py3 for cache_status last commit had changed by mistake to use python2 for cache_status * make migration dbg Utils worn in container. Fix #6853 (#6886) * Py3 for publisher (#6887) * ensure tasks is a list * basestring -> string * no need to cast to unicode * use python3 to start TaskPublish * REST and TW - correctly encode/decode input/outputs of b64encode/b64decode * stop inserting nose in TW tarball. Fix #6455 (#6888) * stop inserting nose in TW tarball. Fix #6455 * make sure CRAB3.zip exists, improve comments * improve log * port to python3 branch of https://github.com/dmwm/CRABServer/commit/87ada3b82688376fb2c8c90156689a7e05fd2638 * port to python3 branch of https://github.com/dmwm/CRABServer/pull/6891/commits/9a72d9e35066856ee141f99180153cbc9eb18060 * Make new publisher default (#6892) * make NEW_PUBLISHER the default, fix #6412 * remove code swithing NEW_PUBLISHER. Fix #6410 * add comments * start Publisher in py3 env (#6894) * stupid typo * py3 crabserver compatible with tasks submitted by py2 crabserver (#6907) - tm_split_args: convert to unicode the values in the lists: 'lumis' and 'runs'wq! * crabserver py3 - change tag for build with jenkins (#6908) * Make tw work in py3 for #6899 (#6901) * Queue is now lowercase, xrange -> range * use python3 to start TW * start TW from python3.8 dir * workaround ldap currently missing in py3 build * basestring --> str * use binary files for pickle * make sure to hande classAd defined as bytes as well * remove MonALISA code. Fix #6911 (#6913) * TW - new tag of WMCore with fix to credential/proxy (#6915) * TW - remove Logger and ProcInfo from setup.py and from bin/htcondor_make_runtime.sh (#6916) * TW - remove Logger and ProcInfo from setup.py * TW - remove Logger and ProcInfo from bin/htcondor_make_runtime.sh * TW - remove apmon from setup.py * TW - update tag of WMCore to mapellidario/py3.211214patch1 * setup.py - remove RESTInteractions from CRABClient build (#6919) * generate Error on bad extconfig format, remove old code, cleanup. Fix #6897 See also https://github.com/dmwm/CRABServer/issues/6897#issuecomment-985317486 (#6910) * better py3 comp. for authenticatedSubprocess. fix https://github.com/dmwm/CRABServer/issues/6899#issuecomment-998307151 (#6927) * remove references to asourl/asodb in TW (#6929) * [py3] apply py3-modernization changes to whole dmwm/CRABServer (#6921) * [py3] migrated TW/Actions/ to py3 * [py3] fix open() mode: str for json, bytes for pickle * [py3] fix use of hashlib.sha1(): input must be bytes * TaskWorker/Actions/StageoutCheck: use execute_command, not executeCommand * Publish utils for py3 (#6941) * use python3 to run DebugFailedBlockPublication * use python3 to run FindFailedBlockPublication * make py3 compat and improve printout. Fix #6939 * optionally create new publication * Fix task publish 6940 (#6942) * avoid using undefined variable. Fix #6940 * make sure all calls to DBS are in try/excect for #6940 * use Rucio client py2 for FTS_transfer.py. Fix #6948 (#6949) * use Rucio client py2 for FTS_transfer.py. Fix #6948 * add comment about python version * pass $XrdSecGSISRVNAMES to cmsRun. Fix #6953 (#6955) (#6956) * Pre dag divide by zero fix 6926 (#6959) * protect against probe jobs returning no events. Fix #6926 * some pylint cleanups * Cleanup userproxy from rest fix 6931 (#6960) * remove unused retrieveUserCert for #6931 * cleanup unused userproxy from REST fix #6931 * remove unused imports * cleanup serverdn/serverproxy/serverkey from REST code. Fix #6961 * correct kill arguments. Fix #6928 (#6964) * requirements.txt: update wmcore tag (#6966) * REST-py3 backward compatibile with publisher-py2 (#6967) * Fix mkruntime 6970 (#6971) * non need for cherrypy in TW tarball. Fix #6970 * place dummyFile in local dir and clenaup * remove useless encode. fix #6972 (#6973) * use $STARTDIR for dummyFile. (#6974) * enable TaskWorker to use IDTOKENS. Fix #6903 (#6975) * update requirements.txt to dmwm/WMCore 1.5.7 (#6982) * use different WEB_DIR for token auth. Fix #6905 (#6983) * correct check for classAd existance. Fix #6986 (#6987) * define CRAB_UserHN ad for task_wrapper. Fix #6981 (#6988) * no spaces around = in bash. properly fix #6981 * fix not py3-compatible pycurl.error handling in RESTInteractions (#6996) * make Pre/Post/RetryJob use existing WEB_DIR. Fix #6994 (#6998) * remove extra / in API name. Fix #7004 (#7005) * Remove extra slash fix 7004 (#7006) * remove extra / in API name. Fix #7004 * remove extra / in API name. Fix #7004 * restore NoAvailableStie exception for TW. Fix #7038 (#7039) * make sure classAds for matching are ORDERED lists, fix #7043 (#7044) * make sure eventsThr and eventsSize are not used if not initialized. Fix #7065 (#7066) * Adjust code to work with new DBS Go based server (#6969) (#7074) Co-authored-by: Valentin Kuznetsov * user python3 for FTS_transfers. Fix #6909 (#7052) * adapt to new DBS serverinfo API (#7093) * use WMCore 2.0.1.pre3 - Fix #7096 (#7097) * point user feedback to CmsTalk. Fix #7100 (#7101) Co-authored-by: Stefano Belforte Co-authored-by: Daina <60326190+ddaina@users.noreply.github.com> Co-authored-by: Valentin Kuznetsov --- bin/htcondor_make_runtime.sh | 36 +- bin/logon_myproxy_openssl.py | 2 +- .../CAFUtilitiesBase.py | 0 .../Databases/FileMetaDataDB/MySQL/Create.py | 0 .../Databases/FileMetaDataDB/MySQL/Destroy.py | 3 +- .../MySQL/FileMetaData/FileMetaData.py | 0 .../MySQL/FileMetaData/__init__.py | 0 .../FileMetaDataDB/MySQL/__init__.py | 0 .../Databases/FileTransfersDB/MySQL/Create.py | 0 .../FileTransfersDB/MySQL/Destroy.py | 3 +- .../FileTransfersDB/MySQL/__init__.py | 0 .../Databases/TaskDB/MySQL/Create.py | 0 .../Databases/TaskDB/MySQL/Destroy.py | 3 +- .../TaskDB/MySQL/JobGroup/JobGroup.py | 0 .../TaskDB/MySQL/JobGroup/__init__.py | 0 .../Databases/TaskDB/MySQL/Task/Task.py | 0 .../Databases/TaskDB/MySQL/Task/__init__.py | 0 .../Databases/TaskDB/MySQL/__init__.py | 0 obsolete/README.MD | 3 + requirements.txt | 3 +- scripts/AdjustSites.py | 48 +- scripts/CMSRunAnalysis.py | 4 +- scripts/Utils/DebugFailedBlockPublication.py | 2 +- scripts/Utils/FindFailedMigrations.py | 16 +- scripts/Utils/RemoveFailedMigration.py | 50 +- scripts/dag_bootstrap.sh | 20 +- scripts/dag_bootstrap_startup.sh | 51 +- scripts/task_process/FTS_Transfers.py | 29 +- scripts/task_process/cache_status.py | 289 ++++++-- scripts/task_process/cache_status_jel.py | 543 --------------- scripts/task_process/task_proc_wrapper.sh | 27 +- setup.py | 21 +- src/python/CRABInterface/Attrib.py | 2 +- src/python/CRABInterface/DataFileMetadata.py | 52 +- src/python/CRABInterface/DataUserWorkflow.py | 32 +- src/python/CRABInterface/DataWorkflow.py | 227 +------ .../CRABInterface/HTCondorDataWorkflow.py | 90 +-- src/python/CRABInterface/RESTBaseAPI.py | 10 +- src/python/CRABInterface/RESTCache.py | 20 +- src/python/CRABInterface/RESTExtensions.py | 2 - src/python/CRABInterface/RESTFileMetadata.py | 5 +- .../CRABInterface/RESTFileUserTransfers.py | 3 - src/python/CRABInterface/RESTServerInfo.py | 4 +- src/python/CRABInterface/RESTTask.py | 4 +- src/python/CRABInterface/RESTUserWorkflow.py | 25 +- .../CRABInterface/RESTWorkerWorkflow.py | 28 +- src/python/CRABInterface/Regexps.py | 3 +- src/python/CRABInterface/Utilities.py | 137 ++-- .../Databases/FileMetaDataDB/Oracle/Create.py | 1 - .../FileMetaDataDB/Oracle/Destroy.py | 3 +- .../Oracle/FileMetaData/FileMetaData.py | 12 +- .../FileTransfersDB/Oracle/Destroy.py | 3 +- src/python/Databases/TaskDB/Oracle/Create.py | 6 +- src/python/Databases/TaskDB/Oracle/Destroy.py | 3 +- .../TaskDB/Oracle/JobGroup/JobGroup.py | 25 - .../TaskDB/Oracle/JobGroup/__init__.py | 3 - .../Databases/TaskDB/Oracle/Task/Task.py | 28 +- src/python/HTCondorUtils.py | 19 +- src/python/Logger.py | 69 -- src/python/PandaServerInterface.py | 571 ---------------- src/python/ProcInfo.py | 588 ---------------- src/python/Publisher/PublisherMaster.py | 162 +++-- src/python/Publisher/TaskPublish.py | 60 +- src/python/RESTInteractions.py | 2 +- src/python/ServerUtilities.py | 13 +- .../TaskWorker/Actions/DBSDataDiscovery.py | 19 +- .../TaskWorker/Actions/DagmanCreator.py | 50 +- src/python/TaskWorker/Actions/DagmanKiller.py | 13 +- .../TaskWorker/Actions/DagmanResubmitter.py | 99 +-- .../TaskWorker/Actions/DagmanSubmitter.py | 32 +- .../TaskWorker/Actions/DataDiscovery.py | 4 +- .../TaskWorker/Actions/DryRunUploader.py | 33 +- src/python/TaskWorker/Actions/Handler.py | 34 +- src/python/TaskWorker/Actions/MyProxyLogon.py | 4 +- src/python/TaskWorker/Actions/PostJob.py | 637 +++++++----------- src/python/TaskWorker/Actions/PreDAG.py | 46 +- src/python/TaskWorker/Actions/PreJob.py | 18 +- .../Actions/Recurring/BanDestinationSites.py | 4 +- .../Actions/Recurring/FMDCleaner.py | 7 +- .../Actions/Recurring/GenerateXML.py | 4 - .../Actions/Recurring/RenewRemoteProxies.py | 8 +- .../Actions/Recurring/TapeRecallStatus.py | 33 - src/python/TaskWorker/Actions/RetryJob.py | 8 +- src/python/TaskWorker/Actions/Splitter.py | 2 +- .../TaskWorker/Actions/StageoutCheck.py | 4 +- src/python/TaskWorker/Actions/TaskAction.py | 16 +- src/python/TaskWorker/MasterWorker.py | 23 +- src/python/TaskWorker/TaskManagerBootstrap.py | 4 +- src/python/TaskWorker/Worker.py | 21 +- src/python/TaskWorker/WorkerExceptions.py | 17 +- src/python/UserFileCache/RESTBaseAPI.py | 30 - src/python/UserFileCache/RESTExtensions.py | 195 ------ src/python/UserFileCache/RESTFile.py | 279 -------- src/python/UserFileCache/__init__.py | 3 - src/python/WMArchiveUploader.py | 2 +- src/python/taskbuffer/FileSpec.py | 117 ---- src/python/taskbuffer/JobSpec.py | 141 ---- src/python/taskbuffer/__init__.py | 0 src/script/Deployment/Publisher/start.sh | 11 +- .../Deployment/TaskWorker/TaskWorkerConfig.py | 4 + src/script/Deployment/TaskWorker/start.sh | 10 +- .../Deployment/TaskWorker/updateTMRuntime.sh | 15 +- .../Monitor/logstash/crabtaskworker.conf | 25 +- .../RESTFileUserTransfers_t.py | 2 +- 104 files changed, 1220 insertions(+), 4119 deletions(-) rename {src/python/Databases => obsolete}/CAFUtilitiesBase.py (100%) rename {src/python => obsolete}/Databases/FileMetaDataDB/MySQL/Create.py (100%) rename {src/python => obsolete}/Databases/FileMetaDataDB/MySQL/Destroy.py (93%) rename {src/python => obsolete}/Databases/FileMetaDataDB/MySQL/FileMetaData/FileMetaData.py (100%) rename {src/python => obsolete}/Databases/FileMetaDataDB/MySQL/FileMetaData/__init__.py (100%) rename {src/python => obsolete}/Databases/FileMetaDataDB/MySQL/__init__.py (100%) rename {src/python => obsolete}/Databases/FileTransfersDB/MySQL/Create.py (100%) rename {src/python => obsolete}/Databases/FileTransfersDB/MySQL/Destroy.py (93%) rename {src/python => obsolete}/Databases/FileTransfersDB/MySQL/__init__.py (100%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/Create.py (100%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/Destroy.py (93%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/JobGroup/JobGroup.py (100%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/JobGroup/__init__.py (100%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/Task/Task.py (100%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/Task/__init__.py (100%) rename {src/python => obsolete}/Databases/TaskDB/MySQL/__init__.py (100%) create mode 100644 obsolete/README.MD mode change 100755 => 100644 scripts/task_process/cache_status.py delete mode 100644 scripts/task_process/cache_status_jel.py delete mode 100644 src/python/Databases/TaskDB/Oracle/JobGroup/JobGroup.py delete mode 100755 src/python/Databases/TaskDB/Oracle/JobGroup/__init__.py delete mode 100644 src/python/Logger.py delete mode 100644 src/python/PandaServerInterface.py delete mode 100644 src/python/ProcInfo.py delete mode 100644 src/python/UserFileCache/RESTBaseAPI.py delete mode 100644 src/python/UserFileCache/RESTExtensions.py delete mode 100644 src/python/UserFileCache/RESTFile.py delete mode 100644 src/python/UserFileCache/__init__.py delete mode 100644 src/python/taskbuffer/FileSpec.py delete mode 100644 src/python/taskbuffer/JobSpec.py delete mode 100644 src/python/taskbuffer/__init__.py diff --git a/bin/htcondor_make_runtime.sh b/bin/htcondor_make_runtime.sh index f3d23a84f6..881f7640cf 100755 --- a/bin/htcondor_make_runtime.sh +++ b/bin/htcondor_make_runtime.sh @@ -34,19 +34,20 @@ else fi pushd $STARTDIR -# -# cleanup, avoid to keep adding to existing tarballs -# +# cleanup, avoid to keep adding to existing tarballs rm -f $STARTDIR/CRAB3.zip rm -f $STARTDIR/WMCore.zip -rm -f $STARTDIR/nose.tar.gz +# make sure there's always a CRAB3.zip to avoid errors in other parts +touch $STARTDIR/dummyFile +zip -r $STARTDIR/CRAB3.zip $STARTDIR/dummyFile +rm -f $STARTDIR/dummyFile # For developers, we download all our dependencies from the various upstream servers. # For actual releases, we take the libraries from the build environment RPMs. if [[ "x$RPM_RELEASE" != "x" ]]; then - + # I am inside a release building pushd $ORIGDIR/../WMCore-$WMCOREVER/build/lib/ zip -r $STARTDIR/WMCore.zip * zip -rq $STARTDIR/CRAB3.zip WMCore PSetTweaks Utils -x \*.pyc || exit 3 @@ -56,16 +57,12 @@ if [[ "x$RPM_RELEASE" != "x" ]]; then zip -rq $STARTDIR/CRAB3.zip RESTInteractions.py HTCondorUtils.py HTCondorLocator.py TaskWorker CRABInterface TransferInterface -x \*.pyc || exit 3 popd - pushd $VO_CMS_SW_DIR/$SCRAM_ARCH/external/cherrypy/*/lib/python2.7/site-packages - zip -rq $STARTDIR/CRAB3.zip cherrypy -x \*.pyc - popd - mkdir -p bin cp -r $ORIGDIR/scripts/{TweakPSet.py,CMSRunAnalysis.py,task_process} . - cp $ORIGDIR/src/python/{Logger.py,ProcInfo.py,ServerUtilities.py,RucioUtils.py,CMSGroupMapper.py,RESTInteractions.py} . + cp $ORIGDIR/src/python/{ServerUtilities.py,RucioUtils.py,CMSGroupMapper.py,RESTInteractions.py} . else - + # building runtime tarballs from development area or GH if [[ -d "$REPLACEMENT_ABSOLUTE/WMCore" ]]; then echo "Using replacement WMCore source at $REPLACEMENT_ABSOLUTE/WMCore" WMCORE_PATH="$REPLACEMENT_ABSOLUTE/WMCore" @@ -85,17 +82,6 @@ else CRABSERVER_PATH="CRABServer-$CRABSERVERVER" fi - if [[ ! -e nose.tar.gz ]]; then - curl -L https://github.com/nose-devs/nose/archive/release_1.3.0.tar.gz > nose.tar.gz || exit 2 - fi - - tar xzf nose.tar.gz || exit 2 - - pushd nose-release_1.3.0/ - zip -rq $STARTDIR/CRAB3.zip nose -x \*.pyc || exit 3 - popd - - # up until this point, evertying in CRAB3.zip is an external cp $STARTDIR/CRAB3.zip $ORIGDIR/CRAB3-externals.zip @@ -110,12 +96,12 @@ else mkdir -p bin cp -r $CRABSERVER_PATH/scripts/{TweakPSet.py,CMSRunAnalysis.py,task_process} . - cp $CRABSERVER_PATH/src/python/{Logger.py,ProcInfo.py,ServerUtilities.py,RucioUtils.py,CMSGroupMapper.py,RESTInteractions.py} . + cp $CRABSERVER_PATH/src/python/{ServerUtilities.py,RucioUtils.py,CMSGroupMapper.py,RESTInteractions.py} . fi pwd echo "Making TaskManagerRun tarball" -tar zcf $ORIGDIR/TaskManagerRun-$CRAB3_VERSION.tar.gz CRAB3.zip TweakPSet.py CMSRunAnalysis.py task_process Logger.py ProcInfo.py ServerUtilities.py RucioUtils.py CMSGroupMapper.py RESTInteractions.py || exit 4 +tar zcf $ORIGDIR/TaskManagerRun-$CRAB3_VERSION.tar.gz CRAB3.zip TweakPSet.py CMSRunAnalysis.py task_process ServerUtilities.py RucioUtils.py CMSGroupMapper.py RESTInteractions.py || exit 4 echo "Making CMSRunAnalysis tarball" -tar zcf $ORIGDIR/CMSRunAnalysis-$CRAB3_VERSION.tar.gz WMCore.zip TweakPSet.py CMSRunAnalysis.py Logger.py ProcInfo.py ServerUtilities.py CMSGroupMapper.py RESTInteractions.py || exit 4 +tar zcf $ORIGDIR/CMSRunAnalysis-$CRAB3_VERSION.tar.gz WMCore.zip TweakPSet.py CMSRunAnalysis.py ServerUtilities.py CMSGroupMapper.py RESTInteractions.py || exit 4 popd diff --git a/bin/logon_myproxy_openssl.py b/bin/logon_myproxy_openssl.py index c4dec1ca2d..07b5e15a09 100644 --- a/bin/logon_myproxy_openssl.py +++ b/bin/logon_myproxy_openssl.py @@ -18,5 +18,5 @@ 'server_cert': sys.argv[3],} timeleftthreshold = 60 * 60 * 24 mypclient = SimpleMyProxy(defaultDelegation) -userproxy = mypclient.logonRenewMyProxy(username=sha1(sys.argv[4]+userdn).hexdigest(), myproxyserver=myproxyserver, myproxyport=7512) +userproxy = mypclient.logonRenewMyProxy(username=sha1((sys.argv[4]+userdn).encode("utf8")).hexdigest(), myproxyserver=myproxyserver, myproxyport=7512) print ("Proxy Retrieved with len ", len(userproxy)) diff --git a/src/python/Databases/CAFUtilitiesBase.py b/obsolete/CAFUtilitiesBase.py similarity index 100% rename from src/python/Databases/CAFUtilitiesBase.py rename to obsolete/CAFUtilitiesBase.py diff --git a/src/python/Databases/FileMetaDataDB/MySQL/Create.py b/obsolete/Databases/FileMetaDataDB/MySQL/Create.py similarity index 100% rename from src/python/Databases/FileMetaDataDB/MySQL/Create.py rename to obsolete/Databases/FileMetaDataDB/MySQL/Create.py diff --git a/src/python/Databases/FileMetaDataDB/MySQL/Destroy.py b/obsolete/Databases/FileMetaDataDB/MySQL/Destroy.py similarity index 93% rename from src/python/Databases/FileMetaDataDB/MySQL/Destroy.py rename to obsolete/Databases/FileMetaDataDB/MySQL/Destroy.py index 07975adab4..5149cebe2e 100755 --- a/src/python/Databases/FileMetaDataDB/MySQL/Destroy.py +++ b/obsolete/Databases/FileMetaDataDB/MySQL/Destroy.py @@ -3,7 +3,6 @@ """ import threading -import string from WMCore.Database.DBCreator import DBCreator from Databases.FileMetaDataDB.Oracle.Create import Create @@ -29,6 +28,6 @@ def __init__(self, logger = None, dbi = None, param=None): i = 0 for tableName in orderedTables: i += 1 - prefix = string.zfill(i, 2) + prefix = str(i).zfill(2) self.create[prefix + tableName] = "DROP TABLE %s" % tableName diff --git a/src/python/Databases/FileMetaDataDB/MySQL/FileMetaData/FileMetaData.py b/obsolete/Databases/FileMetaDataDB/MySQL/FileMetaData/FileMetaData.py similarity index 100% rename from src/python/Databases/FileMetaDataDB/MySQL/FileMetaData/FileMetaData.py rename to obsolete/Databases/FileMetaDataDB/MySQL/FileMetaData/FileMetaData.py diff --git a/src/python/Databases/FileMetaDataDB/MySQL/FileMetaData/__init__.py b/obsolete/Databases/FileMetaDataDB/MySQL/FileMetaData/__init__.py similarity index 100% rename from src/python/Databases/FileMetaDataDB/MySQL/FileMetaData/__init__.py rename to obsolete/Databases/FileMetaDataDB/MySQL/FileMetaData/__init__.py diff --git a/src/python/Databases/FileMetaDataDB/MySQL/__init__.py b/obsolete/Databases/FileMetaDataDB/MySQL/__init__.py similarity index 100% rename from src/python/Databases/FileMetaDataDB/MySQL/__init__.py rename to obsolete/Databases/FileMetaDataDB/MySQL/__init__.py diff --git a/src/python/Databases/FileTransfersDB/MySQL/Create.py b/obsolete/Databases/FileTransfersDB/MySQL/Create.py similarity index 100% rename from src/python/Databases/FileTransfersDB/MySQL/Create.py rename to obsolete/Databases/FileTransfersDB/MySQL/Create.py diff --git a/src/python/Databases/FileTransfersDB/MySQL/Destroy.py b/obsolete/Databases/FileTransfersDB/MySQL/Destroy.py similarity index 93% rename from src/python/Databases/FileTransfersDB/MySQL/Destroy.py rename to obsolete/Databases/FileTransfersDB/MySQL/Destroy.py index 677ec17535..6ecc9c88e4 100755 --- a/src/python/Databases/FileTransfersDB/MySQL/Destroy.py +++ b/obsolete/Databases/FileTransfersDB/MySQL/Destroy.py @@ -3,7 +3,6 @@ """ import threading -import string from WMCore.Database.DBCreator import DBCreator from Databases.TaskDB.Oracle.Create import Create @@ -29,5 +28,5 @@ def __init__(self, logger = None, dbi = None, param=None): i = 0 for tableName in orderedTables: i += 1 - prefix = string.zfill(i, 2) + prefix = str(i).zfill(2) self.create[prefix + tableName] = "DROP TABLE %s" % tableName diff --git a/src/python/Databases/FileTransfersDB/MySQL/__init__.py b/obsolete/Databases/FileTransfersDB/MySQL/__init__.py similarity index 100% rename from src/python/Databases/FileTransfersDB/MySQL/__init__.py rename to obsolete/Databases/FileTransfersDB/MySQL/__init__.py diff --git a/src/python/Databases/TaskDB/MySQL/Create.py b/obsolete/Databases/TaskDB/MySQL/Create.py similarity index 100% rename from src/python/Databases/TaskDB/MySQL/Create.py rename to obsolete/Databases/TaskDB/MySQL/Create.py diff --git a/src/python/Databases/TaskDB/MySQL/Destroy.py b/obsolete/Databases/TaskDB/MySQL/Destroy.py similarity index 93% rename from src/python/Databases/TaskDB/MySQL/Destroy.py rename to obsolete/Databases/TaskDB/MySQL/Destroy.py index 677ec17535..6ecc9c88e4 100755 --- a/src/python/Databases/TaskDB/MySQL/Destroy.py +++ b/obsolete/Databases/TaskDB/MySQL/Destroy.py @@ -3,7 +3,6 @@ """ import threading -import string from WMCore.Database.DBCreator import DBCreator from Databases.TaskDB.Oracle.Create import Create @@ -29,5 +28,5 @@ def __init__(self, logger = None, dbi = None, param=None): i = 0 for tableName in orderedTables: i += 1 - prefix = string.zfill(i, 2) + prefix = str(i).zfill(2) self.create[prefix + tableName] = "DROP TABLE %s" % tableName diff --git a/src/python/Databases/TaskDB/MySQL/JobGroup/JobGroup.py b/obsolete/Databases/TaskDB/MySQL/JobGroup/JobGroup.py similarity index 100% rename from src/python/Databases/TaskDB/MySQL/JobGroup/JobGroup.py rename to obsolete/Databases/TaskDB/MySQL/JobGroup/JobGroup.py diff --git a/src/python/Databases/TaskDB/MySQL/JobGroup/__init__.py b/obsolete/Databases/TaskDB/MySQL/JobGroup/__init__.py similarity index 100% rename from src/python/Databases/TaskDB/MySQL/JobGroup/__init__.py rename to obsolete/Databases/TaskDB/MySQL/JobGroup/__init__.py diff --git a/src/python/Databases/TaskDB/MySQL/Task/Task.py b/obsolete/Databases/TaskDB/MySQL/Task/Task.py similarity index 100% rename from src/python/Databases/TaskDB/MySQL/Task/Task.py rename to obsolete/Databases/TaskDB/MySQL/Task/Task.py diff --git a/src/python/Databases/TaskDB/MySQL/Task/__init__.py b/obsolete/Databases/TaskDB/MySQL/Task/__init__.py similarity index 100% rename from src/python/Databases/TaskDB/MySQL/Task/__init__.py rename to obsolete/Databases/TaskDB/MySQL/Task/__init__.py diff --git a/src/python/Databases/TaskDB/MySQL/__init__.py b/obsolete/Databases/TaskDB/MySQL/__init__.py similarity index 100% rename from src/python/Databases/TaskDB/MySQL/__init__.py rename to obsolete/Databases/TaskDB/MySQL/__init__.py diff --git a/obsolete/README.MD b/obsolete/README.MD new file mode 100644 index 0000000000..382c8578d4 --- /dev/null +++ b/obsolete/README.MD @@ -0,0 +1,3 @@ +### CRAB OBSOLETE +A place where to put files which are not needed in CRABServer repo any more but may be useful as examples, history, or otherwise + diff --git a/requirements.txt b/requirements.txt index 62ac93e270..bdd91d3a6d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,4 +3,5 @@ # Format: # Dependency==version -wmcver==1.5.3 +wmcver==2.0.1.pre3 + diff --git a/scripts/AdjustSites.py b/scripts/AdjustSites.py index f54190c82b..bbb81fe300 100644 --- a/scripts/AdjustSites.py +++ b/scripts/AdjustSites.py @@ -14,10 +14,10 @@ import time import glob import shutil -import urllib +from urllib.parse import urlencode import traceback from datetime import datetime -from httplib import HTTPException +from http.client import HTTPException import classad import htcondor @@ -210,7 +210,10 @@ def makeWebDir(ad): """ Need a doc string here. """ - path = os.path.expanduser("~/%s" % ad['CRAB_ReqName']) + if 'AuthTokenId' in ad: + path = os.path.expanduser("/home/grid/%s/%s" % (ad['CRAB_UserHN'], ad['CRAB_ReqName'])) + else: + path = os.path.expanduser("~/%s" % ad['CRAB_ReqName']) try: ## Create the web directory. os.makedirs(path) @@ -238,23 +241,6 @@ def makeWebDir(ad): os.symlink(os.path.abspath(os.path.join(".", ".job.ad")), os.path.join(path, "job_ad.txt")) os.symlink(os.path.abspath(os.path.join(".", "task_process/status_cache.txt")), os.path.join(path, "status_cache")) os.symlink(os.path.abspath(os.path.join(".", "task_process/status_cache.pkl")), os.path.join(path, "status_cache.pkl")) - # prepare a startup cache_info file with time info for client to have something useful to print - # in crab status while waiting for task_process to fill with actual jobs info. Do it in two ways - # new way: a pickle file for python3 compatibility - startInfo = {'bootstrapTime': {}} - startInfo['bootstrapTime']['date'] = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC") - startInfo['bootstrapTime']['fromEpoch'] = int(time.time()) - with open(os.path.abspath(os.path.join(".", "task_process/status_cache.pkl")), 'w') as fp: - pickle.dump(startInfo, fp) - # old way: a file with multiple lines and print-like output - startInfo = "# Task bootstrapped at " + datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC") + "\n" - startInfo += "%d\n" % (int(time.time())) # machines will like seconds from Epoch more - # prepare fake status_cache info to please current (v3.210127) CRAB Client - fakeInfo = startInfo + "{" - fakeInfo += "'DagStatus': {'SubDagStatus': {}, 'Timestamp': 0L, 'NodesTotal': 1L, 'SubDags': {}, 'DagStatus': 1L}" - fakeInfo += "}\n{}\n" - with open(os.path.abspath(os.path.join(".", "task_process/status_cache.txt")), 'w') as fd: - fd.write(fakeInfo) os.symlink(os.path.abspath(os.path.join(".", "prejob_logs/predag.0.txt")), os.path.join(path, "AutomaticSplitting_Log0.txt")) os.symlink(os.path.abspath(os.path.join(".", "prejob_logs/predag.0.txt")), os.path.join(path, "AutomaticSplitting/DagLog0.txt")) os.symlink(os.path.abspath(os.path.join(".", "prejob_logs/predag.1.txt")), os.path.join(path, "AutomaticSplitting/DagLog1.txt")) @@ -266,6 +252,24 @@ def makeWebDir(ad): except Exception as ex: #pylint: disable=broad-except #Should we just catch OSError and IOError? Is that enough? printLog("Failed to copy/symlink files in the user web directory: %s" % str(ex)) + + # prepare a startup cache_info file with time info for client to have something useful to print + # in crab status while waiting for task_process to fill with actual jobs info. Do it in two ways + # new way: a pickle file for python3 compatibility + startInfo = {'bootstrapTime': {}} + startInfo['bootstrapTime']['date'] = datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC") + startInfo['bootstrapTime']['fromEpoch'] = int(time.time()) + with open(os.path.abspath(os.path.join(".", "task_process/status_cache.pkl")), 'wb') as fp: + pickle.dump(startInfo, fp) + # old way: a file with multiple lines and print-like output + startInfo = "# Task bootstrapped at " + datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC") + "\n" + startInfo += "%d\n" % (int(time.time())) # machines will like seconds from Epoch more + # prepare fake status_cache info to please current (v3.210127) CRAB Client + fakeInfo = startInfo + "{" + fakeInfo += "'DagStatus': {'SubDagStatus': {}, 'Timestamp': 0L, 'NodesTotal': 1L, 'SubDags': {}, 'DagStatus': 1L}" + fakeInfo += "}\n{}\n" + with open(os.path.abspath(os.path.join(".", "task_process/status_cache.txt")), 'w') as fd: + fd.write(fakeInfo) printLog("WEB_DIR created, sym links in place and status_cache initialized") try: @@ -287,7 +291,7 @@ def uploadWebDir(crabserver, ad): try: printLog("Uploading webdir %s to the REST" % data['webdirurl']) - crabserver.post(api='task', data=urllib.urlencode(data)) + crabserver.post(api='task', data=urlencode(data)) return 0 except HTTPException as hte: printLog(traceback.format_exc()) @@ -314,7 +318,7 @@ def saveProxiedWebdir(crabserver, ad): if proxied_webDir: # Prefer the proxied webDir to the non-proxied one ad[webDir_adName] = str(proxied_webDir) - if ad[webDir_adName]: + if webDir_adName in ad: # This condor_edit is required because in the REST interface we look for the webdir if the DB upload failed (or in general if we use the "old logic") # See https://github.com/dmwm/CRABServer/blob/3.3.1507.rc8/src/python/CRABInterface/HTCondorDataWorkflow.py#L398 dagJobId = '%d.%d' % (ad['ClusterId'], ad['ProcId']) diff --git a/scripts/CMSRunAnalysis.py b/scripts/CMSRunAnalysis.py index 2c8e244da6..12860e7f59 100644 --- a/scripts/CMSRunAnalysis.py +++ b/scripts/CMSRunAnalysis.py @@ -745,7 +745,7 @@ def StripReport(report): print("== Execution site from site-local-config.xml: %s" % slCfg.siteName) with open('jobReport.json', 'w') as of: json.dump(rep, of) - with open('jobReportExtract.pickle', 'w') as of: + with open('jobReportExtract.pickle', 'wb') as of: pickle.dump(rep, of) print("==== Report file creation FINISHED at %s ====" % time.asctime(time.gmtime())) except FwkJobReportException as FJRex: @@ -764,7 +764,7 @@ def StripReport(report): try: oldName = 'UNKNOWN' newName = 'UNKNOWN' - for oldName, newName in literal_eval(options.outFiles).iteritems(): + for oldName, newName in literal_eval(options.outFiles).items(): os.rename(oldName, newName) except Exception as ex: handleException("FAILED", EC_MoveOutErr, "Exception while moving file %s to %s." %(oldName, newName)) diff --git a/scripts/Utils/DebugFailedBlockPublication.py b/scripts/Utils/DebugFailedBlockPublication.py index 5347b2c22d..b08de013dc 100755 --- a/scripts/Utils/DebugFailedBlockPublication.py +++ b/scripts/Utils/DebugFailedBlockPublication.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/env python3 # coding: utf-8 from __future__ import division from __future__ import print_function diff --git a/scripts/Utils/FindFailedMigrations.py b/scripts/Utils/FindFailedMigrations.py index 20f7dc1fbb..2a7bbb8847 100755 --- a/scripts/Utils/FindFailedMigrations.py +++ b/scripts/Utils/FindFailedMigrations.py @@ -1,4 +1,4 @@ -#!/usr/bin/env python +#!/usr/bin/env python3 # coding: utf-8 from __future__ import print_function from __future__ import division @@ -7,8 +7,6 @@ from datetime import datetime import argparse -# this is needed to make it possible for the following import to work -import CRABClient #pylint: disable=unused-import from dbs.apis.dbsClient import DbsApi @@ -60,10 +58,20 @@ def readAndParse(csvFile, apiMig): def main(): parser = argparse.ArgumentParser() parser.add_argument('--file', help='log file of terminally failed migrations in CSV format', - default='TerminallyFailedLog.txt') + default='/data/srv/Publisher/logs/migrations/TerminallyFailedLog.txt') args = parser.parse_args() logFile = os.path.abspath(args.file) + # if X509 vars are not defined, use default Publisher location + userProxy = os.getenv('X509_USER_PROXY') + if userProxy: + os.environ['X509_USER_CERT'] = userProxy + os.environ['X509_USER_KEY'] = userProxy + if not os.getenv('X509_USER_CERT'): + os.environ['X509_USER_CERT'] = '/data/certs/servicecert.pem' + if not os.getenv('X509_USER_KEY'): + os.environ['X509_USER_KEY'] = '/data/certs/servicekey.pem' + migUrl = 'https://cmsweb-prod.cern.ch/dbs/prod/phys03/DBSMigrate' apiMig = DbsApi(url=migUrl) diff --git a/scripts/Utils/RemoveFailedMigration.py b/scripts/Utils/RemoveFailedMigration.py index adf199a11b..71fb8fbb4f 100755 --- a/scripts/Utils/RemoveFailedMigration.py +++ b/scripts/Utils/RemoveFailedMigration.py @@ -1,13 +1,12 @@ -#!/usr/bin/env python +#!/usr/bin/env python3 # coding: utf-8 from __future__ import print_function from __future__ import division +import os from datetime import datetime import argparse -# this is needed to make it possible for the following import to work -import CRABClient #pylint: disable=unused-import from dbs.apis.dbsClient import DbsApi def main(): @@ -16,6 +15,16 @@ def main(): args = parser.parse_args() migrationId = int(args.id) + # if X509 vars are not defined, use default Publisher location + userProxy = os.getenv('X509_USER_PROXY') + if userProxy: + os.environ['X509_USER_CERT'] = userProxy + os.environ['X509_USER_KEY'] = userProxy + if not os.getenv('X509_USER_CERT'): + os.environ['X509_USER_CERT'] = '/data/certs/servicecert.pem' + if not os.getenv('X509_USER_KEY'): + os.environ['X509_USER_KEY'] = '/data/certs/servicekey.pem' + migUrl = 'https://cmsweb-prod.cern.ch/dbs/prod/phys03/DBSMigrate' apiMig = DbsApi(url=migUrl) @@ -39,10 +48,10 @@ def main(): print("migrationId: %d was created on %s by %s for block:" % (migrationId, created, creator)) print(" %s" % block) - answer = raw_input("Do you want to remove it ? Yes/[No]: ") + answer = input("Do you want to remove it ? Yes/[No]: ") if answer in ['Yes', 'YES', 'Y', 'y', 'yes']: answer = 'Yes' - if answer is not 'Yes': + if answer != 'Yes': return print("\nRemoving it...") @@ -52,22 +61,21 @@ def main(): print("Migration removal failed with this exception:\n%s" % str(ex)) return print("Migration %d successfully removed\n" % migrationId) - print("CRAB Publisher will issue such a migration request again as/when needed") - print("but if you want to recreated it now, you can do it with this python fragment") - print("\n ===============\n") - print("import CRABClient") - print("from dbs.apis.dbsClient import DbsApi") - print("globUrl='https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader'") - print("migUrl='https://cmsweb-prod.cern.ch/dbs/prod/phys03/DBSMigrate'") - print("apiMig = DbsApi(url=migUrl)") - print("block='%s'" % block) - print("data= {'migration_url': globUrl, 'migration_input': block}") - print("result = apiMig.submitMigration(data)") - print("newId = result.get('migration_details', {}).get('migration_request_id')") - print("print('new migration created: %d' % newId)") - print("status = apiMig.statusMigration(migration_rqst_id=newId)") - print("print(status)") - print("\n ===============\n") + print("CRAB Publisher will issue such a migration request again as/when needed.") + print("But if you want to re-create it now, you can by answering yes here") + answer = input("Do you want to re-create the migration request ? Yes/[No]: ") + if answer in ['Yes', 'YES', 'Y', 'y', 'yes']: + answer = 'Yes' + if answer != 'Yes': + return + print("\nSubmitting new migration request...") + globUrl = 'https://cmsweb-prod.cern.ch/dbs/prod/global/DBSReader' + data = {'migration_url': globUrl, 'migration_input': block} + result = apiMig.submitMigration(data) + newId = result.get('migration_details', {}).get('migration_request_id') + print('new migration created: %d' % newId) + status = apiMig.statusMigration(migration_rqst_id=newId) + print(status) return if __name__ == '__main__': diff --git a/scripts/dag_bootstrap.sh b/scripts/dag_bootstrap.sh index 3a2824ebf3..ae5456fe12 100755 --- a/scripts/dag_bootstrap.sh +++ b/scripts/dag_bootstrap.sh @@ -13,20 +13,21 @@ set -o pipefail set -x echo "Beginning dag_bootstrap.sh (stdout)" echo "Beginning dag_bootstrap.sh (stderr)" 1>&2 -export PYTHONPATH=$PYTHONPATH:/cvmfs/cms.cern.ch/rucio/current/lib/python2.7/site-packages +export PYTHONPATH=$PYTHONPATH:/cvmfs/cms.cern.ch/rucio/x86_64/slc7/py3/current/lib/python3.6/site-packages/ +export PYTHONPATH=$PYTHONPATH:/data/srv/pycurl3/7.44.1 if [ "X$TASKWORKER_ENV" = "X" -a ! -e CRAB3.zip ] then - command -v python > /dev/null + command -v python3 > /dev/null rc=$? if [[ $rc != 0 ]] then - echo "Error: Python isn't available on `hostname`." >&2 - echo "Error: bootstrap execution requires python" >&2 + echo "Error: Python3 isn't available on `hostname`." >&2 + echo "Error: bootstrap execution requires python3" >&2 exit 1 else - echo "I found python at.." - echo `which python` + echo "I found python3 at.." + echo `which python3` fi if [ "x$CRAB3_VERSION" = "x" ]; then @@ -34,10 +35,6 @@ then else TARBALL_NAME=TaskManagerRun-$CRAB3_VERSION.tar.gz fi - - if [[ "X$CRAB_TASKMANAGER_TARBALL" == "X" ]]; then - CRAB_TASKMANAGER_TARBALL="http://hcc-briantest.unl.edu/$TARBALL_NAME" - fi if [[ "X$CRAB_TASKMANAGER_TARBALL" != "Xlocal" ]]; then # pass, we'll just use that value @@ -96,7 +93,6 @@ export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH os_ver=$(source /etc/os-release;echo $VERSION_ID) curl_path="/cvmfs/cms.cern.ch/slc${os_ver}_amd64_gcc700/external/curl/7.59.0" -libcurl_path="${curl_path}/lib" source ${curl_path}/etc/profile.d/init.sh export PYTHONUNBUFFERED=1 @@ -111,5 +107,5 @@ if [ "X$_CONDOR_JOB_AD" != "X" ]; then cat $_CONDOR_JOB_AD fi echo "Now running the job in `pwd`..." -exec nice -n 19 python -m TaskWorker.TaskManagerBootstrap "$@" +exec nice -n 19 python3 -m TaskWorker.TaskManagerBootstrap "$@" } 2>&1 | tee dag_bootstrap.out diff --git a/scripts/dag_bootstrap_startup.sh b/scripts/dag_bootstrap_startup.sh index 50db15f60e..da8a8f3212 100755 --- a/scripts/dag_bootstrap_startup.sh +++ b/scripts/dag_bootstrap_startup.sh @@ -14,10 +14,13 @@ done export PATH="/usr/local/bin:/bin:/usr/bin:/usr/bin:$PATH" export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH -os_ver=$(source /etc/os-release;echo $VERSION_ID) -curl_path="/cvmfs/cms.cern.ch/slc${os_ver}_amd64_gcc700/external/curl/7.59.0" -libcurl_path="${curl_path}/lib" -source ${curl_path}/etc/profile.d/init.sh +export PYTHONPATH=$PYTHONPATH:/data/srv/pycurl3/7.44.1 +#os_ver=$(source /etc/os-release;echo $VERSION_ID) +#curl_path="/cvmfs/cms.cern.ch/slc${os_ver}_amd64_gcc700/external/curl/7.59.0" +#libcurl_path="${curl_path}/lib" +#source ${curl_path}/etc/profile.d/init.sh +source /cvmfs/cms.cern.ch/slc7_amd64_gcc900/external/curl/7.59.0/etc/profile.d/init.sh + srcname=$0 env > ${srcname%.sh}.env @@ -85,16 +88,16 @@ fi # Bootstrap the runtime - we want to do this before DAG is submitted # so all the children don't try this at once. if [ "X$TASKWORKER_ENV" = "X" -a ! -e CRAB3.zip ]; then - command -v python > /dev/null + command -v python3 > /dev/null rc=$? if [[ $rc != 0 ]]; then echo "Error: Python isn't available on `hostname`." >&2 - echo "Error: Bootstrap execution requires python" >&2 - condor_qedit $CONDOR_ID DagmanHoldReason "'Error: Bootstrap execution requires python.'" + echo "Error: Bootstrap execution requires python3" >&2 + condor_qedit $CONDOR_ID DagmanHoldReason "'Error: Bootstrap execution requires python3.'" exit 1 else - echo "I found python at.." - echo `which python` + echo "I found python3 at.." + echo `which python3` fi if [[ "X$CRAB_TASKMANAGER_TARBALL" == "X" ]]; then @@ -138,7 +141,7 @@ cp $_CONDOR_JOB_AD ./_CONDOR_JOB_AD if [ -e AdjustSites.py ]; then export schedd_name=`condor_config_val schedd_name` echo "Execute AdjustSites.py ..." - python AdjustSites.py + python3 AdjustSites.py ret=$? if [ $ret -eq 1 ]; then echo "Error: AdjustSites.py failed to update the webdir." >&2 @@ -159,16 +162,6 @@ else exit 1 fi -# Decide if this task will use old cache_status.py or the new cache_status_jel.py -# for parsing the condor job_log ( git issues: #5942 #5939 ) via the new JobEventLog API -# Do it here so that decision stays for task lifetime, even if schedd configuratoion -# is changed at some point -if [ -f /etc/use_condor_jel ] ; -then - echo "Set this task to use condor JobEventLog API" - touch USE_JEL -fi - # Decide if this task will use enable REUSE flag for FTS ASO jobs # Do it here so that decision stays for task lifetime, even if schedd configuratoion # is changed at some point @@ -178,22 +171,6 @@ then touch USE_FTS_REUSE fi -# Decide if this task will use old ASO based DBSPublisher, or new standalone Publisher -# Do it here so that decision stays for task lifetime, even if schedd configuratoion -# is changed at some point -if [ -f /etc/use_new_publisher ] ; -then - echo "Found file /etc/use_new_publisher. Set this task to use New Publisher" - touch USE_NEW_PUBLISHER -fi - -UseNewPublisher=`grep '^CRAB_USE_NEW_PUBLISHER =' $_CONDOR_JOB_AD | tr -d '"' | awk '{print $NF;}'` -if [ "$UseNewPublisher" = "True" ]; -then - echo "+CRAB_USE_NEW_PUBLISHER classAd is True. Set this task to use New Publisher" - touch USE_NEW_PUBLISHER -fi - export _CONDOR_DAGMAN_LOG=$PWD/$1.dagman.out export _CONDOR_DAGMAN_GENERATE_SUBDAG_SUBMITS=False export _CONDOR_MAX_DAGMAN_LOG=0 @@ -234,6 +211,7 @@ else then echo "creating and executing task process daemon jdl" TASKNAME=`grep '^CRAB_ReqName =' $_CONDOR_JOB_AD | awk '{print $NF;}'` + USERNAME=`grep '^CRAB_UserHN =' $_CONDOR_JOB_AD | awk '{print $NF;}'` CMSTYPE=`grep '^CMS_Type =' $_CONDOR_JOB_AD | awk '{print $NF;}'` CMSWMTOOL=`grep '^CMS_WMTool =' $_CONDOR_JOB_AD | awk '{print $NF;}'` CMSTTASKYPE=`grep '^CMS_TaskType =' $_CONDOR_JOB_AD | awk '{print $NF;}'` @@ -246,6 +224,7 @@ Log = task_process/daemon.PC.log Output = task_process/daemon.out.\$(Cluster).\$(Process) Error = task_process/daemon.err.\$(Cluster).\$(Process) +CRAB_ReqName = $TASKNAME ++CRAB_UserHN = $USERNAME +CMS_Type = $CMSTYPE +CMS_WMTool = $CMSWMTOOL +CMS_TaskType = $CMSTTASKYPE diff --git a/scripts/task_process/FTS_Transfers.py b/scripts/task_process/FTS_Transfers.py index 04109342f9..ed97343c58 100644 --- a/scripts/task_process/FTS_Transfers.py +++ b/scripts/task_process/FTS_Transfers.py @@ -9,8 +9,11 @@ import logging import os import subprocess -from datetime import timedelta -from httplib import HTTPException +from datetime import datetime, timedelta +try: + from httplib import HTTPException +except: + from http.client import HTTPException import fts3.rest.client.easy as fts3 @@ -45,10 +48,7 @@ # proxy = os.getcwd() + "/" + _rest.readline() # print("Proxy: %s" % proxy) -if os.path.exists('USE_NEW_PUBLISHER'): - asoworker = 'schedd' -else: - asoworker = 'asoless' +asoworker = 'schedd' if os.path.exists('USE_FTS_REUSE'): ftsReuse = True @@ -81,7 +81,7 @@ def mark_transferred(ids, crabserver): data['list_of_ids'] = ids data['list_of_transfer_state'] = ["DONE" for _ in ids] - crabserver.post('/filetransfers', data=encodeRequest(data)) + crabserver.post('filetransfers', data=encodeRequest(data)) logging.info("Marked good %s", ids) except Exception: logging.exception("Error updating documents") @@ -105,7 +105,7 @@ def mark_failed(ids, failures_reasons, crabserver): data['list_of_failure_reason'] = failures_reasons data['list_of_retry_value'] = [0 for _ in ids] - crabserver.post('/filetransfers', data=encodeRequest(data)) + crabserver.post('filetransfers', data=encodeRequest(data)) logging.info("Marked failed %s", ids) except Exception: logging.exception("Error updating documents") @@ -199,10 +199,12 @@ def check_FTSJob(logger, ftsContext, jobid, jobsEnded, jobs_ongoing, done_id, fa # xfers have only 3 terminal states: FINISHED, FAILED, and CANCELED see # https://fts3-docs.web.cern.ch/fts3-docs/docs/state_machine.html if tx_state == 'FINISHED': + logger.info('file XFER OK will remove %s', file_status['source_surl']) done_id[jobid].append(_id) files_to_remove.append(file_status['source_surl']) fileIds_to_remove.append(_id) elif tx_state == 'FAILED' or tx_state == 'CANCELED': + logger.info('file XFER FAIL will remove %s', file_status['source_surl']) failed_id[jobid].append(_id) if file_status['reason']: logger.info('Failure reason: ' + file_status['reason']) @@ -231,6 +233,9 @@ def check_FTSJob(logger, ftsContext, jobid, jobsEnded, jobs_ongoing, done_id, fa for f in files_to_remove: list_of_surls += str(f) + ' ' # convert JSON u'srm://....' to plain srm://... removeLogFile = './task_process/transfers/remove_files.log' + msg = str(datetime.now()) + ': Will remove: %s' % list_of_surls + with open(removeLogFile, 'a') as removeLog: + removeLog.write(msg) remove_files_in_bkg(list_of_surls, removeLogFile) # remove those file Id's from the list and update the json disk file fileIds = list(set(fileIds)-set(fileIds_to_remove)) @@ -446,7 +451,7 @@ def submit(rucioClient, ftsContext, toTrans, crabserver): # jobContent.write(str(f[0])) # save the list of src_lfn's in this job for fileDoc in to_update: - _ = crabserver.post('/filetransfers', data=encodeRequest(fileDoc)) + _ = crabserver.post('filetransfers', data=encodeRequest(fileDoc)) logging.info("Marked submitted %s files", fileDoc['list_of_ids']) return jobids @@ -544,17 +549,13 @@ def state_manager(ftsContext, crabserver): if markDone and markFailed: jobs_done.append(jobID) jobs_ongoing.remove(jobID) - else: - jobs_ongoing.append(jobID) # SB is this necessary ? AFAIU check_FTSJob has filled it - else: - jobs_ongoing.append(jobID) # should not be necessary.. but for consistency with above except Exception: logging.exception('Failed to update states') else: logging.warning('No FTS job ID to monitor yet') with open("task_process/transfers/fts_jobids_new.txt", "w+") as _jobids: - for line in list(set(jobs_ongoing)): # SB if we remove the un-necessary append above, non eed for a set here + for line in jobs_ongoing: logging.info("Writing: %s", line) _jobids.write(line+"\n") diff --git a/scripts/task_process/cache_status.py b/scripts/task_process/cache_status.py old mode 100755 new mode 100644 index 7473b174d0..9d3d47c514 --- a/scripts/task_process/cache_status.py +++ b/scripts/task_process/cache_status.py @@ -1,20 +1,21 @@ #!/usr/bin/python +""" +VERSION OF CACHE_STATUS USING HTCONDOR JobEventLog API +THIS REQUIRES HTCONDOR 8.9.3 OR ABOVE +""" from __future__ import print_function, division import re import time import logging import os import ast -import sys -import classad import glob import copy from shutil import move -# Need to import HTCondorUtils from a parent directory, not easy when the files are not in python packages. -# Solution by ajay, SO: http://stackoverflow.com/questions/11536764 -# /attempted-relative-import-in-non-package-even-with-init-py/27876800#comment28841658_19190695 -sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..'))) -import HTCondorUtils +import pickle +import json +import htcondor +import classad logging.basicConfig(filename='task_process/cache_status.log', level=logging.DEBUG) @@ -33,6 +34,8 @@ } STATUS_CACHE_FILE = "task_process/status_cache.txt" +PKL_STATUS_CACHE_FILE = "task_process/status_cache.pkl" +LOG_PARSING_POINTERS_DIR = "task_process/jel_pickles/" FJR_PARSE_RES_FILE = "task_process/fjr_parse_results.txt" # @@ -44,6 +47,12 @@ def insertCpu(event, info): + """ + add CPU usage information to the job info record + :param event: an event from HTCondor log + :param info: the structure where information for a job is collected + :return: nothing + """ if 'TotalRemoteUsage' in event: m = cpuRe.match(event['TotalRemoteUsage']) if m: @@ -62,9 +71,19 @@ def insertCpu(event, info): nodeNameRe = re.compile("DAG Node: Job(\d+(?:-\d+)?)") nodeName2Re = re.compile("Job(\d+(?:-\d+)?)") -def parseJobLog(fp, nodes, nodeMap): +# this now takes as input an htcondor.JobEventLog object +# which as of HTCondor 8.9 can be saved/restored with memory of +# where it had reached in processing the job log file +def parseJobLog(jel, nodes, nodeMap): + """ + parses new events in condor job log file and updates nodeMap + :param jel: a condor JobEventLog object which provides an iterator over events + :param nodes: the structure where we collect one job info for cache_status file + :param nodeMap: the structure where collect summary of all events + :return: nothing + """ count = 0 - for event in HTCondorUtils.readEvents(fp): + for event in jel.events(0): count += 1 eventtime = time.mktime(time.strptime(event['EventTime'], "%Y-%m-%dT%H:%M:%S")) if event['MyType'] == 'SubmitEvent': @@ -95,7 +114,7 @@ def parseJobLog(fp, nodes, nodeMap): if nodes[node]['StartTimes'] : nodes[node]['WallDurations'][-1] = nodes[node]['EndTimes'][-1] - nodes[node]['StartTimes'][-1] else: - nodes[node]['WallDurations'][-1] = 0 + nodes[node]['WallDurations'][-1] = 0 insertCpu(event, nodes[node]) if event['TerminatedNormally']: if event['ReturnValue'] == 0: @@ -138,10 +157,6 @@ def parseJobLog(fp, nodes, nodeMap): nodes[node]['StartTimes'].append(-1) if not nodes[node]['RecordedSite']: nodes[node]['SiteHistory'].append("Unknown") - if nodes[node]['State'] == 'running': - nodes[node]['EndTimes'].append(eventtime) - # nodes[node]['State'] can be 'running' only if an ExcuteEvent was found, so StartTime must be defined - nodes[node]['WallDurations'][-1] = nodes[node]['EndTimes'][-1] - nodes[node]['StartTimes'][-1] nodes[node]['State'] = 'killed' insertCpu(event, nodes[node]) elif event['MyType'] == 'JobHeldEvent': @@ -195,13 +210,20 @@ def parseJobLog(fp, nodes, nodeMap): info['SiteHistory'].append("Unknown") def parseErrorReport(data, nodes): - #iterate over the jobs and set the error dict for those which are failed - for jobid, statedict in nodes.iteritems(): + """ + iterate over the jobs and set the error dict for those which are failed + :param data: a dictionary as returned by summarizeFjrParseResults() : {jobid:errdict} + errdict is {crab_retry:error_summary} from PostJob/prepareErrorSummary + which writes one line for PostJoun run: {job_id : {crab_retry : error_summary}} + where crab_retry is a string and error_summary a list [exitcode, errorMsg, {}] + :param nodes: a dictionary with format {jobid:statedict} + :return: nothing, modifies nodes in place + """ + for jobid, statedict in nodes.items(): if 'State' in statedict and statedict['State'] == 'failed' and jobid in data: - # data[jobid] is a dictionary with the retry number as a key and error summary information as a value. - # Here we want to get the error summary information, and since values() returns a list - # (even if there's only a single value) it has to be indexed to zero. - statedict['Error'] = data[jobid].values()[0] #data[jobid] contains all retries. take the last one + # pick error info from last retry (SB: AFAICT only last retry is listed anyhow) + for key in data[jobid]: + statedict['Error'] = data[jobid][key] def parseNodeStateV2(fp, nodes, level): """ @@ -271,35 +293,55 @@ def parseNodeStateV2(fp, nodes, level): # observed; STATUS_ERROR is terminal. info['State'] = 'failed' -# --- New code ---- - def storeNodesInfoInFile(): - # Open cache file and get the location until which the jobs_log was parsed last time - try: - if os.path.exists(STATUS_CACHE_FILE) and os.stat(STATUS_CACHE_FILE).st_size > 0: - logging.debug("cache file found, opening and reading") + """ + Open cache file and get the location until which the jobs_log was parsed last time + returns: a dictionary with keys: jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap + """ + jobLogCheckpoint = None + if os.path.exists(STATUS_CACHE_FILE) and os.stat(STATUS_CACHE_FILE).st_size > 0: + logging.debug("cache file found, opening") + try: nodesStorage = open(STATUS_CACHE_FILE, "r") + jobLogCheckpoint = nodesStorage.readline().strip() + if jobLogCheckpoint.startswith('#') : + logging.debug("cache file contains initial comments, skipping") + # comment line indicates a place-holder file created at DAG bootstrap time + jobLogCheckpoint = None + else: + logging.debug("reading cache file") + fjrParseResCheckpoint = int(nodesStorage.readline()) + nodes = ast.literal_eval(nodesStorage.readline()) + nodeMap = ast.literal_eval(nodesStorage.readline()) + nodesStorage.close() + except Exception: + logging.exception("error during status_cache handling") + jobLogCheckpoint = None - jobLogCheckpoint = int(nodesStorage.readline()) - fjrParseResCheckpoint = int(nodesStorage.readline()) - nodes = ast.literal_eval(nodesStorage.readline()) - nodeMap = ast.literal_eval(nodesStorage.readline()) - nodesStorage.close() - else: - logging.debug("cache file not found, creating") - jobLogCheckpoint = 0 - fjrParseResCheckpoint = 0 - nodes = {} - nodeMap = {} - except Exception: - logging.exception("error during status_cache handling") - jobsLog = open("job_log", "r") + if not jobLogCheckpoint: + logging.debug("no usable cache file found, creating") + fjrParseResCheckpoint = 0 + nodes = {} + nodeMap = {} - jobsLog.seek(jobLogCheckpoint) + if jobLogCheckpoint: + # resume log parsing where we left + with open((LOG_PARSING_POINTERS_DIR+jobLogCheckpoint), 'rb') as f: + jel = pickle.load(f) + else: + # parse log from beginning + jel = htcondor.JobEventLog('job_log') + #jobsLog = open("job_log", "r") + #jobsLog.seek(jobLogCheckpoint) - parseJobLog(jobsLog, nodes, nodeMap) - newJobLogCheckpoint = jobsLog.tell() - jobsLog.close() + parseJobLog(jel, nodes, nodeMap) + # save jel object in a pickle file made unique by a timestamp + newJelPickleName = 'jel-%d.pkl' % int(time.time()) + if not os.path.exists(LOG_PARSING_POINTERS_DIR): + os.mkdir(LOG_PARSING_POINTERS_DIR) + with open((LOG_PARSING_POINTERS_DIR+newJelPickleName), 'wb') as f: + pickle.dump(jel, f) + newJobLogCheckpoint = newJelPickleName for fn in glob.glob("node_state*"): level = re.match(r'(\w+)(?:.(\w+))?', fn).group(2) @@ -326,6 +368,131 @@ def storeNodesInfoInFile(): move(tempFilename, STATUS_CACHE_FILE) + # collect all cache info in a single dictionary and return it to called + cacheDoc = {} + cacheDoc['jobLogCheckpoint'] = newJobLogCheckpoint + cacheDoc['fjrParseResCheckpoint'] = newFjrParseResCheckpoint + cacheDoc['nodes'] = nodes + cacheDoc['nodeMap'] = nodeMap + return cacheDoc + +def readOldStatusCacheFile(): + """ + it is enough to read the Pickle version, since we want to transition to that + returns: a dictionary with keys: jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap + """ + jobLogCheckpoint = None + if os.path.exists(PKL_STATUS_CACHE_FILE) and os.stat(PKL_STATUS_CACHE_FILE).st_size > 0: + logging.debug("cache file found, opening") + try: + with open(PKL_STATUS_CACHE_FILE, "rb") as fp: + cacheDoc = pickle.load(fp) + # protect against fake file with just bootstrapTime created by AdjustSites.py + jobLogCheckpoint = getattr(cacheDoc, 'jobLogCheckpoint', None) + fjrParseResCheckpoint = getattr(cacheDoc, 'fjrParseResCheckpoint', None) + nodes = getattr(cacheDoc, 'nodes', None) + nodeMap = getattr(cacheDoc, 'nodeMap', None) + except Exception: + logging.exception("error during status_cache handling") + jobLogCheckpoint = None + + if not jobLogCheckpoint: + logging.debug("no usable cache file found, creating") + fjrParseResCheckpoint = 0 + nodes = {} + nodeMap = {} + + # collect all cache info in a single dictionary and return it to called + cacheDoc = {} + cacheDoc['jobLogCheckpoint'] = jobLogCheckpoint + cacheDoc['fjrParseResCheckpoint'] = fjrParseResCheckpoint + cacheDoc['nodes'] = nodes + cacheDoc['nodeMap'] = nodeMap + return cacheDoc + +def parseCondorLog(cacheDoc): + """ + do all real work and update checkpoints, nodes and nodemap dictionaries + takes as input a cacheDoc dictionary with keys + jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap + and returns the same dictionary with updated information + """ + + jobLogCheckpoint = cacheDoc['jobLogCheckpoint'] + fjrParseResCheckpoint = cacheDoc['fjrParseResCheckpoint'] + nodes = cacheDoc['nodes'] + nodeMap = cacheDoc['nodeMap'] + if jobLogCheckpoint: + # resume log parsing where we left + with open((LOG_PARSING_POINTERS_DIR+jobLogCheckpoint), 'rb') as f: + jel = pickle.load(f) + else: + # parse log from beginning + jel = htcondor.JobEventLog('job_log') + + parseJobLog(jel, nodes, nodeMap) + # save jel object in a pickle file made unique by a timestamp + newJelPickleName = 'jel-%d.pkl' % int(time.time()) + if not os.path.exists(LOG_PARSING_POINTERS_DIR): + os.mkdir(LOG_PARSING_POINTERS_DIR) + with open((LOG_PARSING_POINTERS_DIR+newJelPickleName), 'wb') as f: + pickle.dump(jel, f) + newJobLogCheckpoint = newJelPickleName + + for fn in glob.glob("node_state*"): + level = re.match(r'(\w+)(?:.(\w+))?', fn).group(2) + with open(fn, 'r') as nodeState: + parseNodeStateV2(nodeState, nodes, level) + + try: + errorSummary, newFjrParseResCheckpoint = summarizeFjrParseResults(fjrParseResCheckpoint) + if errorSummary and newFjrParseResCheckpoint: + parseErrorReport(errorSummary, nodes) + except IOError: + logging.exception("error during error_summary file handling") + + # collect all cache info in a single dictionary and return it to called + newCacheDoc = {} + newCacheDoc['jobLogCheckpoint'] = newJobLogCheckpoint + newCacheDoc['fjrParseResCheckpoint'] = newFjrParseResCheckpoint + newCacheDoc['nodes'] = nodes + newCacheDoc['nodeMap'] = nodeMap + return newCacheDoc + +def storeNodesInfoInPklFile(cacheDoc): + """ + takes as input an cacheDoc dictionary with keys + jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap + """ + # First write the new cache file under a temporary name, so that other processes + # don't get an incomplete result. Then replace the old one with the new one. + tempFilename = (PKL_STATUS_CACHE_FILE + ".%s") % os.getpid() + + # persist cache info in py2-compatible pickle format + with open(tempFilename, "wb") as fp: + pickle.dump(cacheDoc, fp, protocol=2) + move(tempFilename, PKL_STATUS_CACHE_FILE) + +def storeNodesInfoInTxtFile(cacheDoc): + """ + takes as input an cacheDoc dictionary with keys + jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap + """ + jobLogCheckpoint = cacheDoc['jobLogCheckpoint'] + fjrParseResCheckpoint = cacheDoc['fjrParseResCheckpoint'] + nodes = cacheDoc['nodes'] + nodeMap = cacheDoc['nodeMap'] + # First write the new cache file under a temporary name, so that other processes + # don't get an incomplete result. Then replace the old one with the new one. + tempFilename = (STATUS_CACHE_FILE + ".%s") % os.getpid() + + nodesStorage = open(tempFilename, "w") + nodesStorage.write(str(jobLogCheckpoint) + "\n") + nodesStorage.write(str(fjrParseResCheckpoint) + "\n") + nodesStorage.write(str(nodes) + "\n") + nodesStorage.write(str(nodeMap) + "\n") + nodesStorage.close() + def summarizeFjrParseResults(checkpoint): ''' Reads the fjr_parse_results file line by line. The file likely contains multiple @@ -337,6 +504,14 @@ def summarizeFjrParseResults(checkpoint): Return the updated error dictionary and also the location until which the fjr_parse_results file was read so that we can store it and don't have t re-read the same information next time the cache_status.py runs. + + SB: what this does is to convert a JSON file with a list of dictionaries [{job:msg},...] which + may have the same jobId as key, into a single dictionay with for each jobId key contains only the + last value of msg + for d in content: + for k,v in d.items(): + errDict[k] = v + ''' if os.path.exists(FJR_PARSE_RES_FILE): @@ -344,23 +519,35 @@ def summarizeFjrParseResults(checkpoint): f.seek(checkpoint) content = f.readlines() newCheckpoint = f.tell() - errDict = {} for line in content: - fjrResult = ast.literal_eval(line) - jobId = fjrResult.keys()[0] - errDict[jobId] = fjrResult[jobId] + fjrResult = json.loads(line) + for jobId,msg in fjrResult.items(): + errDict[jobId] = msg return errDict, newCheckpoint else: return None, 0 def main(): + """ + parse condor job_log from last checkpoint until now and write summary in status_cache files + :return: + """ try: - storeNodesInfoInFile() + # this is the old part + cacheDoc = storeNodesInfoInFile() + # this is new for the picke file but for the time being stick to using + # cacheDoc information from old way. At some point shoudl carefull check code + # and move on to the more strucutred 3-steps below, most likely when running + # in python3 the old status_cache.txt file will be unusable, as we found in crab client + #infoN = readOldStatusCacheFile() + #infoN = parseCondorLog(info) + storeNodesInfoInPklFile(cacheDoc) + # in case we still need the text file (e.g. for the UI) when we remove the old code: + # storeNodesInfoInTxtFile(cacheDoc) except Exception: logging.exception("error during main loop") main() -logging.debug("cache_status.py exiting") - +logging.debug("cache_status_jel.py exiting") diff --git a/scripts/task_process/cache_status_jel.py b/scripts/task_process/cache_status_jel.py deleted file mode 100644 index b716b72c09..0000000000 --- a/scripts/task_process/cache_status_jel.py +++ /dev/null @@ -1,543 +0,0 @@ -#!/usr/bin/python -""" -VERSION OF CACHE_STATUS USING HTCONDOR JobEventLog API -THIS REQUIRES HTCONDOR 8.9.3 OR ABOVE -""" -from __future__ import print_function, division -import re -import time -import logging -import os -import ast -import glob -import copy -from shutil import move -import pickle -import htcondor -import classad - -logging.basicConfig(filename='task_process/cache_status.log', level=logging.DEBUG) - -NODE_DEFAULTS = { - 'Retries': 0, - 'Restarts': 0, - 'SiteHistory': [], - 'ResidentSetSize': [], - 'SubmitTimes': [], - 'StartTimes': [], - 'EndTimes': [], - 'TotalUserCpuTimeHistory': [], - 'TotalSysCpuTimeHistory': [], - 'WallDurations': [], - 'JobIds': [] -} - -STATUS_CACHE_FILE = "task_process/status_cache.txt" -PKL_STATUS_CACHE_FILE = "task_process/status_cache.pkl" -LOG_PARSING_POINTERS_DIR = "task_process/jel_pickles/" -FJR_PARSE_RES_FILE = "task_process/fjr_parse_results.txt" - -# -# insertCpu, parseJobLog, parsNodeStateV2 and parseErrorReport -# code copied from the backend HTCondorDataWorkflow.py with minimal changes. -# - -cpuRe = re.compile(r"Usr \d+ (\d+):(\d+):(\d+), Sys \d+ (\d+):(\d+):(\d+)") - - -def insertCpu(event, info): - """ - add CPU usage information to the job info record - :param event: an event from HTCondor log - :param info: the structure where information for a job is collected - :return: nothing - """ - if 'TotalRemoteUsage' in event: - m = cpuRe.match(event['TotalRemoteUsage']) - if m: - g = [int(i) for i in m.groups()] - user = g[0] * 3600 + g[1] * 60 + g[2] - system = g[3] * 3600 + g[4] * 60 + g[5] - info['TotalUserCpuTimeHistory'][-1] = user - info['TotalSysCpuTimeHistory'][-1] = system - else: - if 'RemoteSysCpu' in event: - info['TotalSysCpuTimeHistory'][-1] = float(event['RemoteSysCpu']) - if 'RemoteUserCpu' in event: - info['TotalUserCpuTimeHistory'][-1] = float(event['RemoteUserCpu']) - - -nodeNameRe = re.compile("DAG Node: Job(\d+(?:-\d+)?)") -nodeName2Re = re.compile("Job(\d+(?:-\d+)?)") - -# this now takes as input an htcondor.JobEventLog object -# which as of HTCondor 8.9 can be saved/restored with memory of -# where it had reached in processing the job log file -def parseJobLog(jel, nodes, nodeMap): - """ - parses new events in condor job log file and updates nodeMap - :param jel: a condor JobEventLog object which provides an iterator over events - :param nodes: the structure where we collect one job info for cache_status file - :param nodeMap: the structure where collect summary of all events - :return: nothing - """ - count = 0 - for event in jel.events(0): - count += 1 - eventtime = time.mktime(time.strptime(event['EventTime'], "%Y-%m-%dT%H:%M:%S")) - if event['MyType'] == 'SubmitEvent': - m = nodeNameRe.match(event['LogNotes']) - if m: - node = m.groups()[0] - proc = event['Cluster'], event['Proc'] - info = nodes.setdefault(node, copy.deepcopy(NODE_DEFAULTS)) - info['State'] = 'idle' - info['JobIds'].append("%d.%d" % proc) - info['RecordedSite'] = False - info['SubmitTimes'].append(eventtime) - info['TotalUserCpuTimeHistory'].append(0) - info['TotalSysCpuTimeHistory'].append(0) - info['WallDurations'].append(0) - info['ResidentSetSize'].append(0) - info['Retries'] = len(info['SubmitTimes'])-1 - nodeMap[proc] = node - elif event['MyType'] == 'ExecuteEvent': - node = nodeMap[event['Cluster'], event['Proc']] - nodes[node]['StartTimes'].append(eventtime) - nodes[node]['State'] = 'running' - nodes[node]['RecordedSite'] = False - elif event['MyType'] == 'JobTerminatedEvent': - node = nodeMap[event['Cluster'], event['Proc']] - nodes[node]['EndTimes'].append(eventtime) - # at times HTCondor does not log the ExecuteEvent and there's no StartTime - if nodes[node]['StartTimes'] : - nodes[node]['WallDurations'][-1] = nodes[node]['EndTimes'][-1] - nodes[node]['StartTimes'][-1] - else: - nodes[node]['WallDurations'][-1] = 0 - insertCpu(event, nodes[node]) - if event['TerminatedNormally']: - if event['ReturnValue'] == 0: - nodes[node]['State'] = 'transferring' - else: - nodes[node]['State'] = 'cooloff' - else: - nodes[node]['State'] = 'cooloff' - elif event['MyType'] == 'PostScriptTerminatedEvent': - m = nodeName2Re.match(event['DAGNodeName']) - if m: - node = m.groups()[0] - if event['TerminatedNormally']: - if event['ReturnValue'] == 0: - nodes[node]['State'] = 'finished' - elif event['ReturnValue'] == 2: - nodes[node]['State'] = 'failed' - else: - nodes[node]['State'] = 'cooloff' - else: - nodes[node]['State'] = 'cooloff' - elif event['MyType'] == 'ShadowExceptionEvent' or event["MyType"] == "JobReconnectFailedEvent" or event['MyType'] == 'JobEvictedEvent': - node = nodeMap[event['Cluster'], event['Proc']] - if nodes[node]['State'] != 'idle': - nodes[node]['EndTimes'].append(eventtime) - if nodes[node]['WallDurations'] and nodes[node]['EndTimes'] and nodes[node]['StartTimes']: - nodes[node]['WallDurations'][-1] = nodes[node]['EndTimes'][-1] - nodes[node]['StartTimes'][-1] - nodes[node]['State'] = 'idle' - insertCpu(event, nodes[node]) - nodes[node]['TotalUserCpuTimeHistory'].append(0) - nodes[node]['TotalSysCpuTimeHistory'].append(0) - nodes[node]['WallDurations'].append(0) - nodes[node]['ResidentSetSize'].append(0) - nodes[node]['SubmitTimes'].append(-1) - nodes[node]['JobIds'].append(nodes[node]['JobIds'][-1]) - nodes[node]['Restarts'] += 1 - elif event['MyType'] == 'JobAbortedEvent': - node = nodeMap[event['Cluster'], event['Proc']] - if nodes[node]['State'] == "idle" or nodes[node]['State'] == "held": - nodes[node]['StartTimes'].append(-1) - if not nodes[node]['RecordedSite']: - nodes[node]['SiteHistory'].append("Unknown") - nodes[node]['State'] = 'killed' - insertCpu(event, nodes[node]) - elif event['MyType'] == 'JobHeldEvent': - node = nodeMap[event['Cluster'], event['Proc']] - if nodes[node]['State'] == 'running': - nodes[node]['EndTimes'].append(eventtime) - if nodes[node]['WallDurations'] and nodes[node]['EndTimes'] and nodes[node]['StartTimes']: - nodes[node]['WallDurations'][-1] = nodes[node]['EndTimes'][-1] - nodes[node]['StartTimes'][-1] - insertCpu(event, nodes[node]) - nodes[node]['TotalUserCpuTimeHistory'].append(0) - nodes[node]['TotalSysCpuTimeHistory'].append(0) - nodes[node]['WallDurations'].append(0) - nodes[node]['ResidentSetSize'].append(0) - nodes[node]['SubmitTimes'].append(-1) - nodes[node]['JobIds'].append(nodes[node]['JobIds'][-1]) - nodes[node]['Restarts'] += 1 - nodes[node]['State'] = 'held' - elif event['MyType'] == 'JobReleaseEvent': - node = nodeMap[event['Cluster'], event['Proc']] - nodes[node]['State'] = 'idle' - elif event['MyType'] == 'JobAdInformationEvent': - node = nodeMap[event['Cluster'], event['Proc']] - if (not nodes[node]['RecordedSite']) and ('JOBGLIDEIN_CMSSite' in event) and not event['JOBGLIDEIN_CMSSite'].startswith("$$"): - nodes[node]['SiteHistory'].append(event['JOBGLIDEIN_CMSSite']) - nodes[node]['RecordedSite'] = True - insertCpu(event, nodes[node]) - elif event['MyType'] == 'JobImageSizeEvent': - node = nodeMap[event['Cluster'], event['Proc']] - nodes[node]['ResidentSetSize'][-1] = int(event['ResidentSetSize']) - if nodes[node]['StartTimes']: - nodes[node]['WallDurations'][-1] = eventtime - nodes[node]['StartTimes'][-1] - insertCpu(event, nodes[node]) - elif event["MyType"] == "JobDisconnectedEvent" or event["MyType"] == "JobReconnectedEvent": - # These events don't really affect the node status - pass - else: - logging.warning("Unknown event type: %s", event['MyType']) - - logging.debug("There were %d events in the job log.", count) - now = time.time() - for node, info in nodes.items(): - if node == 'DagStatus': - # StartTimes and WallDurations are not present, though crab status2 uses this record to get the DagStatus. - continue - lastStart = now - if info['StartTimes']: - lastStart = info['StartTimes'][-1] - while len(info['WallDurations']) < len(info['SiteHistory']): - info['WallDurations'].append(now - lastStart) - while len(info['WallDurations']) > len(info['SiteHistory']): - info['SiteHistory'].append("Unknown") - -def parseErrorReport(data, nodes): - """ - iterate over the jobs and set the error dict for those which are failed - :param data: - :param nodes: - :return: - """ - for jobid, statedict in nodes.iteritems(): - if 'State' in statedict and statedict['State'] == 'failed' and jobid in data: - # data[jobid] is a dictionary with the retry number as a key and error summary information as a value. - # Here we want to get the error summary information, and since values() returns a list - # (even if there's only a single value) it has to be indexed to zero. - statedict['Error'] = data[jobid].values()[0] #data[jobid] contains all retries. take the last one - -def parseNodeStateV2(fp, nodes, level): - """ - HTCondor 8.1.6 updated the node state file to be classad-based. - This is a more flexible format that allows future extensions but, unfortunately, - also requires a separate parser. - """ - dagStatus = nodes.setdefault("DagStatus", {}) - dagStatus.setdefault("SubDagStatus", {}) - subDagStatus = dagStatus.setdefault("SubDags", {}) - for ad in classad.parseAds(fp): - if ad['Type'] == "DagStatus": - if level: - statusDict = subDagStatus.setdefault(int(level), {}) - statusDict['Timestamp'] = ad.get('Timestamp', -1) - statusDict['NodesTotal'] = ad.get('NodesTotal', -1) - statusDict['DagStatus'] = ad.get('DagStatus', -1) - else: - dagStatus['Timestamp'] = ad.get('Timestamp', -1) - dagStatus['NodesTotal'] = ad.get('NodesTotal', -1) - dagStatus['DagStatus'] = ad.get('DagStatus', -1) - continue - if ad['Type'] != "NodeStatus": - continue - node = ad.get("Node", "") - if node.endswith("SubJobs"): - status = ad.get('NodeStatus', -1) - dagname = "RunJobs{0}.subdag".format(nodeName2Re.match(node).group(1)) - # Add special state where we *expect* a submitted DAG for the - # status command on the client - if status == 5 and os.path.exists(dagname) and os.stat(dagname).st_size > 0: - status = 99 - dagStatus["SubDagStatus"][node] = status - continue - if not node.startswith("Job"): - continue - nodeid = node[3:] - status = ad.get('NodeStatus', -1) - retry = ad.get('RetryCount', -1) - msg = ad.get("StatusDetails", "") - info = nodes.setdefault(nodeid, copy.deepcopy(NODE_DEFAULTS)) - if status == 1: # STATUS_READY - if info.get("State") == "transferring": - info["State"] = "cooloff" - elif info.get('State') != "cooloff": - info['State'] = 'unsubmitted' - elif status == 2: # STATUS_PRERUN - if retry == 0: - info['State'] = 'unsubmitted' - else: - info['State'] = 'cooloff' - elif status == 3: # STATUS_SUBMITTED - if msg == 'not_idle': - info.setdefault('State', 'running') - else: - info.setdefault('State', 'idle') - elif status == 4: # STATUS_POSTRUN - if info.get("State") != "cooloff": - info['State'] = 'transferring' - elif status == 5: # STATUS_DONE - info['State'] = 'finished' - elif status == 6: # STATUS_ERROR - # Older versions of HTCondor would put jobs into STATUS_ERROR - # for a short time if the job was to be retried. Hence, we had - # some status parsing logic to try and guess whether the job would - # be tried again in the near future. This behavior is no longer - # observed; STATUS_ERROR is terminal. - info['State'] = 'failed' - -def storeNodesInfoInFile(): - """ - Open cache file and get the location until which the jobs_log was parsed last time - returns: a dictionary with keys: jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap - """ - jobLogCheckpoint = None - if os.path.exists(STATUS_CACHE_FILE) and os.stat(STATUS_CACHE_FILE).st_size > 0: - logging.debug("cache file found, opening") - try: - nodesStorage = open(STATUS_CACHE_FILE, "r") - jobLogCheckpoint = nodesStorage.readline().strip() - if jobLogCheckpoint.startswith('#') : - logging.debug("cache file contains initial comments, skipping") - # comment line indicates a place-holder file created at DAG bootstrap time - jobLogCheckpoint = None - else: - logging.debug("reading cache file") - fjrParseResCheckpoint = int(nodesStorage.readline()) - nodes = ast.literal_eval(nodesStorage.readline()) - nodeMap = ast.literal_eval(nodesStorage.readline()) - nodesStorage.close() - except Exception: - logging.exception("error during status_cache handling") - jobLogCheckpoint = None - - if not jobLogCheckpoint: - logging.debug("no usable cache file found, creating") - fjrParseResCheckpoint = 0 - nodes = {} - nodeMap = {} - - if jobLogCheckpoint: - # resume log parsing where we left - with open((LOG_PARSING_POINTERS_DIR+jobLogCheckpoint), 'r') as f: - jel = pickle.load(f) - else: - # parse log from beginning - jel = htcondor.JobEventLog('job_log') - #jobsLog = open("job_log", "r") - #jobsLog.seek(jobLogCheckpoint) - - parseJobLog(jel, nodes, nodeMap) - # save jel object in a pickle file made unique by a timestamp - newJelPickleName = 'jel-%d.pkl' % int(time.time()) - if not os.path.exists(LOG_PARSING_POINTERS_DIR): - os.mkdir(LOG_PARSING_POINTERS_DIR) - with open((LOG_PARSING_POINTERS_DIR+newJelPickleName), 'w') as f: - pickle.dump(jel, f) - newJobLogCheckpoint = newJelPickleName - - for fn in glob.glob("node_state*"): - level = re.match(r'(\w+)(?:.(\w+))?', fn).group(2) - with open(fn, 'r') as nodeState: - parseNodeStateV2(nodeState, nodes, level) - - try: - errorSummary, newFjrParseResCheckpoint = summarizeFjrParseResults(fjrParseResCheckpoint) - if errorSummary and newFjrParseResCheckpoint: - parseErrorReport(errorSummary, nodes) - except IOError: - logging.exception("error during error_summary file handling") - - # First write the new cache file under a temporary name, so that other processes - # don't get an incomplete result. Then replace the old one with the new one. - tempFilename = (STATUS_CACHE_FILE + ".%s") % os.getpid() - - nodesStorage = open(tempFilename, "w") - nodesStorage.write(str(newJobLogCheckpoint) + "\n") - nodesStorage.write(str(newFjrParseResCheckpoint) + "\n") - nodesStorage.write(str(nodes) + "\n") - nodesStorage.write(str(nodeMap) + "\n") - nodesStorage.close() - - move(tempFilename, STATUS_CACHE_FILE) - - # collect all cache info in a single dictionary and return it to called - cacheDoc = {} - cacheDoc['jobLogCheckpoint'] = newJobLogCheckpoint - cacheDoc['fjrParseResCheckpoint'] = newFjrParseResCheckpoint - cacheDoc['nodes'] = nodes - cacheDoc['nodeMap'] = nodeMap - return cacheDoc - -def readOldStatusCacheFile(): - """ - it is enough to read the Pickle version, since we want to transition to that - returns: a dictionary with keys: jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap - """ - jobLogCheckpoint = None - if os.path.exists(PKL_STATUS_CACHE_FILE) and os.stat(PKL_STATUS_CACHE_FILE).st_size > 0: - logging.debug("cache file found, opening") - try: - with open(PKL_STATUS_CACHE_FILE, "r") as fp: - cacheDoc = pickle.load(fp) - # protect against fake file with just bootstrapTime created by AdjustSites.py - jobLogCheckpoint = getattr(cacheDoc, 'jobLogCheckpoint', None) - fjrParseResCheckpoint = getattr(cacheDoc, 'fjrParseResCheckpoint', None) - nodes = getattr(cacheDoc, 'nodes', None) - nodeMap = getattr(cacheDoc, 'nodeMap', None) - except Exception: - logging.exception("error during status_cache handling") - jobLogCheckpoint = None - - if not jobLogCheckpoint: - logging.debug("no usable cache file found, creating") - fjrParseResCheckpoint = 0 - nodes = {} - nodeMap = {} - - # collect all cache info in a single dictionary and return it to called - cacheDoc = {} - cacheDoc['jobLogCheckpoint'] = jobLogCheckpoint - cacheDoc['fjrParseResCheckpoint'] = fjrParseResCheckpoint - cacheDoc['nodes'] = nodes - cacheDoc['nodeMap'] = nodeMap - return cacheDoc - -def parseCondorLog(cacheDoc): - """ - do all real work and update checkpoints, nodes and nodemap dictionaries - takes as input a cacheDoc dictionary with keys - jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap - and returns the same dictionary with updated information - """ - - jobLogCheckpoint = cacheDoc['jobLogCheckpoint'] - fjrParseResCheckpoint = cacheDoc['fjrParseResCheckpoint'] - nodes = cacheDoc['nodes'] - nodeMap = cacheDoc['nodeMap'] - if jobLogCheckpoint: - # resume log parsing where we left - with open((LOG_PARSING_POINTERS_DIR+jobLogCheckpoint), 'r') as f: - jel = pickle.load(f) - else: - # parse log from beginning - jel = htcondor.JobEventLog('job_log') - - parseJobLog(jel, nodes, nodeMap) - # save jel object in a pickle file made unique by a timestamp - newJelPickleName = 'jel-%d.pkl' % int(time.time()) - if not os.path.exists(LOG_PARSING_POINTERS_DIR): - os.mkdir(LOG_PARSING_POINTERS_DIR) - with open((LOG_PARSING_POINTERS_DIR+newJelPickleName), 'w') as f: - pickle.dump(jel, f) - newJobLogCheckpoint = newJelPickleName - - for fn in glob.glob("node_state*"): - level = re.match(r'(\w+)(?:.(\w+))?', fn).group(2) - with open(fn, 'r') as nodeState: - parseNodeStateV2(nodeState, nodes, level) - - try: - errorSummary, newFjrParseResCheckpoint = summarizeFjrParseResults(fjrParseResCheckpoint) - if errorSummary and newFjrParseResCheckpoint: - parseErrorReport(errorSummary, nodes) - except IOError: - logging.exception("error during error_summary file handling") - - # collect all cache info in a single dictionary and return it to called - newCacheDoc = {} - newCacheDoc['jobLogCheckpoint'] = newJobLogCheckpoint - newCacheDoc['fjrParseResCheckpoint'] = newFjrParseResCheckpoint - newCacheDoc['nodes'] = nodes - newCacheDoc['nodeMap'] = nodeMap - return newCacheDoc - -def storeNodesInfoInPklFile(cacheDoc): - """ - takes as input an cacheDoc dictionary with keys - jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap - """ - # First write the new cache file under a temporary name, so that other processes - # don't get an incomplete result. Then replace the old one with the new one. - tempFilename = (PKL_STATUS_CACHE_FILE + ".%s") % os.getpid() - - # persist cache info in py2-compatible pickle format - with open(tempFilename, "w") as fp: - pickle.dump(cacheDoc, fp, protocol=2) - move(tempFilename, PKL_STATUS_CACHE_FILE) - -def storeNodesInfoInTxtFile(cacheDoc): - """ - takes as input an cacheDoc dictionary with keys - jobLogCheckpoint, fjrParseResCheckpoint, nodes, nodeMap - """ - jobLogCheckpoint = cacheDoc['jobLogCheckpoint'] - fjrParseResCheckpoint = cacheDoc['fjrParseResCheckpoint'] - nodes = cacheDoc['nodes'] - nodeMap = cacheDoc['nodeMap'] - # First write the new cache file under a temporary name, so that other processes - # don't get an incomplete result. Then replace the old one with the new one. - tempFilename = (STATUS_CACHE_FILE + ".%s") % os.getpid() - - nodesStorage = open(tempFilename, "w") - nodesStorage.write(str(jobLogCheckpoint) + "\n") - nodesStorage.write(str(fjrParseResCheckpoint) + "\n") - nodesStorage.write(str(nodes) + "\n") - nodesStorage.write(str(nodeMap) + "\n") - nodesStorage.close() - -def summarizeFjrParseResults(checkpoint): - ''' - Reads the fjr_parse_results file line by line. The file likely contains multiple - errors for the same jobId coming from different retries, we only care about - the last error for each jobId. Since each postjob writes this information - sequentially (job retry #2 will be written after job retry #1), overwrite - whatever information there was before for each jobId. - - Return the updated error dictionary and also the location until which the - fjr_parse_results file was read so that we can store it and - don't have t re-read the same information next time the cache_status.py runs. - ''' - - if os.path.exists(FJR_PARSE_RES_FILE): - with open(FJR_PARSE_RES_FILE, "r") as f: - f.seek(checkpoint) - content = f.readlines() - newCheckpoint = f.tell() - - errDict = {} - for line in content: - fjrResult = ast.literal_eval(line) - jobId = fjrResult.keys()[0] - errDict[jobId] = fjrResult[jobId] - return errDict, newCheckpoint - else: - return None, 0 - -def main(): - """ - parse condor job_log from last checkpoint until now and write summary in status_cache files - :return: - """ - try: - # this is the old part - cacheDoc = storeNodesInfoInFile() - # this is new for the picke file but for the time being stick to using - # cacheDoc information from old way. At some point shoudl carefull check code - # and move on to the more strucutred 3-steps below, most likely when running - # in python3 the old status_cache.txt file will be unusable, as we found in crab client - #infoN = readOldStatusCacheFile() - #infoN = parseCondorLog(info) - storeNodesInfoInPklFile(cacheDoc) - # in case we still need the text file (e.g. for the UI) when we remove the old code: - # storeNodesInfoInTxtFile(cacheDoc) - except Exception: - logging.exception("error during main loop") - -main() - -logging.debug("cache_status_jel.py exiting") diff --git a/scripts/task_process/task_proc_wrapper.sh b/scripts/task_process/task_proc_wrapper.sh index e235794336..bd3a691ed9 100644 --- a/scripts/task_process/task_proc_wrapper.sh +++ b/scripts/task_process/task_proc_wrapper.sh @@ -6,12 +6,7 @@ function log { function cache_status { log "Running cache_status.py" - python task_process/cache_status.py -} - -function cache_status_jel { - log "Running cache_status_jel.py" - python task_process/cache_status_jel.py + python3 task_process/cache_status.py } function manage_transfers { @@ -21,9 +16,9 @@ function manage_transfers { DEST_LFN=`python -c 'import sys, json; print json.loads( open("task_process/transfers.txt").readlines()[0] )["destination_lfn"]' ` if [[ $DEST_LFN =~ ^/store/user/rucio/* ]]; then - timeout 15m python task_process/RUCIO_Transfers.py + timeout 15m env PYTHONPATH=$PYTHONPATH:$RucioPy3 python3 task_process/RUCIO_Transfers.py else - timeout 15m python task_process/FTS_Transfers.py + timeout 15m env PYTHONPATH=$PYTHONPATH:$RucioPy3 python3 task_process/FTS_Transfers.py fi err=$? @@ -101,9 +96,17 @@ TIME_OF_LAST_QUERY=$(date +"%s") # submission is most likely pointless and relatively expensive, the script will run normally and perform the query later. DAG_INFO="init" +# following two lines are needed to use pycurl on python3 without the full COMP or CMSSW env. +export PYTHONPATH=$PYTHONPATH:/data/srv/pycurl3/7.44.1 +source /cvmfs/cms.cern.ch/slc7_amd64_gcc900/external/curl/7.59.0/etc/profile.d/init.sh + export PYTHONPATH=`pwd`/task_process:`pwd`/CRAB3.zip:`pwd`/WMCore.zip:$PYTHONPATH -export PYTHONPATH=$PYTHONPATH:/cvmfs/cms.cern.ch/rucio/current/lib/python2.7/site-packages +# will use one of the other of following two as appropriate, of course this implies +# that we use python(2 or 3) from the OS and that we have an FTS python client available +# in CVMFS for the same python version +export RucioPy2=/cvmfs/cms.cern.ch/rucio/x86_64/slc7/py2/current/lib/python2.7/site-packages/ +export RucioPy3=/cvmfs/cms.cern.ch/rucio/x86_64/slc7/py3/current/lib/python3.6/site-packages/ log "Starting task daemon wrapper" while true @@ -120,11 +123,7 @@ do fi # Run the parsing script - if [ -f USE_JEL ] ; then - cache_status_jel - else - cache_status - fi + cache_status manage_transfers sleep 300s diff --git a/setup.py b/setup.py index d7328ddd7f..dc5b27a2f0 100644 --- a/setup.py +++ b/setup.py @@ -20,18 +20,17 @@ { 'CRABClient': #Will be used if we moved the CRABClient repository { - 'py_modules': ['RESTInteractions', 'ServerUtilities'], + 'py_modules': ['ServerUtilities'], 'python': [], }, 'CRABInterface': { - 'py_modules': ['PandaServerInterface', 'CRABQuality', 'HTCondorUtils', 'HTCondorLocator', 'ServerUtilities'], + 'py_modules': ['CRABQuality', 'HTCondorUtils', 'HTCondorLocator', 'ServerUtilities'], 'python': ['CRABInterface', 'CRABInterface/Pages', 'Databases', 'Databases/FileMetaDataDB', 'Databases/FileMetaDataDB/Oracle', 'Databases/FileMetaDataDB/Oracle/FileMetaData', 'Databases/TaskDB', 'Databases/TaskDB/Oracle', - 'Databases/TaskDB/Oracle/JobGroup', 'Databases/TaskDB/Oracle/Task', 'Databases/FileTransfersDB', 'Databases/FileTransfersDB/Oracle/', @@ -39,18 +38,12 @@ }, 'TaskWorker': { - 'py_modules': ['PandaServerInterface', 'RESTInteractions', - 'apmon', 'Logger', 'ProcInfo', + 'py_modules': ['RESTInteractions', 'CRABQuality', 'HTCondorUtils', 'HTCondorLocator', 'ServerUtilities', 'MultiProcessingLog', 'CMSGroupMapper', 'RucioUtils', 'cache_status'], 'python': ['TaskWorker', 'TaskWorker/Actions', 'TaskWorker/DataObjects', - 'TaskWorker/Actions/Recurring', 'taskbuffer', 'Publisher', 'TransferInterface'] - }, - 'UserFileCache': - { - 'py_modules' : ['ServerUtilities'], - 'python': ['UserFileCache'] + 'TaskWorker/Actions/Recurring', 'Publisher', 'TransferInterface'] }, 'Publisher': { @@ -60,7 +53,7 @@ 'All': { 'py_modules': [''], - 'python': ['TaskWorker', 'CRABInterface', 'UserFileCache', 'CRABClient', 'Publisher'] + 'python': ['TaskWorker', 'CRABInterface', 'CRABClient', 'Publisher'] } } @@ -146,8 +139,8 @@ def define_the_build(dist, system_name, patch_x=''): class BuildCommand(Command): """Build python modules for a specific system.""" description = \ - "Build python modules for the specified system. The two supported systems\n" + \ - "\t\t at the moment are 'CRABInterface' and 'UserFileCache'. Use with --force \n" + \ + "Build python modules for the specified system. The supported system(s)\n" + \ + "\t\t at the moment are 'CRABInterface' . Use with --force \n" + \ "\t\t to ensure a clean build of only the requested parts.\n" user_options = build.user_options user_options.append(('system=', 's', 'build the specified system (default: CRABInterface)')) diff --git a/src/python/CRABInterface/Attrib.py b/src/python/CRABInterface/Attrib.py index 779cece1cf..5ee9b26e4c 100644 --- a/src/python/CRABInterface/Attrib.py +++ b/src/python/CRABInterface/Attrib.py @@ -11,7 +11,7 @@ def attr(*args, **kwargs): def wrap_ob(ob): for name in args: setattr(ob, name, True) - for name, value in kwargs.iteritems(): + for name, value in kwargs.items(): setattr(ob, name, value) return ob return wrap_ob diff --git a/src/python/CRABInterface/DataFileMetadata.py b/src/python/CRABInterface/DataFileMetadata.py index a8957c66ee..a8cf915d84 100644 --- a/src/python/CRABInterface/DataFileMetadata.py +++ b/src/python/CRABInterface/DataFileMetadata.py @@ -7,6 +7,8 @@ import logging from ast import literal_eval +from Utils.Utilities import decodeBytesToUnicode + from CRABInterface.Utilities import getDBinstance class DataFileMetadata(object): @@ -33,11 +35,9 @@ def getFiles(self, taskname, filetype, howmany, lfn): for row in rows: row = self.FileMetaData.GetFromTaskAndType_tuple(*row) if lfn==[] or row.lfn in lfn: - yield json.dumps({ + filedict = { 'taskname': taskname, 'filetype': filetype, - #TODO pandajobid should not be used. Let's wait a "quiet release" and remove it - 'pandajobid': row.pandajobid, 'jobid': row.jobid, 'outdataset': row.outdataset, 'acquisitionera': row.acquisitionera, @@ -55,9 +55,49 @@ def getFiles(self, taskname, filetype, howmany, lfn): 'filesize': row.filesize, 'parents': literal_eval(row.parents.read()), 'state': row.state, - 'created': str(row.parents), + 'created': literal_eval(row.parents.read()), # postpone conversion to str 'tmplfn': row.tmplfn - }) + } + ## temporary changes for making REST py3 compatible with Publisher py2 - start + ## this block of code can be removed after we complete the + ## deployment in production of the services running in python3 + # we aim at replacing with unicode all the bytes from such a dictionary: + # {'taskname': '220113_142727:dmapelli_crab_20220113_152722', + # 'filetype': 'EDM', + # 'jobid': '7', + # 'outdataset': '/GenericTTbar/dmapelli-[...]-94ba0e06145abd65ccb1d21786dc7e1d/USER', + # 'acquisitionera': 'null', + # 'swversion': 'CMSSW_10_6_29', + # 'inevents': 300, + # 'globaltag': 'None', + # 'publishname': '[...]-94ba0e06145abd65ccb1d21786dc7e1d', + # 'location': 'T2_CH_CERN', + # 'tmplocation': 'T2_UK_London_Brunel', + # 'runlumi': {b'1': {b'2521': b'300'}}, ## THIS CONTAINS BYTES + # 'adler32': '31018715', + # 'cksum': 2091402041, 'md5': 'asda', + # 'lfn': '/store/user/dmapelli/GenericTTbar/[...]/220113_142727/0000/output_7.root', + # 'filesize': 651499, + # 'parents': [b'/store/[...]-0CC47A7C34C8.root'], ## THIS CONTAINS BYTES + # 'state': None, + # 'created': "[b'/store/[...]-0CC47A7C34C8.root']", ## THIS CONTAINS BYTES + # 'tmplfn': '/store/user/dmapelli/GenericTTbar/[...]/220113_142727/0000/output_7.root'} + self.logger.info("converting bytes into unicode in filemetadata - before - %s", filedict) + for key0, val0 in filedict.items(): + if isinstance(val0, list): # 'parents' and 'created' + filedict[key0] = [decodeBytesToUnicode(el) for el in val0] + if isinstance(val0, dict): # 'runlumi' + for key1, val1 in list(val0.items()): + val0.pop(key1) + val0[decodeBytesToUnicode(key1)] = val1 + if isinstance(val1, dict): + for key2, val2 in list(val1.items()): + val1.pop(key2) + val1[decodeBytesToUnicode(key2)] = decodeBytesToUnicode(val2) + self.logger.info("converting bytes into unicode in filemetadata - after - %s", filedict) + ## temporary changes for making REST py3 compatible with Publisher py2 - end + filedict['created'] = str(filedict['created']) # convert to str, after removal of bytes + yield json.dumps(filedict) def inject(self, **kwargs): """ Insert or update a record in the database @@ -118,7 +158,7 @@ def changeState(self, **kwargs): #kwargs are (taskname, outlfn, filestate) """ self.logger.debug("Changing state of file %(outlfn)s in task %(taskname)s to %(filestate)s" % kwargs) - self.api.modify(self.FileMetaData.ChangeFileState_sql, **dict((k, [v]) for k, v in kwargs.iteritems())) + self.api.modify(self.FileMetaData.ChangeFileState_sql, **dict((k, [v]) for k, v in kwargs.items())) def delete(self, taskname, hours): """ UNUSED method that deletes record from the FILEMETADATA table diff --git a/src/python/CRABInterface/DataUserWorkflow.py b/src/python/CRABInterface/DataUserWorkflow.py index d2a5cbe26d..3d8db3c0c3 100644 --- a/src/python/CRABInterface/DataUserWorkflow.py +++ b/src/python/CRABInterface/DataUserWorkflow.py @@ -23,10 +23,6 @@ def getLatests(self, username, timestamp): (this should probably have a default!) :arg int limit: limit on the workflow age :return: a list of workflows""" - # convert the workflow age in something eatable by a couch view - # in practice it's convenient that the timestamp is on a fixed format: latest 1 or 3 days, latest 1 week, latest 1 month - # and that it's a list (probably it can be converted into it): [year, month-num, day, hh, mm, ss] - # this will allow to query as it's described here: http://guide.couchdb.org/draft/views.html#many return self.workflow.getLatests(username, timestamp) def errors(self, workflow, shortformat): @@ -49,14 +45,14 @@ def report(self, workflow, userdn, usedbs): def report2(self, workflow, userdn, usedbs): return self.workflow.report2(workflow, userdn) - def logs(self, workflow, howmany, exitcode, jobids, userdn, userproxy=None): + def logs(self, workflow, howmany, exitcode, jobids, userdn): """Returns the workflow logs PFN. It takes care of the LFN - PFN conversion too. :arg str workflow: a workflow name :arg int howmany: the limit on the number of PFN to return :arg int exitcode: the log has to be of a job ended with this exit_code :return: a generator of list of logs pfns""" - return self.workflow.logs(workflow, howmany, exitcode, jobids, userdn, userproxy) + return self.workflow.logs(workflow, howmany, exitcode, jobids, userdn) def logs2(self, workflow, howmany, jobids): """Returns information about the workflow log files. @@ -68,13 +64,13 @@ def logs2(self, workflow, howmany, jobids): :return: a generator of list of logs pfns""" return self.workflow.logs2(workflow, howmany, jobids) - def output(self, workflow, howmany, jobids, userdn, userproxy=None): + def output(self, workflow, howmany, jobids, userdn): """Returns the workflow output PFN. It takes care of the LFN - PFN conversion too. :arg str list workflow: a workflow name :arg int howmany: the limit on the number of PFN to return :return: a generator of list of output pfns""" - return self.workflow.output(workflow, howmany, jobids, userdn, userproxy) + return self.workflow.output(workflow, howmany, jobids, userdn) def output2(self, workflow, howmany, jobids): """Returns information about the workflow output files. @@ -148,38 +144,34 @@ def submit(self, *args, **kwargs): return self.workflow.submit(*args, **kwargs) - def resubmit(self, workflow, publication, jobids, force, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, userdn, userproxy=None): + def resubmit(self, workflow, publication, jobids, force, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, userdn): """Request to Resubmit a workflow. :arg str workflow: a workflow name""" - return self.workflow.resubmit(workflow, publication, jobids, force, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, userdn, userproxy) + return self.workflow.resubmit(workflow, publication, jobids, force, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, userdn) - def resubmit2(self, workflow, publication, jobids, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, - userproxy=None): + def resubmit2(self, workflow, publication, jobids, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority): """Request to Resubmit a workflow. :arg str workflow: a workflow name""" - return self.workflow.resubmit2(workflow, publication, jobids, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, - userproxy) + return self.workflow.resubmit2(workflow, publication, jobids, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority) - def status(self, workflow, userdn, userproxy=None, verbose=False): + def status(self, workflow, userdn, verbose=False): """Retrieve the status of the workflow :arg str workflow: a valid workflow name :arg str userdn: the user dn makind the request - :arg str userproxy: the user proxy retrieved by `retrieveUserCert` :return: a generator of workflow states """ - return self.workflow.status(workflow, userdn, userproxy) + return self.workflow.status(workflow, userdn) - def kill(self, workflow, force, killwarning, userdn, userproxy=None): + def kill(self, workflow, killwarning=''): """Request to Abort a workflow. :arg str workflow: a workflow name :arg str force: a flag to know if kill should be brutal - :arg str userproxy: the user proxy retrieved by `retrieveUserCert` :arg int force: force to delete the workflows in any case; 0 no, everything else yes""" - return self.workflow.kill(workflow, force, killwarning, userdn, userproxy) + return self.workflow.kill(workflow, killwarning) def proceed(self, workflow): """Continue a task initialized with 'crab submit --dryrun'. diff --git a/src/python/CRABInterface/DataWorkflow.py b/src/python/CRABInterface/DataWorkflow.py index e6690aa9d8..6bb3f6fb21 100644 --- a/src/python/CRABInterface/DataWorkflow.py +++ b/src/python/CRABInterface/DataWorkflow.py @@ -6,13 +6,12 @@ ## WMCore dependecies from WMCore.REST.Error import ExecutionError -from WMCore.Database.CMSCouch import CouchServer ## CRAB dependencies from ServerUtilities import checkTaskLifetime from ServerUtilities import PUBLICATIONDB_STATUSES from ServerUtilities import NUM_DAYS_FOR_RESUBMITDRAIN -from ServerUtilities import isCouchDBURL, getEpochFromDBTime +from ServerUtilities import getEpochFromDBTime from CRABInterface.Utilities import CMSSitesCache, conn_handler, getDBinstance @@ -44,7 +43,6 @@ def __init__(self, config): "EventAwareLumiBased": "events_per_job"} self.Task = getDBinstance(config, 'TaskDB', 'Task') - self.JobGroup = getDBinstance(config, 'TaskDB', 'JobGroup') self.FileMetaData = getDBinstance(config, 'FileMetaDataDB', 'FileMetaData') self.transferDB = getDBinstance(config, 'FileTransfersDB', 'FileTransfers') @@ -63,17 +61,6 @@ def getLatests(self, username, timestamp): (this should probably have a default!) :arg int limit: limit on the workflow age :return: a list of workflows""" - # convert the workflow age in something eatable by a couch view - # in practice it's convenient that the timestamp is on a fixed format: latest 1 or 3 days, latest 1 week, latest 1 month - # and that it's a list (probably it can be converted into it): [year, month-num, day, hh, mm, ss] - # this will allow to query as it's described here: http://guide.couchdb.org/draft/views.html#many - - # example: - # return self.monitordb.conn.loadView('WMStats', 'byUser', - # options = { "startkey": user, - # "endkey": user, - # "limit": limit, }) - #raise NotImplementedError return self.api.query(None, None, self.Task.GetTasksFromUser_sql, username=username, timestamp=timestamp) def report(self, workflow): @@ -107,7 +94,7 @@ def submit(self, workflow, activity, jobtype, jobsw, jobarch, use_parent, second sitewhitelist, splitalgo, algoargs, cachefilename, cacheurl, addoutputfiles, username, userdn, savelogsflag, publication, publishname, publishname2, asyncdest, dbsurl, publishdbsurl, vorole, vogroup, tfileoutfiles, edmoutfiles, runs, lumis, totalunits, adduserfiles, oneEventMode=False, maxjobruntime=None, numcores=None, maxmemory=None, priority=None, lfn=None, - ignorelocality=None, saveoutput=None, faillimit=10, userfiles=None, userproxy=None, asourl=None, asodb=None, scriptexe=None, scriptargs=None, + ignorelocality=None, saveoutput=None, faillimit=10, userfiles=None, asourl=None, asodb=None, scriptexe=None, scriptargs=None, scheddname=None, extrajdl=None, collector=None, dryrun=False, publishgroupname=False, nonvaliddata=False, inputdata=None, primarydataset=None, debugfilename=None, submitipaddr=None, ignoreglobalblacklist=False): """Perform the workflow injection @@ -189,7 +176,6 @@ def submit(self, workflow, activity, jobtype, jobsw, jobarch, use_parent, second self.api.modify(self.Task.New_sql, task_name = [workflow], task_activity = [activity], - jobset_id = [None], task_status = ['NEW'], task_command = ['SUBMIT'], task_failure = [''], @@ -227,7 +213,6 @@ def submit(self, workflow, activity, jobtype, jobsw, jobarch, use_parent, second edm_outfiles = [dbSerializer(edmoutfiles)], job_type = [jobtype], arguments = [dbSerializer(arguments)], - resubmitted_jobs= [dbSerializer([])], save_logs = ['T' if savelogsflag else 'F'], user_infiles = [dbSerializer(adduserfiles)], maxjobruntime = [maxjobruntime], @@ -265,9 +250,9 @@ def publicationStatusWrapper(self, workflow, asourl, asodb, username, publicatio publicationInfo['status'] = {'disabled': []} return publicationInfo - @conn_handler(services=['servercert']) + @conn_handler(services=[]) def resubmit2(self, workflow, publication, jobids, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, - numcores, priority, userproxy): + numcores, priority): """Request to reprocess what the workflow hasn't finished to reprocess. This needs to create a new workflow in the same campaign """ @@ -335,13 +320,10 @@ def resubmit2(self, workflow, publication, jobids, siteblacklist, sitewhitelist, msg = "Cannot resubmit publication." msg += " Error in publication status: %s" % (publicationInfo['error']) raise ExecutionError(msg) - if isCouchDBURL(asourl) and publicationInfo['status'].get('publication_failed', 0) == 0: - msg = "There are no failed publications to resubmit." - raise ExecutionError(msg) ## Here we can add a check on the publication status of the documents ## corresponding to the job ids in resubmitjobids and jobids. So far the ## publication resubmission will resubmit all the failed publications. - self.resubmitPublication(asourl, asodb, userproxy, workflow) + self.resubmitPublication(asourl, asodb, workflow) return [{'result': retmsg}] else: self.logger.info("Jobs to resubmit: %s", jobids) @@ -372,8 +354,7 @@ def resubmit2(self, workflow, publication, jobids, siteblacklist, sitewhitelist, 'maxjobruntime' : maxjobruntime, 'maxmemory' : maxmemory, 'numcores' : numcores, - 'priority' : priority, - 'resubmit_publication' : publication + 'priority' : priority } ## Change the 'tm_arguments' column of the Tasks DB for this task to contain the ## above parameters. @@ -389,157 +370,7 @@ def resubmit2(self, workflow, publication, jobids, siteblacklist, sitewhitelist, self.api.modify(self.Task.SetStatusTask_sql, status = newstate, command = newcommand, taskname = [workflow]) return [{'result': retmsg}] - def resubmit(self, workflow, publication, jobids, force, siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority, userdn, userproxy): - """Request to reprocess what the workflow hasn't finished to reprocess. - This needs to create a new workflow in the same campaign - - :arg str workflow: a valid workflow name - :arg str list siteblacklist: black list of sites, with CMS name; - :arg str list sitewhitelist: white list of sites, with CMS name; - :arg int whether to resubmit publications or jobs.""" - retmsg = "ok" - resubmitWhat = "publications" if publication else "jobs" - - self.logger.info("About to resubmit %s for workflow: %s. Getting status first.", resubmitWhat, workflow) - - ## Get the status of the task/jobs. - statusRes = self.status(workflow, userdn, userproxy)[0] - - ## Check lifetime of the task and raise ExecutionError if appropriate - self.logger.info("Checking if resubmission is possible: we don't allow resubmission %s days before task expiration date", NUM_DAYS_FOR_RESUBMITDRAIN) - retmsg = checkTaskLifetime(statusRes['submissionTime']) - if retmsg != "ok": - return [{'result': retmsg}] - - ## Ignore the following options if this is a publication resubmission or if the - ## task was never submitted. - if publication or statusRes['status'] == 'SUBMITFAILED': - jobids, force = None, False - siteblacklist, sitewhitelist, maxjobruntime, maxmemory, numcores, priority = None, None, None, None, None, None - - ## We allow resubmission only if the task status is one of these: - allowedTaskStates = ['SUBMITTED', 'KILLED', 'KILLFAILED', 'RESUBMITFAILED', 'FAILED'] - ## We allow resubmission of successfully finished jobs if the user explicitly - ## gave the job ids and the force option (i.e. the user knows what he/she is - ## doing). In that case we have to allow the task to be in COMPLETED status. - ## The same is true for publication resubmission. - if (jobids and force) or publication: - allowedTaskStates += ['COMPLETED'] - ## Allow resubmission of tasks in SUBMITFAILED status if this is not a - ## publication resubmission. - if not publication: - allowedTaskStates += ['SUBMITFAILED'] #NB submitfailed goes to NEW, not RESUBMIT - ## If the task status is not an allowed one, fail the resubmission. - if statusRes['status'] not in allowedTaskStates: - if statusRes['status'] == 'COMPLETED': - msg = "Task status is COMPLETED." - msg += " To resubmit jobs from a task in status COMPLETED, specify the job ids and use the force option." - msg += " To resubmit publications use the publication option." - else: - msg = "You cannot resubmit %s if the task is in status %s." % (resubmitWhat, statusRes['status']) - raise ExecutionError(msg) - - if statusRes['status'] != 'SUBMITFAILED': - ## This is the list of job ids that we allow to be resubmitted. - ## Note: This list will be empty if statusRes['jobList'] is empty to begin with. - resubmitjobids = [] - for jobstatus, jobid in statusRes['jobList']: - if (not publication and jobstatus in self.failedList) or \ - (((jobids and force) or publication) and jobstatus in self.successList): - resubmitjobids.append(jobid) - if statusRes['jobList'] and not resubmitjobids: - msg = "There are no %s to resubmit." % (resubmitWhat) - if publication: - msg += " Publications can only be resubmitted for jobs in status %s." % (self.successList) - else: - msg += " Only jobs in status %s can be resubmitted." % (self.failedList) - msg += " Jobs in status %s can also be resubmitted," % (self.successList) - msg += " but only if the jobid is specified and the force option is set." - raise ExecutionError(msg) - ## Checks for publication resubmission. - if publication: - if 'publication' not in statusRes or not statusRes['publication']: - msg = "Cannot resubmit publication." - msg += " Unable to retrieve the publication status." - raise ExecutionError(msg) - if 'disabled' in statusRes['publication']: - msg = "Cannot resubmit publication." - msg += " Publication was disabled in the CRAB configuration." - raise ExecutionError(msg) - if 'error' in statusRes['publication']: - msg = "Cannot resubmit publication." - msg += " Error in publication status: %s" % (statusRes['publication']['error']) - raise ExecutionError(msg) - if isCouchDBURL(statusRes['ASOURL']) and statusRes['publication'].get('publication_failed', 0) == 0: - msg = "There are no failed publications to resubmit." - raise ExecutionError(msg) - ## Here we can add a check on the publication status of the documents - ## corresponding to the job ids in resubmitjobids and jobids. So far the - ## publication resubmission will resubmit all the failed publications. - - ## If the user wants to resubmit a specific set of job ids ... - if jobids: - ## ... make the intersection between the "allowed" and "wanted" job ids. - resubmitjobids = list(set(resubmitjobids) & set(jobids)) - ## Check if all the "wanted" job ids can be resubmitted. If not, fail the resubmission. - if len(resubmitjobids) != len(jobids): - requestedResub = list(set(jobids) - set(resubmitjobids)) - msg = "CRAB server refused to resubmit the following jobs: %s." % (str(requestedResub)) - msg += " Only jobs in status %s can be resubmitted." % (self.failedList) - msg += " Jobs in status %s can also be resubmitted," % (self.successList) - msg += " but only if the jobid is specified and the force option is set." - raise ExecutionError(msg) #return [{'result': msg}] - if publication: - self.logger.info("Publications to resubmit if failed: %s", resubmitjobids) - else: - self.logger.info("Jobs to resubmit: %s", resubmitjobids) - - ## If these parameters are not set, give them the same values they had in the - ## original task submission. - if (siteblacklist is None) or (sitewhitelist is None) or (maxjobruntime is None) or (maxmemory is None) or (numcores is None) or (priority is None): - ## origValues = [orig_siteblacklist, orig_sitewhitelist, orig_maxjobruntime, orig_maxmemory, orig_numcores, orig_priority] - origValues = next(self.api.query(None, None, self.Task.GetResubmitParams_sql, taskname = workflow)) - if siteblacklist is None: - siteblacklist = literal_eval(origValues[0]) - if sitewhitelist is None: - sitewhitelist = literal_eval(origValues[1]) - if maxjobruntime is None: - maxjobruntime = origValues[2] - if maxmemory is None: - maxmemory = origValues[3] - if numcores is None: - numcores = origValues[4] - if priority is None: - priority = origValues[5] - ## These are the parameters that we want to writte down in the 'tm_arguments' - ## column of the Tasks DB each time a resubmission is done. - ## DagmanResubmitter will read these parameters and write them into the task ad. - arguments = {'resubmit_jobids' : resubmitjobids, - 'site_blacklist' : siteblacklist, - 'site_whitelist' : sitewhitelist, - 'maxjobruntime' : maxjobruntime, - 'maxmemory' : maxmemory, - 'numcores' : numcores, - 'priority' : priority, - 'resubmit_publication' : publication - } - ## Change the 'tm_arguments' column of the Tasks DB for this task to contain the - ## above parameters. - self.api.modify(self.Task.SetArgumentsTask_sql, taskname = [workflow], arguments = [str(arguments)]) - - #TODO states are changed - ## Change the status of the task in the Tasks DB to RESUBMIT (or NEW). - if statusRes['status'] == 'SUBMITFAILED': - newstate = ["NEW"] - newcommand = ["SUBMIT"] - else: - newstate = ["NEW"] - newcommand = ["RESUBMIT"] - self.api.modify(self.Task.SetStatusTask_sql, status = newstate, command = newcommand, taskname = [workflow]) - return [{'result': retmsg}] - - - def status(self, workflow, userdn, userproxy=None): + def status(self, workflow, userdn): """Retrieve the status of the workflow. :arg str workflow: a valid workflow name @@ -548,7 +379,7 @@ def status(self, workflow, userdn, userproxy=None): @conn_handler(services=['centralconfig']) - def kill(self, workflow, force, jobids, killwarning, userdn, userproxy=None): + def kill(self, workflow, killwarning=''): """Request to Abort a workflow. :arg str workflow: a workflow name""" @@ -569,7 +400,6 @@ def kill(self, workflow, force, jobids, killwarning, userdn, userproxy=None): args = {'ASOURL' : getattr(row, 'asourl', '')} if row.task_status in ['SUBMITTED', 'KILLFAILED', 'RESUBMITFAILED', 'FAILED', 'KILLED', 'TAPERECALL']: - args.update({"killList": jobids}) #Set arguments first so in case of failure we don't do any "damage" self.api.modify(self.Task.SetArgumentsTask_sql, taskname = [workflow], arguments = [str(args)]) self.api.modify(self.Task.SetStatusWarningTask_sql, status = ["NEW"], command = ["KILL"], taskname = [workflow], warnings = [str(warnings)]) @@ -597,46 +427,9 @@ def proceed(self, workflow): return [{'result': 'ok'}] - def resubmitPublication(self, asourl, asodb, proxy, taskname): - if isCouchDBURL(asourl): - return self.resubmitCouchPublication(asourl, asodb, proxy, taskname) - else: - return self.resubmitOraclePublication(taskname) + def resubmitPublication(self, asourl, asodb, taskname): - def resubmitCouchPublication(self, asourl, asodb, proxy, taskname): - """ - Resubmit failed publications by resetting the publication - status in the CouchDB documents. - """ - server = CouchServer(dburl=asourl, ckey=proxy, cert=proxy) - try: - database = server.connectDatabase(asodb) - except Exception as ex: - msg = "Error while trying to connect to CouchDB: %s" % (str(ex)) - raise Exception(msg) - try: - failedPublications = database.loadView('DBSPublisher', 'PublicationFailedByWorkflow', {'reduce': False, 'startkey': [taskname], 'endkey': [taskname, {}]})['rows'] - except Exception as ex: - msg = "Error while trying to load view 'DBSPublisher.PublicationFailedByWorkflow' from CouchDB: %s" % (str(ex)) - raise Exception(msg) - msg = "There are %d failed publications to resubmit: %s" % (len(failedPublications), failedPublications) - self.logger.info(msg) - for doc in failedPublications: - docid = doc['id'] - if doc['key'][0] != taskname: # this should never happen... - msg = "Skipping document %s as it seems to correspond to another task: %s" % (docid, doc['key'][0]) - self.logger.warning(msg) - continue - data = {'last_update': time.time(), - 'retry': str(datetime.datetime.now()), - 'publication_state': 'not_published', - } - try: - database.updateDocument(docid, 'DBSPublisher', 'updateFile', data) - self.logger.info("updating document %s ", docid) - except Exception as ex: - self.logger.error("Error updating document %s in CouchDB: %s", docid, str(ex)) - return + return self.resubmitOraclePublication(taskname) def resubmitOraclePublication(self, taskname): binds = {} diff --git a/src/python/CRABInterface/HTCondorDataWorkflow.py b/src/python/CRABInterface/HTCondorDataWorkflow.py index 3870045f2d..d02d223295 100644 --- a/src/python/CRABInterface/HTCondorDataWorkflow.py +++ b/src/python/CRABInterface/HTCondorDataWorkflow.py @@ -5,7 +5,7 @@ import time import copy import tarfile -import StringIO +import io import tempfile import calendar from ast import literal_eval @@ -13,7 +13,6 @@ import pycurl import classad -import WMCore.Database.CMSCouch as CMSCouch from WMCore.WMSpec.WMTask import buildLumiMask from WMCore.DataStructs.LumiList import LumiList from CRABInterface.DataWorkflow import DataWorkflow @@ -25,7 +24,7 @@ throttle = UserThrottle(limit=3) from CRABInterface.Utilities import conn_handler -from ServerUtilities import FEEDBACKMAIL, PUBLICATIONDB_STATES, isCouchDBURL, getEpochFromDBTime +from ServerUtilities import FEEDBACKMAIL, PUBLICATIONDB_STATES, getEpochFromDBTime from Databases.FileMetaDataDB.Oracle.FileMetaData.FileMetaData import GetFromTaskAndType import HTCondorUtils @@ -53,7 +52,7 @@ def taskads(self, workflow): backend_urls['htcondorPool'] = row.collector # need to make sure to pass a simply quoted string, not a byte-array to HTCondor - taskName = str(workflow.decode("utf-8")) if isinstance(workflow, bytes) else workflow + taskName = workflow.decode("utf-8") if isinstance(workflow, bytes) else workflow self.logger.debug("Running condor query for task %s." % taskName) try: locator = HTCondorLocator.HTCondorLocator(backend_urls) @@ -171,8 +170,8 @@ def report2(self, workflow, userdn): yield res @throttle.make_throttled() - @conn_handler(services=['centralconfig', 'servercert']) - def status(self, workflow, userdn, userproxy=None): + @conn_handler(services=['centralconfig']) + def status(self, workflow, userdn): """Retrieve the status of the workflow. :arg str workflow: a valid workflow name @@ -226,7 +225,7 @@ def status(self, workflow, userdn, userproxy=None): ## Apply taskWarning flag to output. taskWarnings = literal_eval(row.task_warnings if isinstance(row.task_warnings, str) else row.task_warnings.read()) - result["taskWarningMsg"] = taskWarnings + result["taskWarningMsg"] = taskWarnings.decode("utf8") if isinstance(taskWarnings, bytes) else taskWarnings ## Helper function to add the task status and the failure message (both as taken ## from the Task DB) to the result dictionary. @@ -317,7 +316,7 @@ def addStatusAndFailure(result, status, failure = None): else: self.logger.info("Node state file is too old or does not have an update time. Stale info is shown") except Exception as ee: - addStatusAndFailure(result, status = 'UNKNOWN', failure = ee.info) + addStatusAndFailure(result, status = 'UNKNOWN', failure = str(ee)) return [result] if 'DagStatus' in taskStatus: @@ -422,7 +421,7 @@ def taskWebStatus(self, task_ad, statusResult): curl = self.prepareCurl() fp = tempfile.TemporaryFile() curl.setopt(pycurl.WRITEFUNCTION, fp.write) - hbuf = StringIO.StringIO() + hbuf = io.BytesIO() curl.setopt(pycurl.HEADERFUNCTION, hbuf.write) try: self.logger.debug("Retrieving task status from schedd via http") @@ -478,71 +477,16 @@ def taskWebStatus(self, task_ad, statusResult): fp.close() hbuf.close() - @conn_handler(services=['servercert']) + @conn_handler(services=[]) def publicationStatus(self, workflow, asourl, asodb, user): """Here is what basically the function return, a dict called publicationInfo in the subcalls: publicationInfo['status']: something like {'publishing': 0, 'publication_failed': 0, 'not_published': 0, 'published': 5}. Later on goes into dictresult['publication'] before being returned to the client - publicationInfo['status']['error']: String containing the error message if not able to contact couch or oracle + publicationInfo['status']['error']: String containing the error message if not able to contact oracle Later on goes into dictresult['publication']['error'] publicationInfo['failure_reasons']: errors of single files (not yet implemented for oracle..) """ - if isCouchDBURL(asourl): - return self.publicationStatusCouch(workflow, asourl, asodb) - else: - return self.publicationStatusOracle(workflow, user) - - def publicationStatusCouch(self, workflow, asourl, asodb): - publicationInfo = {'status': {}, 'failure_reasons': {}} - if not asourl: - raise ExecutionError("This CRAB server is not configured to publish; no publication status is available.") - server = CMSCouch.CouchServer(dburl=asourl, ckey=self.serverKey, cert=self.serverCert) - try: - db = server.connectDatabase(asodb) - except Exception: - msg = "Error while connecting to asynctransfer CouchDB for workflow %s " % workflow - msg += "\n asourl=%s asodb=%s" % (asourl, asodb) - self.logger.exception(msg) - publicationInfo['status'] = {'error': msg} - return publicationInfo - # Get the publication status for the given workflow. The next query to the - # CouchDB view returns a list of 1 dictionary (row) with: - # 'key' : workflow, - # 'value' : a dictionary with possible publication statuses as keys and the - # counts as values. - query = {'reduce': True, 'key': workflow, 'stale': 'update_after'} - try: - publicationList = db.loadView('AsyncTransfer', 'PublicationStateByWorkflow', query)['rows'] - except Exception: - msg = "Error while querying CouchDB for publication status information for workflow %s " % workflow - self.logger.exception(msg) - publicationInfo['status'] = {'error': msg} - return publicationInfo - if publicationList: - publicationStatusDict = publicationList[0]['value'] - publicationInfo['status'] = publicationStatusDict - # Get the publication failure reasons for the given workflow. The next query to - # the CouchDB view returns a list of N_different_publication_failures - # dictionaries (rows) with: - # 'key' : [workflow, publication failure], - # 'value' : count. - numFailedPublications = publicationStatusDict['publication_failed'] - if numFailedPublications: - query = {'group': True, 'startkey': [workflow], 'endkey': [workflow, {}], 'stale': 'update_after'} - try: - publicationFailedList = db.loadView('DBSPublisher', 'PublicationFailedByWorkflow', query)['rows'] - except Exception: - msg = "Error while querying CouchDB for publication failures information for workflow %s " % workflow - self.logger.exception(msg) - publicationInfo['failure_reasons']['error'] = msg - return publicationInfo - publicationInfo['failure_reasons']['result'] = [] - for publicationFailed in publicationFailedList: - failureReason = publicationFailed['key'][1] - numFailedFiles = publicationFailed['value'] - publicationInfo['failure_reasons']['result'].append((failureReason, numFailedFiles)) - - return publicationInfo + return self.publicationStatusOracle(workflow, user) def publicationStatusOracle(self, workflow, user): publicationInfo = {} @@ -586,12 +530,8 @@ def parseASOState(self, fp, nodes, statusResult): """ transfers = {} data = json.load(fp) - for docid, result in data['results'].iteritems(): - #Oracle has an improved structure in aso_status - if isCouchDBURL(self.asoDBURL): - result = result['value'] - else: - result = result[0] + for docid, result in data['results'].items(): + result = result[0] jobid = str(result['jobid']) if jobid not in nodes: msg = ("It seems one or more jobs are missing from the node_state file." @@ -618,7 +558,7 @@ def last(joberrors): fp.seek(0) data = json.load(fp) #iterate over the jobs and set the error dict for those which are failed - for jobid, statedict in nodes.iteritems(): + for jobid, statedict in nodes.items(): if 'State' in statedict and statedict['State'] == 'failed' and jobid in data: statedict['Error'] = last(data[jobid]) #data[jobid] contains all retries. take the last one @@ -631,7 +571,7 @@ def parseNodeState(self, fp, nodes): if first_char == "[": return self.parseNodeStateV2(fp, nodes) for line in fp.readlines(): - m = self.job_re.match(line) + m = self.job_re.match(line.decode("utf8") if isinstance(line, bytes) else line) if not m: continue nodeid, status, msg = m.groups() diff --git a/src/python/CRABInterface/RESTBaseAPI.py b/src/python/CRABInterface/RESTBaseAPI.py index 2d7b45134b..722dc0a5f6 100644 --- a/src/python/CRABInterface/RESTBaseAPI.py +++ b/src/python/CRABInterface/RESTBaseAPI.py @@ -1,7 +1,7 @@ from __future__ import absolute_import import logging import cherrypy -from commands import getstatusoutput +from subprocess import getstatusoutput from time import mktime, gmtime # WMCore dependecies here @@ -37,10 +37,6 @@ def __init__(self, app, config, mount): self.formats = [ ('application/json', JSONFormat()) ] - status, serverdn = getstatusoutput('openssl x509 -noout -subject -in %s | cut -f2- -d\ ' % config.serverhostcert) - if status is not 0: - raise ExecutionError("Internal issue when retrieving crabserver service DN.") - extconfig = ConfigCache(centralconfig=getCentralConfig(extconfigurl=config.extconfigurl, mode=config.mode), cachetime=mktime(gmtime())) @@ -49,12 +45,12 @@ def __init__(self, app, config, mount): DataWorkflow.globalinit(dbapi=self, credpath=config.credpath, centralcfg=extconfig, config=config) DataFileMetadata.globalinit(dbapi=self, config=config) RESTTask.globalinit(centralcfg=extconfig) - globalinit(config.serverhostkey, config.serverhostcert, serverdn, config.credpath) + globalinit(config.credpath) ## TODO need a check to verify the format depending on the resource ## the RESTFileMetadata has the specifc requirement of getting xml reports self._add( {'workflow': RESTUserWorkflow(app, self, config, mount, extconfig), - 'info': RESTServerInfo(app, self, config, mount, serverdn, extconfig), + 'info': RESTServerInfo(app, self, config, mount, extconfig), 'filemetadata': RESTFileMetadata(app, self, config, mount), 'workflowdb': RESTWorkerWorkflow(app, self, config, mount), 'task': RESTTask(app, self, config, mount), diff --git a/src/python/CRABInterface/RESTCache.py b/src/python/CRABInterface/RESTCache.py index 4e554604b6..007de10185 100644 --- a/src/python/CRABInterface/RESTCache.py +++ b/src/python/CRABInterface/RESTCache.py @@ -17,16 +17,6 @@ from ServerUtilities import getUsernameFromTaskname -def fromNewBytesToString(inString): - # since taskname and objecttype come from WMCore validate_str they are newbytes objects - # of type which breaks - # python2 S3 transfer code with a keyError - # here's an awful hack to turn those newbytes into an old-fashioned py2 string - outString = '' # a string - for i in inString: # subscripting a newbytes string gets tha ASCII value of the char ! - outString += chr(i) # from ASCII numerical value to python2 string - return outString - class RESTCache(RESTEntity): """ REST entity for accessing CRAB Cache on S3 @@ -132,7 +122,7 @@ def get(self, subresource, objecttype, taskname, username, tarballname): # pyli ownerName = getUsernameFromTaskname(taskname) # task related files go in bucket/username/taskname/ objectPath = ownerName + '/' + taskname + '/' + objecttype - s3_objectKey = fromNewBytesToString(objectPath) + s3_objectKey = objectPath if subresource == 'upload': # returns a dictionary with the information to upload a file with a POST @@ -213,13 +203,12 @@ def get(self, subresource, objecttype, taskname, username, tarballname): # pyli # fileNames = [] paginator = self.s3_client.get_paginator('list_objects_v2') - user = fromNewBytesToString(username) operation_parameters = {'Bucket': self.s3_bucket, - 'Prefix': user} + 'Prefix': username} page_iterator = paginator.paginate(**operation_parameters) for page in page_iterator: # strip the initial "username/" from the S3 key name - namesInPage = [item['Key'].replace(user+'/', '', 1) for item in page['Contents']] + namesInPage = [item['Key'].replace(username+'/', '', 1) for item in page['Contents']] fileNames += namesInPage if objecttype: filteredFileNames = [f for f in fileNames if objecttype in f] @@ -231,9 +220,8 @@ def get(self, subresource, objecttype, taskname, username, tarballname): # pyli if not username: raise MissingParameter('username is missing') paginator = self.s3_client.get_paginator('list_objects_v2') - user = fromNewBytesToString(username) operation_parameters = {'Bucket': self.s3_bucket, - 'Prefix': user} + 'Prefix': username} page_iterator = paginator.paginate(**operation_parameters) # S3 records object size in bytes, see: # https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.list_objects_v2 diff --git a/src/python/CRABInterface/RESTExtensions.py b/src/python/CRABInterface/RESTExtensions.py index e3d9218999..9b65615285 100644 --- a/src/python/CRABInterface/RESTExtensions.py +++ b/src/python/CRABInterface/RESTExtensions.py @@ -3,8 +3,6 @@ These are extensions which are not directly contained in WMCore.REST module and it shouldn't have any other dependencies a part of that and cherrypy. -Currently authz_owner_match uses a WMCore.Database.CMSCouch method -but in next versions it should be dropped, as from the CRABInterface. """ from WMCore.REST.Error import MissingObject diff --git a/src/python/CRABInterface/RESTFileMetadata.py b/src/python/CRABInterface/RESTFileMetadata.py index 5a2e97db46..3673ffbaf2 100644 --- a/src/python/CRABInterface/RESTFileMetadata.py +++ b/src/python/CRABInterface/RESTFileMetadata.py @@ -33,7 +33,6 @@ def validate(self, apiobj, method, api, param, safe): safe.kwargs['inparentlfns'] = str(safe.kwargs['inparentlfns']) validate_str("globalTag", param, safe, RX_GLOBALTAG, optional=True) validate_str("jobid", param, safe, RX_JOBID, optional=True) - safe.kwargs["pandajobid"] = 0 validate_num("outsize", param, safe, optional=False) validate_str("publishdataname", param, safe, RX_PUBLISH, optional=False) validate_str("appver", param, safe, RX_CMSSW, optional=False) @@ -79,11 +78,11 @@ def validate(self, apiobj, method, api, param, safe): ## * The name of the arguments has to be the same as used in the http request, and the same as used in validate(). @restcall - def put(self, taskname, outfilelumis, inparentlfns, globalTag, outfileruns, jobid, pandajobid, outsize, publishdataname, appver, outtype, checksummd5,\ + def put(self, taskname, outfilelumis, inparentlfns, globalTag, outfileruns, jobid, outsize, publishdataname, appver, outtype, checksummd5,\ checksumcksum, checksumadler32, outlocation, outtmplocation, outdatasetname, acquisitionera, outlfn, events, filestate, directstageout, outtmplfn): """Insert a new job metadata information""" return self.jobmetadata.inject(taskname=taskname, outfilelumis=outfilelumis, inparentlfns=inparentlfns, globalTag=globalTag, outfileruns=outfileruns,\ - jobid=jobid, pandajobid=pandajobid, outsize=outsize, publishdataname=publishdataname, appver=appver, outtype=outtype, checksummd5=checksummd5,\ + jobid=jobid, outsize=outsize, publishdataname=publishdataname, appver=appver, outtype=outtype, checksummd5=checksummd5,\ checksumcksum=checksumcksum, checksumadler32=checksumadler32, outlocation=outlocation, outtmplocation=outtmplocation,\ outdatasetname=outdatasetname, acquisitionera=acquisitionera, outlfn=outlfn, outtmplfn=outtmplfn, events=events, filestate=filestate, \ directstageout=directstageout) diff --git a/src/python/CRABInterface/RESTFileUserTransfers.py b/src/python/CRABInterface/RESTFileUserTransfers.py index 86f66b23b8..fd8ec2275c 100644 --- a/src/python/CRABInterface/RESTFileUserTransfers.py +++ b/src/python/CRABInterface/RESTFileUserTransfers.py @@ -32,9 +32,6 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar # documents authz_login_valid() if method in ['PUT']: - # Do we want to validate everything? - # Now what is put in CouchDB is not validated - # And all put are prepared by us in JOB wrappers, so it should already be correct. # P.S. Validation is done in function and it double check if all required keys are available print (param, safe) validate_str("id", param, safe, RX_ANYTHING, optional=False) diff --git a/src/python/CRABInterface/RESTServerInfo.py b/src/python/CRABInterface/RESTServerInfo.py index acade7c95a..2cf8648963 100644 --- a/src/python/CRABInterface/RESTServerInfo.py +++ b/src/python/CRABInterface/RESTServerInfo.py @@ -15,13 +15,11 @@ class RESTServerInfo(RESTEntity): """REST entity for workflows and relative subresources""" - def __init__(self, app, api, config, mount, serverdn, centralcfg): + def __init__(self, app, api, config, mount, centralcfg): RESTEntity.__init__(self, app, api, config, mount) self.centralcfg = centralcfg - self.serverdn = serverdn self.logger = logging.getLogger("CRABLogger:RESTServerInfo") #used by the client to get the url where to update the cache (cacheSSL) - #and by the taskworker Panda plugin to get panda urls def validate(self, apiobj, method, api, param, safe ): """Validating all the input parameter as enforced by the WMCore.REST module""" diff --git a/src/python/CRABInterface/RESTTask.py b/src/python/CRABInterface/RESTTask.py index fbd1272077..b32ea2c98a 100644 --- a/src/python/CRABInterface/RESTTask.py +++ b/src/python/CRABInterface/RESTTask.py @@ -1,4 +1,5 @@ # WMCore dependecies here +from Utils.Utilities import decodeBytesToUnicode from WMCore.REST.Server import RESTEntity, restcall from WMCore.REST.Validation import validate_str, validate_strlist from WMCore.REST.Error import InvalidParameter, ExecutionError, NotAcceptable @@ -25,7 +26,6 @@ def globalinit(centralcfg=None): def __init__(self, app, api, config, mount): RESTEntity.__init__(self, app, api, config, mount) self.Task = getDBinstance(config, 'TaskDB', 'Task') - self.JobGroup = getDBinstance(config, 'TaskDB', 'JobGroup') self.logger = logging.getLogger("CRABLogger.RESTTask") def validate(self, apiobj, method, api, param, safe): @@ -219,7 +219,7 @@ def addwarning(self, **kwargs): workflow = kwargs['workflow'] authz_owner_match(self.api, [workflow], self.Task) #check that I am modifying my own workflow try: - warning = b64decode(kwargs['warning']) + warning = decodeBytesToUnicode(b64decode(kwargs['warning'])) except TypeError: raise InvalidParameter("Failure message is not in the accepted format") diff --git a/src/python/CRABInterface/RESTUserWorkflow.py b/src/python/CRABInterface/RESTUserWorkflow.py index a2a1bc703a..5a51018734 100644 --- a/src/python/CRABInterface/RESTUserWorkflow.py +++ b/src/python/CRABInterface/RESTUserWorkflow.py @@ -8,10 +8,10 @@ import cherrypy from base64 import b64decode # WMCore dependecies here +from Utils.Utilities import decodeBytesToUnicode from WMCore.REST.Server import RESTEntity, restcall from WMCore.REST.Error import ExecutionError, InvalidParameter -from WMCore.REST.Validation import validate_str, validate_strlist, validate_ustr, validate_ustrlist,\ - validate_num, validate_real +from WMCore.REST.Validation import validate_str, validate_strlist, validate_num, validate_real from WMCore.Services.TagCollector.TagCollector import TagCollector from WMCore.Lexicon import userprocdataset, userProcDSParts, primdataset @@ -349,9 +349,9 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar validate_num("useparent", param, safe, optional=True) validate_str("secondarydata", param, safe, RX_DATASET, optional=True) # site black/white list needs to be cast to unicode for later use in self._expandSites - validate_ustrlist("siteblacklist", param, safe, RX_CMSSITE) + validate_strlist("siteblacklist", param, safe, RX_CMSSITE) safe.kwargs['siteblacklist'] = self._expandSites(safe.kwargs['siteblacklist']) - validate_ustrlist("sitewhitelist", param, safe, RX_CMSSITE) + validate_strlist("sitewhitelist", param, safe, RX_CMSSITE) safe.kwargs['sitewhitelist'] = self._expandSites(safe.kwargs['sitewhitelist']) validate_str("splitalgo", param, safe, RX_SPLIT, optional=False) validate_num("algoargs", param, safe, optional=False) @@ -399,7 +399,7 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar param.kwargs["publishname2"] = safe.kwargs["publishname"] ## 'publishname' was already validated above in _checkPublishDataName(). - ## Calling validate_ustr with a fake regexp to move the param to the + ## Calling validate_str with a fake regexp to move the param to the ## list of validated inputs validate_str("publishname2", param, safe, RX_ANYTHING, optional=True) @@ -452,7 +452,7 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar validate_num("nonvaliddata", param, safe, optional=True) #if one and only one between outputDatasetTag and publishDbsUrl is set raise an error (we need both or none of them) # asyncdest needs to be cast to unicode for later use in self._checkASODestination - validate_ustr("asyncdest", param, safe, RX_CMSSITE, optional=False) + validate_str("asyncdest", param, safe, RX_CMSSITE, optional=False) self._checkASODestination(safe.kwargs['asyncdest']) # We no longer use this attribute, but keep it around for older client compatibility validate_num("blacklistT1", param, safe, optional=True) @@ -491,14 +491,14 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar ## or whitelist, set it to None and DataWorkflow will use the corresponding ## list defined in the initial task submission. If the site black- or whitelist ## is equal to the string 'empty', set it to an empty list and don't call - ## validate_ustrlist as it would fail. + ## validate_strlist as it would fail. if 'siteblacklist' not in param.kwargs: safe.kwargs['siteblacklist'] = None elif param.kwargs['siteblacklist'] == 'empty': safe.kwargs['siteblacklist'] = [] del param.kwargs['siteblacklist'] else: - validate_ustrlist("siteblacklist", param, safe, RX_CMSSITE) + validate_strlist("siteblacklist", param, safe, RX_CMSSITE) safe.kwargs['siteblacklist'] = self._expandSites(safe.kwargs['siteblacklist']) if 'sitewhitelist' not in param.kwargs: safe.kwargs['sitewhitelist'] = None @@ -506,7 +506,7 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar safe.kwargs['sitewhitelist'] = [] del param.kwargs['sitewhitelist'] else: - validate_ustrlist("sitewhitelist", param, safe, RX_CMSSITE) + validate_strlist("sitewhitelist", param, safe, RX_CMSSITE) safe.kwargs['sitewhitelist'] = self._expandSites(safe.kwargs['sitewhitelist']) validate_num("maxjobruntime", param, safe, optional=True) validate_num("maxmemory", param, safe, optional=True) @@ -545,12 +545,11 @@ def validate(self, apiobj, method, api, param, safe): #pylint: disable=unused-ar elif method in ['DELETE']: validate_str("workflow", param, safe, RX_TASKNAME, optional=False) - validate_num("force", param, safe, optional=True) validate_str("killwarning", param, safe, RX_TEXT_FAIL, optional=True) #decode killwarning message if present if safe.kwargs['killwarning']: try: - safe.kwargs['killwarning'] = b64decode(safe.kwargs['killwarning']) + safe.kwargs['killwarning'] = decodeBytesToUnicode(b64decode(safe.kwargs['killwarning'])) except TypeError: raise InvalidParameter("Failure message is not in the accepted format") @@ -721,7 +720,7 @@ def get(self, workflow, subresource, username, limit, shortformat, exitcode, job return result @restcall - def delete(self, workflow, force, killwarning): + def delete(self, workflow, killwarning=''): """Aborts a workflow. The user needs to be a CMS owner of the workflow. :arg str list workflow: list of unique name identifiers of workflows; @@ -730,4 +729,4 @@ def delete(self, workflow, force, killwarning): # strict check on authz: only the workflow owner can modify it authz_owner_match(self.api, [workflow], self.Task) - return self.userworkflowmgr.kill(workflow, force, killwarning, userdn=cherrypy.request.headers['Cms-Authn-Dn']) + return self.userworkflowmgr.kill(workflow, killwarning) diff --git a/src/python/CRABInterface/RESTWorkerWorkflow.py b/src/python/CRABInterface/RESTWorkerWorkflow.py index 7fa54957f7..4f95e6dcff 100644 --- a/src/python/CRABInterface/RESTWorkerWorkflow.py +++ b/src/python/CRABInterface/RESTWorkerWorkflow.py @@ -1,6 +1,8 @@ """ Interface used by the TaskWorker to ucquire tasks and change their state """ # WMCore dependecies here +from builtins import str +from Utils.Utilities import decodeBytesToUnicode from WMCore.REST.Server import RESTEntity, restcall from WMCore.REST.Validation import validate_str, validate_strlist, validate_num from WMCore.REST.Error import InvalidParameter @@ -9,7 +11,7 @@ from CRABInterface.Utilities import getDBinstance from CRABInterface.RESTExtensions import authz_login_valid from CRABInterface.Regexps import (RX_TASKNAME, RX_WORKER_NAME, RX_STATUS, RX_TEXT_FAIL, RX_SUBPOSTWORKER, - RX_SUBGETWORKER, RX_JOBID) + RX_JOBID) # external dependecies here from ast import literal_eval @@ -22,7 +24,6 @@ class RESTWorkerWorkflow(RESTEntity): def __init__(self, app, api, config, mount): RESTEntity.__init__(self, app, api, config, mount) self.Task = getDBinstance(config, 'TaskDB', 'Task') - self.JobGroup = getDBinstance(config, 'TaskDB', 'JobGroup') @staticmethod def validate(apiobj, method, api, param, safe): #pylint: disable=unused-argument @@ -44,7 +45,7 @@ def validate(apiobj, method, api, param, safe): #pylint: disable=unused-argument # possible combinations to check # 1) taskname + status # 2) taskname + status + failure - # 3) taskname + status + resubmitted + jobsetid + # 3) taskname + status + resubmitted # 4) taskname + status == (1) # 5) status + limit + getstatus + workername # 6) taskname + runs + lumis @@ -52,7 +53,6 @@ def validate(apiobj, method, api, param, safe): #pylint: disable=unused-argument validate_str("workername", param, safe, RX_WORKER_NAME, optional=True) validate_str("getstatus", param, safe, RX_STATUS, optional=True) validate_num("limit", param, safe, optional=True) - validate_str("subresource", param, safe, RX_SUBGETWORKER, optional=True) # possible combinations to check # 1) workername + getstatus + limit # 2) subresource + subjobdef + subuser @@ -63,7 +63,7 @@ def post(self, workflow, status, command, subresource, failure, resubmittedjobs, """ Updates task information """ if failure is not None: try: - failure = b64decode(failure) + failure = decodeBytesToUnicode(b64decode(failure)) except TypeError: raise InvalidParameter("Failure message is not in the accepted format") methodmap = {"state": {"args": (self.Task.SetStatusTask_sql,), "method": self.api.modify, "kwargs": {"status": [status], @@ -75,20 +75,20 @@ def post(self, workflow, status, command, subresource, failure, resubmittedjobs, "failure": [failure], "tm_taskname": [workflow]}}, #Used in DagmanSubmitter? "success": {"args": (self.Task.SetInjectedTasks_sql,), "method": self.api.modify, "kwargs": {"tm_task_status": [status], - "tm_taskname": [workflow], "clusterid": [clusterid], "resubmitted_jobs": [str(resubmittedjobs)]}}, + "tm_taskname": [workflow], "clusterid": [clusterid]}}, "process": {"args": (self.Task.UpdateWorker_sql,), "method": self.api.modifynocheck, "kwargs": {"tw_name": [workername], "get_status": [getstatus], "limit": [limit], "set_status": [status]}}, } if subresource is None: subresource = 'state' - if not subresource in methodmap.keys(): + if not subresource in list(methodmap.keys()): raise InvalidParameter("Subresource of workflowdb has not been found") methodmap[subresource]['method'](*methodmap[subresource]['args'], **methodmap[subresource]['kwargs']) return [] @restcall - def get(self, workername, getstatus, limit, subresource): + def get(self, workername, getstatus, limit): """ Retrieve all columns for a specified task or tasks which are in a particular status with particular conditions """ @@ -118,17 +118,25 @@ def fixupTask(task): #fixup CLOBS values by calling read (only for Oracle) for field in ['tm_task_failure', 'tm_split_args', 'tm_outfiles', 'tm_tfile_outfiles', 'tm_edm_outfiles', - 'panda_resubmitted_jobs', 'tm_arguments', 'tm_scriptargs', 'tm_user_files', 'tm_arguments']: + 'tm_arguments', 'tm_scriptargs', 'tm_user_files']: current = result[field] fixedCurr = current if (current is None or isinstance(current, str)) else current.read() result[field] = fixedCurr #liter_evaluate values for field in ['tm_site_whitelist', 'tm_site_blacklist', 'tm_split_args', 'tm_outfiles', 'tm_tfile_outfiles', - 'tm_edm_outfiles', 'panda_resubmitted_jobs', 'tm_user_infiles', 'tm_arguments', 'tm_scriptargs', + 'tm_edm_outfiles', 'tm_user_infiles', 'tm_arguments', 'tm_scriptargs', 'tm_user_files']: current = result[field] result[field] = literal_eval(current) + for idx, value in enumerate(result[field]): + if isinstance(value, bytes): + result[field][idx] = value.decode("utf8") + + # py3 crabserver compatible with tasks submitted with py2 crabserver + for arg in ('lumis', 'runs'): + for idx, val in enumerate(result['tm_split_args'].get(arg)): + result['tm_split_args'][arg][idx] = decodeBytesToUnicode(val) #convert tm_arguments to the desired values extraargs = result['tm_arguments'] diff --git a/src/python/CRABInterface/Regexps.py b/src/python/CRABInterface/Regexps.py index fc0fa1b62c..6d59b5c198 100644 --- a/src/python/CRABInterface/Regexps.py +++ b/src/python/CRABInterface/Regexps.py @@ -103,10 +103,9 @@ RX_DN = re.compile(r"^/(?:C|O|DC)=.*/CN=.") ## worker subresources RX_SUBPOSTWORKER = re.compile(r"^(state|start|failure|success|process|lumimask)$") -RX_SUBGETWORKER = re.compile(r"jobgroup") # Schedulers -RX_SCHEDULER = re.compile(r"^(panda|condor)$") +RX_SCHEDULER = re.compile(r"^(condor)$") diff --git a/src/python/CRABInterface/Utilities.py b/src/python/CRABInterface/Utilities.py index 99d1ca17bc..471d8368c3 100644 --- a/src/python/CRABInterface/Utilities.py +++ b/src/python/CRABInterface/Utilities.py @@ -8,16 +8,16 @@ from hashlib import sha1 import cherrypy import pycurl -import StringIO +import io import json from WMCore.WMFactory import WMFactory from WMCore.REST.Error import ExecutionError, InvalidParameter from WMCore.Services.CRIC.CRIC import CRIC -from WMCore.Credential.SimpleMyProxy import SimpleMyProxy, MyProxyException -from WMCore.Credential.Proxy import Proxy from WMCore.Services.pycurl_manager import ResponseHeader +from Utils.Utilities import encodeUnicodeToBytes + from CRABInterface.Regexps import RX_CERT """ The module contains some utility functions used by the various modules of the CRAB REST interface @@ -27,9 +27,6 @@ ConfigCache = namedtuple("ConfigCache", ["cachetime", "centralconfig"]) #These parameters are set in the globalinit (called in RESTBaseAPI) -serverCert = None -serverKey = None -serverDN = None credServerPath = None def getDBinstance(config, namespace, name): @@ -39,13 +36,13 @@ def getDBinstance(config, namespace, name): backend = 'Oracle' #factory = WMFactory(name = 'TaskQuery', namespace = 'Databases.TaskDB.%s.Task' % backend) - factory = WMFactory(name = name, namespace = 'Databases.%s.%s.%s' % (namespace, backend, name)) + factory = WMFactory(name=name, namespace='Databases.%s.%s.%s' % (namespace, backend, name)) return factory.loadObject( name ) -def globalinit(serverkey, servercert, serverdn, credpath): - global serverCert, serverKey, serverDN, credServerPath # pylint: disable=global-statement - serverCert, serverKey, serverDN, credServerPath = servercert, serverkey, serverdn, credpath +def globalinit(credpath): + global credServerPath # pylint: disable=global-statement + credServerPath = credpath def execute_command(command, logger, timeout): """ @@ -96,8 +93,8 @@ def getCentralConfig(extconfigurl, mode): def retrieveConfig(externalLink): - hbuf = StringIO.StringIO() - bbuf = StringIO.StringIO() + hbuf = io.BytesIO() + bbuf = io.BytesIO() curl = pycurl.Curl() curl.setopt(pycurl.URL, externalLink) @@ -121,40 +118,45 @@ def retrieveConfig(externalLink): return jsonConfig - extConfCommon = json.loads(retrieveConfig(extconfigurl)) - - # below 'if' condition is only added for the transition period from the old config file to the new one. It should be removed after some time. - if 'modes' in extConfCommon: - extConfSchedds = json.loads(retrieveConfig(extConfCommon['htcondorScheddsLink'])) - - # The code below constructs dict from below provided JSON structure - # { u'htcondorPool': '', u'compatible-version': [''], u'htcondorScheddsLink': '', - # u'modes': [{ - # u'mode': '', u'backend-urls': { - # u'asoConfig': [{ u'couchURL': '', u'couchDBName': ''}], - # u'htcondorSchedds': [''], u'cacheSSL': '', u'baseURL': ''}}], - # u'banned-out-destinations': [], u'delegate-dn': ['']} - # to match expected dict structure which is: - # { u'compatible-version': [''], u'htcondorScheddsLink': '', - # 'backend-urls': { - # u'asoConfig': [{u'couchURL': '', u'couchDBName': ''}], - # u'htcondorSchedds': {u'crab3@vocmsXXXX.cern.ch': {u'proxiedurl': '', u'weightfactor': 1}}, - # u'cacheSSL': '', u'baseURL': '', 'htcondorPool': ''}, - # u'banned-out-destinations': [], u'delegate-dn': ['']} - extConfCommon['backend-urls'] = next((item['backend-urls'] for item in extConfCommon['modes'] if item['mode'] == mode), None) - extConfCommon['backend-urls']['htcondorPool'] = extConfCommon.pop('htcondorPool') - del extConfCommon['modes'] - - # if htcondorSchedds": [] is not empty, it gets populated with the specified list of schedds, - # otherwise it takes default list of schedds - if extConfCommon['backend-urls']['htcondorSchedds']: - extConfCommon['backend-urls']['htcondorSchedds'] = {k: v for k, v in extConfSchedds.items() if - k in extConfCommon['backend-urls']['htcondorSchedds']} - else: - extConfCommon["backend-urls"]["htcondorSchedds"] = extConfSchedds - centralCfgFallback = extConfCommon + extConfigJson = retrieveConfig(extconfigurl) # will raise ExecutionError if http calls fail + try: + extConfCommon = json.loads(extConfigJson) + except Exception as ex: + raise ExecutionError("non-JSON format in external configuration from %s" % externalLink) + + schedListLink = extConfCommon['htcondorScheddsLink'] + schedListJson = retrieveConfig(schedListLink) # will raise ExecutionError if http calls fail + try: + extConfSchedds = json.loads(schedListJson) + except Exception as ex: + raise ExecutionError("non-JSON format in sched list from %s" % schedListLink) + + # The code below constructs dict from below provided JSON structure + # { u'htcondorPool': '', u'compatible-version': [''], u'htcondorScheddsLink': '', + # u'modes': [{ + # u'mode': '', u'backend-urls': { + # u'asoConfig': [{ u'couchURL': '', u'couchDBName': ''}], + # u'htcondorSchedds': [''], u'cacheSSL': '', u'baseURL': ''}}], + # u'banned-out-destinations': [], u'delegate-dn': ['']} + # to match expected dict structure which is: + # { u'compatible-version': [''], u'htcondorScheddsLink': '', + # 'backend-urls': { + # u'asoConfig': [{u'couchURL': '', u'couchDBName': ''}], + # u'htcondorSchedds': {u'crab3@vocmsXXXX.cern.ch': {u'proxiedurl': '', u'weightfactor': 1}}, + # u'cacheSSL': '', u'baseURL': '', 'htcondorPool': ''}, + # u'banned-out-destinations': [], u'delegate-dn': ['']} + extConfCommon['backend-urls'] = next((item['backend-urls'] for item in extConfCommon['modes'] if item['mode'] == mode), None) + extConfCommon['backend-urls']['htcondorPool'] = extConfCommon.pop('htcondorPool') + del extConfCommon['modes'] + + # if htcondorSchedds": [] is not empty, it gets populated with the specified list of schedds, + # otherwise it takes default list of schedds + if extConfCommon['backend-urls']['htcondorSchedds']: + extConfCommon['backend-urls']['htcondorSchedds'] = {k: v for k, v in extConfSchedds.items() if + k in extConfCommon['backend-urls']['htcondorSchedds']} else: - centralCfgFallback = extConfCommon[mode] + extConfCommon["backend-urls"]["htcondorSchedds"] = extConfSchedds + centralCfgFallback = extConfCommon return centralCfgFallback @@ -162,11 +164,10 @@ def retrieveConfig(externalLink): def conn_handler(services): """ Decorator to be used among REST resources to optimize connections to other services - as CouchDB and CRIC, WMStats monitoring + as CRIC, WMStats monitoring arg str list services: list of string telling which service connections - should be started; currently availables are - 'monitor' and 'asomonitor'. + should be started """ def wrap(func): def wrapped_func(*args, **kwargs): @@ -175,46 +176,6 @@ def wrapped_func(*args, **kwargs): args[0].allPNNNames = CMSSitesCache(sites=CRIC().getAllPhEDExNodeNames(), cachetime=mktime(gmtime())) if 'centralconfig' in services and (not args[0].centralcfg.centralconfig or (args[0].centralcfg.cachetime+1800 < mktime(gmtime()))): args[0].centralcfg = ConfigCache(centralconfig=getCentralConfig(extconfigurl=args[0].config.extconfigurl, mode=args[0].config.mode), cachetime=mktime(gmtime())) - if 'servercert' in services: - args[0].serverCert = serverCert - args[0].serverKey = serverKey return func(*args, **kwargs) return wrapped_func return wrap - -def retrieveUserCert(func): - def wrapped_func(*args, **kwargs): - logger = logging.getLogger("CRABLogger.Utils") - myproxyserver = "myproxy.cern.ch" - defaultDelegation = {'logger': logger, - 'proxyValidity': '192:00', - 'min_time_left': 36000, - 'server_key': serverKey, - 'server_cert': serverCert,} - mypclient = SimpleMyProxy(defaultDelegation) - userproxy = None - userhash = sha1(kwargs['userdn']).hexdigest() - if serverDN: - try: - userproxy = mypclient.logonRenewMyProxy(username=userhash, myproxyserver=myproxyserver, myproxyport=7512) - except MyProxyException as me: - # Unsure if this works in standalone mode... - cherrypy.log(str(me)) - cherrypy.log(str(serverKey)) - cherrypy.log(str(serverCert)) - invalidp = InvalidParameter("Impossible to retrieve proxy from %s for %s and hash %s" % - (myproxyserver, kwargs['userdn'], userhash)) - setattr(invalidp, 'trace', str(me)) - raise invalidp - - else: - if not re.match(RX_CERT, userproxy): - raise InvalidParameter("Retrieved malformed proxy from %s for %s and hash %s" % - (myproxyserver, kwargs['userdn'], userhash)) - else: - proxy = Proxy(defaultDelegation) - userproxy = proxy.getProxyFilename() - kwargs['userproxy'] = userproxy - out = func(*args, **kwargs) - return out - return wrapped_func diff --git a/src/python/Databases/FileMetaDataDB/Oracle/Create.py b/src/python/Databases/FileMetaDataDB/Oracle/Create.py index c41dab94d4..fe5e398515 100755 --- a/src/python/Databases/FileMetaDataDB/Oracle/Create.py +++ b/src/python/Databases/FileMetaDataDB/Oracle/Create.py @@ -27,7 +27,6 @@ def __init__(self, logger=None, dbi=None, param=None): self.create['b_filemetadata'] = """ CREATE TABLE filemetadata ( tm_taskname VARCHAR(255) NOT NULL, - panda_job_id NUMBER(11) DEFAULT NULL, job_id VARCHAR(20), fmd_outdataset VARCHAR(500) NOT NULL, fmd_acq_era VARCHAR(255) NOT NULL, diff --git a/src/python/Databases/FileMetaDataDB/Oracle/Destroy.py b/src/python/Databases/FileMetaDataDB/Oracle/Destroy.py index 07975adab4..5149cebe2e 100755 --- a/src/python/Databases/FileMetaDataDB/Oracle/Destroy.py +++ b/src/python/Databases/FileMetaDataDB/Oracle/Destroy.py @@ -3,7 +3,6 @@ """ import threading -import string from WMCore.Database.DBCreator import DBCreator from Databases.FileMetaDataDB.Oracle.Create import Create @@ -29,6 +28,6 @@ def __init__(self, logger = None, dbi = None, param=None): i = 0 for tableName in orderedTables: i += 1 - prefix = string.zfill(i, 2) + prefix = str(i).zfill(2) self.create[prefix + tableName] = "DROP TABLE %s" % tableName diff --git a/src/python/Databases/FileMetaDataDB/Oracle/FileMetaData/FileMetaData.py b/src/python/Databases/FileMetaDataDB/Oracle/FileMetaData/FileMetaData.py index eccb1ac153..aeb0d84f95 100644 --- a/src/python/Databases/FileMetaDataDB/Oracle/FileMetaData/FileMetaData.py +++ b/src/python/Databases/FileMetaDataDB/Oracle/FileMetaData/FileMetaData.py @@ -7,8 +7,8 @@ class GetFromTaskAndType(): """ Used for indexing columns retrieved by the GetFromTaskAndType_sql query Order of parameters must be the same as it is in query GetFromTaskAndType_sql """ - PANDAID, JOBID, OUTDS, ACQERA, SWVER, INEVENTS, GLOBALTAG, PUBLISHNAME, LOCATION, TMPLOCATION, RUNLUMI, ADLER32, CKSUM, MD5, LFN, SIZE, PARENTS, STATE,\ - CREATIONTIME, TMPLFN, TYPE, DIRECTSTAGEOUT = range(22) + JOBID, OUTDS, ACQERA, SWVER, INEVENTS, GLOBALTAG, PUBLISHNAME, LOCATION, TMPLOCATION, RUNLUMI, ADLER32, CKSUM, MD5, LFN, SIZE, PARENTS, STATE,\ + CREATIONTIME, TMPLFN, TYPE, DIRECTSTAGEOUT = range(21) class FileMetaData(object): """ @@ -17,7 +17,7 @@ class FileMetaData(object): ChangeFileState_sql = """UPDATE filemetadata SET fmd_filestate=:filestate \ WHERE fmd_lfn=:outlfn and tm_taskname=:taskname """ - GetFromTaskAndType_tuple = namedtuple("GetFromTaskAndType", ["pandajobid", + GetFromTaskAndType_tuple = namedtuple("GetFromTaskAndType", [ "jobid", "outdataset", "acquisitionera", @@ -39,7 +39,7 @@ class FileMetaData(object): "tmplfn", "type", "directstageout"]) - GetFromTaskAndType_sql = """SELECT panda_job_id AS pandajobid, \ + GetFromTaskAndType_sql = """SELECT \ job_id AS jobid, \ fmd_outdataset AS outdataset, \ fmd_acq_era AS acquisitionera, \ @@ -69,10 +69,10 @@ class FileMetaData(object): """ New_sql = "INSERT INTO filemetadata ( \ - tm_taskname, panda_job_id, job_id, fmd_outdataset, fmd_acq_era, fmd_sw_ver, fmd_in_events, fmd_global_tag,\ + tm_taskname, job_id, fmd_outdataset, fmd_acq_era, fmd_sw_ver, fmd_in_events, fmd_global_tag,\ fmd_publish_name, fmd_location, fmd_tmp_location, fmd_runlumi, fmd_adler32, fmd_cksum, fmd_md5, fmd_lfn, fmd_size,\ fmd_type, fmd_parent, fmd_creation_time, fmd_filestate, fmd_direct_stageout, fmd_tmplfn) \ - VALUES (:taskname, :pandajobid, :jobid, :outdatasetname, :acquisitionera, :appver, :events, :globalTag,\ + VALUES (:taskname, :jobid, :outdatasetname, :acquisitionera, :appver, :events, :globalTag,\ :publishdataname, :outlocation, :outtmplocation, :runlumi, :checksumadler32, :checksumcksum, :checksummd5, :outlfn, :outsize,\ :outtype, :inparentlfns, SYS_EXTRACT_UTC(SYSTIMESTAMP), :filestate, :directstageout, :outtmplfn)" diff --git a/src/python/Databases/FileTransfersDB/Oracle/Destroy.py b/src/python/Databases/FileTransfersDB/Oracle/Destroy.py index 5234436363..217e0bb66c 100755 --- a/src/python/Databases/FileTransfersDB/Oracle/Destroy.py +++ b/src/python/Databases/FileTransfersDB/Oracle/Destroy.py @@ -3,7 +3,6 @@ """ import threading -import string from WMCore.Database.DBCreator import DBCreator from Databases.TaskDB.Oracle.Create import Create @@ -29,7 +28,7 @@ def __init__(self, logger = None, dbi = None, param=None): i = 0 for tableName in orderedTables: i += 1 - prefix = string.zfill(i, 2) + prefix = str(i).zfill(2) if tableName.endswith("_seq"): self.create[prefix + tableName] = "DROP SEQUENCE %s" % tableName elif tableName.endswith("_trg"): diff --git a/src/python/Databases/TaskDB/Oracle/Create.py b/src/python/Databases/TaskDB/Oracle/Create.py index f4deda417b..7277b95538 100755 --- a/src/python/Databases/TaskDB/Oracle/Create.py +++ b/src/python/Databases/TaskDB/Oracle/Create.py @@ -12,9 +12,7 @@ class Create(DBCreator): """ Implementation of TaskMgr DB for Oracle """ - requiredTables = ['tasks', - 'jobgroups', - 'jobgroups_id_seq' + requiredTables = ['tasks' ] def __init__(self, logger=None, dbi=None, param=None): @@ -34,7 +32,6 @@ def __init__(self, logger=None, dbi=None, param=None): CREATE TABLE tasks( tm_taskname VARCHAR(255) NOT NULL, tm_activity VARCHAR(255), - panda_jobset_id NUMBER(11), tm_task_status VARCHAR(255) NOT NULL, tm_task_command VARCHAR(20), tm_start_time TIMESTAMP, @@ -71,7 +68,6 @@ def __init__(self, logger=None, dbi=None, param=None): tm_generator VARCHAR(255), tm_events_per_lumi NUMBER(38), tm_arguments CLOB, - panda_resubmitted_jobs CLOB, tm_save_logs VARCHAR(1) NOT NULL, tw_name VARCHAR(255), tm_user_infiles VARCHAR(4000), diff --git a/src/python/Databases/TaskDB/Oracle/Destroy.py b/src/python/Databases/TaskDB/Oracle/Destroy.py index 5234436363..217e0bb66c 100755 --- a/src/python/Databases/TaskDB/Oracle/Destroy.py +++ b/src/python/Databases/TaskDB/Oracle/Destroy.py @@ -3,7 +3,6 @@ """ import threading -import string from WMCore.Database.DBCreator import DBCreator from Databases.TaskDB.Oracle.Create import Create @@ -29,7 +28,7 @@ def __init__(self, logger = None, dbi = None, param=None): i = 0 for tableName in orderedTables: i += 1 - prefix = string.zfill(i, 2) + prefix = str(i).zfill(2) if tableName.endswith("_seq"): self.create[prefix + tableName] = "DROP SEQUENCE %s" % tableName elif tableName.endswith("_trg"): diff --git a/src/python/Databases/TaskDB/Oracle/JobGroup/JobGroup.py b/src/python/Databases/TaskDB/Oracle/JobGroup/JobGroup.py deleted file mode 100644 index 4089e1d505..0000000000 --- a/src/python/Databases/TaskDB/Oracle/JobGroup/JobGroup.py +++ /dev/null @@ -1,25 +0,0 @@ -#!/usr/bin/env python -""" -""" - -class JobGroup(object): - - sql = "INSERT INTO JOBGROUPS ( " - sql += "tm_taskname, panda_jobdef_id, panda_jobdef_status, tm_data_blocks, panda_jobgroup_failure, tm_user_dn)" - sql += " VALUES (:task_name, :jobdef_id, upper(:jobgroup_status), :blocks, :jobgroup_failure, :tm_user_dn) " - - AddJobGroup_sql = sql - - GetFailedJobGroup_sql = "SELECT * FROM jobgroups WHERE panda_jobdef_status = 'FAILED'" - - GetJobGroupFromID_sql = "SELECT panda_jobdef_id, panda_jobdef_status, panda_jobgroup_failure FROM jobgroups WHERE tm_taskname = :taskname" - - GetJobGroupFromJobDef_sql = """SELECT tm_taskname, panda_jobdef_id, panda_jobdef_status, - tm_data_blocks, panda_jobgroup_failure, tm_user_dn - FROM jobgroups - WHERE panda_jobdef_id = :jobdef_id and tm_user_dn = :user_dn""" - - SetJobDefId_sql = "UPDATE jobgroups SET panda_jobdef_id = :panda_jobgroup WHERE tm_jobgroups_id = :tm_jobgroups_id" - - SetStatusJobGroup_sql = "UPDATE jobgroups SET panda_jobdef_status = upper(:status) WHERE tm_jobgroups_id = :tm_jobgroup_id" - diff --git a/src/python/Databases/TaskDB/Oracle/JobGroup/__init__.py b/src/python/Databases/TaskDB/Oracle/JobGroup/__init__.py deleted file mode 100755 index 06ea4e11da..0000000000 --- a/src/python/Databases/TaskDB/Oracle/JobGroup/__init__.py +++ /dev/null @@ -1,3 +0,0 @@ -#!/usr/bin/env python - -__all__ = [] diff --git a/src/python/Databases/TaskDB/Oracle/Task/Task.py b/src/python/Databases/TaskDB/Oracle/Task/Task.py index 067e0a561c..e90e1c4419 100644 --- a/src/python/Databases/TaskDB/Oracle/Task/Task.py +++ b/src/python/Databases/TaskDB/Oracle/Task/Task.py @@ -6,12 +6,12 @@ class Task(object): """ """ #ID - ID_tuple = namedtuple("ID", ["taskname", "panda_jobset_id", "task_status", "task_command", "user_role", "user_group", \ - "task_failure", "split_algo", "split_args", "panda_resubmitted_jobs", "save_logs", "username", \ + ID_tuple = namedtuple("ID", ["taskname", "task_status", "task_command", "user_role", "user_group", \ + "task_failure", "split_algo", "split_args", "save_logs", "username", \ "user_dn", "arguments", "input_dataset", "dbs_url", "task_warnings", "publication", "user_webdir", \ "asourl", "asodb", "output_dataset", "collector", "schedd", "dry_run", "clusterid", "start_time", "twname"]) - ID_sql = "SELECT tm_taskname, panda_jobset_id, tm_task_status, tm_task_command, tm_user_role, tm_user_group, \ - tm_task_failure, tm_split_algo, tm_split_args, panda_resubmitted_jobs, tm_save_logs, tm_username, \ + ID_sql = "SELECT tm_taskname, tm_task_status, tm_task_command, tm_user_role, tm_user_group, \ + tm_task_failure, tm_split_algo, tm_split_args, tm_save_logs, tm_username, \ tm_user_dn, tm_arguments, tm_input_dataset, tm_dbs_url, tm_task_warnings, tm_publication, tm_user_webdir, tm_asourl, \ tm_asodb, tm_output_dataset, tm_collector, tm_schedd, tm_dry_run, clusterid, tm_start_time, tw_name \ FROM tasks WHERE tm_taskname=:taskname" @@ -39,45 +39,45 @@ class Task(object): #New New_sql = "INSERT INTO tasks ( \ - tm_taskname, tm_activity, panda_jobset_id, tm_task_status, tm_task_command, tm_start_time, tm_task_failure, tm_job_sw, \ + tm_taskname, tm_activity, tm_task_status, tm_task_command, tm_start_time, tm_task_failure, tm_job_sw, \ tm_job_arch, tm_input_dataset, tm_primary_dataset, tm_nonvalid_input_dataset, tm_use_parent, tm_secondary_input_dataset, tm_site_whitelist, tm_site_blacklist, \ tm_split_algo, tm_split_args, tm_totalunits, tm_user_sandbox, tm_debug_files, tm_cache_url, tm_username, tm_user_dn, \ tm_user_vo, tm_user_role, tm_user_group, tm_publish_name, tm_publish_groupname, tm_asyncdest, tm_dbs_url, tm_publish_dbs_url, \ tm_publication, tm_outfiles, tm_tfile_outfiles, tm_edm_outfiles, tm_job_type, tm_generator, tm_arguments, \ - panda_resubmitted_jobs, tm_save_logs, tm_user_infiles, tm_maxjobruntime, tm_numcores, tm_maxmemory, tm_priority, \ + tm_save_logs, tm_user_infiles, tm_maxjobruntime, tm_numcores, tm_maxmemory, tm_priority, \ tm_scriptexe, tm_scriptargs, tm_extrajdl, tm_asourl, tm_asodb, tm_events_per_lumi, tm_collector, tm_schedd, tm_dry_run, \ tm_user_files, tm_transfer_outputs, tm_output_lfn, tm_ignore_locality, tm_fail_limit, tm_one_event_mode, tm_submitter_ip_addr, tm_ignore_global_blacklist) \ - VALUES (:task_name, :task_activity, :jobset_id, upper(:task_status), upper(:task_command), SYS_EXTRACT_UTC(SYSTIMESTAMP), :task_failure, :job_sw, \ + VALUES (:task_name, :task_activity, upper(:task_status), upper(:task_command), SYS_EXTRACT_UTC(SYSTIMESTAMP), :task_failure, :job_sw, \ :job_arch, :input_dataset, :primary_dataset, :nonvalid_data, :use_parent, :secondary_dataset, :site_whitelist, :site_blacklist, \ :split_algo, :split_args, :total_units, :user_sandbox, :debug_files, :cache_url, :username, :user_dn, \ :user_vo, :user_role, :user_group, :publish_name, :publish_groupname, :asyncdest, :dbs_url, :publish_dbs_url, \ :publication, :outfiles, :tfile_outfiles, :edm_outfiles, :job_type, :generator, :arguments, \ - :resubmitted_jobs, :save_logs, :user_infiles, :maxjobruntime, :numcores, :maxmemory, :priority, \ + :save_logs, :user_infiles, :maxjobruntime, :numcores, :maxmemory, :priority, \ :scriptexe, :scriptargs, :extrajdl, :asourl, :asodb, :events_per_lumi, :collector, :schedd_name, :dry_run, \ :user_files, :transfer_outputs, :output_lfn, :ignore_locality, :fail_limit, :one_event_mode, :submitter_ip_addr, :ignore_global_blacklist)" - GetReadyTasks_tuple = namedtuple("GetReadyTasks", ["tm_taskname", "panda_jobset_id", "tm_task_status", "tm_task_command", \ + GetReadyTasks_tuple = namedtuple("GetReadyTasks", ["tm_taskname", "tm_task_status", "tm_task_command", \ "tm_start_time", "tm_start_injection", "tm_end_injection", \ "tm_task_failure", "tm_job_sw", "tm_job_arch", "tm_input_dataset", "tm_DDM_reqid", \ "tm_site_whitelist", "tm_site_blacklist", "tm_split_algo", "tm_split_args", \ "tm_totalunits", "tm_user_sandbox", "tm_debug_files", "tm_cache_url", "tm_username", "tm_user_dn", "tm_user_vo", \ "tm_user_role", "tm_user_group", "tm_publish_name", "tm_asyncdest", "tm_dbs_url", \ "tm_publish_dbs_url", "tm_publication", "tm_outfiles", "tm_tfile_outfiles", "tm_edm_outfiles", \ - "tm_job_type", "tm_arguments", "panda_resubmitted_jobs", "tm_save_logs", \ + "tm_job_type", "tm_arguments", "tm_save_logs", \ "tm_user_infiles", "tw_name", "tm_maxjobruntime", "tm_numcores", "tm_maxmemory", "tm_priority", "tm_activity", \ "tm_scriptexe", "tm_scriptargs", "tm_extrajdl", "tm_generator", "tm_asourl", "tm_asodb", "tm_events_per_lumi", \ "tm_use_parent", "tm_collector", "tm_schedd", "tm_dry_run", \ "tm_user_files", "tm_transfer_outputs", "tm_output_lfn", "tm_ignore_locality", "tm_fail_limit", "tm_one_event_mode", \ "tm_publish_groupname", "tm_nonvalid_input_dataset", "tm_secondary_input_dataset", "tm_primary_dataset", "tm_submitter_ip_addr", "tm_ignore_global_blacklist"]) #GetReadyTasks - GetReadyTasks_sql = """SELECT tm_taskname, panda_jobset_id, tm_task_status, tm_task_command, \ + GetReadyTasks_sql = """SELECT tm_taskname, tm_task_status, tm_task_command, \ tm_start_time, tm_start_injection, tm_end_injection, \ tm_task_failure, tm_job_sw, tm_job_arch, tm_input_dataset, tm_DDM_reqid, \ tm_site_whitelist, tm_site_blacklist, tm_split_algo, tm_split_args, \ tm_totalunits, tm_user_sandbox, tm_debug_files, tm_cache_url, tm_username, tm_user_dn, tm_user_vo, \ tm_user_role, tm_user_group, tm_publish_name, tm_asyncdest, tm_dbs_url, \ tm_publish_dbs_url, tm_publication, tm_outfiles, tm_tfile_outfiles, tm_edm_outfiles, \ - tm_job_type, tm_arguments, panda_resubmitted_jobs, tm_save_logs, \ + tm_job_type, tm_arguments, tm_save_logs, \ tm_user_infiles, tw_name, tm_maxjobruntime, tm_numcores, tm_maxmemory, tm_priority, tm_activity, \ tm_scriptexe, tm_scriptargs, tm_extrajdl, tm_generator, tm_asourl, tm_asodb, tm_events_per_lumi, \ tm_use_parent, tm_collector, tm_schedd, tm_dry_run, \ @@ -116,13 +116,9 @@ class Task(object): #SetInjectedTasks SetInjectedTasks_sql = "UPDATE tasks SET tm_end_injection = SYS_EXTRACT_UTC(SYSTIMESTAMP), \ tm_task_status = upper(:tm_task_status), \ - panda_resubmitted_jobs = :resubmitted_jobs, \ clusterid = :clusterid \ WHERE tm_taskname = :tm_taskname" - #SetJobSetId - SetJobSetId_sql = "UPDATE tasks SET panda_jobset_id = :jobsetid WHERE tm_taskname = :taskname" - #SetReadyTasks SetReadyTasks_sql = "UPDATE tasks SET tm_start_injection = SYS_EXTRACT_UTC(SYSTIMESTAMP), \ tm_task_status = upper(:tm_task_status) WHERE tm_taskname = :tm_taskname" diff --git a/src/python/HTCondorUtils.py b/src/python/HTCondorUtils.py index bd85388267..d402fca4fd 100644 --- a/src/python/HTCondorUtils.py +++ b/src/python/HTCondorUtils.py @@ -35,27 +35,36 @@ def __init__(self, outputMessage, outputObj): self.outputMessage = outputMessage self.outputObj = outputObj self.environmentStr = "" - for key, val in os.environ.iteritems(): + for key, val in os.environ.items(): self.environmentStr += "%s=%s\n" % (key, val) class AuthenticatedSubprocess(object): - def __init__(self, proxy, pickleOut=False, outputObj = None, logger = logging): + def __init__(self, proxy, tokenDir=None, pickleOut=False, outputObj=None, logger=logging): self.proxy = proxy self.pickleOut = pickleOut self.outputObj = outputObj self.timedout = False self.logger = logger + self.tokenDir = tokenDir def __enter__(self): self.r, self.w = os.pipe() - self.rpipe = os.fdopen(self.r, 'r') - self.wpipe = os.fdopen(self.w, 'w') + if self.pickleOut: + self.rpipe = os.fdopen(self.r, 'rb') + self.wpipe = os.fdopen(self.w, 'wb') + else: + self.rpipe = os.fdopen(self.r, 'r') + self.wpipe = os.fdopen(self.w, 'w') self.pid = os.fork() if self.pid == 0: htcondor.SecMan().invalidateAllSessions() - htcondor.param['SEC_CLIENT_AUTHENTICATION_METHODS'] = 'FS,GSI' + if self.tokenDir: + htcondor.param['SEC_TOKEN_DIRECTORY'] = self.tokenDir + htcondor.param['SEC_CLIENT_AUTHENTICATION_METHODS'] = 'IDTOKENS,FS,GSI' + else: + htcondor.param['SEC_CLIENT_AUTHENTICATION_METHODS'] = 'FS,GSI' htcondor.param['DELEGATE_FULL_JOB_GSI_CREDENTIALS'] = 'true' htcondor.param['DELEGATE_JOB_GSI_CREDENTIALS_LIFETIME'] = '0' os.environ['X509_USER_PROXY'] = self.proxy diff --git a/src/python/Logger.py b/src/python/Logger.py deleted file mode 100644 index 73db356bee..0000000000 --- a/src/python/Logger.py +++ /dev/null @@ -1,69 +0,0 @@ -#pylint: skip-file -""" - * ApMon - Application Monitoring Tool - * Version: 2.2.1 - * - * Copyright (C) 2006 California Institute of Technology - * - * Permission is hereby granted, free of charge, to use, copy and modify - * this software and its documentation (the "Software") for any - * purpose, provided that existing copyright notices are retained in - * all copies and that this notice is included verbatim in any distributions - * or substantial portions of the Software. - * This software is a part of the MonALISA framework (http://monalisa.cacr.caltech.edu). - * Users of the Software are asked to feed back problems, benefits, - * and/or suggestions about the software to the MonALISA Development Team - * (developers@monalisa.cern.ch). Support for this software - fixing of bugs, - * incorporation of new features - is done on a best effort basis. All bug - * fixes and enhancements will be made available under the same terms and - * conditions as the original software, - - * IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR - * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT - * OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES THEREOF, - * EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - * THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, - * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE IS - * PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE NO - * OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR - * MODIFICATIONS. -""" -from __future__ import print_function - -import time -import threading - -FATAL = 0 # When something very bad happened and we should quit -ERROR = 1 # Tipically when something important fails -WARNING = 2 # Intermediate logging level. -INFO = 3 # Intermediate logging level. -NOTICE = 4 # Logging level with detailed information. -DEBUG = 5 # Logging level for debugging - -LEVELS = ['FATAL', 'ERROR', 'WARNING', 'INFO', 'NOTICE', 'DEBUG'] - -# Simple logging class -class Logger: - # Constructor - def __init__ (this, defaultLevel = INFO): - this.log_lock = threading.Lock(); - this.logLevel = defaultLevel - - # Print the given message if the level is more serious as the existing one - def log(this, level, message): - global LEVELS, FATAL, ERROR, WARNING, INFO, NOTICE, DEBUG - this.log_lock.acquire(); - if(level <= this.logLevel): - print(time.asctime() + ": ApMon["+LEVELS[level]+"]: "+message); - this.log_lock.release(); - - # Set the logging level - def setLogLevel(this, strLevel): - this.log_lock.acquire(); - for l_idx in range(len(LEVELS)): - if strLevel == LEVELS[l_idx]: - this.logLevel = l_idx; - this.log_lock.release(); - diff --git a/src/python/PandaServerInterface.py b/src/python/PandaServerInterface.py deleted file mode 100644 index 946a4a2780..0000000000 --- a/src/python/PandaServerInterface.py +++ /dev/null @@ -1,571 +0,0 @@ -from __future__ import print_function -import os -import re -import sys -import time -import stat -import types -import random -import urllib -import struct -import commands -import cPickle as pickle -import xml.dom.minidom -import socket -import tempfile -import logging -from hashlib import sha1 - -LOGGER = logging.getLogger(__name__) - -# exit code -EC_Failed = 255 - -globalTmpDir = '' -#PandaSites = {} -#PandaClouds = {} - - -class PanDAException(Exception): - """ - Specific errors coming from interaction with PanDa - """ - exitcode = 3100 - - -def _x509(): - try: - return os.environ['X509_USER_PROXY'] - except: - pass - # no valid proxy certificate - # FIXME raise exception or something? - - x509 = '/tmp/x509up_u%s' % os.getuid() - if os.access(x509, os.R_OK): - return x509 - # no valid proxy certificate - # FIXME - raise PanDAException("No valid grid proxy certificate found") - - -# look for a CA certificate directory -def _x509_CApath(): - # use X509_CERT_DIR - try: - return os.environ['X509_CERT_DIR'] - except: - pass - # get X509_CERT_DIR - gridSrc = _getGridSrc() - com = "%s echo $X509_CERT_DIR" % gridSrc - tmpOut = commands.getoutput(com) - return tmpOut.split('\n')[-1] - - - -# curl class -class _Curl: - # constructor - def __init__(self): - # path to curl - self.path = 'curl --user-agent "dqcurl" ' - # verification of the host certificate - self.verifyHost = False - # request a compressed response - self.compress = True - # SSL cert/key - self.sslCert = '' - self.sslKey = '' - # verbose - self.verbose = False - - # GET method - def get(self,url,data,rucioAccount=False): - # make command - com = '%s --silent --get' % self.path - if not self.verifyHost: - com += ' --insecure' - else: - com += ' --capath %s' % _x509_CApath() - if self.compress: - com += ' --compressed' - if self.sslCert != '': - com += ' --cert %s' % self.sslCert - if self.sslKey != '': - com += ' --key %s' % self.sslKey - # add rucio account info - if rucioAccount: - if 'RUCIO_ACCOUNT' in os.environ: - data['account'] = os.environ['RUCIO_ACCOUNT'] - if 'RUCIO_APPID' in os.environ: - data['appid'] = os.environ['RUCIO_APPID'] - # data - strData = '' - for key in data.keys(): - strData += 'data="%s"\n' % urllib.urlencode({key:data[key]}) - # write data to temporary config file - if globalTmpDir != '': - tmpFD, tmpName = tempfile.mkstemp(dir=globalTmpDir) - else: - tmpFD, tmpName = tempfile.mkstemp() - os.write(tmpFD, strData) - os.close(tmpFD) - com += ' --config %s' % tmpName - com += ' %s' % url - # execute - if self.verbose: - LOGGER.debug(com) - LOGGER.debug(strData[:-1]) - s, o = commands.getstatusoutput(com) - if o != '\x00': - try: - tmpout = urllib.unquote_plus(o) - o = eval(tmpout) - except: - pass - ret = (s, o) - # remove temporary file - os.remove(tmpName) - ret = self.convRet(ret) - if self.verbose: - LOGGER.debug(ret) - return ret - - def post(self,url,data,rucioAccount=False): - # make command - com = '%s --silent' % self.path - if not self.verifyHost: - com += ' --insecure' - else: - com += ' --capath %s' % _x509_CApath() - if self.compress: - com += ' --compressed' - if self.sslCert != '': - com += ' --cert %s' % self.sslCert - #com += ' --cert %s' % '/tmp/mycredentialtest' - if self.sslKey != '': - com += ' --key %s' % self.sslCert - #com += ' --key %s' % '/tmp/mycredentialtest' - # add rucio account info - if rucioAccount: - if 'RUCIO_ACCOUNT' in os.environ: - data['account'] = os.environ['RUCIO_ACCOUNT'] - if 'RUCIO_APPID' in os.environ: - data['appid'] = os.environ['RUCIO_APPID'] - # data - strData = '' - for key in data.keys(): - strData += 'data="%s"\n' % urllib.urlencode({key:data[key]}) - # write data to temporary config file - if globalTmpDir != '': - tmpFD, tmpName = tempfile.mkstemp(dir=globalTmpDir) - else: - tmpFD, tmpName = tempfile.mkstemp() - os.write(tmpFD, strData) - os.close(tmpFD) - com += ' --config %s' % tmpName - com += ' %s' % url - # execute - if self.verbose: - LOGGER.debug(com) - LOGGER.debug(strData[:-1]) - s, o = commands.getstatusoutput(com) - #print s,o - if o != '\x00': - try: - tmpout = urllib.unquote_plus(o) - o = eval(tmpout) - except: - pass - ret = (s, o) - # remove temporary file - os.remove(tmpName) - ret = self.convRet(ret) - if self.verbose: - LOGGER.debug(ret) - return ret - - # PUT method - def put(self, url, data): - # make command - com = '%s --silent' % self.path - if not self.verifyHost: - com += ' --insecure' - if self.compress: - com += ' --compressed' - if self.sslCert != '': - com += ' --cert %s' % self.sslCert - #com += ' --cert %s' % '/data/certs/prova' - if self.sslKey != '': - com += ' --key %s' % self.sslKey - # emulate PUT - for key in data.keys(): - com += ' -F "%s=@%s"' % (key, data[key]) - com += ' %s' % url - if self.verbose: - LOGGER.debug(com) - # execute - ret = commands.getstatusoutput(com) - ret = self.convRet(ret) - if self.verbose: - LOGGER.debug(ret) - return ret - - - # convert return - def convRet(self, ret): - if ret[0] != 0: - ret = (ret[0]%255, ret[1]) - # add messages to silent errors - if ret[0] == 35: - ret = (ret[0], 'SSL connect error. The SSL handshaking failed. Check grid certificate/proxy.') - elif ret[0] == 7: - ret = (ret[0], 'Failed to connect to host.') - elif ret[0] == 55: - ret = (ret[0], 'Failed sending network data.') - elif ret[0] == 56: - ret = (ret[0], 'Failure in receiving network data.') - return ret - - -# get site specs -def getSiteSpecs(baseURL, sslCert, sslKey, siteType=None): - # instantiate curl - curl = _Curl() - curl.sslCert = sslCert - curl.sslKey = sslKey - # execute - url = baseURL + '/getSiteSpecs' - data = {} - if siteType != None: - data['siteType'] = siteType - status, output = curl.get(url, data) - try: - return status, pickle.loads(output) - except: - type, value, traceBack = sys.exc_info() - errStr = "ERROR getSiteSpecs : %s %s" % (type, value) - LOGGER.error(errStr) - return EC_Failed, output+'\n'+errStr - - -# get cloud specs -def getCloudSpecs(baseURL, sslCert, sslKey): - # instantiate curl - curl = _Curl() - curl.sslCert = sslCert - curl.sslKey = sslKey - # execute - url = baseURL + '/getCloudSpecs' - status, output = curl.get(url, {}) - try: - return status, pickle.loads(output) - except: - type, value, traceBack = sys.exc_info() - errStr = "ERROR getCloudSpecs : %s %s" % (type, value) - LOGGER.error(errStr) - return EC_Failed, output+'\n'+errStr - -# refresh specs -def refreshSpecs(baseURL, proxy): - global PandaSites - global PandaClouds - - sslCert = proxy - sslKey = proxy - # get Panda Sites - tmpStat, PandaSites = getSiteSpecs(baseURL, sslCert, sslKey) - if tmpStat != 0: - LOGGER.error("ERROR : cannot get Panda Sites") - sys.exit(EC_Failed) - # get cloud info - tmpStat, PandaClouds = getCloudSpecs(baseURL, sslCert, sslKey) - if tmpStat != 0: - LOGGER.error("ERROR : cannot get Panda Clouds") - sys.exit(EC_Failed) - -def getSite(sitename): - global PandaSites - return PandaSites[sitename]['cloud'] - -# submit jobs -def submitJobs(baseURLSSL, jobs, proxy, verbose=False): - # set hostname - hostname = commands.getoutput('hostname') - for job in jobs: - job.creationHost = hostname - # serialize - strJobs = pickle.dumps(jobs) - # instantiate curl - curl = _Curl() - curl.sslCert = proxy - curl.sslKey = proxy - curl.verbose = True - # execute - url = baseURLSSL + '/submitJobs' - data = {'jobs':strJobs} - status, output = curl.post(url, data) - #print 'SUBMITJOBS CALL --> status: %s - output: %s' % (status, output) - if status!=0: - LOGGER.error('==============================') - LOGGER.error('submitJobs output: %s' % output) - LOGGER.error('submitJobs status: %s' % status) - LOGGER.error('==============================') - return status, None - try: - return status, pickle.loads(output) - except: - type, value, traceBack = sys.exc_info() - LOGGER.error("ERROR submitJobs : %s %s" % (type, value)) - return EC_Failed, None - - -# run brokerage -def runBrokerage(baseURLSSL, sites, proxy, - atlasRelease=None, cmtConfig=None, verbose=False, trustIS=False, cacheVer='', - processingType='', loggingFlag=False, memorySize=0, useDirectIO=False, siteGroup=None, - maxCpuCount=-1): - # use only directIO sites - nonDirectSites = [] - if useDirectIO: - tmpNewSites = [] - for tmpSite in sites: - if isDirectAccess(tmpSite): - tmpNewSites.append(tmpSite) - else: - nonDirectSites.append(tmpSite) - sites = tmpNewSites - if sites == []: - if not loggingFlag: - return 0, 'ERROR : no candidate.' - else: - return 0, {'site':'ERROR : no candidate.','logInfo':[]} - ## MATTIA comments this code here below - # choose at most 50 sites randomly to avoid too many lookup - #random.shuffle(sites) - #sites = sites[:50] - # serialize - strSites = pickle.dumps(sites) - # instantiate curl - curl = _Curl() - curl.sslKey = proxy - curl.sslCert = proxy - curl.verbose = verbose - # execute - url = baseURLSSL + '/runBrokerage' - data = {'sites':strSites, - 'atlasRelease':atlasRelease} - if cmtConfig != None: - data['cmtConfig'] = cmtConfig - if trustIS: - data['trustIS'] = True - if maxCpuCount > 0: - data['maxCpuCount'] = maxCpuCount - if cacheVer != '': - # change format if needed - cacheVer = re.sub('^-', '', cacheVer) - match = re.search('^([^_]+)_(\d+\.\d+\.\d+\.\d+\.*\d*)$', cacheVer) - if match != None: - cacheVer = '%s-%s' % (match.group(1), match.group(2)) - else: - # nightlies - match = re.search('_(rel_\d+)$', cacheVer) - if match != None: - # use base release as cache version - cacheVer = '%s:%s' % (atlasRelease, match.group(1)) - # use cache for brokerage - data['atlasRelease'] = cacheVer - if processingType != '': - # set processingType mainly for HC - data['processingType'] = processingType - # enable logging - if loggingFlag: - data['loggingFlag'] = True - # memory size - if not memorySize in [-1, 0, None, 'NULL']: - data['memorySize'] = memorySize - # site group - if not siteGroup in [None, -1]: - data['siteGroup'] = siteGroup - status, output = curl.get(url, data) - try: - if not loggingFlag: - return status, output - else: - outputPK = pickle.loads(output) - # add directIO info - if nonDirectSites != []: - if 'logInfo' not in outputPK: - outputPK['logInfo'] = [] - for tmpSite in nonDirectSites: - msgBody = 'action=skip site=%s reason=nondirect - not directIO site' % tmpSite - outputPK['logInfo'].append(msgBody) - return status, outputPK - except: - type, value, traceBack = sys.exc_info() - LOGGER.error(output) - LOGGER.error("ERROR runBrokerage : %s %s" % (type, value)) - return EC_Failed, None - - -#################################################################################### -# Only the following function -getPandIDsWithJobID- is directly called by the REST # -#################################################################################### -# get PandaIDs for a JobID -def getPandIDsWithJobID(baseURLSSL, jobID, dn=None, nJobs=0, verbose=False, userproxy=None, credpath=None): - # instantiate curl - curl = _Curl() - curl.verbose = verbose - # execute - url = baseURLSSL + '/getPandIDsWithJobID' - data = {'jobID':jobID, 'nJobs':nJobs} - if dn != None: - data['dn'] = dn - - # Temporary solution we cache the proxy file - filehandler, proxyfile = tempfile.mkstemp(dir=credpath) - with open(proxyfile, 'w') as pf: - pf.write(userproxy) - curl.sslCert = proxyfile - curl.sslKey = proxyfile - - status = None - output = None - try: - # call him ... - status, output = curl.post(url, data) - except: - type, value, traceBack = sys.exc_info() - LOGGER.error("ERROR getPandIDsWithJobID : %s %s" % (type, value)) - finally: - # Always delete it! - os.close(filehandler) - os.remove(proxyfile) - - if status is not None and status!=0: - LOGGER.debug(str(output)) - return status, None - try: - return status, pickle.loads(output) - except: - type, value, traceBack = sys.exc_info() - LOGGER.error("ERROR getPandIDsWithJobID : %s %s" % (type, value)) - return EC_Failed, None - - -# kill jobs -def killJobs(baseURLSSL, ids, proxy, code=None, verbose=True, useMailAsID=False): - # serialize - strIDs = pickle.dumps(ids) - # instantiate curl - curl = _Curl() - curl.sslCert = proxy - curl.sslKey = proxy - curl.verbose = verbose - # execute - url = baseURLSSL + '/killJobs' - data = {'ids':strIDs,'code':code,'useMailAsID':useMailAsID} - status, output = curl.post(url, data) - try: - return status, pickle.loads(output) - except: - type, value, traceBack = sys.exc_info() - errStr = "ERROR killJobs : %s %s" % (type, value) - print(errStr) - return EC_Failed, output+'\n'+errStr - -# get full job status -def getFullJobStatus(baseURLSSL, ids, proxy, verbose=False): - # serialize - strIDs = pickle.dumps(ids) - # instantiate curl - curl = _Curl() - curl.sslCert = proxy - curl.sslKey = proxy - curl.verbose = verbose - # execute - url = baseURLSSL + '/getFullJobStatus' - data = {'ids':strIDs} - status, output = curl.post(url, data) - try: - return status, pickle.loads(output) - except Exception as ex: - type, value, traceBack = sys.exc_info() - LOGGER.error("ERROR getFullJobStatus : %s %s" % (type, value)) - LOGGER.error(str(traceBack)) - -def putFile(baseURL, baseURLCSRVSSL, file, checksum, verbose=False, reuseSandbox=False): - # size check for noBuild - sizeLimit = 100*1024*1024 - - fileSize = os.stat(file)[stat.ST_SIZE] - if not file.startswith('sources.'): - if fileSize > sizeLimit: - errStr = 'Exceeded size limit (%sB >%sB). ' % (fileSize, sizeLimit) - errStr += 'Your working directory contains too large files which cannot be put on cache area. ' - errStr += 'Please submit job without --noBuild/--libDS so that your files will be uploaded to SE' - # get logger - raise PanDAException(errStr) - # instantiate curl - curl = _Curl() - curl.sslCert = _x509() - curl.sslKey = _x509() - curl.verbose = verbose - # check duplicationn. Need to rewrite the reuseSandbox part - if reuseSandbox: - # get CRC - fo = open(file) - fileContent = fo.read() - fo.close() - footer = fileContent[-8:] - checkSum, isize = struct.unpack("II", footer) - # check duplication - url = baseURLSSL + '/checkSandboxFile' - data = {'fileSize':fileSize,'checkSum':checksum} - status, output = curl.post(url, data) - if status != 0: - raise PanDAException('ERROR: Could not check Sandbox duplication with %s' % status) - elif output.startswith('FOUND:'): - # found reusable sandbox - hostName, reuseFileName = output.split(':')[1:] - baseURLCSRV = "http://%s:25080/server/panda" % hostName - baseURLCSRVSSL = "https://%s:25443/server/panda" % hostName - # return reusable filename - return 0, "NewFileName:%s:%s" % (hostName, reuseFileName) - #if no specific cache server is passed through the arguments, then we figure it out by ourselves - if not baseURLCSRVSSL: - url = baseURL + '/getServer' - LOGGER.debug('Contacting %s to figure out panda cache location' % url) - status, pandaCacheInstance = curl.put(url, {}) - baseURLCSRVSSL = 'https://%s//server/panda' % pandaCacheInstance - if status != 0: - raise PanDAException('ERROR: Could not get panda cache address %s' % status) - else: - LOGGER.debug('Using fixed URL (%s) as panda cache' % baseURLCSRVSSL) - # execute - url = baseURLCSRVSSL + '/putFile' - data = {'file':file} - status, output = curl.put(url, data) - if status !=0: - raise PanDAException("Failure uploading input file into PanDa") - else: - matchURL = re.search("(http.*://[^/]+)/", baseURLCSRVSSL) - return 0, "True:%s:%s" % (matchURL.group(1), file.split('/')[-1]) - - -def wrappedUuidGen(): - # check if uuidgen is available - tmpSt, tmpOut = commands.getstatusoutput('which uuidgen') - if tmpSt == 0: - # use uuidgen - st, output = commands.getstatusoutput('uuidgen 2>/dev/null') - if st == 0: - return output - # use python uuidgen - try: - import uuid - except: - raise ImportError('uuidgen and uuid.py are unavailable on your system. Please install one of them') - return str(uuid.uuid4()) - diff --git a/src/python/ProcInfo.py b/src/python/ProcInfo.py deleted file mode 100644 index e5f70918be..0000000000 --- a/src/python/ProcInfo.py +++ /dev/null @@ -1,588 +0,0 @@ -#pylint: skip-file -""" - * ApMon - Application Monitoring Tool - * Version: 2.2.1 - * - * Copyright (C) 2006 California Institute of Technology - * - * Permission is hereby granted, free of charge, to use, copy and modify - * this software and its documentation (the "Software") for any - * purpose, provided that existing copyright notices are retained in - * all copies and that this notice is included verbatim in any distributions - * or substantial portions of the Software. - * This software is a part of the MonALISA framework (http://monalisa.cacr.caltech.edu). - * Users of the Software are asked to feed back problems, benefits, - * and/or suggestions about the software to the MonALISA Development Team - * (developers@monalisa.cern.ch). Support for this software - fixing of bugs, - * incorporation of new features - is done on a best effort basis. All bug - * fixes and enhancements will be made available under the same terms and - * conditions as the original software, - - * IN NO EVENT SHALL THE AUTHORS OR DISTRIBUTORS BE LIABLE TO ANY PARTY FOR - * DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT - * OF THE USE OF THIS SOFTWARE, ITS DOCUMENTATION, OR ANY DERIVATIVES THEREOF, - * EVEN IF THE AUTHORS HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. - - * THE AUTHORS AND DISTRIBUTORS SPECIFICALLY DISCLAIM ANY WARRANTIES, - * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, - * FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT. THIS SOFTWARE IS - * PROVIDED ON AN "AS IS" BASIS, AND THE AUTHORS AND DISTRIBUTORS HAVE NO - * OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR - * MODIFICATIONS. -""" -from __future__ import absolute_import - - -import os -import re -import time -import string -import socket -import Logger - -""" -Class ProcInfo -extracts information from the proc/ filesystem for system and job monitoring -""" -class ProcInfo: - # ProcInfo constructor - def __init__ (this, logger): - this.DATA = {}; # monitored data that is going to be reported - this.LAST_UPDATE_TIME = 0; # when the last measurement was done - this.JOBS = {}; # jobs that will be monitored - this.logger = logger # use the given logger - this.OS_TYPE = os.popen('uname -s').readline().replace('\n', ''); - - # This should be called from time to time to update the monitored data, - # but not more often than once a second because of the resolution of time() - def update (this): - if this.LAST_UPDATE_TIME == int(time.time()): - this.logger.log(Logger.NOTICE, "ProcInfo: update() called too often!"); - return; - this.readStat(); - this.readMemInfo(); - if this.OS_TYPE == 'Darwin': - this.darwin_readLoadAvg(); - else: - this.readLoadAvg(); - this.countProcesses(); - this.readGenericInfo(); - this.readNetworkInfo(); - this.readNetStat(); - for pid in this.JOBS.keys(): - this.readJobInfo(pid); - this.readJobDiskUsage(pid); - this.LAST_UPDATE_TIME = int(time.time()); - this.DATA['TIME'] = int(time.time()); - - # Call this to add another PID to be monitored - def addJobToMonitor (this, pid, workDir): - this.JOBS[pid] = {}; - this.JOBS[pid]['WORKDIR'] = workDir; - this.JOBS[pid]['DATA'] = {}; - #print this.JOBS; - - # Call this to stop monitoring a PID - def removeJobToMonitor (this, pid): - if pid in this.JOBS: - del this.JOBS[pid]; - - # Return a filtered hash containting the system-related parameters and values - def getSystemData (this, params, prevDataRef): - return this.getFilteredData(this.DATA, params, prevDataRef); - - # Return a filtered hash containing the job-related parameters and values - def getJobData (this, pid, params): - if pid not in this.JOBS: - return []; - return this.getFilteredData(this.JOBS[pid]['DATA'], params); - - ############################################################################################ - # internal functions for system monitoring - ############################################################################################ - - # this has to be run twice (with the $lastUpdateTime updated) to get some useful results - # the information about pages_in/out and swap_in/out isn't available for 2.6 kernels (yet) - def readStat (this): - try: - FSTAT = open('/proc/stat'); - line = FSTAT.readline(); - while(line != ''): - if(line.startswith("cpu ")): - elem = re.split("\s+", line); - this.DATA['raw_cpu_usr'] = float(elem[1]); - this.DATA['raw_cpu_nice'] = float(elem[2]); - this.DATA['raw_cpu_sys'] = float(elem[3]); - this.DATA['raw_cpu_idle'] = float(elem[4]); - if(line.startswith("page")): - elem = line.split(); - this.DATA['raw_pages_in'] = float(elem[1]); - this.DATA['raw_pages_out'] = float(elem[2]); - if(line.startswith('swap')): - elem = line.split(); - this.DATA['raw_swap_in'] = float(elem[1]); - this.DATA['raw_swap_out'] = float(elem[2]); - line = FSTAT.readline(); - FSTAT.close(); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc/stat"); - return; - - # sizes are reported in MB (except _usage that is in percent). - def readMemInfo (this): - try: - FMEM = open('/proc/meminfo'); - line = FMEM.readline(); - while(line != ''): - elem = re.split("\s+", line); - if(line.startswith("MemFree:")): - this.DATA['mem_free'] = float(elem[1]) / 1024.0; - if(line.startswith("MemTotal:")): - this.DATA['total_mem'] = float(elem[1]) / 1024.0; - if(line.startswith("SwapFree:")): - this.DATA['swap_free'] = float(elem[1]) / 1024.0; - if(line.startswith("SwapTotal:")): - this.DATA['total_swap'] = float(elem[1]) / 1024.0; - line = FMEM.readline(); - FMEM.close(); - if 'total_mem' in this.DATA and 'mem_free' in this.DATA: - this.DATA['mem_used'] = this.DATA['total_mem'] - this.DATA['mem_free']; - if 'total_swap' in this.DATA and 'swap_free' in this.DATA: - this.DATA['swap_used'] = this.DATA['total_swap'] - this.DATA['swap_free']; - if 'mem_used' in this.DATA and 'total_mem' in this.DATA and this.DATA['total_mem'] > 0: - this.DATA['mem_usage'] = 100.0 * this.DATA['mem_used'] / this.DATA['total_mem']; - if 'swap_used' in this.DATA and 'total_swap' in this.DATA and this.DATA['total_swap'] > 0: - this.DATA['swap_usage'] = 100.0 * this.DATA['swap_used'] / this.DATA['total_swap']; - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc/meminfo"); - return; - - # read system load average - def readLoadAvg (this): - try: - FAVG = open('/proc/loadavg'); - line = FAVG.readline(); - FAVG.close(); - elem = re.split("\s+", line); - this.DATA['load1'] = float(elem[0]); - this.DATA['load5'] = float(elem[1]); - this.DATA['load15'] = float(elem[2]); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc/meminfo"); - return; - - - # read system load average on Darwin - def darwin_readLoadAvg (this): - try: - LOAD_AVG = os.popen('sysctl vm.loadavg'); - line = LOAD_AVG.readline(); - LOAD_AVG.close(); - elem = re.split("\s+", line); - this.DATA['load1'] = float(elem[1]); - this.DATA['load5'] = float(elem[2]); - this.DATA['load15'] = float(elem[3]); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot run 'sysctl vm.loadavg"); - return; - - - # read the number of processes currently running on the system - def countProcesses (this): - """ - # old version - nr = 0; - try: - for file in os.listdir("/proc"): - if re.match("\d+", file): - nr += 1; - this.DATA['processes'] = nr; - except IOError, ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc to count processes"); - return; - """ - # new version - total = 0; - states = {'D':0, 'R':0, 'S':0, 'T':0, 'Z':0}; - try: - output = os.popen('ps -A -o state'); - line = output.readline(); - while(line != ''): - states[line[0]] = states[line[0]] + 1; - total = total + 1; - line = output.readline(); - output.close(); - this.DATA['processes'] = total; - for key in states.keys(): - this.DATA['processes_'+key] = states[key]; - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot get output from ps command"); - return; - - # reads the IP, hostname, cpu_MHz, uptime - def readGenericInfo (this): - this.DATA['hostname'] = socket.getfqdn(); - try: - output = os.popen('/sbin/ifconfig -a') - eth, ip = '', ''; - line = output.readline(); - while(line != ''): - line = line.strip(); - if line.startswith("eth"): - elem = line.split(); - eth = elem[0]; - ip = ''; - if len(eth) > 0 and line.startswith("inet addr:"): - ip = re.match("inet addr:(\d+\.\d+\.\d+\.\d+)", line).group(1); - this.DATA[eth + '_ip'] = ip; - eth = ''; - line = output.readline(); - output.close(); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot get output from /sbin/ifconfig -a"); - return; - try: - no_cpus = 0; - FCPU = open('/proc/cpuinfo'); - line = FCPU.readline(); - while(line != ''): - if line.startswith("cpu MHz"): - this.DATA['cpu_MHz'] = float(re.match("cpu MHz\s+:\s+(\d+\.?\d*)", line).group(1)); - no_cpus += 1; - - if line.startswith("vendor_id"): - this.DATA['cpu_vendor_id'] = re.match("vendor_id\s+:\s+(.+)", line).group(1); - - if line.startswith("cpu family"): - this.DATA['cpu_family'] = re.match("cpu family\s+:\s+(.+)", line).group(1); - - if line.startswith("model") and not line.startswith("model name") : - this.DATA['cpu_model'] = re.match("model\s+:\s+(.+)", line).group(1); - - if line.startswith("model name"): - this.DATA['cpu_model_name'] = re.match("model name\s+:\s+(.+)", line).group(1); - - if line.startswith("bogomips"): - this.DATA['bogomips'] = float(re.match("bogomips\s+:\s+(\d+\.?\d*)", line).group(1)); - - line = FCPU.readline(); - FCPU.close(); - this.DATA['no_CPUs'] = no_cpus; - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc/cpuinfo"); - return; - try: - FUPT = open('/proc/uptime'); - line = FUPT.readline(); - FUPT.close(); - elem = line.split(); - this.DATA['uptime'] = float(elem[0]) / (24.0 * 3600); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc/uptime"); - return; - - # do a difference with overflow check and repair - # the counter is unsigned 32 or 64 bit - def diffWithOverflowCheck(this, new, old): - if new >= old: - return new - old; - else: - max = (1 << 31) * 2; # 32 bits - if old >= max: - max = (1 << 63) * 2; # 64 bits - return new - old + max; - - # read network information like transfered kBps and nr. of errors on each interface - def readNetworkInfo (this): - try: - FNET = open('/proc/net/dev'); - line = FNET.readline(); - while(line != ''): - m = re.match("\s*eth(\d):(\d+)\s+\d+\s+(\d+)\s+\d+\s+\d+\s+\d+\s+\d+\s+\d+\s+(\d+)\s+\d+\s+(\d+)", line); - if m != None: - this.DATA['raw_eth'+m.group(1)+'_in'] = float(m.group(2)); - this.DATA['raw_eth'+m.group(1)+'_out'] = float(m.group(4)); - this.DATA['raw_eth'+m.group(1)+'_errs'] = int(m.group(3)) + int(m.group(5)); - line = FNET.readline(); - FNET.close(); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot open /proc/net/dev"); - return; - - # run nestat and collect sockets info (tcp, udp, unix) and connection states for tcp sockets from netstat - def readNetStat(this): - try: - output = os.popen('netstat -an 2>/dev/null'); - sockets = { 'sockets_tcp':0, 'sockets_udp':0, 'sockets_unix':0, 'sockets_icm':0 }; - tcp_details = { 'sockets_tcp_ESTABLISHED':0, 'sockets_tcp_SYN_SENT':0, - 'sockets_tcp_SYN_RECV':0, 'sockets_tcp_FIN_WAIT1':0, 'sockets_tcp_FIN_WAIT2':0, - 'sockets_tcp_TIME_WAIT':0, 'sockets_tcp_CLOSED':0, 'sockets_tcp_CLOSE_WAIT':0, - 'sockets_tcp_LAST_ACK':0, 'sockets_tcp_LISTEN':0, 'sockets_tcp_CLOSING':0, - 'sockets_tcp_UNKNOWN':0 }; - line = output.readline(); - while(line != ''): - arg = string.split(line); - proto = arg[0]; - if proto.find('tcp') == 0: - sockets['sockets_tcp'] += 1; - state = arg[len(arg)-1]; - key = 'sockets_tcp_'+state; - if key in tcp_details: - tcp_details[key] += 1; - if proto.find('udp') == 0: - sockets['sockets_udp'] += 1; - if proto.find('unix') == 0: - sockets['sockets_unix'] += 1; - if proto.find('icm') == 0: - sockets['sockets_icm'] += 1; - - line = output.readline(); - output.close(); - - for key in sockets.keys(): - this.DATA[key] = sockets[key]; - for key in tcp_details.keys(): - this.DATA[key] = tcp_details[key]; - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot get output from netstat command"); - return; - - ############################################################################################## - # job monitoring related functions - ############################################################################################## - - # internal function that gets the full list of children (pids) for a process (pid) - def getChildren (this, parent): - pidmap = {}; - try: - output = os.popen('ps -A -o "pid ppid"'); - line = output.readline(); # skip headers - line = output.readline(); - while(line != ''): - line = line.strip(); - elem = re.split("\s+", line); - pidmap[elem[0]] = elem[1]; - line = output.readline(); - output.close(); - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot execute ps -A -o \"pid ppid\""); - - if parent in pidmap: - this.logger.log(Logger.INFO, 'ProcInfo: No job with pid='+str(parent)); - this.removeJobToMonitor(parent); - return []; - - children = [parent]; - i = 0; - while(i < len(children)): - prnt = children[i]; - for (pid, ppid) in pidmap.items(): - if ppid == prnt: - children.append(pid); - i += 1; - return children; - - # internal function that parses a time formatted like "days-hours:min:sec" and returns the corresponding - # number of seconds. - def parsePSTime (this, my_time): - my_time = my_time.strip(); - m = re.match("(\d+)-(\d+):(\d+):(\d+)", my_time); - if m != None: - return int(m.group(1)) * 24 * 3600 + int(m.group(2)) * 3600 + int(m.group(3)) * 60 + int(m.group(4)); - else: - m = re.match("(\d+):(\d+):(\d+)", my_time); - if(m != None): - return int(m.group(1)) * 3600 + int(m.group(2)) * 60 + int(m.group(3)); - else: - m = re.match("(\d+):(\d+)", my_time); - if(m != None): - return int(m.group(1)) * 60 + int(m.group(2)); - else: - return 0; - - # read information about this the JOB_PID process - # memory sizes are given in KB - def readJobInfo (this, pid): - if (pid == '') or pid not in this.JOBS: - return; - children = this.getChildren(pid); - if(len(children) == 0): - this.logger.log(Logger.INFO, "ProcInfo: Job with pid="+str(pid)+" terminated; removing it from monitored jobs."); - #print ":(" - this.removeJobToMonitor(pid); - return; - try: - JSTATUS = os.popen("ps --no-headers --pid " + ",".join([repr(child) for child in children]) + " -o pid,etime,time,%cpu,%mem,rsz,vsz,comm"); - mem_cmd_map = {}; - etime, cputime, pcpu, pmem, rsz, vsz, comm, fd = 0, 0, 0, 0, 0, 0, 0, 0; - line = JSTATUS.readline(); - while(line != ''): - line = line.strip(); - m = re.match("(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(.+)", line); - if m != None: - apid, etime1, cputime1, pcpu1, pmem1, rsz1, vsz1, comm1 = m.group(1), m.group(2), m.group(3), m.group(4), m.group(5), m.group(6), m.group(7), m.group(8); - sec = this.parsePSTime(etime1); - if sec > etime: # the elapsed time is the maximum of all elapsed - etime = sec; - sec = this.parsePSTime(cputime1); # times corespornding to all child processes. - cputime += sec; # total cputime is the sum of cputimes for all processes. - pcpu += float(pcpu1); # total %cpu is the sum of all children %cpu. - if repr(pmem1)+" "+repr(rsz1)+" "+repr(vsz1)+" "+repr(comm1) not in mem_cmd_map: - # it's the first thread/process with this memory footprint; add it. - mem_cmd_map[repr(pmem1)+" "+repr(rsz1)+" "+repr(vsz1)+" "+repr(comm1)] = 1; - pmem += float(pmem1); rsz += int(rsz1); vsz += int(vsz1); - fd += this.countOpenFD(apid); - # else not adding memory usage - line = JSTATUS.readline(); - JSTATUS.close(); - this.JOBS[pid]['DATA']['run_time'] = etime; - this.JOBS[pid]['DATA']['cpu_time'] = cputime; - this.JOBS[pid]['DATA']['cpu_usage'] = pcpu; - this.JOBS[pid]['DATA']['mem_usage'] = pmem; - this.JOBS[pid]['DATA']['rss'] = rsz; - this.JOBS[pid]['DATA']['virtualmem'] = vsz; - this.JOBS[pid]['DATA']['open_files'] = fd; - except IOError as ex: - this.logger.log(Logger.ERROR, "ProcInfo: cannot execute ps --no-headers -eo \"pid ppid\""); - - # count the number of open files for the given pid - def countOpenFD (this, pid): - dir = '/proc/'+str(pid)+'/fd'; - if os.access(dir, os.F_OK): - if os.access(dir, os.X_OK): - list = os.listdir(dir); - open_files = len(list); - if pid == os.getpid(): - open_files -= 2; - this.logger.log(Logger.DEBUG, "Counting open_files for "+ repr(pid) +": "+ str(len(list)) +" => " + repr(open_files) + " open_files"); - return open_files; - else: - this.logger.log(Logger.ERROR, "ProcInfo: cannot count the number of opened files for job "+repr(pid)); - else: - this.logger.log(Logger.ERROR, "ProcInfo: job "+repr(pid)+" dosen't exist"); - - - # if there is an work directory defined, then compute the used space in that directory - # and the free disk space on the partition to which that directory belongs - # sizes are given in MB - def readJobDiskUsage (this, pid): - if (pid == '') or pid not in this.JOBS: - return; - workDir = this.JOBS[pid]['WORKDIR']; - if workDir == '': - return; - try: - DU = os.popen("du -Lsck " + workDir + " | tail -1 | cut -f 1"); - line = DU.readline(); - this.JOBS[pid]['DATA']['workdir_size'] = int(line) / 1024.0; - except IOError as ex: - this.logger.log(Logger.ERROR, "ERROR", "ProcInfo: cannot run du to get job's disk usage for job "+repr(pid)); - try: - DF = os.popen("df -k "+workDir+" | tail -1"); - line = DF.readline().strip(); - m = re.match("\S+\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)%", line); - if m != None: - this.JOBS[pid]['DATA']['disk_total'] = float(m.group(1)) / 1024.0; - this.JOBS[pid]['DATA']['disk_used'] = float(m.group(2)) / 1024.0; - this.JOBS[pid]['DATA']['disk_free'] = float(m.group(3)) / 1024.0; - this.JOBS[pid]['DATA']['disk_usage'] = float(m.group(4)) / 1024.0; - DF.close(); - except IOError as ex: - this.logger.log(Logger.ERROR, "ERROR", "ProcInfo: cannot run df to get job's disk usage for job "+repr(pid)); - - # create cummulative parameters based on raw params like cpu_, pages_, swap_, or ethX_ - def computeCummulativeParams(this, dataRef, prevDataRef): - if prevDataRef == {}: - for key in dataRef.keys(): - if key.find('raw_') == 0: - prevDataRef[key] = dataRef[key]; - prevDataRef['TIME'] = dataRef['TIME']; - return; - - # cpu -related params - if ('raw_cpu_usr' in dataRef) and ('raw_cpu_usr' in prevDataRef): - diff={}; - cpu_sum = 0; - for param in ['cpu_usr', 'cpu_nice', 'cpu_sys', 'cpu_idle']: - diff[param] = this.diffWithOverflowCheck(dataRef['raw_'+param], prevDataRef['raw_'+param]); - cpu_sum += diff[param]; - for param in ['cpu_usr', 'cpu_nice', 'cpu_sys', 'cpu_idle']: - if cpu_sum != 0: - dataRef[param] = 100.0 * diff[param] / cpu_sum; - else: - del dataRef[param]; - if cpu_sum != 0: - dataRef['cpu_usage'] = 100.0 * (cpu_sum - diff['cpu_idle']) / cpu_sum; - else: - del dataRef['cpu_usage']; - - # swap & pages -related params - if ('raw_pages_in' in dataRef) and ('raw_pages_in' in prevDataRef): - interval = dataRef['TIME'] - prevDataRef['TIME']; - for param in ['pages_in', 'pages_out', 'swap_in', 'swap_out']: - diff = this.diffWithOverflowCheck(dataRef['raw_'+param], prevDataRef['raw_'+param]); - if interval != 0: - dataRef[param] = 1000.0 * diff / interval; - else: - del dataRef[param]; - - # eth - related params - interval = dataRef['TIME'] - prevDataRef['TIME']; - for rawParam in dataRef.keys(): - if (rawParam.find('raw_eth') == 0) and rawParam in prevDataRef: - param = rawParam.split('raw_')[1]; - if interval != 0: - dataRef[param] = this.diffWithOverflowCheck(dataRef[rawParam], prevDataRef[rawParam]); # absolute difference - if param.find('_errs') == -1: - dataRef[param] = dataRef[param] / interval / 1024.0; # if it's _in or _out, compute in KB/sec - else: - del dataRef[param]; - - # copy contents of the current data values to the - for param in dataRef.keys(): - if param.find('raw_') == 0: - prevDataRef[param] = dataRef[param]; - prevDataRef['TIME'] = dataRef['TIME']; - - - # Return a hash containing (param,value) pairs with existing values from the requested ones - def getFilteredData (this, dataHash, paramsList, prevDataHash = None): - - if not prevDataHash is None: - this.computeCummulativeParams(dataHash, prevDataHash); - - result = {}; - for param in paramsList: - if param == 'net_sockets': - for key in dataHash.keys(): - if key.find('sockets') == 0 and key.find('sockets_tcp_') == -1: - result[key] = dataHash[key]; - elif param == 'net_tcp_details': - for key in dataHash.keys(): - if key.find('sockets_tcp_') == 0: - result[key] = dataHash[key]; - - m = re.match("^net_(.*)$", param); - if m == None: - m = re.match("^(ip)$", param); - if m != None: - net_param = m.group(1); - #this.logger.log(Logger.DEBUG, "Querying param "+net_param); - for key, value in dataHash.items(): - m = re.match("eth\d_"+net_param, key); - if m != None: - result[key] = value; - else: - if param == 'processes': - for key in dataHash.keys(): - if key.find('processes') == 0: - result[key] = dataHash[key]; - elif param in dataHash: - result[param] = dataHash[param]; - sorted_result = []; - keys = result.keys(); - keys.sort(); - for key in keys: - sorted_result.append((key, result[key])); - return sorted_result; - diff --git a/src/python/Publisher/PublisherMaster.py b/src/python/Publisher/PublisherMaster.py index 76ea0cd1ed..0ea0bab51f 100644 --- a/src/python/Publisher/PublisherMaster.py +++ b/src/python/Publisher/PublisherMaster.py @@ -19,16 +19,20 @@ import traceback import sys import json +import pickle +import tempfile from datetime import datetime import time from multiprocessing import Process from MultiProcessingLog import MultiProcessingLog from WMCore.Configuration import loadConfigurationFile +from WMCore.Services.Requests import Requests from RESTInteractions import CRABRest from ServerUtilities import getColumn, encodeRequest, oracleOutputMapping, executeCommand from ServerUtilities import SERVICE_INSTANCES +from ServerUtilities import getProxiedWebDir from TaskWorker import __version__ from TaskWorker.WorkerExceptions import ConfigException @@ -43,24 +47,24 @@ def chunks(l, n): for i in range(0, len(l), n): yield l[i:i + n] -def setMasterLogger(name='master'): +def setMasterLogger(logsDir, name='master'): """ Set the logger for the master process. The file used for it is logs/processes/proc.name.txt and it can be retrieved with logging.getLogger(name) in other parts of the code """ logger = logging.getLogger(name) - fileName = os.path.join('logs', 'processes', "proc.c3id_%s.pid_%s.txt" % (name, os.getpid())) + fileName = os.path.join(logsDir, 'processes', "proc.c3id_%s.pid_%s.txt" % (name, os.getpid())) handler = TimedRotatingFileHandler(fileName, 'midnight', backupCount=30) formatter = logging.Formatter("%(asctime)s:%(levelname)s:"+name+":%(message)s") handler.setFormatter(formatter) logger.addHandler(handler) return logger -def setSlaveLogger(name): +def setSlaveLogger(logsDir, name): """ Set the logger for a single slave process. The file used for it is logs/processes/proc.name.txt and it can be retrieved with logging.getLogger(name) in other parts of the code """ logger = logging.getLogger(name) - fileName = os.path.join('logs', 'processes', "proc.c3id_%s.txt" % name) + fileName = os.path.join(logsDir, 'processes', "proc.c3id_%s.txt" % name) #handler = TimedRotatingFileHandler(fileName, 'midnight', backupCount=30) # slaves are short lived, use one log file for each handler = FileHandler(fileName) @@ -124,11 +128,11 @@ def setRootLogger(logsDir, quiet=False, debug=True, console=False): if debug: loglevel = logging.DEBUG logging.getLogger().setLevel(loglevel) - logger = setMasterLogger() + logger = setMasterLogger(logsDir) logger.debug("PID %s.", os.getpid()) logger.debug("Logging level initialized to %s.", loglevel) return logger - + def logVersionAndConfig(config=None, logger=None): """ log version number and major config. parameters @@ -206,14 +210,14 @@ def logVersionAndConfig(config=None, logger=None): # CRAB REST API's self.max_files_per_block = self.config.max_files_per_block self.crabServer = CRABRest(hostname=restHost, localcert=self.config.serviceCert, - localkey=self.config.serviceKey, retry=3, - userAgent='CRABPublisher') + localkey=self.config.serviceKey, retry=3, + userAgent='CRABPublisher') self.crabServer.setDbInstance(dbInstance=dbInstance) self.startTime = time.time() - def active_tasks(self, crabserver): + def active_tasks(self, crabServer): """ - :param crabserver: CRABRest object to access proper REST as createdin __init__ method + :param crabServer: CRABRest object to access proper REST as createdin __init__ method TODO detail here the strucutre it returns :return: a list tuples [(task,info)]. One element for each task which has jobs to be published each list element has the format (task, info) where task and info are lists @@ -231,7 +235,7 @@ def active_tasks(self, crabserver): asoworkers = self.config.asoworker # asoworkers can be a string or a list of strings # but if it is a string, do not turn it into a list of chars ! - asoworkers = [asoworkers] if isinstance(asoworkers, basestring) else asoworkers + asoworkers = [asoworkers] if isinstance(asoworkers, str) else asoworkers for asoworker in asoworkers: self.logger.info("Processing publication requests for asoworker: %s", asoworker) fileDoc = {} @@ -239,7 +243,7 @@ def active_tasks(self, crabserver): fileDoc['subresource'] = 'acquirePublication' data = encodeRequest(fileDoc) try: - result = crabserver.post(api='filetransfers', data=data) # pylint: disable=unused-variable + result = crabServer.post(api='filetransfers', data=data) # pylint: disable=unused-variable except Exception as ex: self.logger.error("Failed to acquire publications from crabserver: %s", ex) return [] @@ -252,7 +256,7 @@ def active_tasks(self, crabserver): fileDoc['limit'] = 100000 data = encodeRequest(fileDoc) try: - results = crabserver.get(api='filetransfers', data=data) + results = crabServer.get(api='filetransfers', data=data) except Exception as ex: self.logger.error("Failed to acquire publications from crabserver: %s", ex) return [] @@ -271,7 +275,7 @@ def active_tasks(self, crabserver): info = [] for task in unique_tasks: info.append([x for x in filesToPublish if x['taskname'] == task[3]]) - return zip(unique_tasks, info) + return list(zip(unique_tasks, info)) def getPublDescFiles(self, workflow, lfn_ready, logger): """ @@ -303,6 +307,83 @@ def getPublDescFiles(self, workflow, lfn_ready, logger): logger.info('Got filemetadata for %d LFNs', len(out)) return out + def getTaskStatusFromSched(self, workflow, logger): + + def translateStatus(statusToTr): + """Translate from DAGMan internal integer status to a string. + + Uses parameter `dbstatus` to clarify if the state is due to the + user killing the task. See: + https://htcondor.readthedocs.io/en/latest/users-manual/dagman-workflows.html#capturing-the-status-of-nodes-in-a-file + """ + status = {0:'PENDING', 1: 'SUBMITTED', 2: 'SUBMITTED', 3: 'SUBMITTED', + 4: 'SUBMITTED', 5: 'COMPLETED', 6: 'FAILED'}[statusToTr] + return status + + def collapseDAGStatus(dagInfo): + """Collapse the status of one or several DAGs to a single one. + Take into account that subdags can be submitted to the queue on the + schedd, but not yet started. + """ + status_order = ['PENDING', 'SUBMITTED', 'FAILED', 'COMPLETED'] + + subDagInfos = dagInfo.get('SubDags', {}) + subDagStatus = dagInfo.get('SubDagStatus', {}) + # Regular splitting, return status of DAG + if len(subDagInfos) == 0 and len(subDagStatus) == 0: + return translateStatus(dagInfo['DagStatus']) + + def check_queued(statusOrSUBMITTED): + # 99 is the status for a subDAG still to be submitted. An ad-hoc value + # introduced in cache_status.py. If there are less + # actual DAG status informations than expected DAGs, at least one + # DAG has to be queued. + if len(subDagInfos) < len([k for k in subDagStatus if subDagStatus[k] == 99]): + return 'SUBMITTED' + return statusOrSUBMITTED + + # If the processing DAG is still running, we are 'SUBMITTED', + # still. + if len(subDagInfos) > 0: + state = translateStatus(subDagInfos[0]['DagStatus']) + if state == 'SUBMITTED': + return state + # Tails active: return most active tail status according to + # `status_order` + if len(subDagInfos) > 1: + states = [translateStatus(subDagInfos[k]['DagStatus']) for k in subDagInfos if k > 0] + for iStatus in status_order: + if states.count(iStatus) > 0: + return check_queued(iStatus) + # If no tails are active, return the status of the processing DAG. + if len(subDagInfos) > 0: + return check_queued(translateStatus(subDagInfos[0]['DagStatus'])) + return check_queued(translateStatus(dagInfo['DagStatus'])) + + crabDBInfo, _, _ = self.crabServer.get(api='task', data={'subresource':'search', 'workflow':workflow}) + dbStatus = getColumn(crabDBInfo, 'tm_task_status') + if dbStatus == 'KILLED': + return 'KILLED' + proxiedWebDir = getProxiedWebDir(crabserver=self.crabServer, task=workflow, logFunction=logger) + # Download status_cache file + _, local_status_cache_pkl = tempfile.mkstemp(dir='/tmp', prefix='status-cache-', suffix='.pkl') + url = proxiedWebDir + "/status_cache.pkl" + # this host is dummy since we will pass full url to downloadFile but WMCore.Requests needs it + host = 'https://cmsweb.cern.ch' + cdict = {'cert':self.config.serviceCert, 'key':self.config.serviceKey} + req = Requests(url=host, idict=cdict) + _, ret = req.downloadFile(local_status_cache_pkl, url) + if not ret.status == 200: + raise Exception('download attempt returned HTTP code %d' % ret.status) + with open(local_status_cache_pkl, 'rb') as fp: + statusCache = pickle.load(fp) + # get DAG status from downloaded cache_status file + statusCacheInfo = statusCache['nodes'] + dagInfo = statusCacheInfo['DagStatus'] + dagStatus = collapseDAGStatus(dagInfo) # takes care of possible multiple subDAGs for automatic splitting + status = dagStatus + return status + def algorithm(self): """ 1. Get a list of files to publish from the REST and organize by taskname @@ -323,17 +404,15 @@ def algorithm(self): maxSlaves = self.config.max_slaves self.logger.info('kicking off pool for %s tasks using up to %s concurrent slaves', len(tasks), maxSlaves) - #self.logger.debug('list of tasks %s', [x[0][3] for x in tasks]) - ## print one line per task with the number of files to be published. Allow to find stuck tasks + # print one line per task with the number of files to be published. Allow to find stuck tasks self.logger.debug(' # of acquired files : taskname') for task in tasks: taskName = task[0][3] acquiredFiles = len(task[1]) - flag = '(***)' if acquiredFiles > 1000 else ' ' # mark suspicious tasks - self.logger.debug('%s %5d : %s', flag, acquiredFiles, taskName) + flag = ' OK' if acquiredFiles < 1000 else 'WARN' # mark suspicious tasks + self.logger.debug('acquired_files: %s %5d : %s', flag, acquiredFiles, taskName) processes = [] - try: for task in tasks: taskname = str(task[0][3]) @@ -386,9 +465,8 @@ def startSlave(self, task): :param task: one tupla describing a task as returned by active_tasks() :return: 0 It will always terminate normally, if publication fails it will mark it in the DB """ - # TODO: lock task! # - process logger - logger = setSlaveLogger(str(task[0][3])) + logger = setSlaveLogger(self.config.logsDir, str(task[0][3])) logger.info("Process %s is starting. PID %s", task[0][3], os.getpid()) self.force_publication = False @@ -400,36 +478,18 @@ def startSlave(self, task): msg += " No need to retrieve task status nor last publication time." logger.info(msg) else: - msg = "At least one dataset has less than %s ready files." % (self.max_files_per_block) + msg = "At least one dataset has less than %s ready files. Retrieve task status" % (self.max_files_per_block) logger.info(msg) - # Retrieve the workflow status. If the status can not be retrieved, continue - # with the next workflow. - workflow_status = '' - msg = "Retrieving status" - logger.info(msg) - data = encodeRequest({'workflow': workflow}) - try: - res = self.crabServer.get(api='workflow', data=data) - except Exception as ex: - logger.warn('Error retrieving status from crabserver for %s:\n%s', workflow, str(ex)) - return 0 - try: - workflow_status = res[0]['result'][0]['status'] - msg = "Task status is %s." % workflow_status - logger.info(msg) - except ValueError: - msg = "Workflow removed from WM." - logger.error(msg) - workflow_status = 'REMOVED' + workflow_status = self.getTaskStatusFromSched(workflow, logger) except Exception as ex: - msg = "Error loading task status!" - msg += str(ex) - msg += str(traceback.format_exc()) - logger.error(msg) + logger.warn('Error retrieving status cache from sched for %s:\n%s', workflow, str(ex)) + logger.warn('Assuming COMPLETED in order to force pending publications if any') + workflow_status = 'COMPLETED' + logger.info('Task status from DAG info: %s', workflow_status) # If the workflow status is terminal, go ahead and publish all the ready files # in the workflow. - if workflow_status in ['COMPLETED', 'FAILED', 'KILLED', 'REMOVED']: + if workflow_status in ['COMPLETED', 'FAILED', 'KILLED', 'REMOVED', 'FAILED (KILLED)']: self.force_publication = True if workflow_status in ['KILLED', 'REMOVED']: self.force_failure = True @@ -437,9 +497,7 @@ def startSlave(self, task): logger.info(msg) # Otherwise... else: ## TODO put this else in a function like def checkForPublication() - msg = "Task status is not considered terminal." - logger.info(msg) - msg = "Getting last publication time." + msg = "Task status is not considered terminal. Will check last publication time." logger.info(msg) # Get when was the last time a publication was done for this workflow (this # should be more or less independent of the output dataset in case there are @@ -448,10 +506,10 @@ def startSlave(self, task): data = encodeRequest({'workflow':workflow, 'subresource':'search'}) try: result = self.crabServer.get(api='task', data=data) - logger.debug("task: %s ", str(result[0])) + #logger.debug("task: %s ", str(result[0])) last_publication_time = getColumn(result[0], 'tm_last_publication') except Exception as ex: - logger.error("Error during task doc retrieving:\n%s", ex) + logger.error("Error during task info retrieving:\n%s", ex) if last_publication_time: date = last_publication_time # datetime in Oracle format timetuple = datetime.strptime(date, "%Y-%m-%d %H:%M:%S.%f").timetuple() # convert to time tuple @@ -580,7 +638,7 @@ def startSlave(self, task): # find the location in the current environment of the script we want to run import Publisher.TaskPublish as tp taskPublishScript = tp.__file__ - cmd = "python %s " % taskPublishScript + cmd = "python3 %s " % taskPublishScript cmd += " --configFile=%s" % self.configurationFile cmd += " --taskname=%s" % workflow if self.TPconfig.dryRun: diff --git a/src/python/Publisher/TaskPublish.py b/src/python/Publisher/TaskPublish.py index ac28e17fd0..32bb1fcc48 100644 --- a/src/python/Publisher/TaskPublish.py +++ b/src/python/Publisher/TaskPublish.py @@ -29,7 +29,7 @@ def format_file_3(file_): """ nf = {'logical_file_name': file_['lfn'], 'file_type': 'EDM', - 'check_sum': unicode(file_['cksum']), + 'check_sum': file_['cksum'], 'event_count': file_['inevents'], 'file_size': file_['filesize'], 'adler32': file_['adler32'], @@ -280,6 +280,7 @@ def requestBlockMigration(taskname, migrateApi, sourceApi, block, migLogDir): msg += "\nRequest detail: %s" % data msg += "\nDBS3 exception: %s" % ex logger.error(msg) + return reqid, atDestination, alreadyQueued if not atDestination: msg = "Result of migration request: %s" % str(result) logger.info(msg) @@ -323,8 +324,8 @@ def requestBlockMigration(taskname, migrateApi, sourceApi, block, migLogDir): try: status = migrateApi.statusMigration(block_name=block) reqid = status[0].get('migration_request_id') - except Exception: - msg = "Could not get status for already queued migration of block %s." % (block) + except Exception as ex: + msg = "Could not get status for already queued migration of block %s.\n%s" % (block, ex) logger.error(msg) return reqid, atDestination, alreadyQueued @@ -532,8 +533,8 @@ def saveSummaryJson(logdir, summary): if not sourceURL.endswith("/DBSReader") and not sourceURL.endswith("/DBSReader/"): sourceURL += "/DBSReader" - except Exception: - logger.exception("ERROR") + except Exception as ex: + logger.exception("ERROR: %s", ex) # When looking up parents may need to look in global DBS as well. globalURL = sourceURL @@ -572,8 +573,8 @@ def saveSummaryJson(logdir, summary): destReadApi = dbsClient.DbsApi(url=publish_read_url) logger.info("DBS Migration API URL: %s", publish_migrate_url) migrateApi = dbsClient.DbsApi(url=publish_migrate_url) - except Exception: - logger.exception('Wrong DBS URL %s', publish_dbs_url) + except Exception as ex: + logger.exception('Error creating DBS APIs, likely wrong DBS URL %s\n%s', publish_dbs_url, ex) nothingToDo['result'] = 'FAIL' nothingToDo['reason'] = 'Error contacting DBS' summaryFileName = saveSummaryJson(logdir, nothingToDo) @@ -586,11 +587,17 @@ def saveSummaryJson(logdir, summary): try: existing_datasets = sourceApi.listDatasets(dataset=inputDataset, detail=True, dataset_access_type='*') primary_ds_type = existing_datasets[0]['primary_ds_type'] - # There's little chance this is correct, but it's our best guess for now. - # CRAB2 uses 'crab2_tag' for all cases + except Exception as ex: + logger.exception('Error looking up input dataset in %s\n%s', sourceApi.url, ex) + nothingToDo['result'] = 'FAIL' + nothingToDo['reason'] = 'Error looking up input dataset in DBS' + summaryFileName = saveSummaryJson(logdir, nothingToDo) + return summaryFileName + + try: existing_output = destReadApi.listOutputConfigs(dataset=inputDataset) - except Exception: - logger.exception('Wrong DBS URL %s', publish_dbs_url) + except Exception as ex: + logger.exception('Error from listOutputConfigs in %s\n%s', destReadApi.url, ex) nothingToDo['result'] = 'FAIL' nothingToDo['reason'] = 'Error looking up input dataset in DBS' summaryFileName = saveSummaryJson(logdir, nothingToDo) @@ -621,7 +628,7 @@ def saveSummaryJson(logdir, summary): acquisitionera = str(toPublish[0]['acquisitionera']) else: acquisitionera = acquisition_era_name - except Exception: + except Exception as ex: acquisitionera = acquisition_era_name _, primName, procName, tier = toPublish[0]['outdataset'].split('/') @@ -632,7 +639,14 @@ def saveSummaryJson(logdir, summary): if dryRun: logger.info("DryRun: skip insertPrimaryDataset") else: - destApi.insertPrimaryDataset(primds_config) + try: + destApi.insertPrimaryDataset(primds_config) + except: + logger.exception('Error inserting PrimaryDataset in %s', destApi.url) + nothingToDo['result'] = 'FAIL' + nothingToDo['reason'] = 'Error looking up input dataset in DBS' + summaryFileName = saveSummaryJson(logdir, nothingToDo) + return summaryFileName msg = "Successfully inserted primary dataset %s." % (primName) logger.info(msg) @@ -803,8 +817,14 @@ def saveSummaryJson(logdir, summary): if dryRun: logger.info("DryRun: skipping migration request") else: - statusCode, failureMsg = migrateByBlockDBS3(taskname, migrateApi, destReadApi, sourceApi, - inputDataset, localParentBlocks, migrationLogDir, verbose) + try: + statusCode, failureMsg = migrateByBlockDBS3(taskname, migrateApi, destReadApi, sourceApi, + inputDataset, localParentBlocks, migrationLogDir, + verbose) + except Exception as ex: + logger.exception('Exception raised inside migrateByBlockDBS3\n%s', ex) + statusCode = 1 + failureMsg = 'Exception raised inside migrateByBlockDBS3' if statusCode: failureMsg += " Not publishing any files." logger.info(failureMsg) @@ -817,8 +837,14 @@ def saveSummaryJson(logdir, summary): if dryRun: logger.info("DryRun: skipping migration request") else: - statusCode, failureMsg = migrateByBlockDBS3(taskname, migrateApi, destReadApi, globalApi, - inputDataset, globalParentBlocks, migrationLogDir, verbose) + try: + statusCode, failureMsg = migrateByBlockDBS3(taskname, migrateApi, destReadApi, globalApi, + inputDataset, globalParentBlocks, migrationLogDir, + verbose) + except Exception as ex: + logger.exception('Exception raised inside migrateByBlockDBS3\n%s', ex) + statusCode = 1 + failureMsg = 'Exception raised inside migrateByBlockDBS3' if statusCode: failureMsg += " Not publishing any files." logger.info(failureMsg) diff --git a/src/python/RESTInteractions.py b/src/python/RESTInteractions.py index bffed3f9c9..0f065a6b39 100644 --- a/src/python/RESTInteractions.py +++ b/src/python/RESTInteractions.py @@ -44,7 +44,7 @@ def retriableError(ex): if isinstance(ex, pycurl.error): #28 is 'Operation timed out...' #35,is 'Unknown SSL protocol error', see https://github.com/dmwm/CRABServer/issues/5102 - return ex[0] in [28, 35] + return ex.args[0] in [28, 35] return False diff --git a/src/python/ServerUtilities.py b/src/python/ServerUtilities.py index 2d5f4d926e..a3dd6c62a8 100644 --- a/src/python/ServerUtilities.py +++ b/src/python/ServerUtilities.py @@ -33,7 +33,7 @@ from urllib import urlencode, quote BOOTSTRAP_CFGFILE_DUMP = 'PSetDump.py' -FEEDBACKMAIL = 'hn-cms-computing-tools@cern.ch' +FEEDBACKMAIL = 'cmstalk+computing-tools@dovecotmta.cern.ch' # Parameters for User File Cache # 120 MB is the maximum allowed size of a single file @@ -156,12 +156,6 @@ def getTestDataDirectory(): testdirList = __file__.split(os.sep)[:-3] + ["test", "data"] return os.sep.join(testdirList) -def isCouchDBURL(url): - """ Return True if the url proviced is a couchdb one - """ - return 'couchdb' in url - - def truncateError(msg): """Truncate the error message to the first 7400 chars if needed, and add a message if we truncate it. See https://github.com/dmwm/CRABServer/pull/4867#commitcomment-12086393 @@ -407,8 +401,7 @@ def getLock(name): def getHashLfn(lfn): """ Provide a hashed lfn from an lfn. """ - return hashlib.sha224(lfn).hexdigest() - + return hashlib.sha224(lfn.encode('utf-8')).hexdigest() def generateTaskName(username, requestname, timestamp=None): """ Generate a taskName which is saved in database @@ -802,6 +795,7 @@ def uploadToS3ViaPSU (filepath=None, preSignedUrlFields=None, logger=None): # CRAB_useGoCurl env. variable is used to define how upload to S3 command should be executed. # If variable is set, then goCurl is used for command execution: https://github.com/vkuznet/gocurl + # The same variable is also used inside CRABClient, we should keep name changes (if any) synchronized if os.getenv('CRAB_useGoCurl'): uploadCommand += '/cvmfs/cms.cern.ch/cmsmon/gocurl -verbose 2 -method POST' uploadCommand += ' -header "User-Agent:%s"' % userAgent @@ -861,6 +855,7 @@ def downloadFromS3ViaPSU(filepath=None, preSignedUrl=None, logger=None): # CRAB_useGoCurl env. variable is used to define how download from S3 command should be executed. # If variable is set, then goCurl is used for command execution: https://github.com/vkuznet/gocurl + # The same variable is also used inside CRABClient, we should keep name changes (if any) synchronized if os.getenv('CRAB_useGoCurl'): downloadCommand += '/cvmfs/cms.cern.ch/cmsmon/gocurl -verbose 2 -method GET' downloadCommand += ' -out "%s"' % filepath diff --git a/src/python/TaskWorker/Actions/DBSDataDiscovery.py b/src/python/TaskWorker/Actions/DBSDataDiscovery.py index d930294f5e..0d500c3334 100644 --- a/src/python/TaskWorker/Actions/DBSDataDiscovery.py +++ b/src/python/TaskWorker/Actions/DBSDataDiscovery.py @@ -3,8 +3,13 @@ import sys import logging import copy -from httplib import HTTPException -import urllib +from http.client import HTTPException + +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode from WMCore.DataStructs.LumiList import LumiList from WMCore.Services.DBS.DBSReader import DBSReader @@ -60,7 +65,7 @@ def keepOnlyDiskRSEs(self, locationsMap): # get all the RucioStorageElements (RSEs) which are of kind 'Disk' # locationsMap is a dictionary {block1:[locations], block2:[locations],...} diskLocationsMap = {} - for block, locations in locationsMap.iteritems(): + for block, locations in locationsMap.items(): # as of Sept 2020, tape RSEs ends with _Tape, go for the quick hack diskRSEs = [rse for rse in locations if not 'Tape' in rse] if 'T3_CH_CERN_OpenData' in diskRSEs: @@ -197,7 +202,7 @@ def requestTapeRecall(self, blockList=[], system='Dynamo', msgHead=''): # pyli 'subresource': 'addddmreqid', } try: - tapeRecallStatusSet = self.crabserver.post(api='task', data=urllib.urlencode(configreq)) + tapeRecallStatusSet = self.crabserver.post(api='task', data=urlencode(configreq)) except HTTPException as hte: self.logger.exception(hte) msg = "HTTP Error while contacting the REST Interface %s:\n%s" % ( @@ -254,7 +259,11 @@ def executeInternal(self, *args, **kwargs): dbsurl = dbsurl.replace(hostname, self.config.Services.DBSHostName) self.logger.info("will connect to DBS at URL: %s", dbsurl) self.dbs = DBSReader(dbsurl) - self.dbsInstance = self.dbs.dbs.serverinfo()["dbs_instance"] + # with new DBS, we can not get the instance from serverinfo api + # instead, we parse it from the URL + # if url is 'https://cmsweb.cern.ch/dbs/prod/global/DBSReader' + # then self.dbsInstance needs to be 'prod/global' + self.dbsInstance = "{}/{}".format(dbsurl.split("//")[1].split("/")[2], dbsurl.split("//")[1].split("/")[3]) self.taskName = kwargs['task']['tm_taskname'] # pylint: disable=W0201 self.username = kwargs['task']['tm_username'] # pylint: disable=W0201 diff --git a/src/python/TaskWorker/Actions/DagmanCreator.py b/src/python/TaskWorker/Actions/DagmanCreator.py index 6dc5297b1c..90e246b30c 100644 --- a/src/python/TaskWorker/Actions/DagmanCreator.py +++ b/src/python/TaskWorker/Actions/DagmanCreator.py @@ -29,10 +29,6 @@ import WMCore.WMSpec.WMTask from WMCore.Services.CRIC.CRIC import CRIC -try: - from WMCore.Services.UserFileCache.UserFileCache import UserFileCache -except ImportError: - UserFileCache = None DAG_HEADER = """ @@ -101,8 +97,6 @@ +CRAB_OutTempLFNDir = "%(temp_dest)s" +CRAB_OutLFNDir = "%(output_dest)s" +CRAB_oneEventMode = %(oneEventMode)s -+CRAB_ASOURL = %(tm_asourl)s -+CRAB_ASODB = %(tm_asodb)s +CRAB_PrimaryDataset = %(primarydataset)s +TaskType = "Job" accounting_group = %(accounting_group)s @@ -207,6 +201,7 @@ def makeLFNPrefixes(task): if 'tm_user_role' in task and task['tm_user_role']: hash_input += "," + task['tm_user_role'] lfn = task['tm_output_lfn'] + hash_input = hash_input.encode('utf-8') pset_hash = hashlib.sha1(hash_input).hexdigest() user = task['tm_username'] tmp_user = "%s.%s" % (user, pset_hash) @@ -287,7 +282,7 @@ def transform_strings(data): for var in 'workflow', 'jobtype', 'jobsw', 'jobarch', 'inputdata', 'primarydataset', 'splitalgo', 'algoargs', \ 'cachefilename', 'cacheurl', 'userhn', 'publishname', 'asyncdest', 'dbsurl', 'publishdbsurl', \ 'userdn', 'requestname', 'oneEventMode', 'tm_user_vo', 'tm_user_role', 'tm_user_group', \ - 'tm_maxmemory', 'tm_numcores', 'tm_maxjobruntime', 'tm_priority', 'tm_asourl', 'tm_asodb', \ + 'tm_maxmemory', 'tm_numcores', 'tm_maxjobruntime', 'tm_priority', \ 'stageoutpolicy', 'taskType', 'worker_name', 'cms_wmtool', 'cms_tasktype', 'cms_type', \ 'desired_arch', 'resthost', 'dbinstance', 'submitter_ip_addr', \ 'task_lifetime_days', 'task_endtime', 'maxproberuntime', 'maxtailruntime': @@ -319,7 +314,6 @@ def transform_strings(data): for var in ["cacheurl", "jobsw", "jobarch", "cachefilename", "asyncdest", "requestname"]: info[var+"_flatten"] = data[var] - # TODO: PanDA wrapper wants some sort of dictionary. info["addoutputfiles_flatten"] = '{}' temp_dest, dest = makeLFNPrefixes(data) @@ -469,9 +463,6 @@ def makeJobSubmit(self, task): info['tfileoutfiles'] = task['tm_tfile_outfiles'] info['edmoutfiles'] = task['tm_edm_outfiles'] info['oneEventMode'] = 1 if info['tm_one_event_mode'] == 'T' else 0 - info['ASOURL'] = task['tm_asourl'] - asodb = task.get('tm_asodb', 'asynctransfer') or 'asynctransfer' - info['ASODB'] = asodb info['taskType'] = self.getDashboardTaskType(task) info['worker_name'] = getattr(self.config.TaskWorker, 'name', 'unknown') info['retry_aso'] = 1 if getattr(self.config.TaskWorker, 'retryOnASOFailures', True) else 0 @@ -558,10 +549,12 @@ def makeDagSpecs(self, task, sitead, siteinfo, jobgroup, block, availablesites, i = startjobid temp_dest, dest = makeLFNPrefixes(task) try: - # use temp_dest since it the longest path and overall LFN has a length limit to fit in DataBase + # validate LFN's. Check both dest and temp_dest. See https://github.com/dmwm/CRABServer/issues/6871 if task['tm_publication'] == 'T': + validateLFNs(dest, outfiles) validateLFNs(temp_dest, outfiles) else: + validateUserLFNs(dest, outfiles) validateUserLFNs(temp_dest, outfiles) except AssertionError as ex: msg = "\nYour task speficies an output LFN which fails validation in" @@ -608,6 +601,8 @@ def makeDagSpecs(self, task, sitead, siteinfo, jobgroup, block, availablesites, localOutputFiles.append("%s=%s" % (origFile, fileName)) remoteOutputFilesStr = " ".join(remoteOutputFiles) localOutputFiles = ", ".join(localOutputFiles) + # no need to use // in the next line, thanks to integer formatting with `%d` + # see: https://docs.python.org/3/library/string.html#formatstrings counter = "%04d" % (i / 1000) tempDest = os.path.join(temp_dest, counter) directDest = os.path.join(dest, counter) @@ -1101,23 +1096,22 @@ def getHighPrioUsers(self, userProxy, workflow, egroups): # Import needed because the DagmanCreator module is also imported in the schedd, # where there is no ldap available. This function however is only called # in the TW (where ldap is installed) during submission. - from ldap import LDAPError highPrioUsers = set() try: + from ldap import LDAPError for egroup in egroups: highPrioUsers.update(get_egroup_users(egroup)) - except LDAPError as le: + except Exception as ex: msg = "Error when getting the high priority users list." \ " Will ignore the high priority list and continue normally." \ - " Error reason: %s" % str(le) + " Error reason: %s" % str(ex) self.uploadWarning(msg, userProxy, workflow) return [] return highPrioUsers def executeInternal(self, *args, **kw): - # FIXME: In PanDA, we provided the executable as a URL. # So, the filename becomes http:// -- and doesn't really work. Hardcoding the analysis wrapper. #transform_location = getLocation(kw['task']['tm_transformation'], 'CAFUtilities/src/python/transformation/CMSRunAnalysis/') transform_location = getLocation('CMSRunAnalysis.sh', 'CRABServer/scripts/') @@ -1138,8 +1132,8 @@ def executeInternal(self, *args, **kw): sandboxTarBall = 'sandbox.tar.gz' debugTarBall = 'debug_files.tar.gz' - # Bootstrap the ISB if we are using S3 and running in the TW - if self.crabserver and 'S3' in kw['task']['tm_cache_url'].upper(): + # Bootstrap the ISB if we are running in the TW + if self.crabserver: username = kw['task']['tm_username'] sandboxName = kw['task']['tm_user_sandbox'] dbgFilesName = kw['task']['tm_debug_files'] @@ -1157,26 +1151,6 @@ def executeInternal(self, *args, **kw): except Exception as ex: self.logger.exception(ex) - # Bootstrap the ISB if we are using UFC - else: - if UserFileCache and kw['task']['tm_cache_url'].find('/crabcache') != -1: - ufc = UserFileCache(mydict={'cert': kw['task']['user_proxy'], 'key': kw['task']['user_proxy'], 'endpoint' : kw['task']['tm_cache_url']}) - try: - ufc.download(hashkey=kw['task']['tm_user_sandbox'].split(".")[0], output=sandboxTarBall) - except Exception as ex: - self.logger.exception(ex) - raise TaskWorkerException("The CRAB3 server backend could not download the input sandbox with your code "+\ - "from the frontend (crabcache component).\nThis could be a temporary glitch; please try to submit a new task later "+\ - "(resubmit will not work) and contact the experts if the error persists.\nError reason: %s" % str(ex)) #TODO url!? - kw['task']['tm_user_sandbox'] = sandboxTarBall - - # For an older client (<3.3.1607) this field will be empty and the file will not exist. - if kw['task']['tm_debug_files']: - try: - ufc.download(hashkey=kw['task']['tm_debug_files'].split(".")[0], output=debugTarBall) - except Exception as ex: - self.logger.exception(ex) - # Bootstrap the runtime if it is available. job_runtime = getLocation('CMSRunAnalysis.tar.gz', 'CRABServer/') shutil.copy(job_runtime, '.') diff --git a/src/python/TaskWorker/Actions/DagmanKiller.py b/src/python/TaskWorker/Actions/DagmanKiller.py index 0430b50756..b052e1d88e 100644 --- a/src/python/TaskWorker/Actions/DagmanKiller.py +++ b/src/python/TaskWorker/Actions/DagmanKiller.py @@ -1,8 +1,12 @@ import re import socket -import urllib +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode -from httplib import HTTPException +from http.client import HTTPException import htcondor @@ -88,7 +92,8 @@ def killAll(self, jobConst): # TODO: Remove jobConst query when htcondor ticket is solved # https://htcondor-wiki.cs.wisc.edu/index.cgi/tktview?tn=5175 - with HTCondorUtils.AuthenticatedSubprocess(self.proxy) as (parent, rpipe): + tokenDir = getattr(self.config.TaskWorker, 'SEC_TOKEN_DIRECTORY', None) + with HTCondorUtils.AuthenticatedSubprocess(self.proxy, tokenDir) as (parent, rpipe): if not parent: with self.schedd.transaction() as dummytsc: self.schedd.act(htcondor.JobAction.Hold, rootConst) @@ -123,7 +128,7 @@ def execute(self, *args, **kwargs): 'workflow': kwargs['task']['tm_taskname'], 'status': 'KILLED'} self.logger.debug("Setting the task as successfully killed with %s", str(configreq)) - self.crabserver.post(api='workflowdb', data=urllib.urlencode(configreq)) + self.crabserver.post(api='workflowdb', data=urlencode(configreq)) except HTTPException as hte: self.logger.error(hte.headers) msg = "The CRAB server successfully killed the task," diff --git a/src/python/TaskWorker/Actions/DagmanResubmitter.py b/src/python/TaskWorker/Actions/DagmanResubmitter.py index 497b96e11d..a703a72eab 100644 --- a/src/python/TaskWorker/Actions/DagmanResubmitter.py +++ b/src/python/TaskWorker/Actions/DagmanResubmitter.py @@ -1,22 +1,25 @@ import time -import urllib import datetime +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode + import classad import htcondor import HTCondorLocator import HTCondorUtils -from WMCore.Database.CMSCouch import CouchServer - from ServerUtilities import FEEDBACKMAIL from TaskWorker.Actions.TaskAction import TaskAction from TaskWorker.Actions.DagmanSubmitter import checkMemoryWalltime from TaskWorker.WorkerExceptions import TaskWorkerException import TaskWorker.DataObjects.Result as Result -from httplib import HTTPException +from http.client import HTTPException class DagmanResubmitter(TaskAction): @@ -38,30 +41,13 @@ def executeInternal(self, *args, **kwargs): #pylint: disable=unused-argument raise ValueError("No proxy provided") proxy = task['user_proxy'] - if task.get('resubmit_publication', False): - resubmitWhat = "publications" - else: - resubmitWhat = "jobs" - - self.logger.info("About to resubmit %s for workflow: %s.", resubmitWhat, workflow) + self.logger.info("About to resubmit failed jobs for workflow: %s.", workflow) self.logger.debug("Task info: %s", str(task)) - if task.get('resubmit_publication', False): - asourl = task.get('tm_asourl', None) - #Let's not assume the db has been updated (mostly for devs), let's default asodb to asynctransfer! - #Also the "or" takes care of the case were the new code is executed on old task - #i.e.: tm_asodb is there but empty. - asodb = task.get('tm_asodb', 'asynctransfer') or 'asynctransfer' - if not asourl: - msg = "ASO URL not set. Can not resubmit publication." - raise TaskWorkerException(msg) - self.logger.info("Will resubmit failed publications") - self.resubmitPublication(asourl, asodb, proxy, workflow) - return - if task['tm_collector']: self.backendurls['htcondorPool'] = task['tm_collector'] loc = HTCondorLocator.HTCondorLocator(self.backendurls) + tokenDir = getattr(self.config.TaskWorker, 'SEC_TOKEN_DIRECTORY', None) schedd = "" dummyAddress = "" @@ -102,31 +88,22 @@ def executeInternal(self, *args, **kwargs): #pylint: disable=unused-argument overwrite = False for taskparam in params.values(): if ('resubmit_'+taskparam in task) and task['resubmit_'+taskparam] != None: - # In case resubmission parameters contain a list of unicode strings, - # convert it to a list of ascii strings because of HTCondor unicode - # incompatibility. - # Note that unicode strings that are not in a list are not handled, - # but so far they don't exist in this part of the code. + # py3: we can revert changes coming from #5317 if isinstance(task['resubmit_'+taskparam], list): - nonUnicodeList = [] - for p in task['resubmit_'+taskparam]: - if isinstance(p, unicode): - nonUnicodeList.append(p.encode('ascii', 'ignore')) - else: - nonUnicodeList.append(p) - ad[taskparam] = nonUnicodeList + ad[taskparam] = task['resubmit_'+taskparam] if taskparam != 'jobids': overwrite = True if ('resubmit_jobids' in task) and task['resubmit_jobids']: - with HTCondorUtils.AuthenticatedSubprocess(proxy, logger=self.logger) as (parent, rpipe): + with HTCondorUtils.AuthenticatedSubprocess(proxy, tokenDir, + logger=self.logger) as (parent, rpipe): if not parent: schedd.edit(rootConst, "HoldKillSig", 'SIGKILL') ## Overwrite parameters in the os.environ[_CONDOR_JOB_AD] file. This will affect ## all the jobs, not only the ones we want to resubmit. That's why the pre-job ## is saving the values of the parameters for each job retry in text files (the ## files are in the directory resubmit_info in the schedd). - for adparam, taskparam in params.iteritems(): + for adparam, taskparam in params.items(): if taskparam in ad: schedd.edit(rootConst, adparam, ad.lookup(taskparam)) elif task['resubmit_'+taskparam] != None: @@ -136,9 +113,10 @@ def executeInternal(self, *args, **kwargs): #pylint: disable=unused-argument schedd.act(htcondor.JobAction.Release, rootConst) elif overwrite: self.logger.debug("Resubmitting under condition overwrite = True") - with HTCondorUtils.AuthenticatedSubprocess(proxy, logger=self.logger) as (parent, rpipe): + with HTCondorUtils.AuthenticatedSubprocess(proxy, tokenDir, + logger=self.logger) as (parent, rpipe): if not parent: - for adparam, taskparam in params.iteritems(): + for adparam, taskparam in params.items(): if taskparam in ad: if taskparam == 'jobids' and len(list(ad[taskparam])) == 0: self.logger.debug("Setting %s = True in the task ad.", adparam) @@ -153,7 +131,8 @@ def executeInternal(self, *args, **kwargs): #pylint: disable=unused-argument ## starting from CRAB 3.3.16 the resubmission parameters are written to the ## Task DB with value != None, so the overwrite variable should never be False. self.logger.debug("Resubmitting under condition overwrite = False") - with HTCondorUtils.AuthenticatedSubprocess(proxy, logger=self.logger) as (parent, rpipe): + with HTCondorUtils.AuthenticatedSubprocess(proxy, tokenDir, + logger=self.logger) as (parent, rpipe): if not parent: schedd.edit(rootConst, "HoldKillSig", 'SIGKILL') schedd.edit(rootConst, "CRAB_ResubmitList", classad.ExprTree("true")) @@ -183,7 +162,7 @@ def execute(self, *args, **kwargs): 'workflow': kwargs['task']['tm_taskname'], 'status': 'SUBMITTED'} self.logger.debug("Setting the task as successfully resubmitted with %s", str(configreq)) - self.crabserver.post(api='workflowdb', data=urllib.urlencode(configreq)) + self.crabserver.post(api='workflowdb', data=urlencode(configreq)) except HTTPException as hte: self.logger.error(hte.headers) msg = "The CRAB server successfully resubmitted the task to the Grid scheduler," @@ -192,44 +171,6 @@ def execute(self, *args, **kwargs): raise TaskWorkerException(msg) return Result.Result(task=kwargs['task'], result='OK') - - def resubmitPublication(self, asourl, asodb, proxy, taskname): - """ - Resubmit failed publications by resetting the publication - status in the CouchDB documents. - """ - server = CouchServer(dburl=asourl, ckey=proxy, cert=proxy) - try: - database = server.connectDatabase(asodb) - except Exception as ex: - msg = "Error while trying to connect to CouchDB: %s" % (str(ex)) - raise TaskWorkerException(msg) - try: - failedPublications = database.loadView('DBSPublisher', 'PublicationFailedByWorkflow',\ - {'reduce': False, 'startkey': [taskname], 'endkey': [taskname, {}]})['rows'] - except Exception as ex: - msg = "Error while trying to load view 'DBSPublisher.PublicationFailedByWorkflow' from CouchDB: %s" % (str(ex)) - raise TaskWorkerException(msg) - msg = "There are %d failed publications to resubmit: %s" % (len(failedPublications), failedPublications) - self.logger.info(msg) - for doc in failedPublications: - docid = doc['id'] - if doc['key'][0] != taskname: # this should never happen... - msg = "Skipping document %s as it seems to correspond to another task: %s" % (docid, doc['key'][0]) - self.logger.warning(msg) - continue - data = {'last_update': time.time(), - 'retry': str(datetime.datetime.now()), - 'publication_state': 'not_published', - } - try: - database.updateDocument(docid, 'DBSPublisher', 'updateFile', data) - except Exception as ex: - msg = "Error updating document %s in CouchDB: %s" % (docid, str(ex)) - self.logger.error(msg) - return - - if __name__ == "__main__": import os import logging diff --git a/src/python/TaskWorker/Actions/DagmanSubmitter.py b/src/python/TaskWorker/Actions/DagmanSubmitter.py index caf60ea7a6..be73cc5aa8 100644 --- a/src/python/TaskWorker/Actions/DagmanSubmitter.py +++ b/src/python/TaskWorker/Actions/DagmanSubmitter.py @@ -8,9 +8,14 @@ import json import time import pickle -import urllib -from httplib import HTTPException +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode + +from http.client import HTTPException import HTCondorUtils import CMSGroupMapper @@ -81,8 +86,6 @@ ('MaxWallTimeMinsProbe', 'maxproberuntime'), ('MaxWallTimeMinsTail', 'maxtailruntime'), ('JobPrio', 'tm_priority'), - ('CRAB_ASOURL', 'tm_asourl'), - ('CRAB_ASODB', 'tm_asodb'), ('CRAB_FailedNodeLimit', 'faillimit'), ('CRAB_DashboardTaskType', 'taskType'), ('CRAB_MaxIdle', 'maxidle'), @@ -203,7 +206,7 @@ def sendScheddToREST(self, task, schedd): configreq = {'workflow':task['tm_taskname'], 'subresource':'updateschedd', 'scheddname':schedd} try: - self.crabserver.post(api='task', data=urllib.urlencode(configreq)) + self.crabserver.post(api='task', data=urlencode(configreq)) except HTTPException as hte: msg = "Unable to contact cmsweb and update scheduler on which task will be submitted. Error msg: %s" % hte.headers self.logger.warning(msg) @@ -364,7 +367,7 @@ def duplicateCheck(self, task): 'clusterid': results[0]['ClusterId'] } self.logger.warning("Task %s already submitted to HTCondor; pushing information centrally: %s", workflow, str(configreq)) - data = urllib.urlencode(configreq) + data = urlencode(configreq) self.crabserver.post(api='workflowdb', data=data) # Note that we don't re-send Dashboard jobs; we assume this is a rare occurrance and @@ -455,7 +458,7 @@ def executeInternal(self, info, dashboardParams, inputFiles, **kwargs): 'subresource': 'success', 'clusterid' : self.clusterId} #that's the condor cluster id of the dag_bootstrap.sh self.logger.debug("Pushing information centrally %s", configreq) - data = urllib.urlencode(configreq) + data = urlencode(configreq) self.crabserver.post(api='workflowdb', data=data) return Result.Result(task=kwargs['task'], result='OK') @@ -486,9 +489,9 @@ def submitDirect(self, schedd, cmd, arg, info): #pylint: disable=R0201 dagAd["Environment"] = classad.ExprTree('strcat("PATH=/usr/bin:/bin CRAB3_VERSION=3.3.0-pre1 CONDOR_ID=", ClusterId, ".", ProcId," %s")' % " ".join(info['additional_environment_options'].split(";"))) dagAd["RemoteCondorSetup"] = info['remote_condor_setup'] - dagAd["CRAB_TaskSubmitTime"] = classad.ExprTree("%s" % info["start_time"].encode('ascii', 'ignore')) + dagAd["CRAB_TaskSubmitTime"] = info['start_time'] dagAd['CRAB_TaskLifetimeDays'] = TASKLIFETIME // 24 // 60 // 60 - dagAd['CRAB_TaskEndTime'] = int(info["start_time"]) + TASKLIFETIME + dagAd['CRAB_TaskEndTime'] = int(info['start_time']) + TASKLIFETIME #For task management info see https://github.com/dmwm/CRABServer/issues/4681#issuecomment-302336451 dagAd["LeaveJobInQueue"] = classad.ExprTree("true") dagAd["PeriodicHold"] = classad.ExprTree("time() > CRAB_TaskEndTime") @@ -502,7 +505,11 @@ def submitDirect(self, schedd, cmd, arg, info): #pylint: disable=R0201 for k, v in dagAd.items(): if k == 'X509UserProxy': v = os.path.basename(v) - if isinstance(v, basestring): + if isinstance(v, str): + value = classad.quote(v) + elif isinstance(v, bytes): + # we only expect strings in the code above, but.. just in case, + # be prarped for bytes in case it requires a different handling at some point value = classad.quote(v) elif isinstance(v, classad.ExprTree): value = repr(v) @@ -520,7 +527,10 @@ def submitDirect(self, schedd, cmd, arg, info): #pylint: disable=R0201 dagAd["TransferInput"] = str(info['inputFilesString']) condorIdDict = {} - with HTCondorUtils.AuthenticatedSubprocess(info['user_proxy'], pickleOut=True, outputObj=condorIdDict, logger=self.logger) as (parent, rpipe): + tokenDir = getattr(self.config.TaskWorker, 'SEC_TOKEN_DIRECTORY', None) + with HTCondorUtils.AuthenticatedSubprocess(info['user_proxy'], tokenDir, + pickleOut=True, outputObj=condorIdDict, + logger=self.logger) as (parent, rpipe): if not parent: resultAds = [] condorIdDict['ClusterId'] = schedd.submit(dagAd, 1, True, resultAds) diff --git a/src/python/TaskWorker/Actions/DataDiscovery.py b/src/python/TaskWorker/Actions/DataDiscovery.py index c8b4bf170e..3d2e46288a 100644 --- a/src/python/TaskWorker/Actions/DataDiscovery.py +++ b/src/python/TaskWorker/Actions/DataDiscovery.py @@ -40,7 +40,7 @@ def formatOutput(self, task, requestname, datasetfiles, locations, tempDir): resourceCatalog = CRIC(logger=self.logger, configDict=configDict) # can't affort one message from CRIC per file, unless critical ! with tempSetLogLevel(logger=self.logger, level=logging.ERROR): - for lfn, infos in datasetfiles.iteritems(): + for lfn, infos in datasetfiles.items(): ## Skip the file if it is not in VALID state. if not infos.get('ValidFile', True): self.logger.warning("Skipping invalid file %s", lfn) @@ -67,7 +67,7 @@ def formatOutput(self, task, requestname, datasetfiles, locations, tempDir): raise wmfile['workflow'] = requestname event_counter += infos['NumberOfEvents'] - for run, lumis in infos['Lumis'].iteritems(): + for run, lumis in infos['Lumis'].items(): datasetLumis.setdefault(run, []).extend(lumis) wmfile.addRun(Run(run, *lumis)) for lumi in lumis: diff --git a/src/python/TaskWorker/Actions/DryRunUploader.py b/src/python/TaskWorker/Actions/DryRunUploader.py index ddcda5400f..16698729f7 100644 --- a/src/python/TaskWorker/Actions/DryRunUploader.py +++ b/src/python/TaskWorker/Actions/DryRunUploader.py @@ -1,14 +1,18 @@ """ -Upload an archive containing all files needed to run the a to the UserFileCache (necessary for crab submit --dryrun.) +Upload an archive containing all files needed to run the task to the Cache (necessary for crab submit --dryrun.) """ import os import json -import urllib import tarfile import time +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode + from WMCore.DataStructs.LumiList import LumiList -from WMCore.Services.UserFileCache.UserFileCache import UserFileCache from TaskWorker.DataObjects.Result import Result from TaskWorker.Actions.TaskAction import TaskAction @@ -17,7 +21,7 @@ class DryRunUploader(TaskAction): """ - Upload an archive containing all files needed to run the task to the UserFileCache (necessary for crab submit --dryrun.) + Upload an archive containing all files needed to run the task to the Cache (necessary for crab submit --dryrun.) """ def packSandbox(self, inputFiles): @@ -46,18 +50,10 @@ def executeInternal(self, *args, **kw): self.logger.info('Uploading dry run tarball to the user file cache') t0 = time.time() - if 'S3' in kw['task']['tm_cache_url'].upper(): - uploadToS3(crabserver=self.crabserver, filepath='dry-run-sandbox.tar.gz', - objecttype='runtimefiles', taskname=kw['task']['tm_taskname'], logger=self.logger) - result = {'hashkey':'ok'} # a dummy one to keep same semantics as when using UserFileCache - os.remove('dry-run-sandbox.tar.gz') - else: - ufc = UserFileCache(mydict={'cert': kw['task']['user_proxy'], 'key': kw['task']['user_proxy'], 'endpoint': kw['task']['tm_cache_url']}) - result = ufc.uploadLog('dry-run-sandbox.tar.gz') - os.remove('dry-run-sandbox.tar.gz') - if 'hashkey' not in result: - raise TaskWorkerException('Failed to upload dry-run-sandbox.tar.gz to the user file cache: ' + str(result)) - self.logger.info('Uploaded dry run tarball to the user file cache: %s', str(result)) + uploadToS3(crabserver=self.crabserver, filepath='dry-run-sandbox.tar.gz', + objecttype='runtimefiles', taskname=kw['task']['tm_taskname'], logger=self.logger) + os.remove('dry-run-sandbox.tar.gz') + self.logger.info('Uploaded dry run tarball to the user file cache') # wait until tarball is available, S3 may take a few seconds for this (ref. issue #6706 ) t1 = time.time() lt1 = time.strftime("%H:%M:%S", time.localtime(t1)) @@ -79,7 +75,7 @@ def executeInternal(self, *args, **kw): time.sleep(5) update = {'workflow': kw['task']['tm_taskname'], 'subresource': 'state', 'status': 'UPLOADED'} self.logger.debug('Updating task status: %s', str(update)) - self.crabserver.post(api='workflowdb', data=urllib.urlencode(update)) + self.crabserver.post(api='workflowdb', data=urlencode(update)) finally: os.chdir(cwd) @@ -140,6 +136,5 @@ def dump(self, outname): 'avg_files': sum(self.filesPerJob)/float(len(self.filesPerJob)), 'min_files': min(self.filesPerJob)}) - with open(outname, 'wb') as f: + with open(outname, 'w') as f: json.dump(summary, f) - diff --git a/src/python/TaskWorker/Actions/Handler.py b/src/python/TaskWorker/Actions/Handler.py index 55042b2d6c..4787a25a1f 100644 --- a/src/python/TaskWorker/Actions/Handler.py +++ b/src/python/TaskWorker/Actions/Handler.py @@ -4,9 +4,7 @@ import logging import tempfile import traceback -from httplib import HTTPException - -from WMCore.Services.UserFileCache.UserFileCache import UserFileCache +from http.client import HTTPException from RESTInteractions import CRABRest from RucioUtils import getNativeRucioClient @@ -94,29 +92,13 @@ def executeAction(self, nextinput, work): #TODO: we need to do that also in Worker.py otherwise some messages might only be in the TW file but not in the crabcache. logpath = self.config.TaskWorker.logsDir+'/tasks/%s/%s.log' % (self._task['tm_username'], self.taskname) if os.path.isfile(logpath) and 'user_proxy' in self._task: #the user proxy might not be there if myproxy retrieval failed - cacheurldict = {'endpoint':self._task['tm_cache_url'], 'cert':self._task['user_proxy'], 'key':self._task['user_proxy']} - if 'S3' in self._task['tm_cache_url'].upper(): - # use S3 - try: - uploadToS3(crabserver=self.crabserver, objecttype='twlog', filepath=logpath, - taskname=self.taskname, logger=self.logger) - except Exception as e: - msg = 'Failed to upload logfile to S3 for task %s. ' % self.taskname - msg += 'Details:\n%s' % str(e) - self.logger.error(msg) - else: - # use old crabcache - try: - ufc = UserFileCache(cacheurldict) - logfilename = self.taskname + '_TaskWorker.log' - ufc.uploadLog(logpath, logfilename) - except HTTPException as hte: - msg = "Failed to upload the logfile to %s for task %s. More details in the http headers and body:\n%s\n%s" % (self._task['tm_cache_url'], self.taskname, hte.headers, hte.result) - self.logger.error(msg) - except Exception: #pylint: disable=broad-except - msg = "Unknown error while uploading the logfile for task %s" % self.taskname - self.logger.exception(msg) #upload logfile of the task to the crabcache - + try: + uploadToS3(crabserver=self.crabserver, objecttype='twlog', filepath=logpath, + taskname=self.taskname, logger=self.logger) + except Exception as e: + msg = 'Failed to upload logfile to S3 for task %s. ' % self.taskname + msg += 'Details:\n%s' % str(e) + self.logger.error(msg) return output diff --git a/src/python/TaskWorker/Actions/MyProxyLogon.py b/src/python/TaskWorker/Actions/MyProxyLogon.py index 96406d9f0c..09ab12441f 100644 --- a/src/python/TaskWorker/Actions/MyProxyLogon.py +++ b/src/python/TaskWorker/Actions/MyProxyLogon.py @@ -51,8 +51,8 @@ def tryProxyLogon(self, proxycfg=None): self.logger.error("===========PROXY ERROR END ==========================") raise TaskWorkerException(errmsg) - hoursleft = timeleft/3600 - minutesleft = (timeleft%3600)/60 + hoursleft = timeleft // 3600 + minutesleft = (timeleft % 3600) // 60 self.logger.info('retrieved proxy lifetime in h:m: %d:%d', hoursleft, minutesleft) return (userproxy, usergroups) diff --git a/src/python/TaskWorker/Actions/PostJob.py b/src/python/TaskWorker/Actions/PostJob.py index 23fe47a86b..6812803beb 100644 --- a/src/python/TaskWorker/Actions/PostJob.py +++ b/src/python/TaskWorker/Actions/PostJob.py @@ -1,7 +1,5 @@ #!/usr/bin/python # TODO: This is a long term issue and to maintain ~3k lines of code in one file is hard. -# Would be nice to separate all files and have one for Couch another for RDBMS and maybe -# in the future someone will want to use Mongo or ES... # ANOTHER TODO: # In the code it is hard to read: workflow, taskname, reqname. All are the same.... @@ -78,7 +76,6 @@ import hashlib import logging import logging.handlers -import commands import subprocess import unittest import datetime @@ -87,19 +84,18 @@ import random import shutil from shutil import move -from httplib import HTTPException +from http.client import HTTPException import htcondor import classad -import WMCore.Database.CMSCouch as CMSCouch from WMCore.DataStructs.LumiList import LumiList from WMCore.Services.WMArchive.DataMap import createArchiverDoc from TaskWorker import __version__ from TaskWorker.Actions.RetryJob import RetryJob from TaskWorker.Actions.RetryJob import JOB_RETURN_CODES -from ServerUtilities import isFailurePermanent, parseJobAd, mostCommon, TRANSFERDB_STATES, PUBLICATIONDB_STATES, encodeRequest, isCouchDBURL, oracleOutputMapping -from ServerUtilities import getLock +from ServerUtilities import isFailurePermanent, parseJobAd, mostCommon, TRANSFERDB_STATES, PUBLICATIONDB_STATES, encodeRequest, oracleOutputMapping +from ServerUtilities import getLock, getHashLfn from RESTInteractions import CRABRest ASO_JOB = None @@ -233,6 +229,7 @@ def prepareErrorSummary(logger, fsummary, job_id, crab_retry): if error_summary_changed: with getLock(G_FJR_PARSE_RESULTS_FILE_NAME): with open(G_FJR_PARSE_RESULTS_FILE_NAME, "a+") as fjr_parse_results: + # make sure the "json file" is written as multiple lines fjr_parse_results.write(json.dumps({job_id : {crab_retry : error_summary}}) + "\n") # Read, update and re-write the error_summary.json file @@ -295,8 +292,6 @@ def __init__(self, logger, aso_start_time, aso_start_timestamp, dest_site, sourc self.docs_in_transfer = None self.crab_retry = crab_retry self.retry_timeout = retry_timeout - self.couch_server = None - self.couch_database = None self.job_id = job_id self.dest_site = dest_site self.source_dir = source_dir @@ -316,24 +311,13 @@ def __init__(self, logger, aso_start_time, aso_start_timestamp, dest_site, sourc self.aso_start_timestamp = aso_start_timestamp proxy = os.environ.get('X509_USER_PROXY', None) self.proxy = proxy - self.aso_db_url = self.job_ad['CRAB_ASOURL'] self.rest_host = rest_host self.db_instance = db_instance self.rest_url = rest_host + '/crabserver/' + db_instance + '/' # used in logging self.found_doc_in_db = False - #I don't think it is necessary to default to asynctransfer here, we are taking care of it - #in dagman creator and if CRAB_ASODB is not there it means it's old task executing old code - #But just to make sure... - self.aso_db_name = self.job_ad.get('CRAB_ASODB', 'asynctransfer') or 'asynctransfer' try: - if first_pj_execution(): - self.logger.info("Will use ASO server at %s." % (self.aso_db_url)) - if isCouchDBURL(self.aso_db_url): - self.couch_server = CMSCouch.CouchServer(dburl=self.aso_db_url, ckey=proxy, cert=proxy) - self.couch_database = self.couch_server.connectDatabase(self.aso_db_name, create=False) - else: - self.crabserver = CRABRest(self.rest_host, proxy, proxy, retry=2, userAgent='CRABSchedd') - self.crabserver.setDbInstance(self.db_instance) + self.crabserver = CRABRest(self.rest_host, proxy, proxy, retry=2, userAgent='CRABSchedd') + self.crabserver.setDbInstance(self.db_instance) except Exception as ex: msg = "Failed to connect to ASO database via CRABRest: %s" % (str(ex)) self.logger.exception(msg) @@ -343,7 +327,7 @@ def __init__(self, logger, aso_start_time, aso_start_timestamp, dest_site, sourc def save_docs_in_transfer(self): """ The function is used to save into a file the documents we are transfering so - we do not have to query couch to get this list every time the postjob is restarted. + we do not have to query the DB to get this list every time the postjob is restarted. """ try: filename = 'transfer_info/docs_in_transfer.%s.%d.json' % (self.job_id, self.crab_retry) @@ -596,7 +580,7 @@ def inject_to_aso(self): ## from the worker node directly to the permanent storage. needs_transfer = self.log_needs_transfer self.logger.info("Working on file %s" % (filename)) - doc_id = hashlib.sha224(source_lfn).hexdigest() + doc_id = getHashLfn(source_lfn) doc_new_info = {'state': 'new', 'source': fixUpTempStorageSite(logger=self.logger, siteName=source_site), 'destination': self.dest_site, @@ -713,7 +697,8 @@ def inject_to_aso(self): self.logger.info(msg) msg = "Previous document: %s" % (pprint.pformat(doc)) self.logger.debug(msg) - except (CMSCouch.CouchNotFoundError, NotFound): + + except (NotFound): ## The document was not yet uploaded to ASO database (if this is the first job ## retry, then either the upload from the WN failed, or cmscp did a direct ## stageout and here we need to inject for publication only). In any case we @@ -798,169 +783,152 @@ def inject_to_aso(self): ##= = = = = ASOServerJob = = = = = = = = = = = = = = = = = = = = = = = = = = = = def getDocByID(self, doc_id): - if not isCouchDBURL(self.aso_db_url): - docInfo = self.crabserver.get(api='fileusertransfers', data=encodeRequest({'subresource': 'getById', "id": doc_id})) - if docInfo and len(docInfo[0]['result']) == 1: - # Means that we have already a document in database! - docInfo = oracleOutputMapping(docInfo) - # Just to be 100% sure not to break after the mapping been added - if not docInfo: - self.found_doc_in_db = False - raise NotFound('Document not found in database') - # transfer_state and publication_state is a number in database. Lets change - # it to lowercase until we will end up support for CouchDB. - docInfo[0]['transfer_state'] = TRANSFERDB_STATES[docInfo[0]['transfer_state']].lower() - docInfo[0]['publication_state'] = PUBLICATIONDB_STATES[docInfo[0]['publication_state']].lower() - # Also change id to doc_id - docInfo[0]['job_id'] = docInfo[0]['id'] - self.found_doc_in_db = True # This is needed for further if there is a need to update doc info in DB - return docInfo[0] - else: + docInfo = self.crabserver.get(api='fileusertransfers', data=encodeRequest({'subresource': 'getById', "id": doc_id})) + if docInfo and len(docInfo[0]['result']) == 1: + # Means that we have already a document in database! + docInfo = oracleOutputMapping(docInfo) + # Just to be 100% sure not to break after the mapping been added + if not docInfo: self.found_doc_in_db = False - raise NotFound('Document not found in database!') + raise NotFound('Document not found in database') + # transfer_state and publication_state is a number in database. + docInfo[0]['transfer_state'] = TRANSFERDB_STATES[docInfo[0]['transfer_state']] + docInfo[0]['publication_state'] = PUBLICATIONDB_STATES[docInfo[0]['publication_state']] + # Also change id to doc_id + docInfo[0]['job_id'] = docInfo[0]['id'] + self.found_doc_in_db = True # This is needed for further if there is a need to update doc info in DB + return docInfo[0] else: - return self.couch_database.document(doc_id) + self.found_doc_in_db = False + raise NotFound('Document not found in database!') def updateOrInsertDoc(self, doc, toTransfer): """""" returnMsg = {} - if not isCouchDBURL(self.aso_db_url): - if not self.found_doc_in_db: - # This means that it was not founded in DB and we will have to insert new doc - newDoc = {'id': doc['_id'], - 'username': doc['user'], - 'taskname': doc['workflow'], - 'start_time': self.aso_start_timestamp, - 'destination': doc['destination'], - 'destination_lfn': doc['destination_lfn'], - 'source': doc['source'], - 'source_lfn': doc['source_lfn'], - 'filesize': doc['size'], - 'publish': doc['publish'], - 'transfer_state': doc['state'].upper(), - 'publication_state': 'NEW' if doc['publish'] else 'NOT_REQUIRED', - 'job_id': doc['jobid'], - 'job_retry_count': doc['job_retry_count'], - 'type': doc['type'], - 'rest_host': doc['rest_host'], - 'rest_uri': doc['rest_uri']} - try: - self.crabserver.put(api='fileusertransfers', data=encodeRequest(newDoc)) - except HTTPException as hte: - msg = "Error uploading document to database." - msg += " Transfer submission failed." - msg += "\n%s" % (str(hte.headers)) - returnMsg['error'] = msg - updateDoc={'subresource':'updateTransfers', 'list_of_ids':[doc['_id']]} - updateDoc['list_of_transfer_state'] = [newDoc['transfer_state']] - # make sure that asoworker field in transfersdb is always filled, since - # otherwise whichever Publisher process looks first for things to do, grabs them - # https://github.com/dmwm/CRABServer/blob/8012e1297759bab620d89c8cb253f1832b4eb466/src/python/Databases/FileTransfersDB/Oracle/FileTransfers/FileTransfers.py#L27-L33 - # but since PUT API ignores an asoworker argument when inserting - # https://github.com/dmwm/CRABServer/blob/43f6377447922d46353072e86d960e3c78967a17/src/python/CRABInterface/RESTFileUserTransfers.py#L122-L125 - # we need to update the record after insetion with a POST, which requires list of ids and states - # https://github.com/dmwm/CRABServer/blob/43f6377447922d46353072e86d960e3c78967a17/src/python/CRABInterface/RESTFileTransfers.py#L131-L133 - if os.path.exists('USE_NEW_PUBLISHER'): - self.logger.info("USE_NEW_PUBLISHER: set asoworker=schedd in transferdb") - updateDoc['asoworker'] = 'schedd' - else: - self.logger.info("OLD Publisher: set asoworker=asoless in transferdb") - updateDoc['asoworker'] = 'asoless' - try: - self.crabserver.post(api='filetransfers', data=encodeRequest(updateDoc)) - except HTTPException as hte: - msg = "Error uploading document to database." - msg += " Transfer submission failed." - msg += "\n%s" % (str(hte.headers)) - returnMsg['error'] = msg - else: - # This means it is in database and we need only update specific fields. - newDoc = {'id': doc['id'], - 'username': doc['username'], - 'taskname': doc['taskname'], - 'start_time': self.aso_start_timestamp, - 'source': doc['source'], - 'source_lfn': doc['source_lfn'], - 'filesize': doc['filesize'], - 'transfer_state': doc.get('state', 'NEW').upper(), - 'publish': doc['publish'], - 'publication_state': 'NEW' if doc['publish'] else 'NOT_REQUIRED', - 'job_id': doc['jobid'], - 'job_retry_count': doc['job_retry_count'], - 'transfer_retry_count': 0, - 'subresource': 'updateDoc'} - try: - self.crabserver.post(api='fileusertransfers', data=encodeRequest(newDoc)) - except HTTPException as hte: - msg = "Error updating document in database." - msg += " Transfer submission failed." - msg += "\n%s" % (str(hte.headers)) - returnMsg['error'] = msg - # Previous post resets asoworker to NULL. This is not good, so we set it again - # using a different API to update the transfersDB record - updateDoc = {} - updateDoc['list_of_ids'] = [newDoc['id']] - updateDoc['list_of_transfer_state'] = [newDoc['transfer_state']] - updateDoc['subresource'] = 'updateTransfers' - if os.path.exists('USE_NEW_PUBLISHER'): - self.logger.info("USE_NEW_PUBLISHER: set asoworker=schedd in transferdb") - updateDoc['asoworker'] = 'schedd' - else: - self.logger.info("OLD Publisher: set asoworker=asoless in transferdb") - updateDoc['asoworker'] = 'asoless' - try: - self.crabserver.post(api='filetransfers', data=encodeRequest(updateDoc)) - except HTTPException as hte: - msg = "Error uploading document to database." - msg += " Transfer submission failed." - msg += "\n%s" % (str(hte.headers)) - returnMsg['error'] = msg - if toTransfer: - if not 'publishname' in newDoc: - newDoc['publishname'] = self.publishname - if not 'checksums' in newDoc: - newDoc['checksums'] = doc['checksums'] - if not 'destination_lfn' in newDoc: - newDoc['destination_lfn'] = doc['destination_lfn'] - if not 'destination' in newDoc: - newDoc['destination'] = doc['destination'] - with open('task_process/transfers.txt', 'a+') as transfers_file: - transfer_dump = json.dumps(newDoc) - transfers_file.write(transfer_dump+"\n") - if not os.path.exists('task_process/RestInfoForFileTransfers.json'): - #if not os.path.exists('task_process/rest_filetransfers.txt'): - restInfo = {'host':self.rest_host, - 'dbInstance': self.db_instance, - 'proxyfile': self.proxy} - with open('task_process/RestInfoForFileTransfers.json', 'w') as fp: - json.dump(restInfo, fp) - #with open('task_process/rest_filetransfers.txt', 'w+') as rest_file: - # rest_file.write(self.rest_url + '\n') - # rest_file.write(self.proxy) - else: - if not 'publishname' in newDoc: - newDoc['publishname'] = self.publishname - if not 'checksums' in newDoc: - newDoc['checksums'] = doc['checksums'] - if not 'destination_lfn' in newDoc: - newDoc['destination_lfn'] = doc['destination_lfn'] - if not 'destination' in newDoc: - newDoc['destination'] = doc['destination'] - with open('task_process/transfers_direct.txt', 'a+') as transfers_file: - transfer_dump = json.dumps(newDoc) - transfers_file.write(transfer_dump+"\n") - if not os.path.exists('task_process/RestInfoForFileTransfers.json'): - #if not os.path.exists('task_process/rest_filetransfers.txt'): - restInfo = {'host':self.rest_host, - 'dbInstance': self.db_instance, - 'proxyfile': self.proxy} - with open('task_process/RestInfoForFileTransfers.json','w') as fp: - json.dump(restInfo, fp) - #with open('task_process/rest_filetransfers.txt', 'w+') as rest_file: - # rest_file.write(self.rest_host + self.rest_url + '\n') - # rest_file.write(self.proxy) + if not self.found_doc_in_db: + # This means that it was not founded in DB and we will have to insert new doc + newDoc = {'id': doc['_id'], + 'username': doc['user'], + 'taskname': doc['workflow'], + 'start_time': self.aso_start_timestamp, + 'destination': doc['destination'], + 'destination_lfn': doc['destination_lfn'], + 'source': doc['source'], + 'source_lfn': doc['source_lfn'], + 'filesize': doc['size'], + 'publish': doc['publish'], + 'transfer_state': doc['state'].upper(), + 'publication_state': 'NEW' if doc['publish'] else 'NOT_REQUIRED', + 'job_id': doc['jobid'], + 'job_retry_count': doc['job_retry_count'], + 'type': doc['type'], + 'rest_host': doc['rest_host'], + 'rest_uri': doc['rest_uri']} + try: + self.crabserver.put(api='fileusertransfers', data=encodeRequest(newDoc)) + except HTTPException as hte: + msg = "Error uploading document to database." + msg += " Transfer submission failed." + msg += "\n%s" % (str(hte.headers)) + returnMsg['error'] = msg + updateDoc={'subresource':'updateTransfers', 'list_of_ids':[doc['_id']]} + updateDoc['list_of_transfer_state'] = [newDoc['transfer_state']] + # make sure that asoworker field in transfersdb is always filled, since + # otherwise whichever Publisher process looks first for things to do, grabs them + # https://github.com/dmwm/CRABServer/blob/8012e1297759bab620d89c8cb253f1832b4eb466/src/python/Databases/FileTransfersDB/Oracle/FileTransfers/FileTransfers.py#L27-L33 + # but since PUT API ignores an asoworker argument when inserting + # https://github.com/dmwm/CRABServer/blob/43f6377447922d46353072e86d960e3c78967a17/src/python/CRABInterface/RESTFileUserTransfers.py#L122-L125 + # we need to update the record after insetion with a POST, which requires list of ids and states + # https://github.com/dmwm/CRABServer/blob/43f6377447922d46353072e86d960e3c78967a17/src/python/CRABInterface/RESTFileTransfers.py#L131-L133 + updateDoc['asoworker'] = 'schedd' + try: + self.crabserver.post(api='filetransfers', data=encodeRequest(updateDoc)) + except HTTPException as hte: + msg = "Error uploading document to database." + msg += " Transfer submission failed." + msg += "\n%s" % (str(hte.headers)) + returnMsg['error'] = msg else: - returnMsg = self.couch_database.commitOne(doc)[0] + # This means it is in database and we need only update specific fields. + newDoc = {'id': doc['id'], + 'username': doc['username'], + 'taskname': doc['taskname'], + 'start_time': self.aso_start_timestamp, + 'source': doc['source'], + 'source_lfn': doc['source_lfn'], + 'filesize': doc['filesize'], + 'transfer_state': doc.get('state', 'NEW').upper(), + 'publish': doc['publish'], + 'publication_state': 'NEW' if doc['publish'] else 'NOT_REQUIRED', + 'job_id': doc['jobid'], + 'job_retry_count': doc['job_retry_count'], + 'transfer_retry_count': 0, + 'subresource': 'updateDoc'} + try: + self.crabserver.post(api='fileusertransfers', data=encodeRequest(newDoc)) + except HTTPException as hte: + msg = "Error updating document in database." + msg += " Transfer submission failed." + msg += "\n%s" % (str(hte.headers)) + returnMsg['error'] = msg + # Previous post resets asoworker to NULL. This is not good, so we set it again + # using a different API to update the transfersDB record + updateDoc = {} + updateDoc['list_of_ids'] = [newDoc['id']] + updateDoc['list_of_transfer_state'] = [newDoc['transfer_state']] + updateDoc['subresource'] = 'updateTransfers' + updateDoc['asoworker'] = 'schedd' + try: + self.crabserver.post(api='filetransfers', data=encodeRequest(updateDoc)) + except HTTPException as hte: + msg = "Error uploading document to database." + msg += " Transfer submission failed." + msg += "\n%s" % (str(hte.headers)) + returnMsg['error'] = msg + if toTransfer: + if not 'publishname' in newDoc: + newDoc['publishname'] = self.publishname + if not 'checksums' in newDoc: + newDoc['checksums'] = doc['checksums'] + if not 'destination_lfn' in newDoc: + newDoc['destination_lfn'] = doc['destination_lfn'] + if not 'destination' in newDoc: + newDoc['destination'] = doc['destination'] + with open('task_process/transfers.txt', 'a+') as transfers_file: + transfer_dump = json.dumps(newDoc) + transfers_file.write(transfer_dump+"\n") + if not os.path.exists('task_process/RestInfoForFileTransfers.json'): + #if not os.path.exists('task_process/rest_filetransfers.txt'): + restInfo = {'host':self.rest_host, + 'dbInstance': self.db_instance, + 'proxyfile': self.proxy} + with open('task_process/RestInfoForFileTransfers.json', 'w') as fp: + json.dump(restInfo, fp) + #with open('task_process/rest_filetransfers.txt', 'w+') as rest_file: + # rest_file.write(self.rest_url + '\n') + # rest_file.write(self.proxy) + else: + if not 'publishname' in newDoc: + newDoc['publishname'] = self.publishname + if not 'checksums' in newDoc: + newDoc['checksums'] = doc['checksums'] + if not 'destination_lfn' in newDoc: + newDoc['destination_lfn'] = doc['destination_lfn'] + if not 'destination' in newDoc: + newDoc['destination'] = doc['destination'] + with open('task_process/transfers_direct.txt', 'a+') as transfers_file: + transfer_dump = json.dumps(newDoc) + transfers_file.write(transfer_dump+"\n") + if not os.path.exists('task_process/RestInfoForFileTransfers.json'): + #if not os.path.exists('task_process/rest_filetransfers.txt'): + restInfo = {'host':self.rest_host, + 'dbInstance': self.db_instance, + 'proxyfile': self.proxy} + with open('task_process/RestInfoForFileTransfers.json','w') as fp: + json.dump(restInfo, fp) + #with open('task_process/rest_filetransfers.txt', 'w+') as rest_file: + # rest_file.write(self.rest_host + self.rest_url + '\n') + # rest_file.write(self.proxy) return returnMsg ##= = = = = ASOServerJob = = = = = = = = = = = = = = = = = = = = = = = = = = = = @@ -983,9 +951,7 @@ def get_transfers_statuses(self): """ Retrieve the status of all transfers from the cached file 'aso_status.json' or by querying an ASO database view if the file is more than 5 minutes old - or if we injected a document after the file was last updated. Calls to - get_transfers_statuses_fallback() have been removed to not generate load - on couch. + or if we injected a document after the file was last updated. """ statuses = [] query_view = False @@ -1024,127 +990,57 @@ def get_transfers_statuses(self): self.logger.debug("Changing query_view back to true") query_view = True break - if isCouchDBURL(self.aso_db_url): - if query_view: - query = {'reduce': False, 'key': self.reqname, 'stale': 'update_after'} - self.logger.debug("Querying ASO view.") - try: - view_results = self.couch_database.loadView('AsyncTransfer', 'JobsIdsStatesByWorkflow', query)['rows'] - view_results_dict = {} - for view_result in view_results: - view_results_dict[view_result['id']] = view_result - except: - msg = "Error while querying the NoSQL (Couch) database." - self.logger.exception(msg) - self.build_failed_cache() - raise TransferCacheLoadError(msg) - aso_info = { - "query_timestamp": time.time(), - "query_succeded": True, - "query_jobid": self.job_id, - "results": view_results_dict, - } - tmp_fname = "aso_status.%d.json" % (os.getpid()) - with open(tmp_fname, 'w') as fd: - json.dump(aso_info, fd) - os.rename(tmp_fname, "aso_status.json") - else: - self.logger.debug("Using cached ASO results.") - #Is this ever happening? - if not aso_info: - raise TransferCacheLoadError("Unexpected error. aso_info is not set. Deferring postjob") - for doc_info in self.docs_in_transfer: - doc_id = doc_info['doc_id'] - if doc_id not in aso_info.get("results", {}): - ## This is a legitimate use case. It can happen if the document has just been injected - ## (e.g.: transfer injected in the PJ and not in the WN). In this case the cache does not - ## have the document yet, so we defer the postjob. - msg = "Document with id %s not found in the transfer cache." % doc_id - raise TransferCacheLoadError(msg) - ## Use the start_time parameter to check whether the transfer state in aso_info - ## corresponds to the document we have to monitor. The reason why we have to do - ## this check is because the information in aso_info might have been obtained by - ## querying an ASO CouchDB view, which can return stale results. We don't mind - ## having stale results as long as they correspond to the documents we have to - ## monitor (the worst that can happen is that we will wait more than necessary - ## to see the transfers in a terminal state). But it could be that some results - ## returned by the view correspond to documents injected in a previous job retry - ## (or restart), and this is not what we want. If the start_time in aso_info is - ## not the same as in the document (we saved the start_time from the documents - ## in self.docs_in_transfer), then the view result corresponds to a previous job - ## retry. In that case, return transfer state = 'unknown'. - transfer_status = aso_info['results'][doc_id]['value']['state'] - if 'start_time' in aso_info['results'][doc_id]['value']: - start_time = aso_info['results'][doc_id]['value']['start_time'] - if start_time == doc_info['start_time']: - statuses.append(transfer_status) - else: - msg = "Got stale transfer state '%s' for document %s" % (transfer_status, doc_id) - msg += " (got start_time = %s, while in the document is start_time = %s)." % (start_time, doc_info['start_time']) - msg += " Transfer state may correspond to a document from a previous job retry." - msg += " Returning transfer state = 'unknown'." - self.logger.info(msg) - statuses.append('unknown') - else: - statuses.append(transfer_status) - elif not isCouchDBURL(self.aso_db_url): - if query_view: - self.logger.debug("Querying ASO RDBMS database.") - try: - view_results = self.crabserver.get(api='fileusertransfers', data=encodeRequest({'subresource': 'getTransferStatus', - 'username': str(self.job_ad['CRAB_UserHN']), - 'taskname': self.reqname})) - view_results_dict = oracleOutputMapping(view_results, 'id') - # There is so much noise values in aso_status.json file. So lets provide a new file structure. - # We will not run ever for one task which can use also RDBMS and CouchDB - # New Structure for view_results_dict is - # {"DocumentHASHID": [{"id": "DocumentHASHID", "start_time": timestamp, "transfer_state": NUMBER!, "last_update": timestamp}]} - for document in view_results_dict: - view_results_dict[document][0]['state'] = TRANSFERDB_STATES[view_results_dict[document][0]['transfer_state']].lower() - except: - msg = "Error while querying the RDBMS (Oracle) database." - self.logger.exception(msg) - self.build_failed_cache() - raise TransferCacheLoadError(msg) - aso_info = { - "query_timestamp": time.time(), - "query_succeded": True, - "query_jobid": self.job_id, - "results": view_results_dict, - } - tmp_fname = "aso_status.%d.json" % (os.getpid()) - with open(tmp_fname, 'w') as fd: - json.dump(aso_info, fd) - os.rename(tmp_fname, "aso_status.json") - else: - self.logger.debug("Using cached ASO results.") - if not aso_info: - raise TransferCacheLoadError("Unexpected error. aso_info is not set. Deferring postjob") - for doc_info in self.docs_in_transfer: - doc_id = doc_info['doc_id'] - if doc_id not in aso_info.get("results", {}): - msg = "Document with id %s not found in the transfer cache. Deferring PJ." % doc_id - raise TransferCacheLoadError(msg) ## Not checking timestamps for oracle since we are not injecting from WN - ## and it cannot happen that condor restarts screw up things. - transfer_status = aso_info['results'][doc_id][0]['state'] - statuses.append(transfer_status) + if query_view: + self.logger.debug("Querying ASO RDBMS database.") + try: + view_results = self.crabserver.get(api='fileusertransfers', data=encodeRequest({'subresource': 'getTransferStatus', + 'username': str(self.job_ad['CRAB_UserHN']), + 'taskname': self.reqname})) + view_results_dict = oracleOutputMapping(view_results, 'id') + # There is so much noise values in aso_status.json file. So lets provide a new file structure. + # We will not run ever for one task which can use also RDBMS + # New Structure for view_results_dict is + # {"DocumentHASHID": [{"id": "DocumentHASHID", "start_time": timestamp, "transfer_state": NUMBER!, "last_update": timestamp}]} + for document in view_results_dict: + view_results_dict[document][0]['state'] = TRANSFERDB_STATES[view_results_dict[document][0]['transfer_state']].lower() + except: + msg = "Error while querying the RDBMS (Oracle) database." + self.logger.exception(msg) + self.build_failed_cache() + raise TransferCacheLoadError(msg) + aso_info = { + "query_timestamp": time.time(), + "query_succeded": True, + "query_jobid": self.job_id, + "results": view_results_dict, + } + tmp_fname = "aso_status.%d.json" % (os.getpid()) + with open(tmp_fname, 'w') as fd: + json.dump(aso_info, fd) + os.rename(tmp_fname, "aso_status.json") + else: + self.logger.debug("Using cached ASO results.") + if not aso_info: + raise TransferCacheLoadError("Unexpected error. aso_info is not set. Deferring postjob") + for doc_info in self.docs_in_transfer: + doc_id = doc_info['doc_id'] + if doc_id not in aso_info.get("results", {}): + msg = "Document with id %s not found in the transfer cache. Deferring PJ." % doc_id + raise TransferCacheLoadError(msg) ## Not checking timestamps for oracle since we are not injecting from WN + ## and it cannot happen that condor restarts screw up things. + transfer_status = aso_info['results'][doc_id][0]['state'] + statuses.append(transfer_status) return statuses ##= = = = = ASOServerJob = = = = = = = = = = = = = = = = = = = = = = = = = = = = def load_transfer_document(self, doc_id): """ - Wrapper to load a document from CouchDB or RDBMS, catching exceptions. + Wrapper to load a document from RDBMS, catching exceptions. """ doc = None try: - if not isCouchDBURL(self.aso_db_url): - doc = self.getDocByID(doc_id) - else: - doc = self.couch_database.document(doc_id) - except CMSCouch.CouchError as cee: - msg = "Error retrieving document from ASO database for ID %s: %s" % (doc_id, str(cee)) - self.logger.error(msg) + doc = self.getDocByID(doc_id) except HTTPException as hte: msg = "Error retrieving document from ASO database for ID %s: %s" % (doc_id, str(hte.headers)) self.logger.error(msg) @@ -1169,10 +1065,7 @@ def get_transfers_statuses_fallback(self): for doc_info in self.docs_in_transfer: doc_id = doc_info['doc_id'] doc = self.load_transfer_document(doc_id) - if isCouchDBURL(self.aso_db_url): - status = doc['state'] if doc else 'unknown' - else: - status = doc['transfer_state'] if doc else 'unknown' + status = doc['transfer_state'] if doc else 'unknown' statuses.append(status) return statuses @@ -1206,7 +1099,7 @@ def cancel(self, doc_ids_reasons=None, max_retries=0): time.sleep(3*60) msg = "This is cancellation retry number %d." % (retry) self.logger.info(msg) - for doc_id, reason in doc_ids_reasons.iteritems(): + for doc_id, reason in doc_ids_reasons.items(): if doc_id in cancelled: continue msg = "Cancelling ASO transfer %s" % (doc_id) @@ -1223,57 +1116,33 @@ def cancel(self, doc_ids_reasons=None, max_retries=0): doc['state'] = 'killed' doc['end_time'] = now username = doc['username'] - if isCouchDBURL(self.aso_db_url): - # In case it is still CouchDB leave this in this loop and for RDBMS add - # everything to a list and update this with one call for multiple files. - # One bad thing that we are not saving failure reason. It should not be a big deal - # to implement, but I leave it so far to discuss how it is best to do it. - # I think best is to show the last reason and not all. - if reason: - if doc['failure_reason']: - if isinstance(doc['failure_reason'], list): - doc['failure_reason'].append(reason) - elif isinstance(doc['failure_reason'], str): - doc['failure_reason'] = [doc['failure_reason'], reason] - else: - doc['failure_reason'] = reason - res = self.couch_database.commitOne(doc)[0] - if 'error' in res: - msg = "Error cancelling ASO transfer %s: %s" % (doc_id, res) - self.logger.warning(msg) - if retry == max_retries: - not_cancelled.append((doc_id, msg)) - else: - cancelled.append(doc_id) - else: - transfersToKill.append(doc_id) - - if not isCouchDBURL(self.aso_db_url): - # Now this means that we have a list of ids which needs to be killed - # First try to kill ALL in one API call - newDoc = {'listOfIds': transfersToKill, - 'publish' : 0, - 'username': username, - 'subresource': 'killTransfersById'} - try: - killedFiles = self.crabserver.post(api='fileusertransfers', data=encodeRequest(newDoc, ['listOfIds'])) - not_cancelled = killedFiles[0]['result'][0]['failedKill'] - cancelled = killedFiles[0]['result'][0]['killed'] - break # no need to retry - except HTTPException as hte: - msg = "Error setting KILL status in database." - msg += " Transfer KILL failed." - msg += "\n%s" % (str(hte.headers)) - self.logger.warning(msg) - not_cancelled = transfersToKill - cancelled = [] - except Exception as ex: - msg = "Unknown error setting KILL status in database." - msg += " Transfer KILL failed." - msg += "\n%s" % (str(ex)) - self.logger.error(msg) - not_cancelled = transfersToKill - cancelled = [] + transfersToKill.append(doc_id) + + # Now this means that we have a list of ids which needs to be killed + # First try to kill ALL in one API call + newDoc = {'listOfIds': transfersToKill, + 'publish' : 0, + 'username': username, + 'subresource': 'killTransfersById'} + try: + killedFiles = self.crabserver.post(api='fileusertransfers', data=encodeRequest(newDoc, ['listOfIds'])) + not_cancelled = killedFiles[0]['result'][0]['failedKill'] + cancelled = killedFiles[0]['result'][0]['killed'] + break # no need to retry + except HTTPException as hte: + msg = "Error setting KILL status in database." + msg += " Transfer KILL failed." + msg += "\n%s" % (str(hte.headers)) + self.logger.warning(msg) + not_cancelled = transfersToKill + cancelled = [] + except Exception as ex: + msg = "Unknown error setting KILL status in database." + msg += " Transfer KILL failed." + msg += "\n%s" % (str(ex)) + self.logger.error(msg) + not_cancelled = transfersToKill + cancelled = [] # Ok Now lets do a double check on doc_ids which failed to update one by one. @@ -1445,7 +1314,7 @@ def get_defer_num(self): try: #open in rb+ mode instead of w so if the schedd crashes between the open and the write #we do not end up with an empty (corrupted) file. (Well, this can only happens the first try) - with open(DEFER_INFO_FILE, 'rb+') as fd: + with open(DEFER_INFO_FILE, 'r+') as fd: #put some spaces to overwrite possibly longer numbers (should never happen, but..) fd.write(str(defer_num + 1) + ' '*10) except IOError as e: @@ -1460,15 +1329,8 @@ def get_defer_num(self): ## = = = = = PostJob = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = def create_taskwebdir(self): - ## Create the task web directory in the schedd. - self.logpath = os.path.expanduser("~/%s" % (self.reqname)) - try: - os.makedirs(self.logpath) - except OSError as ose: - if ose.errno != errno.EEXIST: - msg = "Failed to create log web-shared directory %s" % (self.logpath) - self.logger.error(msg) - raise + ## The task web directory in the schedd has been created by AdjustSites.py + self.logpath = os.path.realpath('WEB_DIR') ## = = = = = PostJob = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = @@ -2424,12 +2286,16 @@ def upload_input_files_metadata(self): "outdatasetname" : "/FakeDataset/fakefile-FakePublish-5b6a581e4ddd41b130711a045d5fecb9/USER", "directstageout" : direct_stageout } - configreq = configreq.items() + #TODO: there could be a better py3 way to get lists of outfileruns/lumis outfileruns = [] outfilelumis = [] - for run, lumis in ifile[u'runs'].iteritems(): + for run, lumis in ifile[u'runs'].items(): outfileruns.append(str(run)) outfilelumis.append(','.join(map(str, lumis))) + + configreq = [item for item in configreq.items()] # make a real list of (k,v) pairs as rest_api requires + #configreq['outfileruns'] = [run for run in outfileruns] + #configreq['outfilelumis'] = [lumis for lumis in outfilelumis] for run in outfileruns: configreq.append(("outfileruns", run)) for lumis in outfilelumis: @@ -2498,7 +2364,8 @@ def upload_output_files_metadata(self): 'directstageout' : int(file_info['direct_stageout']), 'globalTag' : 'None' } - configreq = configreq.items() + configreq = [item for item in configreq.items()] # make a real list of (k,v) pairs as rest_api requires + #configreq = configreq.items() if 'outfileruns' in file_info: for run in file_info['outfileruns']: configreq.append(("outfileruns", run)) @@ -2604,8 +2471,6 @@ def check_required_job_ad_attrs(self): """ required_job_ad_attrs = {'CRAB_UserRole': {'allowUndefined': True}, 'CRAB_UserGroup': {'allowUndefined': True}, - 'CRAB_ASOURL': {'allowUndefined': False}, - 'CRAB_ASODB': {'allowUndefined': True}, 'CRAB_AsyncDest': {'allowUndefined': False}, 'CRAB_DBSURL': {'allowUndefined': False}, 'DESIRED_CMSDataset': {'allowUndefined': True}, @@ -2760,7 +2625,8 @@ def get_output_file_info(filename): # Creating a string like '100:20,101:21,105:20...' # where the lumi is followed by a colon and number of events in that lumi. # Note that the events per lumi information is provided by WMCore version >=1.1.2 when parsing FWJR. - lumisAndEvents = ','.join(['{0}:{1}'.format(str(lumi), str(numEvents)) for lumi, numEvents in lumis.iteritems()]) + lumisAndEvents = ','.join(['{0}:{1}'.format(str(lumi), str(numEvents)) + for lumi, numEvents in lumis.items()]) file_info['outfilelumis'].append(lumisAndEvents) else: msg = "Output file info for %s not found in job report." % (orig_file_name) @@ -2832,7 +2698,7 @@ def set_state_ClassAds(self, state, exitCode=None): self.logger.warning(" -----> Failed to set job ClassAd attributes -----") maxSleep = 2*counter -1 self.logger.warning("Sleeping for %d minute at most...", maxSleep) - time.sleep(60 * random.randint(2*(counter/3), maxSleep+1)) + time.sleep(60 * random.randint(2*(counter//3), maxSleep+1)) else: self.logger.error("Failed to set job ClassAd attributes for %d times, will not retry. Dashboard may report stale job status/exit-code.", limit) @@ -3131,23 +2997,6 @@ def setUp(self): open(self.json_name, 'w').write(json.dumps(self.generateJobJson())) - def makeTempFile(self, size, pfn): - fh, path = tempfile.mkstemp() - try: - inputString = "CRAB3POSTJOBUNITTEST" - os.write(fh, (inputString * ((size/len(inputString))+1))[:size]) - os.close(fh) - cmd = "env -u LD_LIBRAY_PATH lcg-cp -b -D srmv2 -v file://%s %s" % (path, pfn) - print(cmd) - status, res = commands.getstatusoutput(cmd) - if status: - exmsg = "Couldn't make file: %s" % (res) - raise RuntimeError(exmsg) - finally: - if os.path.exists(path): - os.unlink(path) - - def getLevelOneDir(self): return datetime.datetime.now().strftime("%Y-%m") @@ -3157,22 +3006,6 @@ def getLevelTwoDir(self): def getUniqueFilename(self): return "%s-postjob.txt" % (uuid.uuid4()) - def testNonexistent(self): - self.full_args.extend(['/store/temp/user/meloam/CRAB3-UNITTEST-NONEXISTENT/b/c/', - '/store/user/meloam/CRAB3-UNITTEST-NONEXISTENT/b/c', - self.getUniqueFilename()]) - self.assertNotEqual(self.postjob.execute(*self.full_args), 0) - - source_prefix = "srm://dcache07.unl.edu:8443/srm/v2/server?SFN=/mnt/hadoop/user/uscms01/pnfs/unl.edu/data4/cms" - def testExistent(self): - source_dir = "/store/temp/user/meloam/CRAB3-UnitTest/%s/%s" % \ - (self.getLevelOneDir(), self.getLevelTwoDir()) - source_file = self.getUniqueFilename() - source_lfn = "%s/%s" % (source_dir, source_file) - dest_dir = source_dir.replace("temp/user", "user") - self.makeTempFile(200, "%s/%s" %(self.source_prefix, source_lfn)) - self.full_args.extend([source_dir, dest_dir, source_file]) - self.assertEqual(self.postjob.execute(*self.full_args), 0) def tearDown(self): if os.path.exists(self.json_name): diff --git a/src/python/TaskWorker/Actions/PreDAG.py b/src/python/TaskWorker/Actions/PreDAG.py index 123ce6a7d2..d1b440ae9c 100644 --- a/src/python/TaskWorker/Actions/PreDAG.py +++ b/src/python/TaskWorker/Actions/PreDAG.py @@ -32,10 +32,10 @@ import tempfile import functools import subprocess +from ast import literal_eval import classad -from ast import literal_eval from WMCore.DataStructs.LumiList import LumiList from ServerUtilities import getLock, newX509env, MAX_IDLE_JOBS, MAX_POST_JOBS @@ -67,13 +67,15 @@ def __init__(self): def readJobStatus(self): """Read the job status(es) from the cache_status file and save the relevant info into self.statusCacheInfo""" #XXX Maybe the status_cache filname should be in a variable in ServerUtilities? - if not os.path.exists("task_process/status_cache.txt"): + if not os.path.exists("task_process/status_cache.pkl"): return - with open("task_process/status_cache.txt") as fd: - fileContent = fd.read() - #TODO Splitting '\n' and accessing the second element is really fragile. - #It is what it is done in the client though, but we should change it - self.statusCacheInfo = literal_eval(fileContent.split('\n')[2]) + with open("task_process/status_cache.pkl", 'rb') as fd: + statusCache = pickle.load(fd) + if not 'nodes' in statusCache: + return + self.statusCacheInfo = statusCache['nodes'] + if 'DagStatus' in self.statusCacheInfo: + del self.statusCacheInfo['DagStatus'] def readProcessedJobs(self): """Read processed job ids""" @@ -96,7 +98,7 @@ def completedJobs(self, stage, processFailed=True): stagere['processing'] = re.compile(r"^0-\d+$") stagere['tail'] = re.compile(r"^[1-9]\d*$") completedCount = 0 - for jobnr, jobdict in self.statusCacheInfo.iteritems(): + for jobnr, jobdict in self.statusCacheInfo.items(): state = jobdict.get('State') if stagere[stage].match(jobnr) and state in ('finished', 'failed'): if state == 'failed' and processFailed: @@ -172,9 +174,9 @@ def executeInternal(self, *args): # need to use user proxy as credential for talking with cmsweb config.TaskWorker.cmscert = os.environ.get('X509_USER_PROXY') - config.TaskWorker.cmskey = os.environ.get('X509_USER_PROXY') + config.TaskWorker.cmskey = os.environ.get('X509_USER_PROXY') config.TaskWorker.envForCMSWEB = newX509env(X509_USER_CERT=config.TaskWorker.cmscert, - X509_USER_KEY=config.TaskWorker.cmskey) + X509_USER_KEY=config.TaskWorker.cmskey) # need to get username from classAd to setup for Rucio access task_ad = classad.parseOne(open(os.environ['_CONDOR_JOB_AD'])) @@ -206,11 +208,20 @@ def executeInternal(self, *args): sumEventsThr += throughput sumEventsSize += eventsize count += 1 - eventsThr = sumEventsThr / count - eventsSize = sumEventsSize / count + if count: + eventsThr = sumEventsThr / count + eventsSize = sumEventsSize / count + self.logger.info("average throughput for %s jobs: %s evt/s", count, eventsThr) + self.logger.info("average eventsize for %s jobs: %s bytes", count, eventsSize) + else: + self.logger.info("No probe job output could be found") + eventsThr = 0 + eventsSize = 0 - self.logger.info("average throughput for %s jobs: %s evt/s", count, eventsThr) - self.logger.info("average eventsize for %s jobs: %s bytes", count, eventsSize) + if not count or not eventsThr or not eventsSize: + retmsg = "Splitting failed because all probe jobs failed or anyhow failed to provide estimates" + self.logger.error(retmsg) + return 1 maxSize = getattr(config.TaskWorker, 'automaticOutputSizeMaximum', 5 * 1000**3) maxEvents = (maxSize / eventsSize) if eventsSize > 0 else 0 @@ -266,7 +277,10 @@ def executeInternal(self, *args): creator = DagmanCreator(config, crabserver=None, rucioClient=rucioClient) with config.TaskWorker.envForCMSWEB: creator.createSubdag(split_result.result, task=task, parent=parent, stage=self.stage) - self.submitSubdag('RunJobs{0}.subdag'.format(self.prefix), getattr(config.TaskWorker, 'maxIdle', MAX_IDLE_JOBS), getattr(config.TaskWorker, 'maxPost', MAX_POST_JOBS), self.stage) + self.submitSubdag('RunJobs{0}.subdag'.format(self.prefix), + getattr(config.TaskWorker, 'maxIdle', MAX_IDLE_JOBS), + getattr(config.TaskWorker, 'maxPost', MAX_POST_JOBS), + self.stage) except TaskWorkerException as e: retmsg = "DAG creation failed with:\n{0}".format(e) self.logger.error(retmsg) @@ -331,7 +345,7 @@ def adjustLumisForCompletion(self, task, unprocessed): # Now we turn lumis it into something like: # lumis=['1, 33, 35, 35, 37, 47, 49, 75, 77, 130, 133, 136','1,45,50,80'] # which is the format expected by buildLumiMask in the splitting algorithm - lumis = [",".join(str(l) for l in functools.reduce(lambda x, y:x + y, missing_compact[run])) for run in runs] + lumis = [",".join(str(l) for l in functools.reduce(lambda x, y: x + y, missing_compact[run])) for run in runs] task['tm_split_args']['runs'] = runs task['tm_split_args']['lumis'] = lumis diff --git a/src/python/TaskWorker/Actions/PreJob.py b/src/python/TaskWorker/Actions/PreJob.py index d5686d1484..6d10d96bc9 100644 --- a/src/python/TaskWorker/Actions/PreJob.py +++ b/src/python/TaskWorker/Actions/PreJob.py @@ -14,7 +14,6 @@ import CMSGroupMapper - class PreJob: """ Need a doc string here. @@ -434,28 +433,25 @@ def redo_sites(self, new_submit_text, crab_retry, use_resubmit_info): self.logger.error("Can not submit since DESIRED_Sites list is empty") self.prejob_exit_code = 1 sys.exit(self.prejob_exit_code) + ## Make sure that attributest which will be used in MatchMaking are SORTED lists + available = list(available) + available.sort() + datasites = list(datasites) + datasites.sort() ## Add DESIRED_SITES to the Job..submit content. new_submit_text = '+DESIRED_SITES="%s"\n%s' % (",".join(available), new_submit_text) new_submit_text = '+DESIRED_CMSDataLocations="%s"\n%s' % (",".join(datasites), new_submit_text) return new_submit_text - def touch_logs(self, crab_retry): """ - Create the log web-shared directory for the task and create the + Use the log web-shared directory created by AdjustSites.py for the task and create the job_out...txt and postjob...txt files with default messages. """ try: taskname = self.task_ad['CRAB_ReqName'] - logpath = os.path.expanduser("~/%s" % (taskname)) - try: - os.makedirs(logpath) - except OSError as oe: - if oe.errno != errno.EEXIST: - msg = "Failed to create log web-shared directory %s" % (logpath) - self.logger.info(msg) - return + logpath = os.path.relpath('WEB_DIR') job_retry = "%s.%s" % (self.job_id, crab_retry) fname = os.path.join(logpath, "job_out.%s.txt" % job_retry) with open(fname, 'w') as fd: diff --git a/src/python/TaskWorker/Actions/Recurring/BanDestinationSites.py b/src/python/TaskWorker/Actions/Recurring/BanDestinationSites.py index f94061d2a5..333201b24f 100644 --- a/src/python/TaskWorker/Actions/Recurring/BanDestinationSites.py +++ b/src/python/TaskWorker/Actions/Recurring/BanDestinationSites.py @@ -3,7 +3,7 @@ import sys import json import shutil -import urllib2 +import urllib.request import logging import traceback @@ -41,7 +41,7 @@ def writeBannedSitesToFile(self, bannedSites, saveLocation): def execute(self): blacklistedSites = [] try: - usableSites = urllib2.urlopen(self.config.Sites.DashboardURL).read() + usableSites = urllib.request.urlopen(self.config.Sites.DashboardURL).read() except Exception as e: # If exception is got, don`t change anything and previous data will be used self.logger.error("Got exception in retrieving usable sites list from %s. Exception: %s", diff --git a/src/python/TaskWorker/Actions/Recurring/FMDCleaner.py b/src/python/TaskWorker/Actions/Recurring/FMDCleaner.py index 141df4e903..5850ada48a 100644 --- a/src/python/TaskWorker/Actions/Recurring/FMDCleaner.py +++ b/src/python/TaskWorker/Actions/Recurring/FMDCleaner.py @@ -5,9 +5,8 @@ the filemetadata delete API """ import sys -import urllib import logging -from httplib import HTTPException +from http.client import HTTPException from RESTInteractions import CRABRest from TaskWorker.Actions.Recurring.BaseRecurringAction import BaseRecurringAction @@ -19,9 +18,9 @@ def _execute(self, crabserver): self.logger.info('Cleaning filemetadata older than 30 days..') ONE_MONTH = 24 * 30 try: - crabserver.delete(api='filemetadata', data=urllib.urlencode({'hours': ONE_MONTH})) + crabserver.delete(api='filemetadata', data=urlencode({'hours': ONE_MONTH})) #TODO return from the server a value (e.g.: ["ok"]) to see if everything is ok -# result = crabserver.delete(api='filemetadata', data=urllib.urlencode({'hours': ONE_MONTH}))[0]['result'][0] +# result = crabserver.delete(api='filemetadata', data=urlencode({'hours': ONE_MONTH}))[0]['result'][0] # self.logger.info('FMDCleaner, got %s' % result) except HTTPException as hte: self.logger.error(hte.headers) diff --git a/src/python/TaskWorker/Actions/Recurring/GenerateXML.py b/src/python/TaskWorker/Actions/Recurring/GenerateXML.py index 6be8bf9f1c..0381b65da7 100644 --- a/src/python/TaskWorker/Actions/Recurring/GenerateXML.py +++ b/src/python/TaskWorker/Actions/Recurring/GenerateXML.py @@ -1,11 +1,7 @@ # report health of TaskWorker to SLS import os import sys -import time -import json -import urllib import logging -import traceback import subprocess from datetime import datetime from TaskWorker.Actions.Recurring.BaseRecurringAction import BaseRecurringAction diff --git a/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py b/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py index 4aaf8641a6..a63fc0a1de 100644 --- a/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py +++ b/src/python/TaskWorker/Actions/Recurring/RenewRemoteProxies.py @@ -142,10 +142,10 @@ def get_proxy_from_MyProxy(self, ad): raise Exception("Failed to retrieve proxy.") return userproxy - def push_new_proxy_to_schedd(self, schedd, ad, proxy): + def push_new_proxy_to_schedd(self, schedd, ad, proxy, tokenDir): if not hasattr(schedd, 'refreshGSIProxy'): raise NotImplementedError() - with HTCondorUtils.AuthenticatedSubprocess(proxy) as (parent, rpipe): + with HTCondorUtils.AuthenticatedSubprocess(proxy, tokenDir) as (parent, rpipe): if not parent: schedd.refreshGSIProxy(ad['ClusterId'], ad['ProcID'], proxy, -1) results = rpipe.read() @@ -153,6 +153,8 @@ def push_new_proxy_to_schedd(self, schedd, ad, proxy): raise Exception("Failure when renewing HTCondor task proxy: '%s'" % results) def execute_schedd(self, schedd_name, collector): + + tokenDir = getattr(self.config.TaskWorker, 'SEC_TOKEN_DIRECTORY', None) self.logger.info("Updating tasks in schedd %s", schedd_name) self.logger.debug("Trying to locate schedd.") schedd_ad = collector.locate(htcondor.DaemonTypes.Schedd, schedd_name) @@ -196,7 +198,7 @@ def execute_schedd(self, schedd_name, collector): continue for ad in ad_list: try: - self.push_new_proxy_to_schedd(schedd, ad, proxyfile) + self.push_new_proxy_to_schedd(schedd, ad, proxyfile, tokenDir) except NotImplementedError: raise except Exception: diff --git a/src/python/TaskWorker/Actions/Recurring/TapeRecallStatus.py b/src/python/TaskWorker/Actions/Recurring/TapeRecallStatus.py index d5a6886c9d..42b9f3b328 100644 --- a/src/python/TaskWorker/Actions/Recurring/TapeRecallStatus.py +++ b/src/python/TaskWorker/Actions/Recurring/TapeRecallStatus.py @@ -19,33 +19,6 @@ class TapeRecallStatus(BaseRecurringAction): pollingTime = 60*4 # minutes rucioClient = None - def refreshSandbox(self, task): - - from WMCore.Services.UserFileCache.UserFileCache import UserFileCache - ufc = UserFileCache({'cert': task['user_proxy'], 'key': task['user_proxy'], - 'endpoint': task['tm_cache_url'], "pycurl": True}) - sandbox = task['tm_user_sandbox'].replace(".tar.gz", "") - debugFiles = task['tm_debug_files'].replace(".tar.gz", "") - sandboxPath = os.path.join("/tmp", sandbox) - debugFilesPath = os.path.join("/tmp", debugFiles) - try: - ufc.download(sandbox, sandboxPath, task['tm_username']) - ufc.download(debugFiles, debugFilesPath, task['tm_username']) - self.logger.info( - "Successfully touched input and debug sandboxes (%s and %s) of task %s (frontend: %s) using the '%s' username (request_id = %s).", - sandbox, debugFiles, task['tm_taskname'], task['tm_cache_url'], task['tm_username'], task['tm_DDM_reqid']) - except Exception as ex: - msg = "The CRAB3 server backend could not download the input and/or debug sandbox (%s and/or %s) " % ( - sandbox, debugFiles) - msg += "of task %s from the frontend (%s) using the '%s' username (request_id = %s). " % \ - (task['tm_taskname'], task['tm_cache_url'], task['tm_username'], task['tm_DDM_reqid']) - msg += "\nThis could be a temporary glitch, will try again in next occurrence of the recurring action." - msg += "Error reason:\n%s" % str(ex) - self.logger.info(msg) - finally: - if os.path.exists(sandboxPath): os.remove(sandboxPath) - if os.path.exists(debugFilesPath): os.remove(debugFilesPath) - def _execute(self, config, task): # setup logger @@ -110,12 +83,6 @@ def _execute(self, config, task): user_proxy = False self.logger.exception(twe) - if not 'S3' in recallingTask['tm_cache_url'].upper(): - # when using old crabcache had to worry about sandbox purging after 3 days - # Make sure the task sandbox in the crabcache is not deleted until the tape recall is completed - if user_proxy: - self.refreshSandbox(recallingTask) - # Retrieve status of recall request if not self.rucioClient: self.rucioClient = getNativeRucioClient(config=config, logger=self.logger) diff --git a/src/python/TaskWorker/Actions/RetryJob.py b/src/python/TaskWorker/Actions/RetryJob.py index 0ba0a8a091..133ba60c9c 100644 --- a/src/python/TaskWorker/Actions/RetryJob.py +++ b/src/python/TaskWorker/Actions/RetryJob.py @@ -134,7 +134,7 @@ def record_site(self, job_status): Need a doc string here. """ job_status_name = None - for name, code in JOB_RETURN_CODES._asdict().iteritems(): + for name, code in JOB_RETURN_CODES._asdict().items(): if code == job_status: job_status_name = name try: @@ -219,7 +219,7 @@ def check_memory_report(self): """ # If job was killed on the worker node, we probably don't have a FJR. if self.ad.get("RemoveReason", "").startswith("Removed due to memory use"): - job_rss = int(self.ad.get("ResidentSetSize","0"))/1000 + job_rss = int(self.ad.get("ResidentSetSize","0")) // 1000 exitMsg = "Job killed by HTCondor due to excessive memory use" exitMsg += " (RSS=%d MB)." % job_rss exitMsg += " Will not retry it." @@ -322,7 +322,7 @@ def check_exit_code(self): if exitCode == 134: recoverable_signal = False try: - fname = os.path.expanduser("~/%s/job_out.%s.%d.txt" % (self.reqname, self.job_id, self.crab_retry)) + fname = os.path.realpath("WEB_DIR/job_out.%s.%d.txt" % (self.job_id, self.crab_retry)) with open(fname) as fd: for line in fd: if line.startswith("== CMSSW: A fatal system signal has occurred: illegal instruction"): @@ -338,7 +338,7 @@ def check_exit_code(self): if exitCode == 8001 or exitCode == 65: cvmfs_issue = False try: - fname = os.path.expanduser("~/%s/job_out.%s.%d.txt" % (self.reqname, self.job_id, self.crab_retry)) + fname = os.path.relpath("WEB_DIR/job_out.%s.%d.txt" % (self.job_id, self.crab_retry)) cvmfs_issue_re = re.compile("== CMSSW: unable to load /cvmfs/.*file too short") with open(fname) as fd: for line in fd: diff --git a/src/python/TaskWorker/Actions/Splitter.py b/src/python/TaskWorker/Actions/Splitter.py index 35ef1756b2..f18c1f9387 100644 --- a/src/python/TaskWorker/Actions/Splitter.py +++ b/src/python/TaskWorker/Actions/Splitter.py @@ -43,7 +43,7 @@ def execute(self, *args, **kwargs): splitparam['files_per_job'] = (len(data.getFiles()) + numProbes - 1) // numProbes elif kwargs['task']['tm_job_type'] == 'PrivateMC': # sanity check - nJobs = kwargs['task']['tm_totalunits'] / splitparam['events_per_job'] + nJobs = kwargs['task']['tm_totalunits'] // splitparam['events_per_job'] if nJobs > maxJobs: raise TaskWorkerException( "Your task would generate %s jobs. The maximum number of jobs in each task is %s" % diff --git a/src/python/TaskWorker/Actions/StageoutCheck.py b/src/python/TaskWorker/Actions/StageoutCheck.py index 6e7e43ab2c..baecdedf61 100644 --- a/src/python/TaskWorker/Actions/StageoutCheck.py +++ b/src/python/TaskWorker/Actions/StageoutCheck.py @@ -5,7 +5,7 @@ from TaskWorker.WorkerExceptions import TaskWorkerException from ServerUtilities import isFailurePermanent from ServerUtilities import getCheckWriteCommand, createDummyFile -from ServerUtilities import removeDummyFile, executeCommand +from ServerUtilities import removeDummyFile, execute_command from RucioUtils import getWritePFN class StageoutCheck(TaskAction): @@ -30,7 +30,7 @@ def checkPermissions(self, Cmd): Return 0 otherwise """ self.logger.info("Executing command: %s ", Cmd) - out, err, exitcode = executeCommand(Cmd) + out, err, exitcode = execute_command(Cmd) if exitcode != 0: isPermanent, failure, dummyExitCode = isFailurePermanent(err) if isPermanent: diff --git a/src/python/TaskWorker/Actions/TaskAction.py b/src/python/TaskWorker/Actions/TaskAction.py index 466328359c..9d48d76e01 100644 --- a/src/python/TaskWorker/Actions/TaskAction.py +++ b/src/python/TaskWorker/Actions/TaskAction.py @@ -1,12 +1,18 @@ import os import json -import urllib import logging from base64 import b64encode -from httplib import HTTPException +from http.client import HTTPException +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode from ServerUtilities import truncateError +from Utils.Utilities import encodeUnicodeToBytes + class TaskAction(object): """The ABC of all actions""" @@ -51,9 +57,9 @@ def uploadWarning(self, warning, userProxy, taskname): truncWarning = truncateError(warning) configreq = {'subresource': 'addwarning', 'workflow': taskname, - 'warning': b64encode(truncWarning)} + 'warning': b64encode(encodeUnicodeToBytes(truncWarning))} try: - self.crabserver.post(api='/task', data=urllib.urlencode(configreq)) + self.crabserver.post(api='task', data=urlencode(configreq)) except HTTPException as hte: self.logger.error("Error uploading warning: %s", str(hte)) self.logger.warning("Cannot add a warning to REST interface. Warning message: %s", warning) @@ -62,7 +68,7 @@ def uploadWarning(self, warning, userProxy, taskname): def deleteWarnings(self, userProxy, taskname): configreq = {'subresource': 'deletewarnings', 'workflow': taskname} try: - self.crabserver.post(api='task', data=urllib.urlencode(configreq)) + self.crabserver.post(api='task', data=urlencode(configreq)) except HTTPException as hte: self.logger.error("Error deleting warnings: %s", str(hte)) self.logger.warning("Can not delete warnings from REST interface.") diff --git a/src/python/TaskWorker/MasterWorker.py b/src/python/TaskWorker/MasterWorker.py index 1277619afb..6f614c5b1e 100644 --- a/src/python/TaskWorker/MasterWorker.py +++ b/src/python/TaskWorker/MasterWorker.py @@ -7,15 +7,20 @@ import shutil import sys import time -import urllib import signal import logging from base64 import b64encode -from httplib import HTTPException +from http.client import HTTPException from MultiProcessingLog import MultiProcessingLog +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode + #WMcore dependencies +from Utils.Utilities import encodeUnicodeToBytes from WMCore.Configuration import loadConfigurationFile #CRAB dependencies @@ -255,8 +260,8 @@ def _lockWork(self, limit, getstatus, setstatus): configreq = {'subresource': 'process', 'workername': self.config.TaskWorker.name, 'getstatus': getstatus, 'limit': limit, 'status': setstatus} try: - #self.server.post(self.restURInoAPI + '/workflowdb', data=urllib.urlencode(configreq)) - self.crabserver.post(api='workflowdb', data=urllib.urlencode(configreq)) + #self.server.post(self.restURInoAPI + '/workflowdb', data=urlencode(configreq)) + self.crabserver.post(api='workflowdb', data=urlencode(configreq)) except HTTPException as hte: msg = "HTTP Error during _lockWork: %s\n" % str(hte) msg += "HTTP Headers are %s: " % hte.headers @@ -299,8 +304,8 @@ def updateWork(self, taskname, command, status): configreq = {'workflow': taskname, 'command': command, 'status': status, 'subresource': 'state'} try: - #self.server.post(self.restURInoAPI + '/workflowdb', data=urllib.urlencode(configreq)) - self.crabserver.post(api='workflowdb', data=urllib.urlencode(configreq)) + #self.server.post(self.restURInoAPI + '/workflowdb', data=urlencode(configreq)) + self.crabserver.post(api='workflowdb', data=urlencode(configreq)) except HTTPException as hte: msg = "HTTP Error during updateWork: %s\n" % str(hte) msg += "HTTP Headers are %s: " % hte.headers @@ -354,7 +359,7 @@ def failBannedTask(self, task): warning = 'username %s banned in CRAB TaskWorker configuration' % task['tm_username'] configreq = {'subresource': 'addwarning', 'workflow': taskname, 'warning': warning} try: - self.crabserver.post(api='task', data=urllib.urlencode(configreq)) + self.crabserver.post(api='task', data=urlencode(configreq)) except Exception as e: self.logger.error("Error uploading warning: %s", str(e)) self.logger.warning("Cannot add a warning to REST interface. Warning message: %s", warning) @@ -380,9 +385,9 @@ def skipRejectedCommand(self, task): if command == 'KILL': # ignore, i.e. leave in status 'SUBMITTED' self.updateWork(taskname, command, 'SUBMITTED') warning = 'command %s disabled in CRAB TaskWorker configuration' % command - configreq = {'subresource': 'addwarning', 'workflow': taskname, 'warning': b64encode(warning)} + configreq = {'subresource': 'addwarning', 'workflow': taskname, 'warning': b64encode(encodeUnicodeToBytes(warning))} try: - self.crabserver.post(api='task', data=urllib.urlencode(configreq)) + self.crabserver.post(api='task', data=urlencode(configreq)) except Exception as e: self.logger.error("Error uploading warning: %s", str(e)) self.logger.warning("Cannot add a warning to REST interface. Warning message: %s", warning) diff --git a/src/python/TaskWorker/TaskManagerBootstrap.py b/src/python/TaskWorker/TaskManagerBootstrap.py index 68b3d112e7..2904af17d7 100644 --- a/src/python/TaskWorker/TaskManagerBootstrap.py +++ b/src/python/TaskWorker/TaskManagerBootstrap.py @@ -36,7 +36,7 @@ def bootstrap(): print("..done") in_args = [] if infile != "None": - with open(infile, "r") as fd: + with open(infile, "rb") as fd: in_args = pickle.load(fd) config = Configuration.Configuration() @@ -75,7 +75,7 @@ def bootstrap(): results = task.execute(in_args, task=ad).result print(results) - with open(outfile, "w") as fd: + with open(outfile, "wb") as fd: pickle.dump(results, fd) return 0 diff --git a/src/python/TaskWorker/Worker.py b/src/python/TaskWorker/Worker.py index ea09a86c3e..2f9673be76 100644 --- a/src/python/TaskWorker/Worker.py +++ b/src/python/TaskWorker/Worker.py @@ -1,21 +1,28 @@ from __future__ import print_function import os import time -import urllib import logging import traceback import multiprocessing -from Queue import Empty +from queue import Empty from base64 import b64encode from logging import FileHandler -from httplib import HTTPException +from http.client import HTTPException from logging.handlers import TimedRotatingFileHandler +import sys +if sys.version_info >= (3, 0): + from urllib.parse import urlencode # pylint: disable=no-name-in-module +if sys.version_info < (3, 0): + from urllib import urlencode + from RESTInteractions import CRABRest from TaskWorker.DataObjects.Result import Result from ServerUtilities import truncateError, executeCommand from TaskWorker.WorkerExceptions import WorkerHandlerException, TapeDatasetException +from Utils.Utilities import encodeUnicodeToBytes + ## Creating configuration globals to avoid passing these around at every request ## and tell pylink to bare with this :-) # pylint: disable=W0604, W0601 @@ -48,8 +55,8 @@ def failTask(taskName, crabserver, msg, log, failstatus='FAILED'): 'status': failstatus, 'subresource': 'failure', # Limit the message to 7500 chars, which means no more than 10000 once encoded. That's the limit in the REST - 'failure': b64encode(truncMsg)} - crabserver.post(api='workflowdb', data = urllib.urlencode(configreq)) + 'failure': b64encode(encodeUnicodeToBytes(truncMsg))} + crabserver.post(api='workflowdb', data = urlencode(configreq)) log.info("Failure message successfully uploaded to the REST") except HTTPException as hte: log.warning("Cannot upload failure message to the REST for task %s. HTTP exception headers follows:", taskName) @@ -188,7 +195,7 @@ def begin(self): """Starting up all the slaves""" if len(self.pool) == 0: # Starting things up - for x in xrange(1, self.nworkers + 1): + for x in range(1, self.nworkers + 1): self.logger.debug("Starting process %i", x) p = multiprocessing.Process(target = processWorker, args = (self.inputs, self.results, self.resthost, self.dbInstance, WORKER_CONFIG.TaskWorker.logsDir, x)) p.start() @@ -246,7 +253,7 @@ def checkFinished(self): return [] allout = [] self.logger.info("%d work on going, checking if some has finished", len(self.working.keys())) - for _ in xrange(len(self.working.keys())): + for _ in range(len(self.working.keys())): out = None try: out = self.results.get_nowait() diff --git a/src/python/TaskWorker/WorkerExceptions.py b/src/python/TaskWorker/WorkerExceptions.py index 7e9388ff23..5cfaa90e8d 100644 --- a/src/python/TaskWorker/WorkerExceptions.py +++ b/src/python/TaskWorker/WorkerExceptions.py @@ -23,19 +23,10 @@ class ConfigException(TaskWorkerException): TaskWorker configuration""" exitcode = 4000 -class PanDAException(TaskWorkerException): - """Generic exception interacting with PanDA""" - exitcode = 5000 - -class PanDAIdException(PanDAException): - """Returned in case there are issues with the expected - behaviour of PanDA id's (def, set)""" - exitcode = 5001 - -class NoAvailableSite(PanDAException): - """In case there is no site available to run the jobs - use this exception""" - exitcode = 5002 +class NoAvailableSite(TaskWorkerException): + """In case there is no site available to run the jobs + use this exception""" + exitcode = 5000 class WorkerHandlerException(TaskWorkerException): """Generic exception in case slave worker action diff --git a/src/python/UserFileCache/RESTBaseAPI.py b/src/python/UserFileCache/RESTBaseAPI.py deleted file mode 100644 index 695956cf54..0000000000 --- a/src/python/UserFileCache/RESTBaseAPI.py +++ /dev/null @@ -1,30 +0,0 @@ -# WMCore dependecies here -from WMCore.REST.Server import RESTApi -from WMCore.REST.Format import JSONFormat - -# CRABServer dependecies here -from UserFileCache.RESTFile import RESTFile, RESTLogFile, RESTInfo -import UserFileCache.RESTExtensions - -# external dependecies here -import os - - -class RESTBaseAPI(RESTApi): - """The UserFileCache REST API module""" - - def __init__(self, app, config, mount): - RESTApi.__init__(self, app, config, mount) - - self.formats = [ ('application/json', JSONFormat()) ] - - if not os.path.exists(config.cachedir) or not os.path.isdir(config.cachedir): - raise Exception("Failing to start because of wrong cache directory '%s'" % config.cachedir) - - if hasattr(config, 'powerusers'): - UserFileCache.RESTExtensions.POWER_USERS_LIST = config.powerusers - if hasattr(config, 'quota_user_limit'): - UserFileCache.RESTExtensions.QUOTA_USER_LIMIT = config.quota_user_limit * 1024 * 1024 - self._add( {'logfile': RESTLogFile(app, self, config, mount), - 'file': RESTFile(app, self, config, mount), - 'info': RESTInfo(app, self, config, mount)} ) diff --git a/src/python/UserFileCache/RESTExtensions.py b/src/python/UserFileCache/RESTExtensions.py deleted file mode 100644 index 32517da9bf..0000000000 --- a/src/python/UserFileCache/RESTExtensions.py +++ /dev/null @@ -1,195 +0,0 @@ -""" -This module aims to contain the method specific to the REST interface. -These are extensions which are not directly contained in WMCore.REST module. -Collecting all here since aren't supposed to be many. -""" - -from ServerUtilities import USER_SANDBOX_EXCLUSIONS, NEW_USER_SANDBOX_EXCLUSIONS -from ServerUtilities import FILE_SIZE_LIMIT, FILE_MEMORY_LIMIT - -# WMCore dependecies here -from WMCore.REST.Validation import _validate_one -from WMCore.REST.Error import RESTError, InvalidParameter -from WMCore.Services.UserFileCache.UserFileCache import calculateChecksum -from WMCore.REST.Auth import get_user_info - -# external dependecies here -import tarfile -import hashlib -import cStringIO -import urllib2 -from os import fstat, walk, path, listdir - -# 600MB is the default user quota limit - overwritten in RESTBaseAPI if quota_user_limit is set in the config -QUOTA_USER_LIMIT = 1024*1024*600 -#these users have 10* basic user quota - overwritten in RESTBaseAPI if powerusers is set in the config -POWER_USERS_LIST = [] - -def http_error(msg, code=403): - try: - import requests - err = requests.HTTPError(msg) - err.response.status_code = code - return err - except ImportError: - url = '' # required for urllib2 HTTPError but we should acquire it from elsewhere - return urllib2.HTTPError(url, code, err, None, None) - -###### authz_login_valid is currently duplicatint CRABInterface.RESTExtension . A better solution -###### should be found for authz_* -def authz_login_valid(): - user = get_user_info() - if not user['login']: - err = "You are not allowed to access this resources" - raise http_error(err) - -def authz_operator(username): - """ Check if the the user who is trying to access this resource (i.e.: user['login'], the cert username) is the - same as username. If not check if the user is a CRAB3 operator. {... 'operator': {'group': set(['crab3']) ... in request roles} - If the user is not an operator and is trying to access a file owned by another user than raise - """ - user = get_user_info() - if user['login'] != username and\ - 'crab3' not in user.get('roles', {}).get('operator', {}).get('group', set()): - err = "You are not allowed to access this resource. You need to be a CRAB3 operator in sitedb to access other user's files" - raise http_error(err) - -def file_size(argfile): - """Return the file or cStringIO.StringIO size - - :arg file|cStringIO.StringIO argfile: file object handler or cStringIO.StringIO - :return: size in bytes""" - if isinstance(argfile, file): - return fstat(argfile.fileno()).st_size, True - elif isinstance(argfile, cStringIO.OutputType): - argfile.seek(0, 2) - filesize = argfile.tell() - argfile.seek(0) - return filesize, False - -def list_users(cachedir): - #file are stored in directories like u/username - for name in listdir(cachedir): #iterate over u ... - if name == 'lost+found': # skip root-owned file at top mount point - continue - if path.isdir(path.join(cachedir, name)): - for username in listdir(path.join(cachedir, name)): #list all the users under u - yield username - -def list_files(quotapath): - for _, _, filenames in walk(quotapath): - for f in filenames: - yield f - -def get_size(quotapath): - """Check the quotapath directory size; it doesn't include the 4096 bytes taken by each directory - - :arg str quotapath: the directory for which is needed to calculate the quota - :return: bytes taken by the directory""" - totalsize = 0 - for dirpath, _, filenames in walk(quotapath): - for f in filenames: - fp = path.join(dirpath, f) - totalsize += path.getsize(fp) - return totalsize - -def quota_user_free(quotadir, infile): - """Raise an exception if the input file overflow the user quota - - :arg str quotadir: the user path where the file will be written - :arg file|cStringIO.StringIO infile: file object handler or cStringIO.StringIO - :return: Nothing""" - filesize, _ = file_size(infile.file) - quota = get_size(quotadir) - user = get_user_info() - quotaLimit = QUOTA_USER_LIMIT*10 if user['login'] in POWER_USERS_LIST else QUOTA_USER_LIMIT - if filesize + quota > quotaLimit: - excquota = ValueError("User %s has reached quota of %dB: additional file of %dB cannot be uploaded." \ - % (user['login'], quota, filesize)) - raise InvalidParameter("User quota limit reached; cannot upload the file", errobj=excquota, trace='') - -def _check_file(argname, val): - """Check that `argname` `val` is a file - - :arg str argname: name of the argument - :arg file val: the file object - :return: the val if the validation passes.""" - # checking that is a valid file or an input string - # note: the input string is generated on client side just when the input file is empty - filesize = 0 - if not hasattr(val, 'file') or not (isinstance(val.file, file) or isinstance(val.file, cStringIO.OutputType)): - raise InvalidParameter("Incorrect inputfile parameter") - else: - filesize, realfile = file_size(val.file) - if realfile: - if filesize > FILE_SIZE_LIMIT: - raise InvalidParameter("File size is %sB. This is bigger than the maximum allowed size of %sB." % (filesize, FILE_SIZE_LIMIT)) - elif filesize > FILE_MEMORY_LIMIT: - raise InvalidParameter('File too large to be completely loaded into memory.') - - return val - - -def _check_tarfile(argname, val, hashkey, newchecksum): - """Check that `argname` `val` is a tar file and that provided 'hashkey` - matches with the hashkey calculated on the `val`. - - :arg str argname: name of the argument - :arg file val: the file object - :arg str hashkey: the sha256 hexdigest of the file, calculated over the tuple - (name, size, mtime, uname) of all the tarball members - :return: the val if the validation passes.""" - # checking that is a valid file or an input string - # note: the input string is generated on client side just when the input file is empty - _check_file(argname, val) - - digest = None - try: - #This newchecksum param and the if/else branch is there for backward compatibility. - #The parameter, older exclusion and checksum functions should be removed in the future. - if newchecksum == 2: - digest = calculateChecksum(val.file, exclude=NEW_USER_SANDBOX_EXCLUSIONS) - elif newchecksum == 1: - digest = calculateChecksum(val.file, exclude=USER_SANDBOX_EXCLUSIONS) - else: - tar = tarfile.open(fileobj=val.file, mode='r') - lsl = [(x.name, int(x.size), int(x.mtime), x.uname) for x in tar.getmembers()] - hasher = hashlib.sha256(str(lsl)) - digest = hasher.hexdigest() - except tarfile.ReadError: - raise InvalidParameter('File is not a .tgz file.') - if not digest or hashkey != digest: - raise ChecksumFailed("Checksums do not match") - return val - -class ChecksumFailed(RESTError): - "Checksum calculation failed, file transfer problem." - http_code = 400 - app_code = 302 - message = "Input file hashkey mismatch" - -def validate_tarfile(argname, param, safe, hashkey, optional=False): - """Validates that an argument is a file and matches the hashkey. - - Checks that an argument named `argname` exists in `param.kwargs` - and it is a tar file which matches the provided hashkey. If - successful the string is copied into `safe.kwargs` and the value - is removed from `param.kwargs`. - - If `optional` is True, the argument is not required to exist in - `param.kwargs`; None is then inserted into `safe.kwargs`. Otherwise - a missing value raises an exception.""" - _validate_one(argname, param, safe, _check_tarfile, optional, safe.kwargs[hashkey], safe.kwargs['newchecksum']) - -def validate_file(argname, param, safe, hashkey, optional=False): - """Validates that an argument is a file and matches the hashkey. - - Checks that an argument named `argname` exists in `param.kwargs` - and it is a tar file which matches the provided hashkey. If - successful the string is copied into `safe.kwargs` and the value - is removed from `param.kwargs`. - - If `optional` is True, the argument is not required to exist in - `param.kwargs`; None is then inserted into `safe.kwargs`. Otherwise - a missing value raises an exception.""" - _validate_one(argname, param, safe, _check_file, optional) diff --git a/src/python/UserFileCache/RESTFile.py b/src/python/UserFileCache/RESTFile.py deleted file mode 100644 index c3d665eaa0..0000000000 --- a/src/python/UserFileCache/RESTFile.py +++ /dev/null @@ -1,279 +0,0 @@ -# WMCore dependecies here -from WMCore.REST.Format import RawFormat -from WMCore.REST.Server import RESTEntity, restcall -from WMCore.REST.Validation import validate_str, _validate_one, validate_num -from WMCore.REST.Error import RESTError, InvalidParameter, MissingObject, ExecutionError - -# CRABServer dependecies here -from UserFileCache.__init__ import __version__ -from UserFileCache.RESTExtensions import ChecksumFailed, validate_file, validate_tarfile, authz_login_valid, authz_operator,\ - quota_user_free, get_size, list_files, list_users - -# external dependecies here -import re -import os -import shutil -import tarfile -import hashlib -import cherrypy -from cherrypy.lib.static import serve_file - -# here go the all regex to be used for validation -RX_USERNAME = re.compile(r"^\w+$") #TODO use WMCore regex -RX_HASH = re.compile(r'^[a-f0-9]{64}$') -RX_LOGFILENAME = re.compile(r"^[\w\-.: ]+$") -RX_SUBRES = re.compile(r"^fileinfo|userinfo|powerusers|basicquota|fileremove|listusers|usedspace$") - -def touch(filename): - """Touch the file to keep automated cleanup away - - :arg str filename: the filename path.""" - if os.path.isfile(filename): - os.utime(filename, None) - -def filepath(cachedir, username=None): - # NOTE: if we need to share a file between users (something we do not really want to make default or too easy...) we can: - # - use the group of the user instead of the user name, which can be retrieved from cherrypy.request.user - # - have an extra input parameter group=something (but this wouldn't be transparent when downloading it) - username = username if username else cherrypy.request.user['login'] - return os.path.join(cachedir, username[0], username) - -class RESTFile(RESTEntity): - """The RESTEntity for uploaded and downloaded files""" - - def __init__(self, app, api, config, mount): - RESTEntity.__init__(self, app, api, config, mount) - self.config = config - self.cachedir = config.cachedir - self.overwriteFile = False - - def validate(self, apiobj, method, api, param, safe): - """Validating all the input parameter as enforced by the WMCore.REST module""" - authz_login_valid() - - if method in ['PUT']: - validate_str("hashkey", param, safe, RX_HASH, optional=False) - validate_num("newchecksum", param, safe, optional=True) - validate_tarfile("inputfile", param, safe, 'hashkey', optional=False) - if method in ['GET']: - validate_str("hashkey", param, safe, RX_HASH, optional=False) - validate_str("username", param, safe, RX_USERNAME, optional=True) - if safe.kwargs['username']: - authz_operator(safe.kwargs['username']) - - @restcall - def put(self, inputfile, hashkey, newchecksum=0): - """Allow to upload a tarball file to be written in the local filesystem. - Base path of the local filesystem is configurable. - - The caller needs to be a CMS user with a valid CMS x509 cert/proxy. - - :arg file inputfile: file object to be uploaded - :arg str hashkey: the sha256 hexdigest of the file, calculated over the tuple - (name, size, mtime, uname) of all the tarball members - :return: hashkey, name, size of the uploaded file.""" - outfilepath = filepath(self.cachedir) - outfilename = None - result = {'hashkey': hashkey} - - # using the hash of the file to create a subdir and filename - outfilepath = os.path.join(outfilepath, hashkey[0:2]) - outfilename = os.path.join(outfilepath, hashkey) - - if os.path.isfile(outfilename) and not self.overwriteFile: - # we do not want to upload again a file that already exists - touch(outfilename) - result['size'] = os.path.getsize(outfilename) - else: - # check that the user quota is still below limit - quota_user_free(filepath(self.cachedir), inputfile) - - if not os.path.isdir(outfilepath): - os.makedirs(outfilepath) - handlefile = open(outfilename, 'wb') - inputfile.file.seek(0) - shutil.copyfileobj(inputfile.file, handlefile) - handlefile.close() - result['size'] = os.path.getsize(outfilename) - return [result] - - @restcall(formats = [('application/octet-stream', RawFormat())]) - def get(self, hashkey, username): - """Retrieve a file previously uploaded to the local filesystem. - The base path on the local filesystem is configurable. - - The caller needs to be a CMS user with a valid CMS x509 cert/proxy. - - :arg str hashkey: the sha256 hexdigest of the file, calculated over the tuple - (name, size, mtime, uname) of all the tarball members - :return: the raw file""" - filename = None - infilepath = filepath(self.cachedir, username) - - # defining the path/name from the hash of the file - filename = os.path.join(infilepath, hashkey[0:2], hashkey) - - if not os.path.isfile(filename): - raise MissingObject("Not such file") - touch(filename) - return serve_file(filename, "application/octet-stream", "attachment") - -class RESTLogFile(RESTFile): - """The RESTEntity for uploaded and downloaded logs""" - def __init__(self, app, api, config, mount): - RESTFile.__init__(self, app, api, config, mount) - self.overwriteFile = True - - def validate(self, apiobj, method, api, param, safe): - """Validating all the input parameter as enforced by the WMCore.REST module""" - authz_login_valid() - - if method in ['PUT']: - validate_file("inputfile", param, safe, 'hashkey', optional=False) - validate_str("name", param, safe, RX_LOGFILENAME, optional=False) - if method in ['GET']: - validate_str("name", param, safe, RX_LOGFILENAME, optional=False) - validate_str("username", param, safe, RX_USERNAME, optional=True) - if safe.kwargs['username']: - authz_operator(safe.kwargs['username']) - - @restcall - def put(self, inputfile, name): - return RESTFile.put(self, inputfile, name) - - @restcall(formats = [('application/octet-stream', RawFormat())]) - def get(self, name, username): - return RESTFile.get(self, name, username) - - -class RESTInfo(RESTEntity): - """REST entity for workflows and relative subresources""" - - def __init__(self, app, api, config, mount): - RESTEntity.__init__(self, app, api, config, mount) - self.cachedir = config.cachedir - - def validate(self, apiobj, method, api, param, safe): - """Validating all the input parameter as enforced by the WMCore.REST module""" - authz_login_valid() - if method in ['GET']: - validate_str('subresource', param, safe, RX_SUBRES, optional=True) - validate_str("hashkey", param, safe, RX_HASH, optional=True) - validate_num("verbose", param, safe, optional=True) - validate_str("username", param, safe, RX_USERNAME, optional=True) - if safe.kwargs['username']: - authz_operator(safe.kwargs['username']) - - @restcall - def get(self, subresource, **kwargs): - """Retrieves the server information, like delegateDN, filecacheurls ... - :arg str subresource: the specific server information to be accessed; - """ - if subresource: - return getattr(RESTInfo, subresource)(self, **kwargs) - else: - return [{"crabcache":"Welcome","version":__version__}] - @restcall - def fileinfo(self, **kwargs): - """Retrieve the file summary information. - - The caller needs to be a CMS user with a valid CMS x509 cert/proxy. - - :arg str hashkey: the sha256 hexdigest of the file, calculated over the tuple - (name, size, mtime, uname) of all the tarball members - :return: hashkey, name, size of the requested file""" - - hashkey = kwargs['hashkey'] - result = {} - filename = None - infilepath = filepath(self.cachedir, kwargs['username']) - - # defining the path/name from the hash of the file - filename = os.path.join(infilepath, hashkey[0:2], hashkey) - result['hashkey'] = hashkey - - if not os.path.isfile(filename): - raise MissingObject("Not such file") - result['exists'] = True - result['size'] = os.path.getsize(filename) - result['accessed'] = os.path.getctime(filename) - result['changed'] = os.path.getctime(filename) - result['modified'] = os.path.getmtime(filename) - touch(filename) - - return [result] - - - @restcall - def fileremove(self, **kwargs): - """Remove the file with the specified hashkey. - - The caller needs to be a CMS user with a valid CMS x509 cert/proxy. Users can only delete their own files - - :arg str hashkey: the sha256 hexdigest of the file, calculated over the tuple - (name, size, mtime, uname) of all the tarball members - """ - hashkey = kwargs['hashkey'] - - infilepath = filepath(self.cachedir) - # defining the path/name from the hash of the file - filename = os.path.join(infilepath, hashkey[0:2], hashkey) - - if not os.path.isfile(filename): - raise MissingObject("Not such file") - - try: - os.remove(filename) - except Exception as ex: - raise ExecutionError("Impossible to remove the file: %s" % str(ex)) - - @restcall - def userinfo(self, **kwargs): - """Retrieve the user summary information. - - :arg str username: username for which the informations are retrieved - - :return: quota, list of filenames""" - username = kwargs['username'] - userpath = filepath(self.cachedir, username) - - res = {} - files = list_files(userpath) - if kwargs['verbose']: - files_dict = {} - for file_ in files: - files_dict[file_] = self.fileinfo(hashkey=file_, username=username) - - res["file_list"] = files_dict if kwargs['verbose'] else list(files) - res["used_space"] = [get_size(userpath)] - - yield res - - #inserted by eric obeng summer student - @restcall - def usedspace(self, **kwargs): - """Retrieves only the used space of the user""" - username = kwargs["username"] - userpath = filepath(self.cachedir, username) - yield get_size(userpath) - - @restcall - def listusers(self, **kwargs): - """ Retrieve the list of power users from the config - """ - - return list_users(self.cachedir) - - @restcall - def powerusers(self, **kwargs): - """ Retrieve the list of power users from the config - """ - - return self.config.powerusers - - @restcall - def basicquota(self, **kwargs): - """ Retrieve the basic quota space - """ - - yield {"quota_user_limit" : self.config.quota_user_limit} diff --git a/src/python/UserFileCache/__init__.py b/src/python/UserFileCache/__init__.py deleted file mode 100644 index 42e449a66b..0000000000 --- a/src/python/UserFileCache/__init__.py +++ /dev/null @@ -1,3 +0,0 @@ -__version__ = 'development' - -#the __version__ will be automatically changed when building RPMs diff --git a/src/python/WMArchiveUploader.py b/src/python/WMArchiveUploader.py index 100dfda357..ae8493eee4 100644 --- a/src/python/WMArchiveUploader.py +++ b/src/python/WMArchiveUploader.py @@ -11,7 +11,7 @@ import signal import logging import subprocess -from httplib import HTTPException +from http.client import HTTPException from logging.handlers import TimedRotatingFileHandler from WMCore.WMException import WMException diff --git a/src/python/taskbuffer/FileSpec.py b/src/python/taskbuffer/FileSpec.py deleted file mode 100644 index 92005352ec..0000000000 --- a/src/python/taskbuffer/FileSpec.py +++ /dev/null @@ -1,117 +0,0 @@ -""" -file specification - -""" - -class FileSpec(object): - # attributes - _attributes = ('rowID', 'PandaID', 'GUID', 'lfn', 'type', 'dataset', 'status', 'prodDBlock', - 'prodDBlockToken', 'dispatchDBlock', 'dispatchDBlockToken', 'destinationDBlock', - 'destinationDBlockToken', 'destinationSE', 'fsize', 'md5sum', 'checksum') - # slots - __slots__ = _attributes+('_owner',) - - - # constructor - def __init__(self): - # install attributes - for attr in self._attributes: - setattr(self, attr, None) - # set owner to synchronize PandaID - self._owner = None - - - # override __getattribute__ for SQL and PandaID - def __getattribute__(self, name): - # PandaID - if name == 'PandaID': - if self._owner == None: - return 'NULL' - return self._owner.PandaID - # others - ret = object.__getattribute__(self, name) - if ret == None: - return "NULL" - return ret - - - # set owner - def setOwner(self, owner): - self._owner = owner - - - # return a tuple of values - def values(self): - ret = [] - for attr in self._attributes: - val = getattr(self, attr) - ret.append(val) - return tuple(ret) - - - # pack tuple into FileSpec - def pack(self, values): - for i in range(len(self._attributes)): - attr= self._attributes[i] - val = values[i] - setattr(self, attr, val) - - - # return state values to be pickled - def __getstate__(self): - state = [] - for attr in self._attributes: - val = getattr(self, attr) - state.append(val) - # append owner info - state.append(self._owner) - return state - - - # restore state from the unpickled state values - def __setstate__(self, state): - for i in range(len(self._attributes)): - if i+1 < len(state): - setattr(self, self._attributes[i], state[i]) - else: - setattr(self, self._attributes[i], 'NULL') - self._owner = state[-1] - - - # return column names for INSERT - def columnNames(cls): - ret = "" - for attr in cls._attributes: - if ret != "": - ret += ',' - ret += attr - return ret - columnNames = classmethod(columnNames) - - - # return expression of values for INSERT - def valuesExpression(cls): - ret = "VALUES(" - for attr in cls._attributes: - ret += "%s" - if attr != cls._attributes[len(cls._attributes)-1]: - ret += "," - ret += ")" - return ret - valuesExpression = classmethod(valuesExpression) - - - # return an expression for UPDATE - def updateExpression(cls): - ret = "" - for attr in cls._attributes: - ret = ret + attr + "=%s" - if attr != cls._attributes[len(cls._attributes)-1]: - ret += "," - return ret - updateExpression = classmethod(updateExpression) - - - - - diff --git a/src/python/taskbuffer/JobSpec.py b/src/python/taskbuffer/JobSpec.py deleted file mode 100644 index 6e3f4ac330..0000000000 --- a/src/python/taskbuffer/JobSpec.py +++ /dev/null @@ -1,141 +0,0 @@ -""" -job specification - -""" - -class JobSpec(object): - # attributes - _attributes = ('PandaID', 'jobDefinitionID', 'schedulerID', 'pilotID', 'creationTime', 'creationHost', - 'modificationTime', 'modificationHost', 'AtlasRelease', 'transformation', 'homepackage', - 'prodSeriesLabel', 'prodSourceLabel', 'prodUserID', 'assignedPriority', 'currentPriority', - 'attemptNr', 'maxAttempt', 'jobStatus', 'jobName', 'maxCpuCount', 'maxCpuUnit', 'maxDiskCount', - 'maxDiskUnit', 'ipConnectivity', 'minRamCount', 'minRamUnit', 'startTime', 'endTime', - 'cpuConsumptionTime', 'cpuConsumptionUnit', 'commandToPilot', 'transExitCode', 'pilotErrorCode', - 'pilotErrorDiag', 'exeErrorCode', 'exeErrorDiag', 'supErrorCode', 'supErrorDiag', - 'ddmErrorCode', 'ddmErrorDiag', 'brokerageErrorCode', 'brokerageErrorDiag', - 'jobDispatcherErrorCode', 'jobDispatcherErrorDiag', 'taskBufferErrorCode', - 'taskBufferErrorDiag', 'computingSite', 'computingElement', 'jobParameters', - 'metadata', 'prodDBlock', 'dispatchDBlock', 'destinationDBlock', 'destinationSE', - 'nEvents', 'grid', 'cloud', 'cpuConversion', 'sourceSite', 'destinationSite', 'transferType', - 'taskID', 'cmtConfig', 'stateChangeTime', 'prodDBUpdateTime', 'lockedby', 'relocationFlag', - 'jobExecutionID', 'VO', 'pilotTiming', 'workingGroup', 'processingType', 'prodUserName', - 'nInputFiles', 'countryGroup', 'batchID', 'parentID', 'specialHandling', 'jobsetID', - 'coreCount') - # slots - __slots__ = _attributes+('Files',) - - - # constructor - def __init__(self): - # install attributes - for attr in self._attributes: - setattr(self, attr, None) - # files list - self.Files = [] - - - # override __getattribute__ for SQL - def __getattribute__(self, name): - ret = object.__getattribute__(self, name) - if ret == None: - return "NULL" - return ret - - - # add File to files list - def addFile(self, file): - # set owner - file.setOwner(self) - # append - self.Files.append(file) - - - # pack tuple into JobSpec - def pack(self, values): - for i in range(len(self._attributes)): - attr= self._attributes[i] - val = values[i] - setattr(self, attr, val) - - - # return a tuple of values - def values(self): - ret = [] - for attr in self._attributes: - val = getattr(self, attr) - ret.append(val) - return tuple(ret) - - - # return state values to be pickled - def __getstate__(self): - state = [] - for attr in self._attributes: - val = getattr(self, attr) - state.append(val) - # append File info - state.append(self.Files) - return state - - - # restore state from the unpickled state values - def __setstate__(self, state): - for i in range(len(self._attributes)): - # schema evolution is supported only when adding attributes - if i+1 < len(state): - setattr(self, self._attributes[i], state[i]) - else: - setattr(self, self._attributes[i], 'NULL') - self.Files = state[-1] - - - # return column names for INSERT or full SELECT - def columnNames(cls): - ret = "" - for attr in cls._attributes: - if ret != "": - ret += ',' - ret += attr - return ret - columnNames = classmethod(columnNames) - - - # return expression of values for INSERT - def valuesExpression(cls): - ret = "VALUES(" - for attr in cls._attributes: - ret += "%s" - if attr != cls._attributes[len(cls._attributes)-1]: - ret += "," - ret += ")" - return ret - valuesExpression = classmethod(valuesExpression) - - - # return an expression for UPDATE - def updateExpression(cls): - ret = "" - for attr in cls._attributes: - ret = ret + attr + "=%s" - if attr != cls._attributes[len(cls._attributes)-1]: - ret += "," - return ret - updateExpression = classmethod(updateExpression) - - - # comparison function for sort - def compFunc(cls, a, b): - iPandaID = list(cls._attributes).index('PandaID') - iPriority = list(cls._attributes).index('currentPriority') - if a[iPriority] > b[iPriority]: - return -1 - elif a[iPriority] < b[iPriority]: - return 1 - else: - if a[iPandaID] > b[iPandaID]: - return 1 - elif a[iPandaID] < b[iPandaID]: - return -1 - else: - return 0 - compFunc = classmethod(compFunc) diff --git a/src/python/taskbuffer/__init__.py b/src/python/taskbuffer/__init__.py deleted file mode 100644 index e69de29bb2..0000000000 diff --git a/src/script/Deployment/Publisher/start.sh b/src/script/Deployment/Publisher/start.sh index f314c3b4c3..8264199040 100755 --- a/src/script/Deployment/Publisher/start.sh +++ b/src/script/Deployment/Publisher/start.sh @@ -65,12 +65,12 @@ ln -s /data/hostdisk/${SERVICE}/nohup.out nohup.out case $MODE in current) # current mode: run current instance - COMMAND_DIR=${PUBLISHER_ROOT}/lib/python2.7/site-packages/Publisher/ + COMMAND_DIR=${PUBLISHER_ROOT}/lib/python3.8/site-packages/Publisher/ CONFIG=${PUBLISHER_HOME}/PublisherConfig.py if [ "$debug" = true ]; then - python ${COMMAND_DIR}/SequentialPublisher.py --config ${CONFIG} --debug + python3 ${COMMAND_DIR}/SequentialPublisher.py --config ${CONFIG} --debug else - nohup python ${COMMAND_DIR}/PublisherMaster.py --config $PUBLISHER_HOME/PublisherConfig.py & + nohup python3 ${COMMAND_DIR}/PublisherMaster.py --config $PUBLISHER_HOME/PublisherConfig.py & fi ;; private) @@ -80,9 +80,8 @@ case $MODE in COMMAND_DIR=${GHrepoDir}/CRABServer/src/python/Publisher/ CONFIG=$PUBLISHER_HOME/PublisherConfig.py if [ "$debug" = true ]; then - python ${COMMAND_DIR}/SequentialPublisher.py --config ${CONFIG} --debug + python3 ${COMMAND_DIR}/SequentialPublisher.py --config ${CONFIG} --debug else - nohup python ${COMMAND_DIR}/PublisherMaster.py --config ${CONFIG} & + nohup python3 ${COMMAND_DIR}/PublisherMaster.py --config ${CONFIG} & fi esac - diff --git a/src/script/Deployment/TaskWorker/TaskWorkerConfig.py b/src/script/Deployment/TaskWorker/TaskWorkerConfig.py index 4d1e4fca4a..c7be0b02da 100644 --- a/src/script/Deployment/TaskWorker/TaskWorkerConfig.py +++ b/src/script/Deployment/TaskWorker/TaskWorkerConfig.py @@ -62,6 +62,10 @@ config.TaskWorker.cmskey = '/data/certs/servicekey.pem' config.TaskWorker.backend = 'glidein' + +# for connection to HTCondor scheds +config.TaskWorker.SEC_TOKEN_DIRECTORY = '/data/certs/tokens.d' + #Retry policy config.TaskWorker.max_retry = 4 config.TaskWorker.retry_interval = [30, 60, 120, 0] diff --git a/src/script/Deployment/TaskWorker/start.sh b/src/script/Deployment/TaskWorker/start.sh index b367acd3de..224b8cb457 100755 --- a/src/script/Deployment/TaskWorker/start.sh +++ b/src/script/Deployment/TaskWorker/start.sh @@ -65,12 +65,12 @@ ln -s /data/hostdisk/${SERVICE}/nohup.out nohup.out case $MODE in current) # current mode: run current instance - COMMAND_DIR=${TASKWORKER_ROOT}/lib/python2.7/site-packages/TaskWorker/ + COMMAND_DIR=${TASKWORKER_ROOT}/lib/python3.8/site-packages/TaskWorker/ CONFIG=${TASKWORKER_HOME}/current/TaskWorkerConfig.py if [ "$debug" = true ]; then - python ${COMMAND_DIR}/SequentialWorker.py $CONFIG --logDebug + python3 ${COMMAND_DIR}/SequentialWorker.py $CONFIG --logDebug else - nohup python ${COMMAND_DIR}/MasterWorker.py --config ${CONFIG} --logDebug & + nohup python3 ${COMMAND_DIR}/MasterWorker.py --config ${CONFIG} --logDebug & fi ;; private) @@ -80,9 +80,9 @@ case $MODE in COMMAND_DIR=${GHrepoDir}/CRABServer/src/python/TaskWorker CONFIG=$TASKWORKER_HOME/current/TaskWorkerConfig.py if [ "$debug" = true ]; then - python -m pdb ${COMMAND_DIR}/SequentialWorker.py ${CONFIG} --logDebug + python3 -m pdb ${COMMAND_DIR}/SequentialWorker.py ${CONFIG} --logDebug else - nohup python ${COMMAND_DIR}/MasterWorker.py --config ${CONFIG} --logDebug & + nohup python3 ${COMMAND_DIR}/MasterWorker.py --config ${CONFIG} --logDebug & fi esac diff --git a/src/script/Deployment/TaskWorker/updateTMRuntime.sh b/src/script/Deployment/TaskWorker/updateTMRuntime.sh index e4180fa1e9..5a9f82caf9 100755 --- a/src/script/Deployment/TaskWorker/updateTMRuntime.sh +++ b/src/script/Deployment/TaskWorker/updateTMRuntime.sh @@ -3,13 +3,10 @@ # with actual CRAB version CRAB3_DUMMY_VERSION=3.3.0-pre1 -# # Will replace the tarball in whatever is current TW TW_HOME=/data/srv/TaskManager - TW_CURRENT=${TW_HOME}/current - TW_RELEASE=`ls -l ${TW_CURRENT}|awk '{print $NF}'|tr -d '/'` TW_ARCH=$(basename $(ls -d ${TW_CURRENT}/sl*)) CRABTASKWORKER_ROOT=${TW_CURRENT}/${TW_ARCH}/cms/crabtaskworker/${TW_RELEASE} @@ -23,10 +20,19 @@ fi logFile=/tmp/updateRuntimeLog.txt +echo "working environment:" > ${logFile} +echo TW_HOME = $TW_HOME >> ${logFile} +echo TW_CURRENT = $TW_CURRENT >> ${logFile} +echo TW_RELEASE = $TW_RELEASE >> ${logFile} +echo TW_ARCH = $TW_ARCH >> ${logFile} +echo CRABTASKWORKER_ROOT = $CRABTASKWORKER_ROOT >> ${logFile} + # CRAB_OVERRIDE_SOURCE tells htcondor_make_runtime.sh where to find the CRABServer repository export CRAB_OVERRIDE_SOURCE=${GHrepoDir} +echo CRAB_OVERRIDE_SOURCE = $CRAB_OVERRIDE_SOURCE >> ${logFile} pushd $CRAB_OVERRIDE_SOURCE/CRABServer > /dev/null -sh bin/htcondor_make_runtime.sh > ${logFile} 2>&1 +echo "current working directory is: " `pwd` >> ${logFile} +sh bin/htcondor_make_runtime.sh >> ${logFile} 2>&1 mv TaskManagerRun-$CRAB3_DUMMY_VERSION.tar.gz TaskManagerRun.tar.gz >> ${logFile} 2>&1 mv CMSRunAnalysis-$CRAB3_DUMMY_VERSION.tar.gz CMSRunAnalysis.tar.gz >> ${logFile} 2>&1 @@ -55,6 +61,7 @@ if [ $? -eq 0 ] then echo "" echo "OK. New tarballs created and placed inside directory tree: $TW_CURRENT" + echo "See log in ${logFile}" echo "Previous files have been saved in $CRABTASKWORKER_ROOT/data/PreviousRuntime/" echo "BEWARE: Safest way to revert to original configuration is to re-deploy container" else diff --git a/src/script/Monitor/logstash/crabtaskworker.conf b/src/script/Monitor/logstash/crabtaskworker.conf index ffe0b97212..ac04e40a85 100644 --- a/src/script/Monitor/logstash/crabtaskworker.conf +++ b/src/script/Monitor/logstash/crabtaskworker.conf @@ -85,18 +85,6 @@ filter{ overwrite => ["message"] } - # This filter is currently used for production publisher. - # after we deploy preprod to prod, this filter can be dropped - } else if [message] =~ /.*DEBUG:master.*/ { - grok{ - match => { - #2021-04-15 22:23:39,566:DEBUG:master: 8 : 210415_093249:algomez_crab_QCDHT100to200TuneCP5PSWeights13TeV-madgraphMLM - "message" => "%{TIMESTAMP_ISO8601:timestamp_temp}:%{NOTSPACE:logMsg}:master:%{SPACE}%{INT:acquiredFiles}%{SPACE}:%{SPACE}%{NOTSPACE:taskName}" - } - add_field => {"log_type" => "acquired_files"} - overwrite => ["message"] - } - # this filter mathces changes in #6861, which is already in preprod and dev } else if [message] =~ /.*acquired_files.*/ { grok{ @@ -159,6 +147,7 @@ filter{ } } + output { if [log_type] == "work_on_task_completed" { http { @@ -235,18 +224,6 @@ output { } } - # This filter is currently used for production publisher. - # after we deploy preprod to prod, this filter can be dropped - if [log_type] == "acquired_files" { - http { - http_method => post - url => "http://monit-logs.cern.ch:10012/" - content_type => "application/json; charset=UTF-8" - format => "message" - message => '{"hostname": "%{hostname}", "rec_timestamp":"%{rec_timestamp}", "log_file": "%{log_file}", "producer": "crab", "type": "publisher", "timestamp":"%{timestamp}", "producer_time":"%{producer_time}", "log_type":"%{log_type}", "logMsg":"%{logMsg}", "acquiredFiles":"%{acquiredFiles}", "taskName":"%{taskName}" }' - } - } - # this filter mathces changes in #6861, which is already in preprod and dev if [log_type] == "acquired_files_status" { http { diff --git a/test-old/python/CRABInterface_t/RESTFileUserTransfers_t.py b/test-old/python/CRABInterface_t/RESTFileUserTransfers_t.py index 65a76521d9..78af5db2fa 100644 --- a/test-old/python/CRABInterface_t/RESTFileUserTransfers_t.py +++ b/test-old/python/CRABInterface_t/RESTFileUserTransfers_t.py @@ -88,7 +88,7 @@ def testFileTransferPUT(self): print(self.fileDoc) self.server.put('/crabserver/dev/fileusertransfers', data=encodeRequest(self.fileDoc)) # if I will put the same doc twice, it should raise an error. - # self.server.put('/crabserver/dev/fileusertransfers', data=urllib.urlencode(self.fileDoc)) + # self.server.put('/crabserver/dev/fileusertransfers', data=urlencode(self.fileDoc)) # This tasks are for the future and next calls if user not in self.tasks: self.tasks[user] = {'workflowName': workflowName, 'taskname': taskname, 'listOfIds': [],