-
Notifications
You must be signed in to change notification settings - Fork 21
Scripts
Valentin Kuznetsov edited this page Dec 20, 2017
·
2 revisions
This page provide description of all scripts available in CMSSpark.
-
bin/run_spark
is a bash script which is a wrapper around spark-submit. It setups all necessary JAR (java archive libraries) to submit provided python script to HDFS+Spark. -
run_aggregation
is a bash script to aggregate CMS data-streams into records and submit them to CERN MONIT -
cron4aggregation
is a bash script to be used in crontab to submitrun_aggregation
script -
cron4dbs_condor
is a bash script to be used in crontab to submitrun_spark dbs_condor.py
script, it identifies last date of DBS+HTCondor data available on HDFS and use it in submission -
cron4phedex
is a bash script to be used in crontab to submitrun_spark phedex.py
script, it identifies last date of PhEDEx data available on HDFS and use it in submission
-
aso_stats.py
provides PySpark pipeline to collect ASO statistics -
data_aggregation.py
provides PySpark CMS popularity pipeline to collect data from DBS, AAA, CMSSW, EOS, JM data streams -
dbs_aaa.py
provides example of PySpark pipeline for DBS+AAA aggregation -
dbs_adler.py
provides example of PySpark pipeline for DBS LFN adler lookup -
dbs_block_lumis.py
provides example of PySpark pipeline for DBS block lumi aggregation -
dbs_cmssw.py
provides example of PySpark pipeline for DBS+CMSSW aggregation -
dbs_condor.py
provides example of PySpark pipeline for DBS+HTCondor aggregation -
dbs_eos.py
provides example of PySpark pipeline for DBS+EOS aggregation -
dbs_jm.py
provides example of PySpark pipeline for DBS+JobMonitoring aggregation -
dbs_lfn.py
provides example of PySpark pipeline for DBS LFN look-up aggregation -
dbs_phedex.py
provides example of PySpark pipeline for DBS+PhEDEx aggregation -
fts_aso.py
provides example of PySpark pipeline for FTS+ASO aggregation -
jm_stats.py
provides example of PySpark pipeline for JobMonitoring stats -
phedex.py
provides example of PySpark pipeline for PhEDEx aggregation -
phedex_agg.py
provides example of PySpark pipeline for post-processing PhEDEx aggregation (replaced with mergePhedex.py or Go versions which are much faster) -
wmarchive.py
provides example of PySpark pipeline for WMArchive aggregation
-
schema.py
contains all data-stream scheamas -
spark_utils.py
provides generic utilities to access data-stream on HDFS and define data-streams tables -
utils.py
generic utilities
-
cern_monit.py
helper script to submit given records to CERN MONIT via AMQ broker -
dates.py
helper script to produce series of dates -
getCSV.py
helper script to fetch HDFS dataframes and store them in local area (use together withdbs_condor.py
) -
mergePhedex.py
helper script to process HDFS phedex dataframe and produce aggregated one -
mergePhedex.go
a Go equivalent ofmergePhedex.py
which runs about 5x times faster (on multi-core node)