Scripts

This page provide description of all scripts available in CMSSpark.

bin/run_spark is a bash script which is a wrapper around spark-submit. It setups all necessary JAR (java archive libraries) to submit provided python script to HDFS+Spark.
run_aggregation is a bash script to aggregate CMS data-streams into records and submit them to CERN MONIT
cron4aggregation is a bash script to be used in crontab to submit run_aggregation script
cron4dbs_condor is a bash script to be used in crontab to submit run_spark dbs_condor.py script, it identifies last date of DBS+HTCondor data available on HDFS and use it in submission
cron4phedex is a bash script to be used in crontab to submit run_spark phedex.py script, it identifies last date of PhEDEx data available on HDFS and use it in submission

aso_stats.py provides PySpark pipeline to collect ASO statistics
data_aggregation.py provides PySpark CMS popularity pipeline to collect data from DBS, AAA, CMSSW, EOS, JM data streams
dbs_aaa.py provides example of PySpark pipeline for DBS+AAA aggregation
dbs_adler.py provides example of PySpark pipeline for DBS LFN adler lookup
dbs_block_lumis.py provides example of PySpark pipeline for DBS block lumi aggregation
dbs_cmssw.py provides example of PySpark pipeline for DBS+CMSSW aggregation
dbs_condor.py provides example of PySpark pipeline for DBS+HTCondor aggregation
dbs_eos.py provides example of PySpark pipeline for DBS+EOS aggregation
dbs_jm.py provides example of PySpark pipeline for DBS+JobMonitoring aggregation
dbs_lfn.py provides example of PySpark pipeline for DBS LFN look-up aggregation
dbs_phedex.py provides example of PySpark pipeline for DBS+PhEDEx aggregation
fts_aso.py provides example of PySpark pipeline for FTS+ASO aggregation
jm_stats.py provides example of PySpark pipeline for JobMonitoring stats
phedex.py provides example of PySpark pipeline for PhEDEx aggregation
phedex_agg.py provides example of PySpark pipeline for post-processing PhEDEx aggregation (replaced with mergePhedex.py or Go versions which are much faster)
wmarchive.py provides example of PySpark pipeline for WMArchive aggregation

schema.py contains all data-stream scheamas
spark_utils.py provides generic utilities to access data-stream on HDFS and define data-streams tables
utils.py generic utilities

cern_monit.py helper script to submit given records to CERN MONIT via AMQ broker
dates.py helper script to produce series of dates
getCSV.py helper script to fetch HDFS dataframes and store them in local area (use together with dbs_condor.py)
mergePhedex.py helper script to process HDFS phedex dataframe and produce aggregated one
mergePhedex.go a Go equivalent of mergePhedex.py which runs about 5x times faster (on multi-core node)

Provide feedback