-
Notifications
You must be signed in to change notification settings - Fork 21
Scrutiny plot
In order to produce scrutiny plot we perform several steps to aggregate the data:
- collect phedex snapshot for desired period of time, e.g. 1 year
- aggregate all phedex dataframes into single one which provides number of days each dataset present on a specific site
- collect dataset, campaign, release, era aggregated dataframes for desired period of time
This step is done by running run_spark phedex.py
script from CMSSpark, e.g.
hdir=hdfs:///cms
dates=`python src/python/CMSSpark/dates.py --range --format="%Y%m%d" --ndays=346`
for d in $dates; do
cmd="PYTHONPATH=$PWD/src/python bin/run_spark phedex.py --yarn --fout=$hdir --date=$d"
echo $cmd
PYTHONPATH=$PWD/src/python bin/run_spark phedex.py --yarn --fout=$hdir --date=$d
done
It runs over PhEDEx snapshots on HDFS and collects the following DataFrames:
date, site, dataset, size, replica_date, groupid
The, we stage data back from HDFS to local disk:
hadoop fs -get $hdir/phedex .
Finally, we use either python or Go code to merge all PhEDEx DataFrames into single DataFrame:
# python script
python src/python/CMSSpark/mergePhedex.py --idir=$PWD/phedex --fout=phedex.cs --dates 20170101-20171212
# Go-based script
go run src/Go/mergePhedex.go -idir=$PWD/phedex -fout phedex.csv -dates 20170101-20171212
It produces the following DataFrame:
site,dataset,min_date,max_date,min_rdate,max_rdate,min_size,max_size,days,gid
where min/max-date/rdate/size are minimum and maximum date, replica create date and size, respectively. The days attribute is calculated from min/max dates, and gid is PhEDEx group id number.
The next step is involved production of dataset, era, campaign and release DataFrames from HTCondor classAds logs merged (if necessary) with DBS database snapshot on HDFS. To produce these DataFrames we use run_spark dbs_condor.py
combination of scripts in the following way:
PYTHONPATH=$PWD/src/python bin/run_spark dbs_condor.py --yarn --fout=$hdir --date=$d
It yields results into $hdir/dbs_condor
area on HDFS which we can stage back to local disk.
The described above procedure for collecting PhEDEx and DBS+HTCondor information is automated via set of crontab jobs. In particular, we submit two scripts
cron4phedex
cron4dbs_condor
They identify last date of data available on HDFS and submit appropriate jobs to collect aggregated DataFrames.