Goal:
Production covers three areas: generation+simulation (cmsim),
Hit
Formatting (writeHits), Digitization and pileup (writeDigis).
cmsim: generated+simulated events --->
.fz files
writeHits: .fz files ---> Objectivity
database
writeDigis: Objectivity database --->
Objectivity database
Resources:
Enstore
We use the Storage Tek
tape library in enstore for mass
storage of .fz files and OBJY data files.
The enstore catalog of files
on tape is managed with a filesystem
lookalike called pnfs.
/pnfs is "mounted" on gallo and velveeta
and can be queried like
a normal filesystem with commands like
"ls", "mkdir", etc. The
exception is that to copy files to and
from enstore you have to
use the "encp" command instead of "cp".
Copying files to pnfs
with encp means that they actually get
written to tape.
The CMS production area in pnfs is under
/pnfs/cms/production.
we create jobs and launch them on gallo.fnal.gov.
The .fz files
come from the /pnfs/cms/production/Projects/<data-set-name>/results
area. The CMS production area on gallo
where we create and launch jobs
is in /data/jetmet_production. The .fz
files are also staged on the
popcrn nodes on which jobs are running.
The Objectivity database is called a
federation because it is
actually a collection of database files.
The C_federation is kept on
gallo.fnal.gov and D-federation is kept
on velveeta.fnal.gov. The
scripts are kept on gallo.gnal.gov.
When running writeHits or
writeDigis, it is important to monitor
the usage of the /data disk on
velveeta and /data disk on gallo.
The space requirements are about the
same.
popcrn31-40:
These are the nodes
of the production farm. Each node has two
processors and cun run two jobs
at a time. popcrn31-40 (except
37) are also pileup servers.
They keep the minimum bias data and
serve it to jobs running writeDigis
that need pileup. popcrn1-2 are
reserved for testing.
cmsprod account:
Username: cmsprod
Password: xxxxxxx
fbsng (farm batch system
next generation):
This is the batch system
used on the popcrn nodes. Some useful
commands:
setup fbsng ( Must be done before using fbsng
)
fbs lj
( Lists all jobs currently running or queued )
fbs nodes ( Lists all running
jobs by node )
fbs submit x.jdf ( Submits a job. )
Each fbs job supports multiple processes;
production scripts:
The scripts we will use for production
have been checked out for
you already in the
directory
/data/jetmet_production/B_scripts or /data/jetmet_production/C_scripts
or
/data/jetmet_production/D_scripts
backup scripts:
Greg has written backup
scripts for production. Go to the directory of either C or D
federation and start backing
up data and system files.
The syntax are as follows:
Backup datafiles for both
Hits and Digis:
HITS:
fb -v -i `pwd` -o jetmet_production
-n <dataset name> -w <dataset Hitsownername> -c <optional> backup_data_files
> keyword.txt
There are five optionals
in the case of Hits:
1. Collections
2. Events
3. MCInfo
4. Hits
5. THits
Make sure you repeat backup
command for all the optionals.
DIGIS:
fb -v -i `pwd` -o jetmet_production
-n <dataset name> -w <dataset Digisownername> -c <optional>
backup_data_files > keyword.txt
There are five optionals
in the case of Hits:
1. Collections
2. Events
3. Digis
Make sure you repeat backup
command for all the optionals.
Backup of system files:
fb -v -i `pwd` -o <C
or D federation> backup_system_files > keyword.txt
Objectivity servers:
Objectivity
uses AMS server to communicate across the network.
It also has
a lock server to provide safe concurrent access to the
database files.
AMS server should be running on velveeta, gallo, popcrn31-40 at all times
during production.
Lock servers should be running on popcrn06 and popcrn07 at all times
during productin.
To check, log on to the corresponding machines as cmsprod and type:
oocheckams
oocheckls
It doesn't matter where
you do this from. If either need to be
started (or stopped) you
can use the following:
setup systools
cmd ams-server start
cmd ams-server stop
cmd ams-lock start &
cmd ams-lock stop
In some circumstances, the
lock server may not be able to be
stopped. In that case,
call an expert (so he/she can issue a
"kill -9" and do the necessary
cleanup.)
Web Site for more information:
http://computing.fnal.gov/cms/Monitor/cms_production.html
(1) So I just
got in and I'm on shift today. What do I do ?
This tutorial will assume that
you are starting out in the
beginning. In real life, you may start in the middle of any of these
procedures.
(a) Attend the production meeting every Monday 10:00
AMand Thursday at 1:30 PM.
Get instructions there on
what samples need to be processed.
Otherwise, wait for instructions
from Production coach.
(b) If this is your first day on shift, make sure
that your name
is included in the FBS job
description file so that you will be
notified when jobs finish.
To do this:
i) Log
into gallo as cmsprod.
ii) cd /data/jetmet_production/C_scripts/cms_prod_util
or D_scripts/cms_prod_util
or B_scripts/cms_prod_util
iii) emacs Templates/hits_template.jdf
iv) add your
email address to the EMAIL lines, (there is more
than one place in the file,) and save the changes.
v) Repeat
for all digis_template*.jdf files in Templates.
(c) For doing CMSIM:
i) Log
into gallo as cmsprod.
ii) setup fbsng
iii) cd /data/jetmet_production/C_scripts/cms_prod_util
or D_scripts/cms_prod_util or
B_scripts/cms_prod_util
iv) Check disk
space of gallo:/data and velveeta:/data. Check
for at least 50 GB space on each. If there is not, ask
for assistance. Free space will be made. When logged
into gallo:
df -k /data
rsh velveeta df -k /data
The following
are the steps involved in cmsim.
scripts/DeclareCMSIMJobs.sh
-v -n 40 data_set_name [number]
This command
creates a directory for data_set_name under cmsim
directory.
Then it creats a directory production under data
set name
directory. Then it creates the following directory
structure
under production directory
declared
created
in_progress
done
params
problems
The
command gets list of all ntpl files for the given data set from
gallo
specified directory and creates these files in "declared" directory.
If it fails to create a entry in "declared" direcotry, it reports that
error message in
"problems" directory. If you receive no error message, it means everything
is ok.
scripts/CreateCMSIMJobs.sh -v data_set_name
This command
creats a batch directory under data set name.
Then it
creates the following directory structure under batch
directory
asociations
created
declared
finished
jdf
logs
params
running
scripts
submitted
This command
creats entries in asociations, created and declared
directories
for each ntpl file. It also creates script from
cmsim_template
for each ntpl file and puts them into script directory.
Then it
creates job desription file for each entry and puts
them in
jdf directory.
scripts/RunJob.sh -v -j cmsim data_set_name
This command
submit all jobs one by one to production farms that
you have
mension in command for a given data set. We
can see
entries of all jobs that have ben submitted successfully
in batch/submitted
directory. After the job have been completed successfully,
it moves
files entry to done directory. If somehow job does not run successfully,
it moves
files entry to problems directory.
(d) For doing OOHits:
i) Log
into gallo as cmsprod.
ii) setup fbsng
iii) cd /data/jetmet_production/C_scripts/cms_prod_util
or D_scripts/cms_prod_util
iv) Check disk
space of gallo:/data and velveeta:/data. Check
for at least 50 GB space on each. If there is not, ask
for assistance. Free space will be made. When logged
into gallo:
df -k /data
rsh velveeta df -k /data
v) You
will receive one or more data sets to process with
writeHits for the day.
The following
are the steps involved in OOHit formating.
scripts/DeclareHitsJobs.sh -v data_set_name
This command
creates a directory for data_set_name under OOHit
directory.
Then it creats a directory production under data
set name
directory. Then it creates the following directory
structure
under production directory
declared
created
in_progress
done
problems
The command
gets list of all fz files for the given data set from
tape drive
and creates these files in "declared" directory
without
fz suffix. If it fails to create a entry in "declared"
direcotry,
it reports that error mesage. If you receive no error
message,
it means everything is ok.
scripts/CreateHitsJobs.sh -v data_set_nam
This command
creats a batch directory under data set name.
Then it
creates the following directory structure under batch
directory
asociations
created
declared
finished
jdf
running
scripts
submitted
This command
creats entries in asociations, created and declared
directories
for each fz file. It also creates script from
hits_template
for each fz file and puts them into script directory.
Then it
creates job desription file for each entry and puts
them in
jdf directory.
scripts/RunJob.sh -v -j OOHit data_set_nam [number]
This command
submit all jobs one by one to production farms that
you have
mension in command for a given data set. We
can see
entries of all jobs that have ben submitted successfully
in batch/submitted
directory. OOHit formatting consist of
three
stages.
Staging
RunHits
ValidateHits
In first
stage, each fz file is staged from enstore tape to run
time area.
Then in second stage, hitformatting is done. In
final
stage hit run number is validated. These three stages
executes
one after the other and depend on the exit code of the
previous
stage. If the exit code from previous stage was not zero,
it does
not execute the next stage. After the job have
been completed
successfully, it moves files entry to done directory.
If somehow
job does not run successfully, it moves
files
entry to problems directory.
(e) For doing OODigis:
i) Log
into gallo as cmsprod.
ii) setup fbsng
iii) cd /data/jetmet_production/C_scripts/cms_prod_util
or D_scripts/cms_prod_util
echo $PROD_RESOURCES
setenv PROD_RESOURCES `pwd`/scripts
iv) Check disk
space of velveeta:/data. Check for at least 50 GB
space. If there is not, ask for assistance. Free space will
be made. When logged into gallo:
df -k /data
rsh velveeta df -k /data
v) You
will receive one or more data sets to process with
writeDigis for the day.
The following steps are involved in OODigitization formatting.
scripts/DeclareDigisJobs.sh -v data_set_name pileup_descriptor
This command creates a directory for data_set_name under OOHit
directory. Then it creats a directory production under
data set name directory. Then it creates the following directory
structure under production directory
declared
created
in_progress
done
problems
The command gets list of all fz files for the given data set
from tape drive and creates these files in "declared" directory
without fz suffix. If it fails to create a entry in "declared"
direcotry, it reports that error mesage. If you receive no error
message, it means every thing is ok.
scripts/CreateDigisJobs.sh -v data_set_nam pileup_descriptor
This command creats a batch directory under data set name.
Then it creates the following directory structure under batch
directory
asociations
created
declared
finished
jdf
running
scripts
submitted
This command creats entries in asociations, created and
declared directories for each fz file. It also creates script
from hits_template for each fz file and puts them into script
directory. Then it creates job desription file for each entry
and puts them in jdf directory.
scripts/RunJob.sh -v -j OODigi data_set_nam [number]
This command submit all jobs one by one to production farms
that you have mension in command for a given data set. We
can see entries of all jobs that have ben submitted successfully
in batch/submitted directory. After the job have been
completed successfully, it moves files entry to done directory.
If somehow job does not run successfully, it moves files
entry to problems directory.
(2) What can I do while it's running ?
Some run-time Sanity Checks
i) Check that the jobs have been submitted OK with fbs lj.
ii) Check on which nodes jobs are running with fbs lj.
iii) After a few minutes, check that the disk
usage on
velveeta:/data
is growing. Check this from gallo
using "rsh velveeta
du -sk /data".
Check it again
shortly after and see if disk space is
accumulating.
iv) Each job will have N sections depending
on what N you gave to
the "create_hits,digis_jobs"
script. Count the number of
sections listed
in the "fbs nodes" output for each job. Did any
sections croak
?
v) Lots of fun: After a while, you can get basic statistics on the
Web page.
Go to:
computing.fnal.gov:/cms/Monitor/cms_production.html
and follow the
"Production Farms" link. You can get various
network traffic
plots and CPU utilizations for all the popcorn
nodes, gallo,
and velveeta. Also, you can get a summary of which
nodes have jobs
running on them.
vi) Do you know your FBS job id numbers
? Then you can check
which event
you are on by "python ~/bin/LogChecker.py <job_id>"
This command
parses the log files as they are written in
/data/fbs-logs
on gallo and gets the last event number
processes
(5) How do I know it finished OK ?
a) When all sections
of the job have exited, you will receive several
EMails. Check the output of the
Email labeled "main." It looks
like this ( which is a failed job! )
:
Section Info:
Job 2118 Section: main
Exec: ['/data/cms_production_220301/cms_production/scripts/
digis_dispatcher', '1', '2', '3', '4', '5', '6', '7',
'8', '9', '10', 'jm_sm_qq_qqh120_inv',
'1034']
Submit_Time: Sun Apr
8 13:29:10 2001
Start_Time: Sun Apr
8 13:29:16 2001
End_time: Sun Apr
8 13:35:52 2001
Exit Code:1
Number of Process 10
-----------------------------
Process Info:
-----------------------------
Process 1
Node: popcrn26
Start Time: Sun Apr
8 13:29:16 2001
End Time: Sun Apr
8 13:35:47 2001
Exit Code:1
Reason:Killed by BMGR
CPU Time: 187
-----------------------------
Process 2
Node: popcrn10
Start Time: Sun Apr
8 13:29:16 2001
End Time: Sun Apr
8 13:35:30 2001
Exit Code:1
Reason:Killed by BMGR
CPU Time: 150
-----------------------------
Process 3
Node: popcrn37
Start Time: Sun Apr
8 13:29:16 2001
End Time: Sun Apr
8 13:35:47 2001
Exit Code:1
Reason:Killed by BMGR
CPU Time: 40
-----------------------------
Note the CPU time of each process. If
any stick out, there may
have been a problem. Also check the
exit codes. If any are non-zero,
there may have been a problem. However,
problems do arise that do not
touch the exit code, so beware.
(b) Check the job directories.
i) Log
into gallo as cmsprod.
ii) cd /data/jetmet-production/cms_db
iii)"ls cmsim/<data_set_name>/production/problems"
or "ls OOHits/production/problems" or
"ls OODigis/production/problems"
iv) If there
are any entries here, then there was a problem.