CMS Monte Carlo Production at FNAL
Shift Takers Manual

Goal:
   Production covers three areas: generation+simulation (cmsim), Hit
  Formatting (writeHits), Digitization and pileup (writeDigis).
 
      cmsim: generated+simulated events ---> .fz files
      writeHits: .fz files ---> Objectivity database
      writeDigis: Objectivity database ---> Objectivity database

Resources:
    Enstore
       We use the Storage Tek tape library in enstore for mass
      storage of .fz files and OBJY data files.  The enstore catalog of files
      on tape is managed with a filesystem lookalike called pnfs.
      /pnfs is "mounted" on gallo and velveeta and can be queried like
      a normal filesystem with commands like "ls", "mkdir", etc.  The
      exception is that to copy files to and from enstore you have to
      use the "encp" command instead of "cp".  Copying files to pnfs
      with encp means that they actually get written to tape.
      The CMS production area in pnfs is under /pnfs/cms/production.
 
      we create jobs and launch them on gallo.fnal.gov.  The .fz files
      come from the /pnfs/cms/production/Projects/<data-set-name>/results
      area. The CMS production area on gallo where we create and launch jobs
      is in /data/jetmet_production. The .fz files are also staged on the
      popcrn nodes on which jobs are running.
 
      The Objectivity database is called a federation because it is
      actually a collection of database files.  The C_federation is kept on
      gallo.fnal.gov and D-federation is kept on velveeta.fnal.gov.  The
      scripts are kept on gallo.gnal.gov.  When running writeHits or
      writeDigis, it is important to monitor the usage of the /data disk on
      velveeta and /data disk on gallo.  The space requirements are about the
      same.

 popcrn31-40:
    These are the nodes of the production farm.  Each node has two
      processors and cun run two  jobs at a time.  popcrn31-40 (except
      37) are also  pileup servers.  They  keep the minimum bias  data and
      serve it to  jobs running writeDigis that need  pileup.  popcrn1-2 are
      reserved for testing.

cmsprod account:
     Username: cmsprod
       Password: xxxxxxx
 
        fbsng (farm batch system next generation):
        This is the batch system used on the popcrn nodes.  Some useful
      commands:
            setup fbsng      ( Must be done before using fbsng )
            fbs lj                 ( Lists all jobs currently running or  queued )
            fbs nodes         ( Lists all running jobs by node )
            fbs submit x.jdf ( Submits a job. )
      Each fbs job supports multiple processes;
 
 production scripts:
      The scripts we will use for production have been checked out for
       you   already in the directory
              /data/jetmet_production/B_scripts or /data/jetmet_production/C_scripts or
              /data/jetmet_production/D_scripts
 
backup scripts:
        Greg has written backup scripts for production. Go to the directory of either C or D
        federation and start backing up data and system files.
        The syntax are as follows:
 
        Backup datafiles for both Hits and Digis:
 
        HITS:
        fb -v -i `pwd` -o jetmet_production -n <dataset name> -w <dataset Hitsownername> -c <optional> backup_data_files > keyword.txt
 
        There are five optionals in the case of Hits:
        1. Collections
        2. Events
        3. MCInfo
        4. Hits
        5. THits
 
        Make sure you repeat backup command for all the optionals.

        DIGIS:
        fb -v -i `pwd` -o jetmet_production -n <dataset name> -w <dataset Digisownername> -c <optional>
        backup_data_files > keyword.txt
 
        There are five optionals in the case of Hits:
        1. Collections
        2. Events
        3. Digis
 
        Make sure you repeat backup command for all the optionals.
 
        Backup of system files:
        fb -v -i `pwd` -o <C or D federation> backup_system_files > keyword.txt

 Objectivity servers:
          Objectivity uses AMS server to communicate across the network.
          It also has a lock server to provide safe concurrent access to the
          database files.  AMS server should be running on velveeta, gallo, popcrn31-40 at all times
          during production. Lock servers should be running on popcrn06 and popcrn07 at all times
          during productin.  To check, log on to the corresponding machines as cmsprod and type:
 
                 oocheckams
                 oocheckls
 
        It doesn't matter where you do this from.  If either need to be
        started (or stopped) you can use the following:
 
                 setup systools
                 cmd ams-server start
                 cmd ams-server stop
                 cmd ams-lock start &
                 cmd ams-lock stop
 
        In some circumstances, the lock server may not be able to be
        stopped.  In that case, call an expert (so he/she can issue a
        "kill -9" and do the necessary cleanup.)
 
  Web Site for more information:
          http://computing.fnal.gov/cms/Monitor/cms_production.html

 
        (1) So I just got in and I'm on shift today.  What do I do ?
 
    This tutorial will assume that you are starting out in the
beginning. In real life, you may start in the middle of any of these
procedures.
 
    (a) Attend the production meeting every Monday 10:00 AMand Thursday at 1:30 PM.
        Get instructions there on what samples need to be processed.
        Otherwise, wait for instructions from Production coach.
 
    (b) If this is your first day on shift, make sure that your name
        is included in the FBS job description file so that you will be
        notified when jobs finish.  To do this:
           i) Log into gallo as cmsprod.
          ii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
                                            or B_scripts/cms_prod_util
         iii) emacs Templates/hits_template.jdf
          iv) add your email address to the EMAIL lines, (there is more
              than one place in the file,) and save the changes.
           v) Repeat for all digis_template*.jdf files in Templates.

     (c) For doing CMSIM:
           i) Log into gallo as cmsprod.
          ii) setup fbsng
         iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util or
               B_scripts/cms_prod_util
          iv) Check disk space of gallo:/data and velveeta:/data. Check
              for at least 50 GB space on each.  If there is not, ask
              for assistance.  Free space will be made.  When logged
              into gallo:
                             df -k /data
                             rsh velveeta df -k /data
 
 
           The following are the steps involved in cmsim.
 
           scripts/DeclareCMSIMJobs.sh -v -n 40 data_set_name [number]
 
           This command creates a directory for data_set_name under cmsim
           directory. Then it creats a directory production under data
           set name directory. Then it creates the following directory
           structure under production directory
                declared
                created
                in_progress
                done
                params
                problems

            The command gets list of all ntpl files for the given data set from
           gallo specified directory and creates these files in "declared" directory.
            If it fails to create a entry in "declared" direcotry, it reports that error message in
             "problems" directory. If you receive no error message, it means everything is ok.

           scripts/CreateCMSIMJobs.sh -v data_set_name

           This command creats a batch directory under data set name.
           Then it creates the following directory structure under batch
           directory

                asociations
                created
                declared
                finished
                jdf
                logs
                params
                 running
                scripts
                submitted

           This command creats entries in asociations, created and declared
           directories for each ntpl file. It also creates script from
           cmsim_template for each ntpl file and puts them into script directory.
           Then it creates job desription file for each entry and puts
           them in jdf directory.

          scripts/RunJob.sh -v -j cmsim data_set_name

           This command submit all jobs one by one to production farms that
           you have mension in command for a given data set. We
           can see entries of all jobs that have ben submitted successfully
           in batch/submitted directory.  After the job have been completed successfully,
           it moves files entry to done directory. If somehow job does not run successfully,
           it moves files entry to problems directory.
 
   (d) For doing OOHits:
           i) Log into gallo as cmsprod.
          ii) setup fbsng
         iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
          iv) Check disk space of gallo:/data and velveeta:/data. Check
              for at least 50 GB space on each.  If there is not, ask
              for assistance.  Free space will be made.  When logged
              into gallo:
                             df -k /data
                             rsh velveeta df -k /data

           v) You will receive one or more data sets to process with
              writeHits for the day.
 
           The following are the steps involved in OOHit formating.

           scripts/DeclareHitsJobs.sh -v data_set_name

           This command creates a directory for data_set_name under OOHit
           directory. Then it creats a directory production under data
           set name directory. Then it creates the following directory
           structure under production directory

                declared
                created
                in_progress
                done
                problems

           The command gets list of all fz files for the given data set from
           tape drive and creates these files in "declared" directory
           without fz suffix. If it fails to create a entry in "declared"
           direcotry, it reports that error mesage. If you receive no error
           message, it means everything is ok.

           scripts/CreateHitsJobs.sh -v data_set_nam

           This command creats a batch directory under data set name.
           Then it creates the following directory structure under batch
           directory

                asociations
                created
                declared
                finished
                jdf
                running
                scripts
                submitted

           This command creats entries in asociations, created and declared
           directories for each fz file. It also creates script from
           hits_template for each fz file and puts them into script directory.
           Then it creates job desription file for each entry and puts
           them in jdf directory.

           scripts/RunJob.sh -v -j OOHit data_set_nam [number]

           This command submit all jobs one by one to production farms that
           you have mension in command for a given data set. We
           can see entries of all jobs that have ben submitted successfully
           in batch/submitted directory. OOHit formatting consist of
           three stages.

                 Staging
                 RunHits
                 ValidateHits

           In first stage, each fz file is staged from enstore tape to run
           time area. Then in second stage, hitformatting is done. In
           final stage hit run number is validated. These three stages
           executes one after the other and depend on the exit code of the
           previous stage. If the exit code from previous stage was not zero,
           it does not execute the next stage. After the job have
           been completed successfully, it moves files entry to done directory.
           If somehow job does not run successfully, it moves
           files entry to problems directory.

    (e) For doing OODigis:
           i) Log into gallo as cmsprod.
          ii) setup fbsng
         iii) cd /data/jetmet_production/C_scripts/cms_prod_util or D_scripts/cms_prod_util
       echo $PROD_RESOURCES
       setenv PROD_RESOURCES `pwd`/scripts
          iv) Check disk space of velveeta:/data. Check for at least 50 GB
              space.  If there is not, ask for assistance.  Free space will
              be made.  When logged into gallo:

                             df -k /data
                             rsh velveeta df -k /data

           v) You will receive one or more data sets to process with
              writeDigis for the day.
              The following steps are involved in OODigitization formatting.

              scripts/DeclareDigisJobs.sh -v data_set_name  pileup_descriptor

              This command creates a directory for data_set_name under OOHit
              directory. Then it creats a directory production under
              data set name directory. Then it creates the following directory
              structure under production directory

                     declared
                     created
                     in_progress
                     done
                     problems

              The command gets list of all fz files for the given data set
              from tape drive and creates these files in "declared" directory
              without fz suffix. If it fails to create a entry in "declared"
              direcotry, it reports that error mesage. If you receive no error
              message, it means every thing is ok.

              scripts/CreateDigisJobs.sh -v data_set_nam  pileup_descriptor

              This command creats a batch directory under data set name.
              Then it creates the following directory structure under batch
              directory

                    asociations
                    created
                    declared
                    finished
                    jdf
                    running
                    scripts
                    submitted

               This command creats entries in asociations, created and
               declared directories for each fz file. It also creates script
               from hits_template for each fz file and puts them into script
               directory. Then it creates job desription file for each entry
               and puts them in jdf directory.

              scripts/RunJob.sh -v -j OODigi data_set_nam [number]

              This command submit all jobs one by one to production farms
              that you have mension in command for a given data set. We
              can see entries of all jobs that have ben submitted successfully
              in batch/submitted directory. After the job have been
              completed successfully, it moves files entry to done directory.
              If somehow job does not run successfully, it moves files
              entry to problems directory.

   (2) What can I do while it's running ?
 
             Some run-time Sanity Checks

       i) Check that the jobs have been submitted OK with fbs lj.

      ii) Check on which nodes jobs are running with fbs lj.

     iii) After a few minutes, check that the disk usage on
          velveeta:/data is growing.  Check this from gallo
          using "rsh velveeta du -sk /data".
          Check it again shortly after and see if disk space is
          accumulating.

      iv) Each job will have N sections depending on what N you gave to
          the "create_hits,digis_jobs" script.  Count the number of
          sections listed in the "fbs nodes" output for each job.  Did any
          sections croak ?

       v) Lots of fun: After a while, you can get basic statistics on the

          Web page.
                 Go to:
               computing.fnal.gov:/cms/Monitor/cms_production.html
          and follow the "Production Farms" link.  You can get various
          network traffic plots and CPU utilizations for all the popcorn
          nodes, gallo, and velveeta.  Also, you can get a summary of which
          nodes have jobs running on them.

      vi) Do you know your FBS job id numbers ?  Then you can check
          which event you are on by "python ~/bin/LogChecker.py <job_id>"
          This command parses the log files as they are written in
          /data/fbs-logs on gallo and gets the last event number
          processes

  (5) How do I know it finished OK ?

      a) When all sections of the job have exited, you will receive several
      EMails.  Check the output of the Email labeled "main."    It looks
      like this ( which is a failed job! ) :
 

        Section Info:
        Job 2118 Section: main
        Exec: ['/data/cms_production_220301/cms_production/scripts/
                digis_dispatcher', '1', '2', '3', '4', '5', '6', '7',
        '8', '9', '10', 'jm_sm_qq_qqh120_inv', '1034']
        Submit_Time: Sun Apr  8 13:29:10 2001
        Start_Time:  Sun Apr  8 13:29:16 2001
        End_time: Sun Apr  8 13:35:52 2001
        Exit Code:1
        Number of Process 10
        -----------------------------
        Process Info:
        -----------------------------
        Process 1
        Node: popcrn26
        Start Time: Sun Apr  8 13:29:16 2001
        End Time: Sun Apr  8 13:35:47 2001
        Exit Code:1
        Reason:Killed by BMGR
        CPU Time: 187
        -----------------------------
        Process 2
        Node: popcrn10
        Start Time: Sun Apr  8 13:29:16 2001
        End Time: Sun Apr  8 13:35:30 2001
        Exit Code:1
        Reason:Killed by BMGR
        CPU Time: 150
        -----------------------------
        Process 3
        Node: popcrn37

        Start Time: Sun Apr  8 13:29:16 2001
        End Time: Sun Apr  8 13:35:47 2001

        Exit Code:1
        Reason:Killed by BMGR
        CPU Time: 40
        -----------------------------

     Note the CPU time of each process.  If any stick out, there may
     have been a problem.  Also check the exit codes.  If any are non-zero,
     there may have been a problem.  However, problems do arise that do not
     touch the exit code, so beware.

  (b) Check the job directories.

           i) Log into gallo as cmsprod.
          ii) cd /data/jetmet-production/cms_db
         iii)"ls cmsim/<data_set_name>/production/problems" or "ls OOHits/production/problems" or
              "ls OODigis/production/problems"
          iv) If there are any entries here, then there was a problem.