www.eu-eela.org e-science grid facility for europe and latin america computing element giuseppe...
DESCRIPTION
Catania (Italy), Joint EELA/EGEEIII Tutorial for Trainers, – OVERVIEW The Computing Element is the central service of a site. Its main functionally are: – manage the jobs (job submission, job control) – update to WMS the status of the jobs – publish all site informations (site location, queues, about the CPUs status, and so on) via ldap (site BDII service) It can run several kinds of batch system: – Torque + MAUI – LSF – SGE – CondorTRANSCRIPT
www.eu-eela.org
E-science grid facility for Europe and Latin America
COMPUTING ELEMENT
GIUSEPPE PLATANIAINFN Catania
30 June - 4 July, 2008
2www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
OUTLINE
• OVERVIEW
• INSTALLATION & CONFIGURATION
• TESTING
• FIREWALL SETUP
• TROUBLESHOOTING
3www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
OVERVIEW• The Computing Element is the central service of a site.• Its main functionally are:
– manage the jobs (job submission, job control)
– update to WMS the status of the jobs
– publish all site informations (site location, queues, about the CPUs status, and so on) via ldap (site BDII service)
• It can run several kinds of batch system:– Torque + MAUI
– LSF
– SGE
– Condor
4www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
TORQUE + MAUI
• The Torque server is composed by a:– pbs_server pbs_server which provides the basic batch services such as
receiving/creating a batch job.
• The Torque client is composed by a:– pbs_mompbs_mom which places the job into execution. It is also
responsible for returning the job’s output to the user
• The MAUI system is composed by a:– job_schedulerjob_scheduler which contains the site's policies in order to
choose which job must be executed.
5www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Site BDII**
– By default it is installed on the CE
– It collects all site GRISes* (for example SE,RB,LFC,etc..)
– The name of the service is bdii
– The list of GRISes you want to publish is: /opt/glite/etc/gip/site-urls.conf
– Log file: /opt/bdii/var/bdii.log
*GRIS=Grid Resource Information Service**BDII=Berkely Database Infomatin Index
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Computing Element installation & configuration using YAIM
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
There are several kinds of metapackages to install:
ig_CE – LCG ComputingElement without batch system packages.
ig_CE_LSF – LCG ComputingElement with LSF. IMPORTANT: providedfor
consistency, it does not install LSF but it apply some fixes via ig_configure_node.
ig_CE_torque – LCG ComputingElement with Torque+MAUI.
WHAT KIND OF CE?
8www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
HOW TO GET AN HOST CERTIFICATE
• Host certificate for CE.– Please, request it to your RA
• Install host certificate (hostcert.pem and hostkey.pem) in /etc/grid-security.
– mkdir /etc/grid-security
– chmod 644 hostcert.pem
– chmod 400 hostkey.pem
9www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Repository settings
• REPOS="ca dag ig jpackage gilda glite-lcg_ce_torque glite-bdii"
Download and store repo files:• for name in $REPOS; do wget
http://grid018.ct.infn.it/mrepo/repos/$name.repo -O /etc/yum.repos.d/$name.repo; done
10www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
INSTALLATION
• yum install jdk java-1.5.0-sun-compat • yum install lcg-CA• yum install ig_CE_torque
If it's also the site bdii collector:• yum install ig_BDII
Gilda rpms:• yum install gilda_utils
11www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
• Copy ig-site-info.def template file provided by ig_yaim in to gilda dir and customize it
cp /opt/glite/yaim/examples/siteinfo/ig-site-info.def /opt/glite/yaim/etc/gilda/<your_site-info.def>
• Open /opt/glite/yaim/etc/gilda/<your_site-info.def> file using a text editor and set the following values according to your grid environment:
CE_HOST=<write the CE hostname you are installing> BATCH_SERVER=$CE_HOST
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
WN_LIST=/opt/glite/yaim/etc/gilda/wn-list.conf
The file specified in WN_LIST has to be set with the list of all your WNs hostname.
WARNING: It’s important to setup it before to run the configure command
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
• Copy users and groups example files to /opt/glite/yaim/etc/gilda/
cp /opt/glite/yaim/examples/ig-groups.conf /opt/glite/yaim/etc/gilda/
cp /opt/glite/yaim/examples/ig-users.conf /opt/glite/yaim/etc/gilda/
• Append gilda users and groups definitions to /opt/glite/yaim/etc/gilda/ig-users.conf
cat /opt/glite/yaim/etc/gilda/gilda_ig-users.conf >> /opt/glite/yaim/etc/gilda/ig-users.conf
cat /opt/glite/yaim/etc/gilda/gilda_ig-groups.conf >> /opt/glite/yaim/etc/gilda/ig-groups.conf
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
GROUPS_CONF=/opt/glite/yaim/etc/gilda/ig-groups.confUSERS_CONF=/opt/glite/yaim/etc/gilda/ig-users.confJAVA_LOCATION="/usr/java/j2sdk1.4.2_12“
SITE_EMAIL=grid-prod@<your_domain> SITE_NAME=GILDA-01..05SITE_LOC=“Catania, ITALY"SITE_LAT=37.5SITE_LONG=15.152SITE_WEB="https://gilda.ct.infn.it"SITE_TIER="GILDA Testbed"SITE_SUPPORT_SITE="grid-prod@<your_domain>"
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
JOB_MANAGER=lcgpbsCE_BATCH_SYS=pbsBATCH_BIN_DIR=/usr/binBATCH_VERSION=torque-2.1.9-4CE_CPU_MODEL=OpteronCE_CPU_VENDOR=AMDCE_CPU_SPEED=3000 CE_OS="Scientific Linux“CE_OS_RELEASE=4.5CE_OS_VERSION="SL“CE_MINPHYSMEM=2048CE_MINVIRTMEM=4096CE_SMPSIZE=2CE_SI00=1000CE_SF00=1200CE_OUTBOUNDIP=TRUECE_INBOUNDIP=TRUE
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
DPM_HOST=“dpm_hostname”SE_LIST="$DPM_HOST“SITE_BDII_HOST=$CE_HOSTBDII_REGIONS="CE SE“BDII_CE_URL="ldap://$CE_HOST:2170/mds-vo-
name=resource,o=grid“BDII_SE_URL="ldap://$DPM_HOST:2170/mds-vo-
name=resource,o=grid“
VOS=“gilda”ALL_VOMS=“gilda”
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
QUEUES="short long infinite“
SHORT_GROUP_ENABLE=$VOSLONG_GROUP_ENABLE=$VOSINFINITE_GROUP_ENABLE=$VOS
In case of to configure a queue fo a single VO:
QUEUES="short long infinite gilda“
SHORT_GROUP_ENABLE=$VOSLONG_GROUP_ENABLE=$VOSINFINITE_GROUP_ENABLE=$VOSGILDA_GROUP_ENABLE=“gilda”
Customize ig-site-info.def
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
CE Torque CONFIGURATION
• Now we can configure the node:
/opt/glite/yaim/bin/ig_yaim -c -s /opt/glite/yaim/etc/gilda/<your_site-info.def> -n ig_CE_torque -n BDII_site
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Computing Elementtesting
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
• Check if the local GRIS and the site BDII are running on CE and are publishing the right informations (CPU, site name and so on)
ldapsearch -x -h <ce_hostname> -p 2170 -b mds-vo-name=resource,o=grid
ldapsearch -x -h <ce_hostname> -p 2170 -b mds-vo-name=<site_name>,o=grid
Testing
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
• Become a gilda user # su – gilda001
• Edit a file and write: #!/bin/sh sleep 20 #(it's useful to see the job status) hostname
• Save it and set the permission of execution: chmod 700 test.sh
Testing
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
[gilda001@ce gilda001]$ qsub -q short test.sh
[gilda001@ce gilda001]$ qstat -a
ce.localdomain: Req'd Req'd ElapJob ID Username Queue Jobname SessID NDS TSK
Memory Time S Time--------------- -------- -------- ---------- ------ --- --- ------ ----- - ----3.wn.localdo gilda001 short test.sh 5839 -- -- -- 00:15 R
--
Testing
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
[gilda001@ce gilda001]$ qstat -a[gilda001@ce gilda001]$
• The job execution has finished and we have to list the output file:
[gilda001@ce gilda001]$ lstest.sh.e3 test.sh.o3
• And show them:[gilda001@ce gilda001]$ cat test.sh.e3 (error file)[gilda001@ce gilda001]$[gilda001@ce gilda001]$ cat test.sh.o3 (output file)wn.localdomain
Testing
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Log on the UI:
hostname -> glite-tutor.ct.infn.itUsername -> catania01..30Password -> GridCAT01..30
Grid passphrase -> CATANIA
Testing
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
[plt@glite-tutor plt]$ voms-proxy-init –voms gilda[plt@glite-tutor plt]$ globus-job-run grid006.ct.infn.it:2119/jobmanager-lcgpbs -q short /bin/hostnamewn.localdomain
[plt@glite-tutor plt]$ edg-job-submit -r grid006.ct.infn.it:2119/jobmanager-lcgpbs-short hostname.jdl
Selected Virtual Organisation name (from proxy certificate extension): gildaConnecting to host glite-rb.ct.infn.it, port 7772Logging to host glite-rb.ct.infn.it, port 9002
******************************************************************************** JOB SUBMIT OUTCOME The job has been successfully submitted to the Network Server. Use edg-job-status command to check job current status. Your job identifier (edg_jobId) is:
- https://glite-rb.ct.infn.it:9000/Vo-4Ih1s-iDbBPr3rs69GQ
********************************************************************************
Testing
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
FIREWALL SETUP
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
/etc/sysconfig/iptables (1/2)*filter:INPUT ACCEPT [0:0]:FORWARD ACCEPT [0:0]:OUTPUT ACCEPT [0:0]:RH-Firewall-1-INPUT - [0:0]-A INPUT -j RH-Firewall-1-INPUT-A FORWARD -j RH-Firewall-1-INPUT-A RH-Firewall-1-INPUT -i lo -j ACCEPT-A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2135 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2119 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2170 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 2811 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport maui -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs_mom -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs_resmom -j ACCEPT
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport pbs -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 3878:3879 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 3879 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 3882 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 1020:1023 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 20000:25000 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 32768:65535 -j ACCEPT-A RH-Firewall-1-INPUT -m state --state NEW -m udp -p udp --dport 32768:65535 -j ACCEPT-A RH-Firewall-1-INPUT -p tcp -m tcp --syn -j REJECT-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibitedCOMMIT
/etc/sysconfig/iptables (2/2)
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
IPTABLES STARTUP
/sbin/chkconfig iptables on
/etc/init.d/iptables start
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Troubleshooting
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
Troubleshooting[plt@ui plt]$ globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs -q short /bin/hostnameGRAM Job submission failed because the connection to the server failed (check host and port) (error
code 12)
solution: check if the globus-gatekeeper daemon is up and running on CE
[plt@ui plt]$ globus-job-run <ce_hostname>:2119/jobmanager-lcgpbs -q short /bin/hostnameGRAM Job submission failed because authentication failed:GSS Major Status: Authentication FailedGSS Minor Status Error Chain:
init.c:499: globus_gss_assist_init_sec_context_async: Error during context initializationinit_sec_context.c:171: gss_init_sec_context: SSLv3 handshake problemsglobus_i_gsi_gss_utils.c:888: globus_i_gsi_gss_handshake: Unable to verify remote side's credentialsglobus_i_gsi_gss_utils.c:847: globus_i_gsi_gss_handshake: Unable to verify remote side's
credentials: Couldn't verify the remote certificateOpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert bad
certificate (error code 7)
solution: probably there is no GILDA CA rpm installed on CE
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
[plt@ui plt]$ edg-gridftp-ls gsiftp://<ce_hostname>/error the server sent an error response: 530 530 LCMAPS credential mapping NOT successful
error the server sent an error response: 530 530 LCMAPS credential mapping NOT successful
solution: check on CE the VO mapping in /opt/edg/etc/lcmaps/gridmapfile /opt/edg/etc/lcmaps/groupmapfile
Troubleshooting
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
The CE is publishing wrong informations such as:GlueCEStateFreeCPUs: 0GlueCEStateRunningJobs: 0GlueCEStateStatus: ProductionGlueCEStateTotalJobs: 0GlueCEStateWaitingJobs: 4444
Run the script:/opt/glite/etc/gip/plugin/glite-info-dynamic-scheduler-wrapperand check if it gives some errors. Often it doesn’t work because the batch system is down or in lock state. In this case restart torque service:/etc/init.d/pbs_server restart
Troubleshooting
www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008
• If a query to the site BDII doesn’t show the information about a site, you have to look at the bdii log file /opt/bdii/var/bdii.log
• For example:GILDA: ldap_bind: Can't contact LDAP server
Check if:– bdii is up & running (ps aux |grep bdii)– That resource url is in the list file /opt/glite/etc/gip/site-urls.conf – Firewall setup
Troubleshooting
35www.eu-eela.eu Catania (Italy) , Joint EELA/EGEEIII Tutorial for Trainers, 30.06.2008 – 04.07.2008