BLCR Integration

From Define Wiki
Revision as of 11:53, 7 December 2012 by Michael (talk | contribs) (Created page with "PDD Link to Files: <file>\\srv-vfs2\PDD_DATA\Product Development\High Performance Computing\HPC Software Information\Platform\BLCR Integration|BLCR Integration</file> ===== B...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

PDD Link to Files: <file>\\srv-vfs2\PDD_DATA\Product Development\High Performance Computing\HPC Software Information\Platform\BLCR Integration|BLCR Integration</file>

BLCR 0.8.2 patch for 2.6.18.238.el5 Kernels

Note: Ref: https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2990

  • Compilation error for BLCR 0.8.2 on 2.6.18-238.el5 kernel (CentOS 5.6):
  .
  CC [M]  /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.o
/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:44: warning: type defaults to ¬int¬ in declaration of ¬DECLARE_DELAYED_WORK¬
/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:44: warning: parameter names (without types) in function declaration
/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c: In function ¬cr_wd_run¬:
/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:60: error: ¬cr_wd_work¬ undeclared (first use in this function)
.
.
/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:93: error: ¬cr_wd_work¬ undeclared (first use in this function)
gmake[5]: *** [/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.o] Error 1
gmake[4]: *** [_module_/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild] Error 2^
  • The folowing patch/changes is needed for BLCR 0.8.2 to compile with 2.6.18-238.el5 kernel
File: cr_watchdog.c (line 35):

--- cr_module/cr_watchdog.c     21 Aug 2008 05:22:49 -0000      1.7
+++ cr_module/cr_watchdog.c     17 Feb 2011 19:42:05 -0000
@@ -32,7 +32,7 @@
 // Interval for the watchdog to wakeup HZ = 1s
 #define CR_WD_INTERVAL (HZ)

-#if HAVE_STRUCT_DELAYED_WORK
+#if HAVE_STRUCT_DELAYED_WORK && defined(DECLARE_DELAYED_WORK)
   typedef struct work_struct * cr_work_arg_t;
   #define CR_DECLARE_DELAYED_WORK DECLARE_DELAYED_WORK
 #else /* struct work_struct */
Install kernel source on Compilation Nodes
kusu-ngedit
(added kernel-devel and kernel-headers)
Install BLCR from source
tar zxvf blcr-0.8.2.tar.gz
make --enable-test <== should it be ./configure --enable-test ??
make insmod
make check
Install BLCR from SRC RPM

NOTE I had to edit the .spec file to remove all mentions of multilib, libdir32 etc

rpm -Uvh blcr-0.8.2-1.src.rpm
cd /usr/src/redhat/SPECS
# edit spec file to remove mentions of whatever, see diff below
# remove all references to: libdir32
rpmbuild -bb blcr-0.8.2.spec
Add RPMs to Platform Repo
# Copy the RPMs to the contrib dir
cp /usr/src/redhat/RPMS/x86_64/blcr* /depot/contrib/1000
# Update the repo DB
kusu-repoman -u -r centos-5.4-x86_64
# Add BLCR packages to the installer group and centos group
kusu-ngedit

# if you dont automatically update the node, then run:
kusu-cfmsync -n 'installer-centos-5.4-x86_64' -p -f
Adding BLCR echkpnt files to the PCM LSF directory
chmod 755 *.blcr
cp echkpnt.blcr $LSF_SERVERDIR
cp erestart.blcr $LSF_SERVERDIR

# copy to the node image CFM directory
mkdir /etc/cfm/compute-centos-5.4-x86_64/$LSF_SERVERDIR -p
# not needed if you RPM install on the headnode: mkdir /etc/cfm/installer-centos-5.4-x86_64/$LSF_SERVERDIR -p

# sync the files
kusu-cfmsync -f
Disable Prelinking on Nodes
set the following in file: /etc/sysconfig/prelink
PRELINKING=no

mkdir /etc/cfm/[node_group]/etc/sysconfig -p

ln -s /etc/sysconfig/prelink /etc/cfm/compute-centos-5.4-x86_64/etc/sysconfig/prelink 
ln -s /etc/sysconfig/prelink /etc/cfm/[any_other_node_groups]/etc/sysconfig/prelink 

kusu-cfmsync -f
Using LSF with BLCR
# submit a job then with:
bsub -k "/home/viglen/job/mychkpntdir method=blcr 60"

# checkpoint the job
bchkpnt JOBID (as per lsf/bjobs output)
bchkpnt -k JOBID (kill the job upon successful checkpointing

# restart the job
brestart /home/viglen/job/mychkpntdir JOBID
or if that gives you jip, -f it
brestart -f /home/viglen/job/mychkpntdir JOBID

# migrate a job
bmig -m node_to_move_job_to JOBID
Debugging it
Test blcr on individual nodes:
cr_run app
cr_checkpoint [--kill] [app_ps_id]
cr_restart context.ps_id

export OMP_NUM_THREADS=1 (blcr didnt work on multi-threaded apps)
# ###################################
# spec diff on the multilib spec file
# ###################################
[root@master SPECS]# diff blcr-0.8.2.orig blcr-0.8.2.spec
55,65d54
< # Are we building both 32- and 64-bit libcr?
< %define build_libdir32 0
< %ifarch x86_64 ppc64
<   %define build_libdir32 %{is_enabled 1 multilib}
< %endif
<
< # Where to put 32-bit libs on a 64-bit platform
< %if %{build_libdir32}
<   %define libdir32 %(echo %{_libdir} | sed -e s/lib64/lib/)
< %endif
<
125d113
<       %{?libdir32:--enable-multilib} \
224,231d211
< %if %{build_shared} && %{build_libdir32}
< %{libdir32}/libcr.so.0
< %{libdir32}/libcr.so.0.5.2
< %{libdir32}/libcr_run.so.0
< %{libdir32}/libcr_run.so.0.5.2
< %{libdir32}/libcr_omit.so.0
< %{libdir32}/libcr_omit.so.0.5.2
< %endif
262,266d241
< %if %{build_libdir32}
< %{libdir32}/libcr.la
< %{libdir32}/libcr_run.la
< %{libdir32}/libcr_omit.la
< %endif
273,277d247
< %if %{build_shared} && %{build_libdir32}
< %{libdir32}/libcr.so
< %{libdir32}/libcr_run.so
< %{libdir32}/libcr_omit.so
< %endif
284,288d253
< %if %{build_static} && %{build_libdir32}
< %{libdir32}/libcr.a
< %{libdir32}/libcr_run.a
< %{libdir32}/libcr_omit.a
< %endif