Difference between revisions of "BLCR Integration"
Jump to navigation
Jump to search
(Created page with "PDD Link to Files: <file>\\srv-vfs2\PDD_DATA\Product Development\High Performance Computing\HPC Software Information\Platform\BLCR Integration|BLCR Integration</file> ===== B...") |
(No difference)
|
Latest revision as of 11:53, 7 December 2012
PDD Link to Files: <file>\\srv-vfs2\PDD_DATA\Product Development\High Performance Computing\HPC Software Information\Platform\BLCR Integration|BLCR Integration</file>
BLCR 0.8.2 patch for 2.6.18.238.el5 Kernels
Note: Ref: https://upc-bugs.lbl.gov/bugzilla/show_bug.cgi?id=2990
- Compilation error for BLCR 0.8.2 on 2.6.18-238.el5 kernel (CentOS 5.6):
. CC [M] /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.o /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:44: warning: type defaults to ¬int¬ in declaration of ¬DECLARE_DELAYED_WORK¬ /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:44: warning: parameter names (without types) in function declaration /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c: In function ¬cr_wd_run¬: /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:60: error: ¬cr_wd_work¬ undeclared (first use in this function) . . /usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.c:93: error: ¬cr_wd_work¬ undeclared (first use in this function) gmake[5]: *** [/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild/cr_watchdog.o] Error 1 gmake[4]: *** [_module_/usr/src/redhat/BUILD/blcr-0.8.2/builddir/cr_module/kbuild] Error 2^
- The folowing patch/changes is needed for BLCR 0.8.2 to compile with 2.6.18-238.el5 kernel
File: cr_watchdog.c (line 35): --- cr_module/cr_watchdog.c 21 Aug 2008 05:22:49 -0000 1.7 +++ cr_module/cr_watchdog.c 17 Feb 2011 19:42:05 -0000 @@ -32,7 +32,7 @@ // Interval for the watchdog to wakeup HZ = 1s #define CR_WD_INTERVAL (HZ) -#if HAVE_STRUCT_DELAYED_WORK +#if HAVE_STRUCT_DELAYED_WORK && defined(DECLARE_DELAYED_WORK) typedef struct work_struct * cr_work_arg_t; #define CR_DECLARE_DELAYED_WORK DECLARE_DELAYED_WORK #else /* struct work_struct */
Install kernel source on Compilation Nodes
kusu-ngedit
(added kernel-devel and kernel-headers)Install BLCR from source
tar zxvf blcr-0.8.2.tar.gz
make --enable-test <== should it be ./configure --enable-test ??
make insmod
make checkInstall BLCR from SRC RPM
NOTE I had to edit the .spec file to remove all mentions of multilib, libdir32 etc
rpm -Uvh blcr-0.8.2-1.src.rpm
cd /usr/src/redhat/SPECS
# edit spec file to remove mentions of whatever, see diff below
# remove all references to: libdir32
rpmbuild -bb blcr-0.8.2.specAdd RPMs to Platform Repo
# Copy the RPMs to the contrib dir
cp /usr/src/redhat/RPMS/x86_64/blcr* /depot/contrib/1000
# Update the repo DB
kusu-repoman -u -r centos-5.4-x86_64
# Add BLCR packages to the installer group and centos group
kusu-ngedit
# if you dont automatically update the node, then run:
kusu-cfmsync -n 'installer-centos-5.4-x86_64' -p -fAdding BLCR echkpnt files to the PCM LSF directory
chmod 755 *.blcr
cp echkpnt.blcr $LSF_SERVERDIR
cp erestart.blcr $LSF_SERVERDIR
# copy to the node image CFM directory
mkdir /etc/cfm/compute-centos-5.4-x86_64/$LSF_SERVERDIR -p
# not needed if you RPM install on the headnode: mkdir /etc/cfm/installer-centos-5.4-x86_64/$LSF_SERVERDIR -p
# sync the files
kusu-cfmsync -fDisable Prelinking on Nodes
set the following in file: /etc/sysconfig/prelink
PRELINKING=no
mkdir /etc/cfm/[node_group]/etc/sysconfig -p
ln -s /etc/sysconfig/prelink /etc/cfm/compute-centos-5.4-x86_64/etc/sysconfig/prelink
ln -s /etc/sysconfig/prelink /etc/cfm/[any_other_node_groups]/etc/sysconfig/prelink
kusu-cfmsync -fUsing LSF with BLCR
# submit a job then with:
bsub -k "/home/viglen/job/mychkpntdir method=blcr 60"
# checkpoint the job
bchkpnt JOBID (as per lsf/bjobs output)
bchkpnt -k JOBID (kill the job upon successful checkpointing
# restart the job
brestart /home/viglen/job/mychkpntdir JOBID
or if that gives you jip, -f it
brestart -f /home/viglen/job/mychkpntdir JOBID
# migrate a job
bmig -m node_to_move_job_to JOBIDDebugging it
Test blcr on individual nodes:
cr_run app
cr_checkpoint [--kill] [app_ps_id]
cr_restart context.ps_id
export OMP_NUM_THREADS=1 (blcr didnt work on multi-threaded apps)# ###################################
# spec diff on the multilib spec file
# ###################################
[root@master SPECS]# diff blcr-0.8.2.orig blcr-0.8.2.spec
55,65d54
< # Are we building both 32- and 64-bit libcr?
< %define build_libdir32 0
< %ifarch x86_64 ppc64
< %define build_libdir32 %{is_enabled 1 multilib}
< %endif
<
< # Where to put 32-bit libs on a 64-bit platform
< %if %{build_libdir32}
< %define libdir32 %(echo %{_libdir} | sed -e s/lib64/lib/)
< %endif
<
125d113
< %{?libdir32:--enable-multilib} \
224,231d211
< %if %{build_shared} && %{build_libdir32}
< %{libdir32}/libcr.so.0
< %{libdir32}/libcr.so.0.5.2
< %{libdir32}/libcr_run.so.0
< %{libdir32}/libcr_run.so.0.5.2
< %{libdir32}/libcr_omit.so.0
< %{libdir32}/libcr_omit.so.0.5.2
< %endif
262,266d241
< %if %{build_libdir32}
< %{libdir32}/libcr.la
< %{libdir32}/libcr_run.la
< %{libdir32}/libcr_omit.la
< %endif
273,277d247
< %if %{build_shared} && %{build_libdir32}
< %{libdir32}/libcr.so
< %{libdir32}/libcr_run.so
< %{libdir32}/libcr_omit.so
< %endif
284,288d253
< %if %{build_static} && %{build_libdir32}
< %{libdir32}/libcr.a
< %{libdir32}/libcr_run.a
< %{libdir32}/libcr_omit.a
< %endif