Monday, November 19, 2018

TERM_SWAP: job killed after reaching LSF swap usage limit.

When your job was killed due to this error, it means you've exceeded the swap usage set for the queue.

bsub -v option can set the limit:

   -v swap_limit
               Set the total process virtual memory limit to
               swap_limit for the whole job. The default is
               no limit. Exceeding the limit causes the job to
               terminate.

               By default, the limit is specified in KB. Use
               LSF_UNIT_FOR_LIMITS in lsf.conf to specify a
               larger unit for the limit (MB, GB, TB, PB, or EB).


Without administrator permission, one cannot change the lsf.conf. 

So as a user what you can do is to switch to use a different queue with larger swap limit. Here is the command to check the limit for a queue, e.g. 

$bqueues -l normal | grep -A 1 SWAP
CORELIMIT MEMLIMIT SWAPLIMIT
0 M 8 G 12 G


Here is what I got from our cluster:

$ for i in `bqueues | grep -v QUEUE_NAME | cut -f1 -d' '`; do echo $i; bqueues -l $i | grep -A 1 SWAP; done
interact
 MEMLIMIT SWAPLIMIT
     39 G      47 G 
interact-big
 MEMLIMIT SWAPLIMIT
    488 G     586 G 
gpu
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M      23 G      29 G 
elephant
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       2 T       2 T 
big-multi
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M     498 G     500 G 
big
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M     498 G     500 G 
normal
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       8 G      12 G 
medium
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       9 G      12 G 
vshort
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       8 G      10 G 
short
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       8 G      10 G 
long
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       9 G      12 G 
vlong
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       8 G      12 G 
mpi-short
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       8 G      10 G 
mpi-ib
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M     248 G     254 G 
aaa-chunk
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M     117 G     125 G 
pcpgmwgs
matlab
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M     250 G     250 G 
matlabdce
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M     250 G     250 G 
matlabdce-short
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       4 G     250 G 
defaultlow
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M       8 G      12 G 
pcpgm
filemove
 CORELIMIT MEMLIMIT SWAPLIMIT
      0 M      12 G      12 G 
rerunnable
rerunnable-big


No comments:

Post a Comment