ATTENDING

  • Yunes, Nicolas <nicolas.yunes@montana.edu>
  • Wright, Michael <mwright@montana.edu>
  • Jerry Sheehan, Josh Turner, Pol Llovet, Aurelien Mazurie, Jonathan Hilmer, Thomas Heetderks

ABSENT

  • Johnson, Erick <erick.johnson@montana.edu>
  • Owkes, Mark <mark.owkes@montana.edu>
  • Young, Mark <myoung@montana.edu>
  • Lawrence, Martin <lawrence@montana.edu>
  • Sheppard, John <john.sheppard@montana.edu>
  • Dlakic, Mensur <mdlakic@montana.edu>
  • Rossmann, Doralyn <doralyn@montana.edu>

MINUTES: HPCAG Meeting #8 – (meeting presentation)

  1. Welcome - Jerry
  2. Technical Issues - Pol
    • Lustre File System Issues
      • cuase unknown - may be related to NTP time drift
      • 2 of 8 servers failed
      • drives died - but NO data loss
      • can probably consider it solved
      • we have implemented SPLUNK and NetData for better future diagnostics
    • Orphaned Jobs
      • cause unknown - may be related to version of SLURM we are running
      • SLURM not always shutting jobs down properly
      • uneeded jobs are left to consume resources
      • setting CPUs to run in high-performance mode seems to solve
      • mainly one user (lab group) affected
  3. CFAC classroom node acquisition - Pol
    • we will acquire 8 compute nodes
    • classroom nodes on seperate queue, so classroom use seperated
    • this will give us 73 total nodes
  4. Increased Queue Wait Time - Pol
    • With increased cluster usage, wait time has increased
    • Proposed mitigation: move some, appropriate users to XSEDE
      • discussion: Yunes' lab has some students that may be good candidates
      • discussion: can we do Mathamatica on XSEDE?
      • not sure...
      • MATLAB should work- but needs testing
      • MATH dept might have some new users that would be good candidates
      • check with Young Lab for possible candidates (SDSC/TACC/IU)
  5. Faculty Candidate Interviews - Pol & Aurelien
    • some were pleasantly surprised about HPC resource availability
    • some thought we would have research storage
      • Jerry: hopefully we will soon - storage RFP should be posted in a couple of weeks
  6. CONDA - Aurelien & Pol
    • not space efficient - but otherwise great
    • has dramatically simplified scientific software configuration
    • Aurelien: greatly solved problems for researcher I worked with
    • may be worth bringing up again in the future - with more people
  7. ALCES FLIGHT - Jerry & Pol
    • AWS spot pricing is good
    • plug-n-play module management is nice
    • good for compute problems of fixed and limited time (not long-running jobs)

ACTIONS

FUTURE AGENDA

  • ...