ATTENDING

  • Jerry Sheehan, Josh Turner, Pol Llovet, Aurelien Mazurie, Jonathan Hilmer, Thomas Heetderks

ABSENT

MINUTES: HPCAG Meeting #5 – (meeting presentation)

  1. Welcome - Jerry
  2. Hyalite Expansion (June 2016) - Pol
    • 16 new nodes installed in June
    • NEW Hyalite overview:
      • 60 Nodes (XEON, 36 Sandy Bridge, 24 Haswell)
      • 16 cores per node for a total of 960 cores (1920 HT)
      • 4 GB Ram per core
      • 620 TB of Lustre scratch storage
      • 10 GbE fabric w/ RDMA
  3. Hyalite Maintenance (September 2016) - Pol
    • Lustre updated to 2.5.42
    • 10GbE network drivers updated
    • Migrated IPMI (management network) to new network
    • RobinHood installed and initialized (for Lustre management)
    • RDMA drivers installed
    • discussion: will the RDMA work with our existing compilers?
      • Pol: not sure
  4. Hyalite Usage / XDMoD Stats - Jonathan
    • Who is using the most CPU time
    • Who is running the longest jobs
    • Who is waiting the most
      • discussion: what happens when jobs go over time?
        • Pol: they are killed... so users should err on the side of caution and estimate long
    • Who waits the most per job
    • Waiting Hours vs Job Length
    • History of CPU Hours
    • Details on wait time
    • Waiting time, everyone, August
      • Pol: fair-share algorithm accumulates historical wait time and job data to try to make things fair over time
      • discussion: does the fairness algorithm effect only users, or also groups?
        • Pol: I think NO, but I am not sure... Sean says it does. As we accumulate more data, we can get a better idea of how this works
        • Pol: it may make sense to weight "fairness" below the priority queue...? Right now they are the same weight
        • Jerry: initially, we wanted to keep things as simple as possible-- we continue to want to keep changes simple relying heavily on your feedback as we consider adjustments
    • Who waits for short jobs and why
      • discussion: I've been running jobs elsewhere (XSEDE resources) -- do we encourage others to do the same -- I see Hyalite as a test platform
      • Jerry: we will have more opportunity now to use Pol to work with users, now that we have Jonathan
      • Jerry: we have time is this group, as we move forward, to better define Hyalite's use case
      • Pol: things Hyalite is good at?... long jobs (too long for XSEDE)... other things?
      • Jerry: on the other hand, as Ben Poulter once said-- do we users, by our use, encourage bad user behavior?
  5. MATLAB Campus License Update - Pol
    • MATLAB Total Academic Headcount License
      • Faculty, Staff, or Student
      • Any machine (Home or On-Campus)
    • see UIT MATLAB Help Page about local Installation
    • Hyalite is running MATLAB version R2016a
      • Pol: do we need older versions than R2016a on Hyalite?
        • discussion: no
    • MATLAB HPC Mentors Monthly Meeting
      • Jerry: we were able to secure the funding for this license, we were NOT able to secure funding for MATLAB support from CFAC-- so how do we support users?
      • discussion: this support should not be on UIT
      • Jerry: we could see if our advanced users would mind answering questions from other users...
      • Pol/Mike: maybe next year we'll get the CFAC funding for this support... when we've had time to demonstate its need
      • Pol: Mathworks provides MATLAB MENTOR support through sysadmins (us), so we could take questions to them on a monthly basis
  6. Computational Chemistry Class (CHMY591) - Jonathan
    • Professor: Robert Szilagyi
    • Students: cap of 15, currently 4
    • Hyalite usage plan:
      • Students start with jobs on their own systems
      • Learn software, shared-user systems
      • Gradually move jobs to the cluster
      • By end of semester, submitting very long job
    • Software
      • TINKER
      • MOPAC
      • DFTP+
      • Gaussian09
      • Tcl shell
    • Estimates (rough Hilmer calculations)
      • Averaged: 25-50% of a single node’s capacity, 24/7 for a semester
      • Heavily imbalanced: weighted towards end of semester
      • Very long jobs: up to one week each (single core jobs)
    • Jerry: [per Ben Poulter] this is a problem if this kind of use effects existing research jobs
    • Jerry: but this could be a great opportunity-- a unique story for classroom usage of Hyalite
    • discussion: a new CS hire (David Millman) who's specialty is HPC, would like to teach HPC... does he use Hyalite? or XSEDE? or AWS?
    • Jerry: at this point, AWS is not an option because of MSU's legal position on AWS endemnity
    • discussion: I would recommend a hybrid approach-- special queue with limited nodes for classroom use
    • discussion: would be great if we could accquire nodes dedicated to classroom use with CFAC $
    • discussion: I think its a good thing-- teaching approiate HPC usage good... a class must teach correct behavior

ACTIONS

  • Need to post documentation on RDMA programatic usage on Hyalite web pages
  • Need to post listing of Hyalite installed software/modules (with version numbers) on Hyalite web pages
  • Need to research Slurm job scheduling behavior as it relates to our researchers (for Slurm configuration adjustments)

FUTURE AGENDA

  • Data Challenge (Data Science Competition)
  • CyberCANOE / SAGE2
  • Hyalite Communication & Publicity