Contact Us

Making the Most of Your Allocation

The choice of hardware components and software configuration drives the time to solution and scalability of any application. The variety of codes that run on the DoD High Performance Computing Modernization Program (HPCMP) machines and the differences in platforms force researchers into a classic economics exercise - how to allocate limited HPC resources to achieve the most in the least amount of time.

Our results below focus on the following codes: ADCIRC, ALEGRA, Air Vehicles Unstructured Solver (AVUS), CTH, General Atomic and Molecular Electronic Structure System (GAMESS), HYbrid Coordinate Ocean Model (HYCOM), Improved Concurrent Electromagnetic Particle In Cell (ICEPIC), and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).

Generally, the results from benchmarking runs can serve two purposes. At a fixed CPU count, relative machine performance can be judged on a per-application basis. This is normally done at what is termed a distinguished processor count. Taking a different perspective, if one were to look at the times for a particular test case run for an application on a particular machine at varying CPU counts, it is possible to judge scalability of the application. This is especially true if one were to make the same runs across widely-disparate types of machines. Certainly the timing of the runs would be different--due to variations in OS, compilers, interconnects, memory layout, etc. However, if carried out to a large enough number of CPUs, a pattern should emerge, providing a sense of how well the application scales. For our purposes here, we focus exclusively on the former, hoping to assist in pushing applications to their most appropriate computational platforms and, in the process, disregarding application scalability. In the matrix summary table below, we use a red-to-green scale to denote the application's relative performance on a machine-by-machine basis, with red highlighting poor performance and green the best relative performance. Note that the number beneath the test case name reflects the number of cores at which performance was recorded in the table.

Matrix Summary of all Application Runtimes (relative performance)
Architecture ADCIRC
Baroclinic
1024
ADCIRC
Hurricane
512
ALEGRA
ObliqueImp
1536
ALEGRA
ExplWire
256
AVUS
Waverider
1024
AVUS
Turret-td
1280
CTH
Fixed-grid
1280
CTH
AMR
1280
GAMESS
DFT-grad
256
GAMESS
MP2-grad
512
GAMESS
CC-energy
1024
HYCOM
Lrg
1353
ICEPIC
Magnetron
384
ICEPIC
Gyrotron
2048
LAMMPS
Au
1024
ERDC Diamond 93 100 100 100 100 100 94 100 83 81 92 96 95 100 100
MHPCC Mana na 74 na na 97 93 100 92 79 98 100 100 100 93 80
NAVY Davinci 65 49 27 37 62 24 na 64 100 68 85 42 30 na 90
NAVY Einstein na na 50 40 52 36 71 52 40 88 78 65 45 37 41
ARL MRAP na na 49 40 52 36 70 50 40 88 78 65 46 36 41
NAVY Einstein2 87 15 52 32 56 41 na 50 45 100 88 68 50 43 49
ERDC Garnet 100 63 58 60 65 49 88 67 53 70 88 84 51 43 50

For more information on how these values are normalized, see the Summary of Application Runtimes below.

0-19 20-39 40-59 60-79 80-100
Matrix Summary of all Application Runtimes (duration in seconds)
Architecture ADCIRC
Baroclinic
1024
ADCIRC
Hurricane
512
ALEGRA
ObliqueImp
1536
ALEGRA
ExplWire
256
AVUS
Waverider
1024
AVUS
Turret-td
1280
CTH
Fixed-grid
1280
CTH
AMR
1280
GAMESS
DFT-grad
256
GAMESS
MP2-grad
512
GAMESS
CC-energy
1024
HYCOM
Lrg
1353
ICEPIC
Magnetron
384
ICEPIC
Gyrotron
2048
LAMMPS
Au
1024
ERDC Diamond 8959 3347 1640 944 941 1332 3399 2535 4701 2536 3658 3020 2559 3639 3182
MHPCC Mana na 4504 na na 967 1437 3186 2768 4929 2108 3354 2893 2443 3911 3993
NAVY Davinci 12720 6836 6048 2556 1509 5535 na 3975 3911 3042 3924 6817 8144 na 3536
NAVY Einstein na na 3273 2348 1815 3699 4479 4912 9661 2342 4320 4443 5370 9925 7771
ARL MRAP na na 3343 2364 1815 3699 4542 5055 9661 2342 4320 4443 5328 10054 7771
NAVY Einstein2 9570 22506 3144 2969 1667 3245 na 5068 8635 2061 3801 4252 4897 8452 6444
ERDC Garnet 8308 5349 2806 1564 1445 2737 3603 3760 7346 2931 3807 3450 4819 8426 6343
Architectures Used in Study
DSRC Name Make Model Chip Set Processor
Speed (GHz)
Interconnect Number
of Cores
Cores
per Node
Operating
System
ERDC Diamond SGI Altix ICE Intel Xeon QC 2.8 DDR4 InfiniBand 15,360 8 SUSE Linux
MHPCC Mana Dell PowerEdge M610 Intel Xeon QC 2.8 DDR Infiniband 9216 8 Linux
NAVY DaVinci IBM Power6 IBM P6 DC 4.7 Infiniband 4800 32 AIX
NAVY Einstein Cray XT5 Cray Opteron QC 2.3 SeaStar2+ 12736 8 CNL
ARL MRAP Cray XT5 Cray Opteron QC 2.6 SeaStar 10400 8 CNL
NAVY Einstein2 Cray XT5 Cray Opteron QC 2.4 SeaStar2+ 12736 8 CNL
ERDC Garnet Cray XE6 AMD Opteron 64-bit 2.4 Cray Gemini 20224 16 CLE

ADCIRC

Code Description

Obtained from the University of North Carolina Institute of Marine Sciences, ADCIRC is a coastal circulation and storm surge model. It solves time-dependent, free-surface circulation and transport problems in two and three dimensions. It uses the finite element method (FEM) in space, permitting highly flexible, unstructured grids. Typical ADCIRC uses have included modeling tides and wind-driven circulation, the analysis of hurricane storm surge and flooding, determining dredging feasibility and material disposal studies, larval transport studies, and near-shore marine operations. ADCIRC solves the equations of motion for a moving fluid on a rotating earth. These equations are formulated using the traditional hydrostatic pressure and Boussinesq approximations and have been discretized in space using the finite element (FE) method and in time using the finite difference (FD) method.

ADCIRC can be run either as a two-dimensional depth integrated (2DDI) model or as a three-dimensional (3-D) model. In either case, elevation is obtained from the solution of the depth-integrated continuity equation in Generalized Wave-Continuity Equation (GWCE) form. Velocity is obtained from the solution of either the 2DDI or 3-D momentum equations. All of the nonlinear terms have been carefully retained in all of these equations.

ADCIRC can be run using either a Cartesian or a spherical coordinate system. It includes a least squares analysis routine that computes harmonic constituents for elevation and depth-averaged velocity during the course of the run, thereby avoiding the need to write out long time series for postprocessing.

ADCIRC has been optimized by unrolling loops for enhanced performance on multiple computer architectures. It includes MPI library calls to allow it to operate at high efficiency, typically better than 90 percent, on parallel computer architectures.

ADCIRC Baroclinic

This test case simulates the dynamics of the Turkish Straits System, including the Northeastern Aegean Sea, Marmara Sea, and the southwest Black Sea. These seas are connected to each other by the Dardanelles and the Bosporus straits, and the salinity/density differences in the seas create/govern a two-layer flow system in both straits. This case is composed of 310,435 nodes and 605,099 elements, and the maximum resolution of the model goes down to 20 m. Although this case is rather I/O intensive, it scales out to approximately 2K cores.

ADCIRC Hurricane

This ADCIRC case is for a Gulf of Mexico hurricane surge simulation. It represents a typical ADCIRC application that the U.S. Army Corps of Engineers looks at when doing levee designs or flood plain mappings for FEMA. Within ADCIRC it is exercising the wetting and drying algorithm, as well as the usual hydrodynamic solution for depth-averaged velocities and sea surface elevation. It has 2,734,399 nodes and 5,357,158 elements in its input decks.

Back To Top

ALEGRA

Code Description

Developed at Sandia National Laboratory, ALEGRA is an Arbitrary Lagrangian Eulerian (ALE) code. The dual nature provides flexibility, accuracy, and reduced numerical dissipation over a pure Eulerian code. Also advantageous, its modern remeshing technology allows for robust mesh smoothing and control.

ALEGRA Wire Explosion

This is a two-dimensional simulation of a suspended aluminum wire exploding. It is represented within a rectilinear-biased mesh in cylindrical (r-z) geometry with 12.5 micrometer resolution (12.5 million elements), with the elapsed time at 1 microsecond.

ALEGRA Oblique Sphere Impact

This is a three-dimensional representation of the Grady-Kipp oblique sphere impact experiment, where a copper sphere of 3.18 mm radius hits a rectangular steel plate with a velocity of 4520 m/s at an angle of 30.8 degrees. The geometrical mesh is three-dimensional and rectilinear, with the problem having a resolution of 300 micrometers totaling 19.3 million elements. The elapsed time for this simulation is 15 microseconds.

Back To Top

AVUS

Code Description

Originating at the Wright-Patterson AFB, the Air Vehicles Unstructured Solver is a computational fluid dynamics (CFD) code, descended from an old COBALT_60 version. It simulates three-dimensional viscous flow over irregular geometries. At its foundation, it is grid-based and, as a result, must read in a sizeable grid file. It is a FORTRAN-90 code encompassing approximately 29K lines, and it uses ParMETIS to partition the mesh. Two versions of ParMETIS are included. AVUS' parallelism is exclusively through the message-passing interface (MPI); no OpenMP functionality is currently available. In this version, the restart and picture output files can optionally be written using parallel I/O (MPI-2 I/O).

AVUS Waverider

This problem was used in our earlier efforts as a "large" test case. It is a generic configuration for a supersonic/hypersonic vehicle that "rides" a shock wave that forms below the vehicle at such speeds, i.e., the attached shock generates lift for the vehicle.

AVUS Turret-TD

This is a model of a turret in a wind tunnel. The turret has a number of small pins for control of the separation characteristics of the flow. Unlike previous test cases, this one is a time-dependent variation. The grid file is read once at the beginning of the run, but both the restart and pix files are written out every 100 time-steps. Thus each file is written 10 times during the course of a 1000 time-step run.

If AVUS is compiled in serial I/O mode, all I/O is done through one process that must collect all the pieces of the restart and pix files from all the other processes. Thus the time-dependent case is expected to scale poorly in serial I/O mode as the process count grows.

If AVUS is compiled in parallel I/O mode, each process writes its own portion of the restart and pix files.

Back To Top

CTH

Code Description

This code originates at Sandia National Laboratory and is part of the computational structural mechanics technology area. The name is an acronym of an acronym; it stands for "CSQ to the Three-Halves". CSQ stands for "CHARTD Squared", where CHARTD stands for Computational Hydrodynamics and Radiative Thermal Diffusion. It uses a two-step, second-order accurate Eulerian algorithm to solve the mass, momentum, and energy equations in shock physics work. This is an explicit approach that bypasses having to solve a linear system. CTH has both static and adaptive mesh capabilities, a feature exercised by both of our test cases below in lieu of having a separate adaptive mesh refinement capability in a stand-alone application. Parallelism is invoked through use of MPI. The total lines of code in the application number around 900K, of which 58 percent is FORTRAN and the remaining 42 percent C. CTH requires use of NetCDF, which is bundled with the CTH distribution.

CTH Fixed-Grid

This model is a fixed-grid, long-rod penetrator with oblique impact. Specifically, a 7.67-cm-long, 0.767-cm-diameter rod made of 10 materials impacts a 0.64-cm-thick plate made of eight materials at an angle of 73.5 degrees. The initial velocity of the rod is 1210 m/s. The computation uses a 3-D fixed grid of 1840 x 230 x 460 cells and runs for 300 time-steps. A restart file, approximately 10 MB in size for each MPI process, is written at the beginning of the run and at the end of the run.

CTH AMR (Automatic Mesh Refinement)

This case is the same as the FY2010 "standard" test case except that the maximum number of levels of refinement has been increased from six to eight, making the problem much more compute-intensive and memory-intensive than the previous version. The model is for the same problem as described above for the fixed-grid test case, but the mesh is allowed to adaptively refine in areas in response to the intensity of the computation rather than use a uniform mesh over the entire domain. The number of time-steps is 200. A restart file for each MPI process is written at the beginning of the run and at the end of the run.

Back To Top

GAMESS

Code Description

The General Atomic and Molecular Structure System originates with the Gordon group at Iowa State University. It has been a mainstay in our benchmarking process for years, in part because of its memory-intensive nature. The application falls under the aegis of the computational chemistry, biology, and material science technology area. As an ab initio quantum chemistry code, it computes many integrals with molecular data in the form of atom positions and electron orbitals. It can be compiled with LAPI, sockets, SHMEM, and MPI, although recent versions have been focusing attention more toward MPI and away from LAPI. It is written almost entirely, 99 percent, in FORTRAN, while the remaining one percent, within the communication layer, is C-based.

GAMESS MP2-grad

This test case performs a 2nd order Moller-Plasset computation that finds the nuclear gradient vector of a "BC4" molecule (i.e., the [B(C(NO2)3)4]- oxygen-rich anion) using the restricted Hartree-Fock calculation with self-consistent field wave functions.

GAMESS DFT-grad

This test case performs a density functional theory computation to compute the nuclear gradient vector of a "POSS" molecule (i.e., the polyhedral oligomeric silsesquioxane molecule) using the restricted Hartree-Fock calculation with self-consistent field wave functions.

GAMESS CC-energy (modified)

The CC-energy test case is a "coupled cluster singles plus doubles plus a perturbative estimate of triples" energy calculation, denoted as CCSD(T), on an energetic heterocyclic ring compound.

Back To Top

HYCOM

Code Description

Standing for the HYbrid Ocean Coordinate Model, this code comes from the U.S. Naval Research Laboratory. It falls under the climate/weather/ocean modeling and simulation computational technology area. Coded 100 percent in FORTRAN, it is a primitive equation ocean general circulation model. Its communication layer is MPI. As with AVUS, MPI-2 parallel I/O processing is available for processing large binary files.

HYCOM large

The sole HYCOM test case is a 32-layer, 1/25 degree global model that simulates 1 day. It requires approximately 1.8 GB of memory per processor and about 180 GB of globally accessible scratch disk.

Back To Top

ICEPIC

Code Description

This application originates with the Air Force Research Laboratory at Kirtland AFB outside of Albuquerque, New Mexico. It serves as a representative of the computational electromagnetic and acoustics technology area. Described as a particle-in-cell plasma physics code, it is used widely in the design of electromagnetic devices. Ions and electrons are known to move under the influence of electromagnetic fields. In ICEPIC, the particles are updated in a grid-free manner and are grouped into cells that are periodically adjusted to preserve the computational load balance. The fields are calculated on a structured, static grid and dual grid according to Maxwell's equations. As a 100 percent C/C++ code, ICEPIC can simulate plasmas contained within complex geometries.

ICEPIC magnetron

This test case performs a simulation of a magnetron for a high-power microwave source during startup. It features many transient waves with particles representing electrons being created and moving in a grid-free way throughout the domain. There are relatively fewer particles than in the larger gyrotron test case described below. This test case emphasizes wave simulation with finite difference time domain more so than the gyrotron test case; the particle-in-cell aspect is significantly less than in the other test case.

ICEPIC gyrotron

This test case is a big simulation of the gyrotron source of the airborne version of the active denial system (ADS). While the mechanics of the updates of the particles and fields are similar to the smaller magnetron test described above, there are many more particles. Therefore, it must perform significantly more particle-in-cell (PIC) work than the other test case. In effect, this test case tests the physics more than the magnetron case.

Back To Top

LAMMPS

Code Description

As an application from Sandia National Laboratory, LAMMPS, like GAMESS, falls in the computational chemistry, biology, and material science technology area. It is a classical molecular dynamics code that models particles in a solid, liquid, or gaseous state. It calculates atomic velocities, positions, system energy, and temperature. All actions occur within a box that is usually orthogonal. Distributed-memory, message-passing parallelism is accomplished by the use of MPI. It is written in C++ and is portable. A fast Fourier transform library, such as FFTW, is necessary to compile the code. Recent code development has worked toward enabling usage of GPGPUs via CUDA.

LAMMPS Au

This model contains a cluster of 121 functionalized gold nanoparticles. The gold nanoparticles are 5 nm in diameter and coated with alkanethiol ligands with eight carbons and a terminating methyl group. The ligands are simulated using the united atom method.

Back To Top

Normalization Calculation for the Performance Matrix

Here we see the duration (in seconds) of the AVUS application running its Waverider test case across several of the DSRC machines. We find the minimum duration, which in this case is 941 seconds (occurring on the ERDC_Diamond machine). This is the best time we see across all machines, so all other machines should be judged relative to this duration, which should receive a normed score of 100. To do this we want to divide this minimum duration by the duration recorded for each machine individually, and then multiply the result by 100.

An Example Using the AVUS Waverider Case
ArchitectureDurationNormalized
ERDC_Diamond 941 100
MHPCC_Mana 967 97
NAVY Davinci 1509 62
NAVY_Einstein 1815 52
ARL_MRAP 1815 52
NAVY_Einstein2 1667 56
ERDC_Garnet 1445 65
Minimum Duration941


In summary, the function is:

normedValue= (minimumDuration over all machines / duration on single machine) * 100

Now drop the decimal portion of the normedValue calculation, i.e., only retain the integer part of the calculation, for the table.

From the above, for instance, we can see that running the AVUS Waverider test case on NAVY_Einstein takes almost twice as long as our fastest results on ERDC_Diamond. The above calculations are performed for each test case for each code across the DSRC architectures under study and finally summarized in the performance matrix. The results are then binned into one of five categories and highlighted by the associated color for ease of reading. Of course, the best results are those closest to 100, while the poor performers are more distant.

Back To Top

View the FY2010 benchmarking results.