The choice of hardware components and software configuration drives the time to solution and scalability of any application. The variety of codes that run on the DoD High Performance Computing Modernization Program (HPCMP) machines and the differences in platforms force researchers into a classic economics exercise - how to allocate limited HPC resources to achieve the most in the least amount of time.
Our results below focus on the following codes: ADCIRC, ALEGRA, Air Vehicles Unstructured Solver (AVUS), CTH, General Atomic and Molecular Electronic Structure System (GAMESS), HYbrid Coordinate Ocean Model (HYCOM), Improved Concurrent Electromagnetic Particle In Cell (ICEPIC), and Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS).
Generally, the results from benchmarking runs can serve two purposes. At a fixed CPU count, relative machine performance can be judged on a per-application basis. This is normally done at what is termed a distinguished processor count. Taking a different perspective, if one were to look at the times for a particular test case run for an application on a particular machine at varying CPU counts, it is possible to judge scalability of the application. This is especially true if one were to make the same runs across widely-disparate types of machines. Certainly the timing of the runs would be different--due to variations in OS, compilers, interconnects, memory layout, etc. However, if carried out to a large enough number of CPUs, a pattern should emerge, providing a sense of how well the application scales. For our purposes here, we focus exclusively on the former, hoping to assist in pushing applications to their most appropriate computational platforms and, in the process, disregarding application scalability. In the matrix summary table below, we use a red-to-green scale to denote the application's relative performance on a machine-by-machine basis, with red highlighting poor performance and green the best relative performance. Note that the number beneath the test case name reflects the number of cores at which performance was recorded in the table.
| Architecture | ADCIRC Baroclinic 1024 |
ADCIRC Hurricane 512 |
ALEGRA ObliqueImp 1536 |
ALEGRA ExplWire 256 |
AVUS Waverider 1024 |
AVUS Turret-td 1280 |
CTH Fixed-grid 1280 |
CTH AMR 1280 |
GAMESS DFT-grad 256 |
GAMESS MP2-grad 512 |
GAMESS CC-energy 1024 |
HYCOM Lrg 1353 |
ICEPIC Magnetron 384 |
ICEPIC Gyrotron 2048 |
LAMMPS Au 1024 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERDC Diamond | 93 | 100 | 100 | 100 | 100 | 100 | 94 | 100 | 83 | 81 | 92 | 96 | 95 | 100 | 100 |
| MHPCC Mana | na | 74 | na | na | 97 | 93 | 100 | 92 | 79 | 98 | 100 | 100 | 100 | 93 | 80 |
| NAVY Davinci | 65 | 49 | 27 | 37 | 62 | 24 | na | 64 | 100 | 68 | 85 | 42 | 30 | na | 90 |
| NAVY Einstein | na | na | 50 | 40 | 52 | 36 | 71 | 52 | 40 | 88 | 78 | 65 | 45 | 37 | 41 |
| ARL MRAP | na | na | 49 | 40 | 52 | 36 | 70 | 50 | 40 | 88 | 78 | 65 | 46 | 36 | 41 |
| NAVY Einstein2 | 87 | 15 | 52 | 32 | 56 | 41 | na | 50 | 45 | 100 | 88 | 68 | 50 | 43 | 49 |
| ERDC Garnet | 100 | 63 | 58 | 60 | 65 | 49 | 88 | 67 | 53 | 70 | 88 | 84 | 51 | 43 | 50 |
For more information on how these values are normalized, see the Summary of Application Runtimes below.
| 0-19 | 20-39 | 40-59 | 60-79 | 80-100 |
|---|---|---|---|---|
| Architecture | ADCIRC Baroclinic 1024 |
ADCIRC Hurricane 512 |
ALEGRA ObliqueImp 1536 |
ALEGRA ExplWire 256 |
AVUS Waverider 1024 |
AVUS Turret-td 1280 |
CTH Fixed-grid 1280 |
CTH AMR 1280 |
GAMESS DFT-grad 256 |
GAMESS MP2-grad 512 |
GAMESS CC-energy 1024 |
HYCOM Lrg 1353 |
ICEPIC Magnetron 384 |
ICEPIC Gyrotron 2048 |
LAMMPS Au 1024 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ERDC Diamond | 8959 | 3347 | 1640 | 944 | 941 | 1332 | 3399 | 2535 | 4701 | 2536 | 3658 | 3020 | 2559 | 3639 | 3182 |
| MHPCC Mana | na | 4504 | na | na | 967 | 1437 | 3186 | 2768 | 4929 | 2108 | 3354 | 2893 | 2443 | 3911 | 3993 |
| NAVY Davinci | 12720 | 6836 | 6048 | 2556 | 1509 | 5535 | na | 3975 | 3911 | 3042 | 3924 | 6817 | 8144 | na | 3536 |
| NAVY Einstein | na | na | 3273 | 2348 | 1815 | 3699 | 4479 | 4912 | 9661 | 2342 | 4320 | 4443 | 5370 | 9925 | 7771 |
| ARL MRAP | na | na | 3343 | 2364 | 1815 | 3699 | 4542 | 5055 | 9661 | 2342 | 4320 | 4443 | 5328 | 10054 | 7771 |
| NAVY Einstein2 | 9570 | 22506 | 3144 | 2969 | 1667 | 3245 | na | 5068 | 8635 | 2061 | 3801 | 4252 | 4897 | 8452 | 6444 |
| ERDC Garnet | 8308 | 5349 | 2806 | 1564 | 1445 | 2737 | 3603 | 3760 | 7346 | 2931 | 3807 | 3450 | 4819 | 8426 | 6343 |
| DSRC | Name | Make | Model | Chip Set | Processor Speed (GHz) |
Interconnect | Number of Cores |
Cores per Node |
Operating System |
|---|---|---|---|---|---|---|---|---|---|
| ERDC | Diamond | SGI | Altix ICE | Intel Xeon QC | 2.8 | DDR4 InfiniBand | 15,360 | 8 | SUSE Linux |
| MHPCC | Mana | Dell | PowerEdge M610 | Intel Xeon QC | 2.8 | DDR Infiniband | 9216 | 8 | Linux |
| NAVY | DaVinci | IBM | Power6 | IBM P6 DC | 4.7 | Infiniband | 4800 | 32 | AIX |
| NAVY | Einstein | Cray | XT5 | Cray Opteron QC | 2.3 | SeaStar2+ | 12736 | 8 | CNL |
| ARL | MRAP | Cray | XT5 | Cray Opteron QC | 2.6 | SeaStar | 10400 | 8 | CNL |
| NAVY | Einstein2 | Cray | XT5 | Cray Opteron QC | 2.4 | SeaStar2+ | 12736 | 8 | CNL |
| ERDC | Garnet | Cray | XE6 | AMD Opteron 64-bit | 2.4 | Cray Gemini | 20224 | 16 | CLE |
Obtained from the University of North Carolina Institute of Marine Sciences, ADCIRC is a coastal circulation and storm surge model. It solves time-dependent, free-surface circulation and transport problems in two and three dimensions. It uses the finite element method (FEM) in space, permitting highly flexible, unstructured grids. Typical ADCIRC uses have included modeling tides and wind-driven circulation, the analysis of hurricane storm surge and flooding, determining dredging feasibility and material disposal studies, larval transport studies, and near-shore marine operations. ADCIRC solves the equations of motion for a moving fluid on a rotating earth. These equations are formulated using the traditional hydrostatic pressure and Boussinesq approximations and have been discretized in space using the finite element (FE) method and in time using the finite difference (FD) method.
ADCIRC can be run either as a two-dimensional depth integrated (2DDI) model or as a three-dimensional (3-D) model. In either case, elevation is obtained from the solution of the depth-integrated continuity equation in Generalized Wave-Continuity Equation (GWCE) form. Velocity is obtained from the solution of either the 2DDI or 3-D momentum equations. All of the nonlinear terms have been carefully retained in all of these equations.
ADCIRC can be run using either a Cartesian or a spherical coordinate system. It includes a least squares analysis routine that computes harmonic constituents for elevation and depth-averaged velocity during the course of the run, thereby avoiding the need to write out long time series for postprocessing.
ADCIRC has been optimized by unrolling loops for enhanced performance on multiple computer architectures. It includes MPI library calls to allow it to operate at high efficiency, typically better than 90 percent, on parallel computer architectures.
This test case simulates the dynamics of the Turkish Straits System, including the Northeastern Aegean Sea, Marmara Sea, and the southwest Black Sea. These seas are connected to each other by the Dardanelles and the Bosporus straits, and the salinity/density differences in the seas create/govern a two-layer flow system in both straits. This case is composed of 310,435 nodes and 605,099 elements, and the maximum resolution of the model goes down to 20 m. Although this case is rather I/O intensive, it scales out to approximately 2K cores.
This ADCIRC case is for a Gulf of Mexico hurricane surge simulation. It represents a typical ADCIRC application that the U.S. Army Corps of Engineers looks at when doing levee designs or flood plain mappings for FEMA. Within ADCIRC it is exercising the wetting and drying algorithm, as well as the usual hydrodynamic solution for depth-averaged velocities and sea surface elevation. It has 2,734,399 nodes and 5,357,158 elements in its input decks.
Developed at Sandia National Laboratory, ALEGRA is an Arbitrary Lagrangian Eulerian (ALE) code. The dual nature provides flexibility, accuracy, and reduced numerical dissipation over a pure Eulerian code. Also advantageous, its modern remeshing technology allows for robust mesh smoothing and control.
This is a two-dimensional simulation of a suspended aluminum wire exploding. It is represented within a rectilinear-biased mesh in cylindrical (r-z) geometry with 12.5 micrometer resolution (12.5 million elements), with the elapsed time at 1 microsecond.
This is a three-dimensional representation of the Grady-Kipp oblique sphere impact experiment, where a copper sphere of 3.18 mm radius hits a rectangular steel plate with a velocity of 4520 m/s at an angle of 30.8 degrees. The geometrical mesh is three-dimensional and rectilinear, with the problem having a resolution of 300 micrometers totaling 19.3 million elements. The elapsed time for this simulation is 15 microseconds.
Originating at the Wright-Patterson AFB, the Air Vehicles Unstructured Solver is a computational fluid dynamics (CFD) code, descended from an old COBALT_60 version. It simulates three-dimensional viscous flow over irregular geometries. At its foundation, it is grid-based and, as a result, must read in a sizeable grid file. It is a FORTRAN-90 code encompassing approximately 29K lines, and it uses ParMETIS to partition the mesh. Two versions of ParMETIS are included. AVUS' parallelism is exclusively through the message-passing interface (MPI); no OpenMP functionality is currently available. In this version, the restart and picture output files can optionally be written using parallel I/O (MPI-2 I/O).
This problem was used in our earlier efforts as a "large" test case. It is a generic configuration for a supersonic/hypersonic vehicle that "rides" a shock wave that forms below the vehicle at such speeds, i.e., the attached shock generates lift for the vehicle.
This is a model of a turret in a wind tunnel. The turret has a number of small pins for control of the separation characteristics of the flow. Unlike previous test cases, this one is a time-dependent variation. The grid file is read once at the beginning of the run, but both the restart and pix files are written out every 100 time-steps. Thus each file is written 10 times during the course of a 1000 time-step run.
If AVUS is compiled in serial I/O mode, all I/O is done through one process that must collect all the pieces of the restart and pix files from all the other processes. Thus the time-dependent case is expected to scale poorly in serial I/O mode as the process count grows.
If AVUS is compiled in parallel I/O mode, each process writes its own portion of the restart and pix files.
This code originates at Sandia National Laboratory and is part of the computational structural mechanics technology area. The name is an acronym of an acronym; it stands for "CSQ to the Three-Halves". CSQ stands for "CHARTD Squared", where CHARTD stands for Computational Hydrodynamics and Radiative Thermal Diffusion. It uses a two-step, second-order accurate Eulerian algorithm to solve the mass, momentum, and energy equations in shock physics work. This is an explicit approach that bypasses having to solve a linear system. CTH has both static and adaptive mesh capabilities, a feature exercised by both of our test cases below in lieu of having a separate adaptive mesh refinement capability in a stand-alone application. Parallelism is invoked through use of MPI. The total lines of code in the application number around 900K, of which 58 percent is FORTRAN and the remaining 42 percent C. CTH requires use of NetCDF, which is bundled with the CTH distribution.
This model is a fixed-grid, long-rod penetrator with oblique impact. Specifically, a 7.67-cm-long, 0.767-cm-diameter rod made of 10 materials impacts a 0.64-cm-thick plate made of eight materials at an angle of 73.5 degrees. The initial velocity of the rod is 1210 m/s. The computation uses a 3-D fixed grid of 1840 x 230 x 460 cells and runs for 300 time-steps. A restart file, approximately 10 MB in size for each MPI process, is written at the beginning of the run and at the end of the run.
This case is the same as the FY2010 "standard" test case except that the maximum number of levels of refinement has been increased from six to eight, making the problem much more compute-intensive and memory-intensive than the previous version. The model is for the same problem as described above for the fixed-grid test case, but the mesh is allowed to adaptively refine in areas in response to the intensity of the computation rather than use a uniform mesh over the entire domain. The number of time-steps is 200. A restart file for each MPI process is written at the beginning of the run and at the end of the run.
The General Atomic and Molecular Structure System originates with the Gordon group at Iowa State University. It has been a mainstay in our benchmarking process for years, in part because of its memory-intensive nature. The application falls under the aegis of the computational chemistry, biology, and material science technology area. As an ab initio quantum chemistry code, it computes many integrals with molecular data in the form of atom positions and electron orbitals. It can be compiled with LAPI, sockets, SHMEM, and MPI, although recent versions have been focusing attention more toward MPI and away from LAPI. It is written almost entirely, 99 percent, in FORTRAN, while the remaining one percent, within the communication layer, is C-based.
This test case performs a 2nd order Moller-Plasset computation that finds the nuclear gradient vector of a "BC4" molecule (i.e., the [B(C(NO2)3)4]- oxygen-rich anion) using the restricted Hartree-Fock calculation with self-consistent field wave functions.
This test case performs a density functional theory computation to compute the nuclear gradient vector of a "POSS" molecule (i.e., the polyhedral oligomeric silsesquioxane molecule) using the restricted Hartree-Fock calculation with self-consistent field wave functions.
The CC-energy test case is a "coupled cluster singles plus doubles plus a perturbative estimate of triples" energy calculation, denoted as CCSD(T), on an energetic heterocyclic ring compound.
Standing for the HYbrid Ocean Coordinate Model, this code comes from the U.S. Naval Research Laboratory. It falls under the climate/weather/ocean modeling and simulation computational technology area. Coded 100 percent in FORTRAN, it is a primitive equation ocean general circulation model. Its communication layer is MPI. As with AVUS, MPI-2 parallel I/O processing is available for processing large binary files.
The sole HYCOM test case is a 32-layer, 1/25 degree global model that simulates 1 day. It requires approximately 1.8 GB of memory per processor and about 180 GB of globally accessible scratch disk.
This application originates with the Air Force Research Laboratory at Kirtland AFB outside of Albuquerque, New Mexico. It serves as a representative of the computational electromagnetic and acoustics technology area. Described as a particle-in-cell plasma physics code, it is used widely in the design of electromagnetic devices. Ions and electrons are known to move under the influence of electromagnetic fields. In ICEPIC, the particles are updated in a grid-free manner and are grouped into cells that are periodically adjusted to preserve the computational load balance. The fields are calculated on a structured, static grid and dual grid according to Maxwell's equations. As a 100 percent C/C++ code, ICEPIC can simulate plasmas contained within complex geometries.
This test case performs a simulation of a magnetron for a high-power microwave source during startup. It features many transient waves with particles representing electrons being created and moving in a grid-free way throughout the domain. There are relatively fewer particles than in the larger gyrotron test case described below. This test case emphasizes wave simulation with finite difference time domain more so than the gyrotron test case; the particle-in-cell aspect is significantly less than in the other test case.
This test case is a big simulation of the gyrotron source of the airborne version of the active denial system (ADS). While the mechanics of the updates of the particles and fields are similar to the smaller magnetron test described above, there are many more particles. Therefore, it must perform significantly more particle-in-cell (PIC) work than the other test case. In effect, this test case tests the physics more than the magnetron case.
As an application from Sandia National Laboratory, LAMMPS, like GAMESS, falls in the computational chemistry, biology, and material science technology area. It is a classical molecular dynamics code that models particles in a solid, liquid, or gaseous state. It calculates atomic velocities, positions, system energy, and temperature. All actions occur within a box that is usually orthogonal. Distributed-memory, message-passing parallelism is accomplished by the use of MPI. It is written in C++ and is portable. A fast Fourier transform library, such as FFTW, is necessary to compile the code. Recent code development has worked toward enabling usage of GPGPUs via CUDA.
This model contains a cluster of 121 functionalized gold nanoparticles. The gold nanoparticles are 5 nm in diameter and coated with alkanethiol ligands with eight carbons and a terminating methyl group. The ligands are simulated using the united atom method.
Here we see the duration (in seconds) of the AVUS application running its Waverider test case across several of the DSRC machines. We find the minimum duration, which in this case is 941 seconds (occurring on the ERDC_Diamond machine). This is the best time we see across all machines, so all other machines should be judged relative to this duration, which should receive a normed score of 100. To do this we want to divide this minimum duration by the duration recorded for each machine individually, and then multiply the result by 100.
| Architecture | Duration | Normalized |
|---|---|---|
| ERDC_Diamond | 941 | 100 |
| MHPCC_Mana | 967 | 97 |
| NAVY Davinci | 1509 | 62 |
| NAVY_Einstein | 1815 | 52 |
| ARL_MRAP | 1815 | 52 |
| NAVY_Einstein2 | 1667 | 56 |
| ERDC_Garnet | 1445 | 65 |
| Minimum Duration | 941 |
normedValue= (minimumDuration over all machines / duration on single machine) * 100
Now drop the decimal portion of the normedValue calculation, i.e., only retain the integer part of the calculation, for the table.
From the above, for instance, we can see that running the AVUS Waverider test case on NAVY_Einstein takes almost twice as long as our fastest results on ERDC_Diamond. The above calculations are performed for each test case for each code across the DSRC architectures under study and finally summarized in the performance matrix. The results are then binned into one of five categories and highlighted by the associated color for ease of reading. Of course, the best results are those closest to 100, while the poor performers are more distant.
View the FY2010 benchmarking results.