NA-ASC-500-12 Issue 19
The Meisner Minute
Editorial by Bob Meisner
At the recent ASC Principal Investigators meeting, I had the opportunity to describe why the ASC Program is living the La Vida Loca. Many of us were around in the mid 1990s and many have heard the legend of the Accelerated Strategic Computing Initiative (ASCI). I’ve got to say that tackling the challenge of proving that massively parallel computing could credibly underpin Science-Based Stockpile Stewardship was awesome. But carrying the 100 teraFLOPS (TF) legacy across the memory-constrained, massively parallel architectural inflection point to exascale is crazy. These are the good old days.
Today, we run 100 TF high-fidelity simulations on petascale platforms, supporting the safety, surety, and reliability of the nation’s nuclear deterrent. These calculations, once considered the goal of ASCI, are now routine. Weapons stewards are performing many such simulations to address Significant Findings Investigations and to work through the Life Extension Programs to ensure that the computational basis of these Department of Defense tasks are truly the best we can deliver without nuclear testing. We have surely come a long way since the pre-ASCI days of single processor vector computing mainly on Cray platforms. The physics algorithms are much more accurate today; the geometric fidelity in today’s simulations is far superior to those of 1990 and together we call this capability full-system, high-fidelity simulation tools. Although improved from previous generations of codes, there are still significant improvements needed to achieve the vision of the current ASC program, i.e. to predict with confidence the performance of our nuclear stockpile.
The pegposts of the Predictive Capability Framework track the major milestones yet to be included in the codes for such a predictive capability. As we proceed to top the Top 500 list once again, we find that the memory footprint of our codes has grown almost exponentially, allowing us to increase the fidelity of our simulations and making us more confident that we can attain predictive stockpile simulations in our lifetimes. The ASC program has also allowed the three defense Laboratories (and their partners) to tackle and solve many challenging problems: uncertainty quantification; physics mysteries from our nuclear tests; qualification of non-nuclear systems through high-fidelity simulations instead of expensive and perhaps impossible testing; as well as, training the next generation of designers in computational simulations instead of testing.
The future is perhaps more challenging than the original ASCI task of moving from single vector processor production codes to massively parallel codes with better predictive capability. Computing platforms and their underlying chips are changing in significant ways. We must address this change in order for our current generation of ASC codes to run effectively. While the design of future computer architectures is unclear, it is clear that the industry is moving in a direction that will drive us to re-work our existing algorithms in significant ways. In fact, as we move forward, counting FLOPS is essentially guaranteed to be the wrong measure of a computer’s worth to the ASC program. Consequently, we are considering new and more meaningful metrics for effective future computers.
Suffice it to say that we are entering a new era in computing and you are leading the charge. The simulation tools you are providing today were only dreams just over 16 years ago. Today’s reality provides only a glimpse of what you can achieve. So when the old timers, myself included, fondly recall the good old days when there was an “I” in ASC, let us have our moment. But, realize that you are living the good old days and you are the architects of our underlying technical capability that ensures a safer nuclear world without testing.
PS: special thanks to Bob Weaver for his insights for this newsletter.
Viz Systems at Los Alamos Aid Progress in Predictive Science
The Production Visualization Project at Los Alamos National Laboratory (LANL) provides visualization (viz) systems to weapons designers. The scientists who are experts on the viz systems work directly with weapons designers in a physics-based, iterative discovery process using EnSight, the tool enabled for next-generation visualization and data analysis. They provide analytical expertise to help LANL weapons designers utilize the full power of the hardware and software. The ASC Program develops and deploys the hardware and software infrastructure for visualization and data analysis.
Bob Weaver, a weapons designer at LANL, commented about data analysis in a recent interview. “When we do data analysis — when we actually look at the results of these large 3D calculations — that whole process has become extremely user friendly. It’s almost become real time. There is a wealth of data in these calculations.”
Weaver goes on to say, “You can imagine a billion-cell calculation has a lot of detail throughout the problem. Our graphics visualization techniques — both small‑file daily interactions with the graphics of these calculations as well as our large three-dimensional visualizations on the stereo PowerWall theatre with our EnSight tools — are state of the art and very quick. We can actually visualize and look at the physics and understand the progress of the calculation almost in real time. This is really a large step forward for us.”
LANL completed the Visualization Cluster Upgrade Project in December 2011 (L2 Milestone 4469). Visualizations of groundbreaking simulations on the petascale systems Cielo and Roadrunner are being performed routinely using ViewMaster2, a new visualization cluster capable of delivering analysis of the scale and precision necessary for predictive scientific progress. This project is an excellent example of a co-designed platform with scientists, scientific and visualization experts, computer scientists, and computer system engineers working together to provide a significant step forward in post-processing performance and capacity.
The ASC Physics and Engineering Models (PEM) simulation code for material strength and damage, Parallel Dislocation Simulator (ParaDis), was recently chosen as the first code to be deployed under the new Lorenz application portal at LLNL. Increasing the accessibility of PEM codes adds value to ASC by enhancing engagement with the wider scientific community. The ParaDis code models in unprecedented detail the mechanisms of dislocation motion.
The Lorenz portal allows ASC computer users from across the country to quickly access all manner of information about their accounts and Livermore Computing, thus simplifying aspects of high performance computing (HPC) that traditionally have been tedious or difficult. The portal will provide high-level Web interfaces for setup, launch, monitoring, and analysis stages of user interaction with ParaDis, thus helping new users become familiar with the code and what it can do. Experienced users will be able to interact with the code and the LC machines more directly.
ParaDiS is a free, large-scale dislocation dynamics simulation code to study the fundamental mechanisms of plasticity. Originally developed at Lawrence Livermore, it is written in C (with a little C++) and uses the MPI library for communication between processors. It runs routinely on 100–1000 processors, and scalability on 132,000 processors of BlueGene/L has been demonstrated.
Trios: A Collaborative Vehicle for I/O Software Technology
Trilinos I/O Support (Trios) is an open-source package of libraries developed as part of a newly formed I/O capability area of the Trilinos project. Trios was released recently as part of the Trilinos 10.10.1 software product in February 2012. Trios serves two important roles: as a repository for production quality I/O libraries such as Exodus, Nemesis, and IOSS -- codes traditionally managed as part of the SIERRA toolkit and in use by ASC codes for more than a decade; and as a vehicle for collaborative design, evaluation, and distribution of new techniques to improve I/O on advanced platforms.
The development portion of Trios contains several ASC-developed software products, including, for example, the Network Scalable Service Interface (Nessie). Nessie is the core framework used to develop "data services," a technology that leverages available compute resources on HPC systems for real-time management and analysis of simulation data. One data service built with Nessie provides caching and staging services for applications that have "bursty" I/O operations (such as checkpoints). Published results demonstrate a 10x improvement in effective I/O rates for representative applications. We are actively developing a production-quality version of this staging service for the Exodus I/O library for users of the Alegra multiphysics codes.
A second data service under development provides real-time analysis for the Sandia shock physics code CTH. This service uses a separate partition of compute nodes to detect material fragments as they are generated by CTH simulation. The nature of the analysis required for fragment detection suggests that offloading the analysis to a separate set of compute nodes could substantially reduce the I/O and analysis overheads on the CTH shock physics code, and possibly lead to analysis techniques for fragment tracking that are not practical using traditional post-processing approaches. A detailed study of this work is underway as part of an FY13 ASC Level 2 milestone.
The integration of ASC I/O software into Trios allows us to leverage the professional quality code management and testing infrastructure in Trilinos to ensure a high-quality product. The broad availability of Trilinos has already facilitated a number of collaborative efforts with Oak Ridge National Laboratory, Georgia Institute of Technology, Northwestern University, and others. An article detailing the R&D associated with Trios is scheduled for publication in a special issue of Scientific Programming later in 2012.
Improving Application Performance with CSSE-Provided Technology
At times, there can be a staff-resource tradeoff between improving application models and modifying the application for higher performance or scalability. One of CSSE’s (Computational Systems and Software Engineering) roles is to lessen this tension by providing generally available software that can help improve application performance with minimal, if any, source code changes to the applications. CSSE-developed software, underlying or in partnership with the application, can exploit newer technology. Sandia staff members have completed a survey of thirteen of their software contributions over the last three years that have addressed this role. The survey included data designed to quantify the improvement. Some of these contributions have been reported in prior ASC newsletters. We highlight three more here.
New Robust Contact Capability Dramatically Improves Runtime
Recent additions to the DASH contact algorithm in the Sierra/Solid Mechanics module dramatically improved the robustness and efficiently of the overall capability. Problems that previously took 2 hours to run now complete in 2 minutes (a 25x reduction in contact iterations and 100x reduction in overall solution iterations). Users also report that problems that previously could not converge in commercial codes are now running to completion in Sierra. Perhaps even more impressive, these solutions were obtained with minimal simple specifications of contacts, i.e., there was no need to specify master or slave surfaces, capture tolerances or special iteration techniques.
These additions to the DASH contact capabilities have enabled robust simulations of high deformation forging problems, system preloads, and problems with many layers of contacts. Contact search and enforcement in implicit codes is one of the more difficult problems in computational solid mechanics due to the highly nonlinear nature of the contact phenomenon and its combination with the nonlinear material and geometric effects that can occur in these types of problems. These effects include phenomenon such as friction or general stick-slip, multiple contact combinations, interfaces with dramatically different stiffness (e.g., foam and steel), thermal softening, and nested contacts (e.g., multiple layers of shells or parts). These types of conditions are appearing in increasing frequency for various nuclear weapons applications as analysts include more and more details in the numerical models.
The recent development activities applied rigorous code verification procedures and advanced features such as adaptive penalty algorithms and methods to suppress intermediate rigid body modes to achieve the reported performance and robustness gains.
Predictive Science Panel Held Spring Meeting at Los Alamos
On March 13–16, 2012, the Predictive Science Panel (PSP) met at Los Alamos National Laboratory (LANL). Under a new charter effective in January 2011, the 15-member panel of experts familiar with relevant scientific disciplines—such as theoretical, computational, and experimental science—came together to provide feedback on the quality and direction of the predictive science work for the NNSA Stockpile Stewardship Program.
At their closeout briefing, the panel acknowledged the knowledge and enthusiasm of the technical presenters and noted that they enjoyed the poster session presented by early-career staff at LANL. The panel provided both technical and programmatic suggestions to strengthen the ASC and Science Campaign programs.
The PSP is chartered by the LANL and LLNL Advanced Simulation and Computing (ASC) and Science Campaign (SC) programs to get feedback on the scope of work executed by the ASC and SC programs. Meetings are scheduled roughly every six months with the location alternating between LANL (spring) and LLNL (fall).
Los Alamos’ ASC Program Sponsors Metropolis Postdoctoral Fellowship
Following the death of Nicholas Metropolis on October 17, 1999, then Los Alamos National Laboratory Director John Browne wrote: “Nick’s work in mathematics and the beginnings of computer science forms the basis for nearly everything the Laboratory has done in computing and simulation science.”
In 2010, LANL inaugurated the Nicholas C. Metropolis Postdoctoral Fellowship in Computer and Computational Science. Under the Advanced Simulation & Computing Program, computer simulation capabilities are developed to support the Stockpile Stewardship Program as well as broader national security needs. “Given that much of our weapons work today is done on computers, I wanted to develop a fellowship that specifically targets computational and computer scientists to join LANL,” says Brian Albright, a scientist in the Plasma Theory and Applications Group and one of the architects of the fellowship.
To date four recipients of this fellowship are pursuing advanced research in the areas of computational and computer science, physics, and engineering. Metropolis postdoc fellows have the opportunity to use the most powerful supercomputers in the world to perform cutting-edge research. An article about the fellowship and a few of its recipients appears in LANL’s National Security Science magazine web site (http://www.lanl.gov/science/NSS/past_issues.shtml). Click on Issue 2, 2011, and find the article “Rolling out a New Supercomputing Fellowship at Los Alamos.”
For more information about the fellowship, go to the LANL Postdocs website at http://www.lanl.gov/science/postdocs/appointments_fellow.shtml.
Lawrence Livermore Sparks Improvements in HPC Energy Efficiency
As Lawrence Livermore National Laboratory (LLNL) sets its sights on exascale computing, Lab scientists and engineers are researching and developing techniques to improve the energy efficiency of high performance computing (HPC). LLNL is involved in several efforts to reduce the energy use of the computers and the facilities that house them and to promote new standards of quantifying efficiency gains beyond gross energy use.
"Today, U.S. servers and data centers are already using more than 1.5% of the total national electricity consumption," said Anna Maria Bailey, Computation Associate Director Facility Manager. "With 20-megawatt exascale systems expected to come online in the next 7 to 10 years, it's vital that we redefine a supercomputer's relation to energy."
LLNL has been a leader in optimizing the efficiency of HPC through various sustainability projects. For instance, since 2004—when ASC Purple was brought online—until today, through multiple generations of HPC platforms, the Terascale Simulation Facility (TSF) computing power has increased five-fold (in one quarter of the space) while using 2.4 times less electricity.
While enhanced data center efficiency and metrics such as power usage effectiveness (PUE)—a measure of how much power is used by the computing equipment itself in contrast to cooling and other overhead—have improved the overall power picture, scientists must now envision innovations, fostering smart cooling, heat re-use, renewable energy, and full lifecycle sustainability. The core focus areas include benchmarking, computation fluid dynamics, Leadership in Energy and Environmental Design certifications, HPC capability gap analysis, free cooling, liquid cooling, innovative electrical distribution, sustainable HPC solutions, HPC platform power budgets, and power management.
"We're using several new techniques with Sequoia that address sustainability issues," Bailey said. "At 2 gigaFLOP/s per watt, it will be the world's most power-efficient supercomputer."
More than 91% of Sequoia will be cooled using a combination of liquid-cooling and air-cooling techniques. Its efficient design also includes an innovative 480-V electrical distribution system, which provides improved voltage optimization to reduce losses.
"Even with these techniques, Sequoia will use enough energy to power 7,200 homes," Bailey said. "We've got to bring that number down as we plan for exascale."
One way to improve an HPC center's energy efficiency is to implement "free cooling," a technique that uses the outside air to drive the machine-cooling process. Bailey and her team have completed a study that shows it would take a $5.5M investment to implement free cooling in the TSF.
While the upfront funding might seem substantial, free cooling would save an estimated 16M kWh per year and would pay for itself in four years. In addition, this design would allow B453 to increase its computational capacity from 30MW to 45MW and improve the overall facility PUE from 1.3 to 1.15.
The LLNL team is also actively pursuing power management solutions, which will be critical to the success of energy management and, ultimately, exascale computing. The team's solution has been to create and implement a centralized real-time data management infrastructure of all data sources, from the individual computer racks to the entire Laboratory site. This effort presents many challenges: understanding how different types of hardware and software affect power utilization, correlating multiple data sources, coordinating with multiple owners of the data, accessing the data, selecting the best interface, comparing and viewing the data on a common platform, and creating various dashboards. Once complete, the infrastructure can be leveraged to all LLNL data centers and perhaps throughout DOE.
"One of our goals is to create solutions that can be adopted by the entire DOE complex," Bailey said. "We share a commitment to mission excellence and are working closely together to use computational efficiency as a viable alternative to measuring advances in HPC sustainable stewardship."
Popular Mechanics Features Story on Sequoia
The Sequoia facility infrastructure project is well underway; however, the facility requirements are challenging in many areas with heavy emphasis on structural, mechanical, and electrical systems pushing the envelope of the building. Each rack is 4W’x4’Lx7’H and weighs 4500 pounds. The entire system is comprised of 96 racks, which is over 210 tons of additional weight on the computer floor. Each rack requires 30 GPM of water ranging from 64°F to 74°F with an air supply of 1700 CFM all needing 100 kW of computational power for a total of 9.6 MW. These technical challenges are coming together in a compartmentalized master space plan to accommodate all of the utilities underneath the floor. December’s Popular Mechanics highlighted many of Sequoia’s features.
ASC Salutes Joel Stevenson
Joel is a Principal Member of Technical Staff in the Scientific Applications and User Support department at Sandia National Laboratories. He currently supports customers working on the ACES (Alliance for Computing at Extreme Scale) Cielo platform located at Los Alamos National Laboratory (LANL).
Over the last year or so, Joel has been engaged in helping Sandia code teams and analysts port applications to Cielo. He has been instrumental in enabling applications to run at scale, in providing detailed evidence of failure modes, in helping to isolate and identify problems encountered with the Panasas file system, and in applying defensive work-arounds to let simulations progress.
“I really enjoy working with subject matter experts like Jason Wilke and Steve Attaway on real-world design problems, and I appreciate the opportunity to learn the application space,” says Joel. “Getting a chance to run the Sandia hydrodynamics code CTH over the last few years has provided me with an appreciation for the work of the code developers and analysts. I think first-hand experience running the code gives me a better appreciation for the issues that users face, and makes me a better resource for users.”
In addition to this analytic effort, Joel also managed the call for Capability Computing Campaign (CCC) proposals for Cielo. He led the Sandia process for both the initial CCC-1 series of workload requests and the current CCC-2 campaign. Before this assignment, Joel supported the data gathering and decision process for determining the Sandia workload on the Purple platform at Lawrence Livermore National Laboratory (LLNL). Soliciting project proposals, refining computing estimates, and ensuring that each project was successfully running on Purple exposed Joel to many Sandia codes and code team members, an experience that has helped establish his reputation as someone who has the “right stuff” to handle production computing issues.
Joel originally joined Sandia in 1986 working in the Materials Science and Technology Center doing electrochemistry research, and left in 1997 to co-found Peak Sensor Systems, a supplier to the microelectronics fabrication industry. When he returned to Sandia in 2005, Joel was looking for a new challenge. He spent a little time working with the High Performance Storage System (HPSS) team, where he gained an appreciation for the complexities of moving and managing large data sets. Initially, Joel assisted customers in transferring data off Sandia’s Red Storm computer system to centralized file systems on the Sandia Restricted Network and the Sandia Classified Network, as well as to the Sandia Mass Storage System tape based archives that manage data using the HPSS servers.
This experience was beneficial as he progressed to supporting longer distance data movement for customers using Purple at LLNL and Cielo at LANL. The bulk of this long distance data transfer is across the ASC funded DisCom Wide Area Network. The DisCom WAN provides 10 gigabit Ethernet connects between the three laboratories classified computing environments. These interconnects require constant observation and analysis as minor changes or error conditions can drastically alter the performance of data transfer between the sites. Joel became very familiar with the various tools and their capabilities for data movement and performance and he was able to create a “cookbook” recommendation for analysts to refer to when needing to move data between systems either locally or across the DisCom WAN.
As a User Support professional, Joel helps beginners learn about High Performance Computing systems and often acts on their behalf to diagnose problems or manage production runs efficiently. Joel has helped several projects to run efficiently on Purple and Cielo by avoiding the little learning errors that can derail early adopters and delay progress towards deliverables. Most large simulations run for days to weeks and create thousands of files. Managing this complexity takes a thorough understanding of the codes, the file systems, and the limitations of the individual computing platform. Joel has become expert in all these areas, making him especially valuable to the ASC program and our NW mission.
ASC Relevant Research
Lawrence Livermore National Laboratory
Citations for Publications
2009 Publications (previously not listed)
2010 Publications (previously not listed)
2011 Publications (previously not listed)
Sandia National Laboratories
Citations for Publications (previously not listed)
CORRECTIONS TO PRIOR SUBMITTALS
Submitted FY11 Q4
Submitted FY12 Q1
Printer-friendly version -- ASCeNews Quarterly Newsletter - March 2012