S1 Sunday, Full Day
Room A102/104/106

Title: Introduction to Clusters: Build Yourself a PC Cluster NOW!

Presenters: Kwai L. Wong and Christian Halloy, University of Tennessee/Oak Ridge National Laboratory

Level: 50% Introductory | 30% Intermediate | 20% Advanced

Abstract:

This will be a well-paced, information-rich full-day tutorial describing step-by-step, with live demos, how to build and utilize a Linux PC cluster for High Performance Computing. PC clusters are being used for parallel computing applications at many sites around the world and they offer by far the best price/performance ratio. Linux-based PC clusters are receiving increasingly more attention, but putting them together remains a challenge to most scientists and researchers. This tutorial offers a unique opportunity for participants to learn how to effectively build a PC cluster on-site from scratch. Several computers will be used to demonstrate in detail how to install, configure, customize, and eventually compute on the newly built cluster.

This tutorial will have three parts. The first part will cover basic Linux installation on an individual workstation. The second part will focus on integrating additional PCs within a cluster through network configuration using a switch. The final part will emphasize the installation of several freely available useful scientific libraries such as PVM, MPI, Scalapack, Petsc, Aztec, etc. Benchmark computations will be carried out to demonstrate the functionality of the finished PC cluster.

The authors have conducted similar training courses and have constructed a research production PC cluster for ORNL's Solid State Division in Spring 2000. See http://www.jics.utk.edu/SSDcluster.html for additional information.



S2 Sunday, Full Day
Room A101/103/105

Title: Introduction to Effective Parallel Computing

Presenters: Quentin F. Stout and Christiane Jablonowski, University of Michigan

Level: 75% Introductory | 25% Intermediate

Abstract:

This tutorial will provide a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, managers, students and anyone needing a general overview of parallel computing. It discusses software and hardware, with an emphasis on standards, portability, and systems that are now (or soon will be) commercially or freely available. Computers examined range from low-cost clusters to highly integrated supercomputers. The tutorial surveys basic concepts and terminology, and gives parallelization examples selected from engineering, scientific, and data intensive applications. These real-world examples are targeted at distributed memory systems using MPI, and shared memory systems using OpenMP. The tutorial shows basic parallelization approaches and discusses some of the software engineering aspects of the parallelization process, including the use of tools. It also discusses techniques for improving parallel performance. It helps attendees make intelligent planning decisions by covering the primary options that are available, explaining how they are used and what they are most suitable for. The tutorial also provides pointers to the literature and web-based resources.


S3 Sunday, Full Day
Room A107

Title: Practical Automatic Performance Analysis

Presenters: Michael Gerndt, University of Technology Munich; Barton P. Miller, University of Wisconsin; Tomàs Margalef, Autonomous University of Barcelona; Bernd Mohr, Research Centre Juelich

Level: 10% Introductory | 50% Intermediate | 40% Advanced

Abstract:

Efficient usage of today's hierarchical clustered machines promises scalable high performance at low cost but often demands the usage of more than one parallel programming models in the same application. As a consequence, performance analysis and tuning become more difficult and creates a need for advanced tools.

In the last years progress was made towards the design and implementation of automatic performance analysis tools. First research tools are already available that either allow to automatically analyze program traces, e.g., Kappa-Pi and KOJAK, or even further implement a fully automatic on-line search, e.g., Paradyn. These tools will be presented in this tutorial. In addition, the tutorial will give an overview of standard and new performance analysis techniques, a concise presentation of performance properties for MPI and OpenMP and an overview of other automatic performance analysis tools not presented in this tutorial.

The tutorial will be a combination of presentation and online demonstrations. It will cover information which application people as well as tool developers will find most useful.

The tutorial is given by members of the APART working group (Automatic Performance Analysis: Resources and Tools) which is funded by the European Commission. APART is a collaborative effort of more than twenty partners from United States and Europe. Over the next years, APART will coordinate several development projects for automatic performance analysis tools in Europe and the United States.



S4 Sunday, Full Day
Room A108

Title: Java for High Performance Computing: Performance and Parallelisation

Presenters: Lorna Smith, Mark Bull, and David Henty, Edinburgh Parallel Computing Centre, The University of Edinburgh

Level: 20% Introductory | 60% Intermediate | 20% Advanced

Abstract:

Java offers a number of benefits as a language for High Performance Computing (HPC), especially in the context of the Computational Grid. For example, Java offers a high level of platform independence not observed with traditional HPC languages. This is crucial in an area where the lifetime of application codes exceeds that of most machines. In addition, the object-oriented nature of Java facilitates code re-use and reduces development time.

There are however a number of issues surrounding the use of Java for HPC, principally performance, numerics and parallelism. EPCC is leading the Benchmarking initiative of the Java Grande Forum, which is specifically concerned with performance. The tutorial will focus on this work, examining performance issues relevant to HPC applications. It will consider benchmarks for evaluating different Java environments, for inter-language comparisons and for testing the performance and scalability of different Java parallel models (native threads, message passing and OpenMP).

The aim is to demonstrate that performance no longer prohibits Java as a base language for HPC and that the available parallel models offer realistic mechanisms for the development of parallel applications. The tutorial will include a number of practical coding sessions which reinforce the concepts described in the lectures.


S5 Sunday, Full Day
Room A110

Title: High-Performance Numerical Linear Algebra: Fast and Robust Kernels for Scientific Computing

Presenters: Jack Dongarra, University of Tennessee; Iain Duff, Rutherford Appleton Laboratory; Danny Sorensen, Rice University; Henk van der Vorst, Utrecht Rutherford Appleton Laboratory University

Level: 20% Introductory | 50% Intermediate | 30% Advanced

Abstract:

Present computers, even workstations and personal computers, allow the solution of very large-scale problems in science and engineering. A major part of the computational effort goes in solving linear algebra subproblems. We will discuss a variety of algorithms for these problems indicating where each is appropriate and emphasizing their efficient implementation. Many of the sequential algorithms used satisfactorily on traditional machines fail to exploit the architecture of modern computers. We will consider techniques devised to utilize modern architectures more fully, especially the design of the Level 1, 2, 3 BLAS, LAPACK and ScaLAPACK.

For large sparse linear systems we will give an introduction to this field and guidelines on the selection of appropriate software. We will consider both direct methods and iterative methods of solution. In the case of direct methods, we will emphasize frontal and multifrontal methods including variants performing well on parallel machines. For iterative methods, our discussion will include CG, MINRES, SYMMLQ, BiCG, QMR, CGS, BiCGSTAB, GMRES, and LSQR. For large-sparse eigenproblems we will discuss some of the most widely used methods such as Lanczos, Arnoldi, and Jacobi-Davidson. The Implicitly Restarted Arnoldi Method will be introduced along with the software ARPACK that is based upon that method.


S6 Sunday, Full Day
Room A201

Title: Sharable and Scalable I/O Solutions for High Performance Computing Applications

Presenters: Larry Schoof, Sandia National Laboratory; Mark Miller, Lawrence Livermore National Laboratory; Mike Folk and Albert Cheng, National Center for Supercomputing Applications

Level: 30% Introductory | 50% Intermediate | 20% Advanced

Abstract:

Two challenges facing HPC applications are the need to improve I/O performance, and an ability to share complex scientific data and data analysis software. The computational times for HPC applications have decreased in recent years by 2-3 orders of magnitude, but unfortunately I/O performance has not kept pace with these impressive increases in raw compute power. Scientific data and tools for working with scientific data have also evolved, from application “stovepipes” (little/no data interoperability) to a recognition of the value of large-scale integration (full data interoperability) that facilitates sharing data and tools among applications and across a varied and changing landscape of computing environments. In this tutorial we discuss two complementary I/O libraries that address these issues. The first, Hierarchical Data Format Version 5 (HDF5) represents and operates on scientific data as concrete arrays. The second, Sets and Fields (SAF) data modeling system, represents and operates on scientific data as abstract fields. Building upon HDF5 as a foundation, SAF encapsulates parallel and scientific constructs intrinsically to provide greater sharability of data and interoperability of software.


S15 Sunday, Full Day
Room C102 - C104

Title: High Performance Computing: What Role for the Individual Microprocessor, if any

Presenters: Yale N. Patt, The University of Texas at Austin

Level: 30% Introductory | 40% Intermediate | 30% Advanced

Abstract:

High performance computing applications continue to want more and more performance capability. Where does the individual microprocessor fit? Process technology promises one billion transistors on each silicon die, running at 10 GHz in a few years. Can that technology be harnessed, or are the nay-sayers right that Moore's Law is dead and the problems of increasing single chip performance are just too hard. This tutorial will try to do several things. We will look at the arguments of the nay-sayers, and point out why they should not deter us. We will explore the bottlenecks of a microarchitecture vis-a-vis high performance, and describe how we are overcoming them. We will examine the relevant characteristics of some relevant state-of-the-art microprocessors. Finally, we will discuss what we might see on a chip five years from now. More on the presenter can be found at http://www.ece.utexas.edu/~patt.


S7 Sunday, Half Day, AM
Room A205

Title: Introduction to Parallel Programming with OpenMP

Presenters: Tim Mattson, Intel Corporation; Rudolf Eigenmann, Purdue University

Level: 75% Introductory | 20% Intermediate | 5% Advanced

Abstract:

OpenMP is an Application Programming Interface for directive-driven parallel programming of shared memory computers. Fortran, C and C++ compilers supporting OpenMP are available for Unix and Windows workstations. Most vendors of shared memory computers are committed to OpenMP making it the de facto standard for writing portable, shared memory, parallel programs. This tutorial will provide a comprehensive introduction to OpenMP. We will start with basic concepts to bring the novice up to speed. We will then present a few more advanced examples to give some insight into the issues that come up for experienced OpenMP programmers.

Related Tutorial:



S8 Sunday, Half Day, AM
Room A207

Title: Understanding Network Performance

Presenters: Phillip Dykstra, WareOnEarth Communications Inc.

Level: 30% Introductory | 50% Intermediate | 20% Advanced

Abstract:

Supercomputers today are usually connected to local and wide area networks capable of transferring data at hundreds of megabits per second or more. Most remote users however only see a fraction of that potential. Wide area transfer rates less than ten million bits per second are still commonplace. Why is this and what can users and administrators of networks and systems do to improve the situation?

This tutorial will introduce the environment of high performance networking today. The behavior of TCP will be explained in detail along with factors that limit performance. Included are well known tuning issues such as "window sizes" and lesser known factors such as packet size, loss rates, and delay. Recent and proposed performance related protocol changes will be discussed.

Numerous tools are introduced that can be used to measure, debug, and tune end systems and networks. You will learn how these tools work and what they tell you. The attendee should come away with a better understanding of what is happening to their data on the network and what is required to achieve higher performance.



S9 Sunday, Half Day, AM
Room A209

Title: The Emerging Grid: Introduction, Tools, Applications

Presenters: Ian Foster, Argonne National Laboratory; Ed Seidel, The Max Planck Institute for Gravitational Physics, Albert Einstein Institute

Level: 85% Introductory | 15% Intermediate | 0% advanced

Abstract:

The paradigm of grid computing is currently being deployed to solve some of our most challenging computing problems. This tutorial is for people interested in getting acquainted with grid computing technologies and approaches and in exploring how to apply grid technologies to their own large-scale computing problems. It will also be of interest to those with previous exposure to the concept who seek an update on how grid techniques have matured and solidified in the past year.

The tutorial will explore—largely through first-hand examples—how to apply grid techniques to complex problems in scientific and engineering computation. The tutorial provides a pragmatic overview of the grid concept, based on the latest models of grid architecture. It surveys several technologies that can be used to construct grids, focusing on the Globus Toolkit, Condor, GridPort, and Cactus. It illustrates the accomplishments, plans, and challenges faced by large Grid projects including the Grid Physics Network, the Particle Physics Data Grid, the NASA Information Power Grid, the Network for Earthquake Engineering Simulation Grid, and the Earth Systems Grid. It also includes a brief review of current research efforts to extend the scope, utility, and ease of grid computing.



S10 Sunday, Half Day, AM
Room A112

Title: Mixed-Mode Programming Introduction

Presenters: Daniel Duffy and Mark R. Fahey, Computer Sciences Corporation - Engineer Research and Development Center Major Shared Resource Center

Level: 40% Introductory | 40% Intermediate | 20% Advanced

Abstract:

This tutorial will discuss the benefits and pitfalls of multilevel parallelism (MP) using the Message Passing Interface (MPI) combined with threads. Examples from the author's experiences will be discussed to give motivation to why multilevel parallelism is beneficial. While a general knowledge of MPI is assumed, the presentation will introduce both OpenMP directives and Pthreads. Furthermore, starting from the context of multithreading an existing MPI application, a general method of how to include threads will be discussed.

Included will be discussions of the pros and cons of various tools that can be used across platforms to help the application developer to optimize and debug an MP program. Also, sample codes showing a comparison of different MP methods will be shown and their resulting speedups presented. Finally, lessons learned from unsuccessful experiences of the authors will be presented.


S11 Sunday, Half Day, PM
Room A205

Title: Advanced Parallel Programming with OpenMP

Presenters: Tim Mattson, Intel Corporation; Rudolf Eigenmann, Purdue University

Level: 10% Introductory | 50% Intermediate | 40% Advanced

Abstract:

OpenMP is rapidly becoming the programming model of choice for shared-memory machines. After a very brief overview of OpenMP basics we will move on to intermediate and advanced topics, such as advanced OpenMP language features, traps that programmers may fall into, and a more extensive outlook on future OpenMP developments. We will also briefly discuss mixing OpenMP with message passing applications written in MPI. We will present many examples of OpenMP programs and discuss their performance behavior.



S12 Sunday, Half Day, PM
Room A207

Title: Achieving Network Performance

Presenters: John Estabrook and Jim Ferguson, National Laboratory for Applied Network Research and the National Center for Supercomputing Applications

Level: 20% Introductory | 60% Intermediate | 20% Advanced

Abstract:

High-bandwidth Wide Area Networks (WANs) deployed in recent years by various Federal agencies and others have brought sky-high expectations and an equal amount of disappointment to many who have used them. The problems with poor end-to-end performance on what should be a fast network connection mostly lie closer to the ends of the network than the well-engineered backbones. Application specialists and engineers with the National Laboratory for Applied Network Research (NLANR, www.nlanr.net) have developed tools and collected knowledge that can assist both applications developers and their local network support staff. This tutorial will specifically address typical problems encountered when applications that run successfully in a Local Area Network are ported to a run on a Wide Area Network. This tutorial will complement the tutorial "Understanding Network Performance" which will focus on the underlying issues of TCP. This tutorial will focus on application level issues, resources for network monitoring, and the "state of the backbone".



S13 Sunday, Half Day, PM
Room A209

Title: Data Grids: Drivers, Technologies, Opportunities

Presenters: Ann Chervenak , USC Information Sciences Institute; Michael Wilde, Argonne National Laboratory

Abstract:

In numerous scientific, engineering, and business disciplines, terabyte- and petabyte-scale data collections are emerging as critical resources. These data sets must be shared by large communities of users that pool their resources from a large number of institutions. This 2-part tutorial shows how to design and implement new information infrastructures called "Data Grids" to access and analyze the enormous distributed datasets employed by these communities. Part 1 surveys the current body of data grid concepts and techniques. It details the goals, requirements, and architectures of both deployed and proposed data grids. Examples will be drawn from case studies and detailed requirements analyses from physics, climate science, and engineering communities. Part 2 presents data grid implementation tools and techniques. We start by examining how to use Grid-enabled data transport and file replication components in application environments. We then focus on Grid-enabling applications directly with Data Grid toolkit components, and conclude with illustrations of integrating components of the Globus Toolkit for security, policy management, and resource monitoring with data management capabilities.



S14 Sunday, Half Day, PM
Room A112

Title: An Introduction to the TotalView Debugger

Presenters: Blaise M. Barney, Lawrence Livermore National Laboratory

Level: 60% Introductory | 20% Intermediate | 20% Advanced

Abstract:

The TotalView debugger has become a "de facto standard" tool within the High Performance Computing industry for debugging cross-platform, cross-language, multi-model parallel applications. TotalView's easy-to-use graphical user interface provides the means to see what an application is "really" doing at the deepest level. TotalView has been selected by the U.S. Department of Energy as the debugger software of choice for its Accelerated Strategic Computing Initiative (ASCI) program. TotalView has likewise been selected by a growing number of telco, petroleum, aerospace, university and HPC organizations as their debugger of choice.

This tutorial will begin by covering all of the essentials for using TotalView in a general programming environment. After covering these essentials, an emphasis will be placed upon debugging parallel programs, including threaded, MPI, OpenMP and hybrid programs. Though this tutorial would be best accompanied by hands-on exercises, the attendee will benefit from the many graphical examples and "screen captures" of TotalView debug sessions. This tutorial will conclude with examples and suggestions for the sometimes challenging task of debugging programs while they are executing in a batch system.


M1 Monday, Full Day
Room A102

Title: Advanced Topics in HPC Linux Cluster Design and Administration

Presenters: Troy Baer and Doug Johnson, Ohio Supercomputer Center

Level: 10% Introductory | 60% Intermediate | 30% Advanced

Abstract:

This tutorial describes a methodology for designing, installing, and administering a cluster of commodity computers as a resource for high performance computing in a production environment. This methodology is a result of past experience in cluster computing at OSC and elsewhere, and is used on both OSC's production cluster systems and on the distributed set of clusters deployed by OSC's Cluster Ohio project.

The tutorial discusses system design and installation, software configuration, high performance networks, resource management, scheduling, and performance monitoring. Wherever possible, currently available technologies and best current practice are described and related back to a common configuration, that of a cluster of dual-processor IA32 nodes interconnected by Myrinet. Other architectures and interconnect technologies are also discussed.



M3 Monday, Full Day
Room A201

Title: Securing Your Network

Presenters: Paula C. Albrecht, Crystal Clear Computing, Inc.

Level: 40% Introductory | 50% Intermediate | 10% Advanced

Abstract:

Pick up a newspaper, turn on the news, or read a magazine - security is a hot topic. Each year, the number of system and network attacks from the Internet continue to rise. Organizations need to be aware of these attacks and how to protect their systems and networks. This tutorial provides an overview of common attacks and the technologies available to protect your network against them. We will first take a look at who is attacking networks and the types of attacks they are using. Then we will examine techniques available for securing your systems and networks. These include cryptography, public key infrastructure (PKI), firewalls, virtual private networks (VPNs), and intrusion detection. We will introduce security terminology and network security concepts. Some detailed information on how the security technologies work and example uses will also be provided. If you are concerned about the security of your systems and networks, or ever wondered what network security is all about—this tutorial is for you.


M4 Monday, Full Day
Room 205

Title: Using MPI-2: Advanced Features of the Message-Passing Interface

Presenters: William Gropp, Ewing (Rusty) Lusk, and Rob Ross, Argonne National Laboratory; Rajeev Thakur, PRISMedia Networks, Inc.

Level: 20% Introductory | 40% Intermediate | 40% Advanced

Abstract:

This tutorial is about how to use MPI-2, the collection of advanced features that were added to MPI (Message-Passing Interface) by the second MPI Forum. These features include parallel I/O, one-sided communication, dynamic process management, language interoperability, and some miscellaneous features. Implementations of MPI-2 are beginning to appear: a few vendors have complete implementations; other vendors and research groups have implemented subsets of MPI-2, with plans for complete implementations.

This tutorial explains how to use MPI-2 in practice, particularly, how to use MPI-2 in a way that results in high performance. We present each feature of MPI-2 in the form of a series of examples (in C, Fortran, and C++), starting with simple programs and moving on to more complex ones. We also discuss how to combine MPI with OpenMP. We assume that attendees are familiar with the basic message-passing concepts of MPI-1.


M5 Monday, Full Day
Room A207

Title: Extreme! Scientific Parallel Computing

Presenters: Alice E. Koniges, David C. Eder, and David E. Keyes, Lawrence Livermore National Laboratory; Rolf Rabenseifner, High Performance Computing Center – Stuttgart

Level: 25% Introductory | 40% Intermediate | 35% Advanced

Abstract:

Teraflop performance is no longer a thing of the future. Indeed, advances in application computing continue to boggle the mind. What does it really take to get a major application performing at the "extreme" level? How do the challenges vary from cluster computing to the largest architectures? In the introductory material, we provide an overview of terminology, hardware, performance issues and software tools. Then, we draw from a series of large-scale application codes and discuss specific challenges and problems encountered in parallelizing these applications. The applications, some of which are drawn from a new book ("Industrial Strength Parallel Computing," Morgan Kaufmann Publishers, 2000), are a mix of industrial and government applications including aerospace, biomedical sciences, materials processing and design, and plasma and fluid dynamics. We also consider applications that were winners of Gordon Bell prizes for parallel performance. Advanced topics cover parallel I/O and file systems and combining MPI with Pthreads and/or OpenMP.


M6 Monday, Full Day
Room A209

Title: Programming with the Distributed Shared-Memory Model

Presenters: William Carlson, IDA Center for Computing Sciences; Tarek El-Ghazawi, The George Washington University; Bob Numrich, Cray Inc.; Kathy Yelick, University of California at Berkeley

Level: 30% Beginner | 50% Intermediate | 20% Advanced

Abstract:

The distributed shared-memory programming paradigm has been receiving rising attention. Recent developments have resulted in viable distributed shared memory languages that are gaining vendors’ support, and several early compilers have been developed. This programming model has the potential of achieving a balance between ease-of-programming and performance. As in the shared-memory model, programmers need not to explicitly specify data accesses. Meanwhile, programmers can exploit data locality using a model that enables the placement of data close to the threads that process them, to reduce remote memory accesses.

In this tutorial, we present the fundamental concepts associated with this programming model. These include execution models, synchronization, workload distribution, and memory consistency. We then introduce the syntax and semantics of three parallel programming language instances with growing interest. These are the Unified Parallel C or UPC, a parallel extension to ANSI C which is developed by a consortium of academia, industry, and government; Co-Array FORTRAN, which is developed at Cray; and Titanium, a JAVA implementation from UCB. It will be shown through experimental case studies that optimized distributed shared memory programs can be competitive with message passing codes, without significant departure from the ease of programming of the shared memory model.


M7 Monday, Full Day
Room A107

Title: Data Mining for Scientific and Engineering Applications

Presenters: Robert Grossman, University of Illinois at Chicago & Magnify, Inc.; Chandrika Kamath, Lawrence Livermore National Laboratory;
Vipin Kumar, Army High Performance Research Center, University of Minnesota

Level: 50% Introductory | 30% Intermediate | 20% Advanced

Abstract:

Due to advances in information technology and high performance computing, very large data sets are becoming available in many scientific disciplines. The rate of production of such data far outstrips our ability to analyze them manually. For example, a computational simulation can generate tera-bytes of data within a few hours, whereas human analysts may take several weeks to analyze these data sets. Other examples include several digital sky surveys, and data sets from the fields of medical imaging, bioinformatics, and remote sensing. As a result, there is an increasing interest in various scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets.

Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data. Traditional data analysis is assumption driven as a hypothesis is formed and validated against the data. Data mining, in contrast, is discovery driven as the patterns are automatically extracted from data. The goal of the tutorial is to provide researchers and practitioners in the area of Supercomputing with an introduction to data mining and its application to several scientific and engineering domains, including astrophysics, medical imaging, computational fluid dynamics, structural mechanics, and ecology.


M8 Monday, Half Day, AM
Room A108

Title: Performance Tuning Using Hardware Counter Data

Presenters: Shirley Moore, University of Tennessee; Nils Smeds, Parallelldatorcentrum

Level: 30% Introductory | 40% Intermediate | 30% Advanced

Abstract:

This tutorial concerns the performance counter Application Programmers Interface (PAPI) and its use for HPC application developers as well as HPC tool developers. PAPI is a specification and reference implementation of a cross-platform interface to hardware performance counters. Using PAPI, a developer need not re-adapt his/her measurement techniques for each new hardware platform the application is to run on. Furthermore, PAPI implements abstractions needed for third-party performance evaluation tools. With a mature platform-independent interface to the hardware counters, tool developers can dedicate their efforts to enhancing the functionality of their tools without the need for re-adapting the tool for new hardware platforms. The tutorial will cover the use of PAPI directly by an application developer as well as discuss the mechanisms by which PAPI provides a platform independent abstraction for tool developers. See http://icl.cs.utk.edu/projects/papi/ for more information about PAPI.



M9 Monday, Half Day, AM
Room A110

Title: Benchmarks, Results, and Tricks the vendors don’t tell you.

Presenters: Robb Graham and Henry Newman, Instrumental Inc

Level: 10% Beginner | 50% Intermediate | 40% Advanced

Abstract:

Benchmarks are often performed to determine which vendor’s machine is best suited for the customers needs. These benchmarks must be constructed in a fashion that will ensure the results will be an accurate representation of the customers needs. Vendors, in their efforts to enhance the benchmarks, may over optimize the code and system. This over optimization will increase the customer’s timeline performance expectations. Solid benchmarking techniques can help mitigate this problem. This tutorial will cover how to create rules and benchmarks for accurate performance predictions, and how to use these benchmarks for timeline performance modeling and predictions.


M10 Monday, Half Day, AM
Room A112

Title: Parallel Partitioning Software for Static, Adaptive, and Multi-phase Computations

Presenters: George Karypis, University of Minnesota; Karen Devine, Sandia National Laboratories;

Level: 25% Introductory | 50% Intermediate | 25% Advanced

Abstract:

In recent years, a number of scalable and high quality partitioning algorithms have been developed that are used extensively for decomposing scientific computations on parallel computers. The goal of this tutorial is to provide an overview of stand-alone graph partitioning packages (ParMetis & Jostle), and of higher-level tools for load balancing adaptive computations (Zoltan & Drama). The tutorial will cover both static and dynamic computations as well as recently developed algorithms and software packages suited for emerging multi-physics and multi-phase computations.


M11 Monday, Half Day, PM
Room A108

Title: Performance Technology for Complex Parallel Systems

Presenters: Allen D. Malony and Sameer Shende, University of Oregon; Bernd Mohr, Research Centre Juelich

Level: 10% Introductory | 50% Intermediate | 40% Advanced

Abstract:

Fundamental to the development and use of parallel systems is the ability to observe, analyze, and understand their performance. However, the growing complexity of parallel systems challenge performance technologists to produce tools and methods that are at once robust (scalable, extensible, configurable) and ubiquitous (cross-platform, cross-language). This half-day tutorial will focus on performance analysis in complex parallel systems which include multi-threading, clusters of SMPs, mixed-language programming, and hybrid parallelism. Several representative complexity scenarios will be presented to highlight two fundamental performance analysis concerns: 1) the need for tight integration of performance observation (instrumentation and measurement) technology with sophisticated programming environments and system platforms, and 2) the ability to map execution performance data to high-level programming abstractions implemented on layered, hierarchical software systems. The tutorial will describe the TAU performance system in detail and demonstrate how it is used to successfully address the performance analysis concerns in each complexity scenario discussed. Tutorial attendees will be introduced to TAU's instrumentation, measurement, and analysis tools, and shown how to configure the TAU performance system for specific needs. A description of future enhancements of the TAU performance framework, including a demonstration of a prototype for automatic bottleneck analysis, will conclude the tutorial.



M12 Monday, Half Day, PM
Room A110

Title: InfiniBand Architecture and What Does It Bring to High Performance Computing?

Presenter: Dhabaleswar K. Panda, The Ohio State University

Level: 20% Introductory | 40% Intermediate | 40% Advanced

Abstract:

The emerging InfiniBand Architecture (IBA) standard is generating a lot of excitement towards building next generation computing systems in a radical different manner. This is leading to the following common questions among many scientists, engineers, managers, developers, and users associated with High-Performance Computing (HPC): 1) What is IBA? 2) How is it different from other on-going developments and standardization effort such as Virtual Interface Architecture (VIA), PCI-X, Rapid I/O, etc.? and 3) What unique features and benefits does IBA bring to HPC?

This tutorial is designed to provide answers to the above questions. We will start with the background behind the origin of the IBA standard. Then we will make the attendees familiar with the novel features of IBA (such as elimination of the standard PCI-bus based architecture, provision for multiple transport services, mechanisms to support QoS and protection in the network, uniform treatment of interprocessor communication and I/O, and support for low latency communication with Virtual Interface). We will compare and contrast the IBA standard with other on-going developments/standards. We will show how the IBA standard facilitates the next generation computing systems to be designed not only to deliver high performance but also RAS (Reliability, Availability, and Serviceability). Open research challenges in designing IBA-based HPC systems will be outlined. The tutorial will conclude with an overview of on-going IBA related research projects and products.

More information on this tutorial and the speaker can be obtained from:
http://www.cis.ohio-state.edu/~panda/sc01_tut.html


M13 Monday, Half Day, PM
Room A112

Title: Cache-based iterative algorithms

Presenters: Ulrich J. Ruede, Univeristy of Erlangen; Craig Douglas, Center for Computational Sciences, University of Kentucky

Level: 30% Introductory, 50% Intermediate, 20% Advanced

Abstract:

In order to mitigate the effect of the gap between the high execution speed of modern RISC CPUs and the comparatively poor main memory performance, computer architectures nowadays comprise several additional levels of smaller and faster cache memories which are located physically between the processor and main memory. Efficient program execution, i.e. high MFLOPS rates, can only be achieved if the codes respect this hierarchical memory design. Unfortunately, today's compilers are still far away from automatically performing code transformations like the ones we apply to achieve remarkable speedups. As a consequence, much of this optimization effort is left to the programmer.

In this tutorial, we will first discuss the underlying hardware properties and then present both data layout optimizations, like array padding for example, and data access optimizations, like e.g., loop blocking. The application of these techniques to iterative numerical schemes — like Gauss-Seidel and multigrid — can significantly enhance their cache performance and thus reduce their execution times on a variety of machines. We will consider both structured and unstructured grid computations.

These techniques have been implemented in our multigrid library DiMEPACK which is freely available on the web. The use of this library will also be discussed.