Data over Distance

AGENDA

7:30-8:30 AM Registration

8:30-8:45 AM Welcome and Opening Remarks

Neena Imam, Oak Ridge National Laboratory

Bio: Dr. Neena Imam is a distinguished research scientist in the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory (ORNL), performing research in extreme-scale computing. Neena Imam has also been serving as the Director of Research Collaboration for ORNL's Computational Science's Directorate (CCSD) for the last six years. Neena Imam holds a Doctoral degree in Electrical Engineering from Georgia Institute of Technology, with Master's and Bachelor's degrees in the same field from Case Western Reserve University and California Institute of Technology, respectively. Neena also served as the Science and Technology Fellow for Senator Lamar Alexander in Washington D.C. (2010-2012).

8:45-9:30 AM Keynote Address - Rising Power of the Network User

Biswanath Mukherjee, University of California, Davis

Abstract: The network user, armed with increasingly-capable smart devices, is becoming more powerful. The user is demanding that the high-performance computing (HPC) infrastructures become more agile, and the services they facilitate become more application-centric, leading to exceptional user experience. As applications are becoming more distributed and dynamic – and with adoption of micro-services – the app components need to be "chased" for monitoring and visibility. Noting also that different applications have different tolerances to available bandwidth, latency, etc., novel methods are required to develop real-time application-centric user-experience analytics for HPC infrastructures. Given the huge data volumes involved, appropriate AI and machine-learning methods need to be applied.

Bio: Prof. Biswanath Mukherjee received the BTech (Hons) degree from Indian Institute of Technology, Kharagpur (1980) and the Ph.D. from University of Washington, Seattle (1987). He was General Co-Chair of IEEE/OSA Optical Fiber Communications (OFC) Conference 2011, TPC Co-Chair of OFC'2009, and TPC Chair of IEEE INFOCOM'96. He is Editor of Springer's Optical Networks Book Series. He has served on 8 journal editorial boards, most notably IEEE/ACM Transactions on Networking and IEEE Network. He is winner of the 2004 Distinguished Graduate Mentoring Award, the 2009 College of Engineering Outstanding Senior Faculty Award, and the 2016 UC Davis International Community Building Award at UC Davis. He is co-winner of 11 Best Paper Awards, mostly from IEEE conferences. He is author of the textbook Optical WDM Networks (Springer, January 2006). He is Founder and President of Ennetix, Inc. He is also winner of the IEEE COMSOC's inaugural Outstanding Technical Achievement Award "for pioneering work on shaping the optical networking area" and is an IEEE Fellow.

9:30-9:45 AM Short Break

Session 1: Next Big Things in Networking

9:45–11:50 AM

Moderator: Neena Imam, Oak Ridge National Laboratory

This session will explore the current state, recent developments, and future advances in developing and testing the networking and related capabilities needed for the access, support, and service of data distributed across disparate time-space distances.

9:45-10:10 AM Talk 1 - Science Elastic Optical Inter-Network (SEOIN)

Malathi Veeraraghavan, University of Virginia

Abstract: The overall vision is to realize a global-scale Science Elastic Optical Inter-Network (SEOIN) that offers high-speed end-to-end rate-guaranteed dynamic Layer-1 (L1) circuits to support high-speed data transfers for scientific-computing applications. This vision leverages two trends: (i) elastic optical networks enabled by FlexiGrid, and (ii) software-defined networking (SDN). The term "inter-network" is used to emphasize that the new protocols and algorithms implemented in SDN controllers should be designed to support a multi-domain (multi-organization) deployment. Hardware advances in NIC and driver designs will be leveraged to fill the high-speed circuits with continuous data transfer.

Bio: Malathi Veeraraghavan is a Professor in the Charles L. Brown Department of Electrical & Computer Engineering at the University of Virginia. She has led DOE and NSF-funded research projects on high-speed data transfers, multipoint data distribution, and traffic engineering of elephant flows.

10:10-10:35 AM Talk 2 - Optimizing Distributed Data-Intensive Workflows

Nathan Tallent, Pacific Northwest National Laboratory

Abstract: We discuss techniques for optimizing the performance of data-intensive workflows that execute on geographically distributed and heterogeneous resources. We optimize for both throughput and response time. Optimizing for throughput, we alleviate data-transfer bottlenecks. To hide latency of accessing remote data, we transparently introduce prefetching, without changing workflow source code. Optimizing for response time, we introduce "intelligent" scheduling for a set of high-priority tasks. We replace the standard greedy scheduler that ignores differing performance on heterogeneous resources. We show performance results for Belle II workflow for high energy physics.

Bio: Dr. Nathan Tallent is a senior HPC computer scientist in the Advanced Computing, Mathematics, and Data Division at Pacific Northwest National Laboratory. His research is at the intersection of tools, performance analysis, parallelism, and architectures. Currently he is working on techniques for diagnosing performance bottlenecks, generating performance models, dynamic program analysis, and evaluating new technology.

10:35-11:00 AM Talk 3 - IBM Aspera Technology

Charles Shiflett, Aspera

Abstract: As system performance has increased both across the network and on the system, transfer performance has plateaued due to traditional APIs and algorithms which don't reflect today's performance needs. IBM Aspera solves this problem through three primary strategies, 1) Aspera's FASP transport protocol, which performs well even with high RTT and packet loss, 2) Use of frameworks like DPDK to bypass BSD Sockets framework, and 3) Flexible I/O strategies which optimize storage bandwidth in traditional, cloud, and clustered storage scenarios. This talk will cover Aspera's approach to solving these issues, with a focus on security, network performance, storage, and cloud.

Bio: Mr. Charles Shiflett is the lead developer of Fasp0, a Zero-Copy in development version of the FASP protocol. Aspera showed 100 Gbps transfers utilizing Fasp0 at SC16 between Salt Lake City and Chicago using two servers.

11:00 AM-11:25 AM Talk 4 - Some Science Aspects of Wide-Area Data Transport

Nagi Rao, Oak Ridge National Laboratory

Abstract: Data transport infrastructure consists of wide-area networks, site networks, file and IO systems, data transfer nodes and transfer software. Machine learning methods provide throughput profile estimates, which may be smooth such as sigmoid neural networks, or non-smooth such as forests of trees, with performance guarantees proved by the generalization theory. The concave-convex analysis of estimated throughput profiles characterizes transport performance: concave profiles indicate near-optimal throughput and convex profiles indicate bottlenecks due to factors such as IO buffers, file system throughput, mis-matched IO-network couplings, sub-optimal protocols and transfer tools. The underlying mathematics of this analysis are based on statistical estimation, as well as protocol time-dynamics in the form of Poincare maps and Lyapunov exponents.

Bio: Dr. Nagi Rao is a Corporate Fellow in Computational Sciences and Engineering Division, Oak Ridge National Laboratory. He was on assignment as a Technical Director of C2BMC Knowledge Center at Missile Defense Agency during 2008-2010. He has published more than 400 technical conference and journal papers in the areas of sensor networks, information fusion, and high-performance networking. He received his Ph.D. in Computer Science from Louisiana State University in 1988. He is a Fellow of IEEE, and has received the 2005 IEEE Technical Achievement Award, and the 2014 R&D 100 Award.

11:25-11:50 AM Talk 5 - Data Management Systems for Monitoring Workloads

Jessie Gaylord, Lawrence Livermore National Laboratory

Abstract: Data volumes are growing exponentially as sensors become less expensive and more prevalent. With these data come the promise of new insight, but before analysis begins, the data must be collected and organized. This talk will overview the data ingestion architectures for monitoring research workloads both in practice and under development. A pipeline for the Multi-Informatics on Nuclear Operations Scenarios (MINOS) venture will be described to demonstrate the complexities of data management at scale, and a potential solution built from open-source technologies.

Bio: Ms. Jessie Gaylord is the Associate Division Leader for Global Security Computing Applications at Lawrence Livermore National Laboratory (LLNL). As the Mission Assurance Lead for MINOS venture for NNSA, she develops collaborative data management strategies to support nonproliferation research efforts across ten DOE laboratories. Jessie leads software engineering teams in developing scalable frameworks to support data science at LLNL and previously developed business intelligence solutions for the National Ignition Facility. Her interests include data engineering and stewardship, database development, and distributed compute architectures. Jessie has an MS in Computer Science from California State University Chico and a BA in Economics from Washington University in St. Louis.

11:50-1:00 PM Working Lunch (lunch provided)

Agenda: Discussion and Feedback on Morning Session

1:00-1:30 PM Invited Talk - Data Transfer Techniques for Amazon Web Services (Load, Move, and Unload)

Brad Dispensa, Amazon Web Services

Abstract: Amazon web services (AWS) provides researchers with the possibility of unlimited storage and compute power, but moving data can still be challenging. This session will cover techniques for rapid ingestion of data from remote sites to the cloud using multiple protocols. Once the data is loaded, we will cover how to move that data inside the cloud for analysis purposes and finally how to archive or evacuate all that data efficiently and expeditiously.

Bio: Brad Dispensa is a security and compliance specialist for Amazon Web services in the world wide public sector unit. Prior to joining AWS, Brad was a technology and security director with the University of California and supported HPC environments for genomic and vascular malformation research.

Session 2: Filesystems and IO for Resilient Data Transfer

1:30-4:20 PM

Moderator: Nagi Rao, Oak Ridge National Laboratory

This session will explore the current state, recent developments, and future advances in developing and testing data and content capabilities needed for providing and supporting data over disparate time-space distances. It covers the areas of IO, storage file systems, software stacks, and frameworks.

1:30-1:55 PM Talk 1 - Data-Intensive Science Executed within Leadership‐Scale Computing Facilities

Jack Wells, Oak Ridge National Laboratory

Abstract: High-performance computing centers, like the Oak Ridge Leadership Computing Facility (OLCF) have realized growth in data-intensive research projects. The goals of experimental and observational data-intensive (EOD) science, like the ATLAS and ALICE experiments at the Large-Hadron Collider (LHC), are joining the goals of simulation studies in their requirements for access to computing at the largest scales. The BigPanDA project, along with other data projects, has served as a driver for innovation at OLCF. These innovations have included the opportunistic backfill in Titan's scheduled compute nodes with the large, malleable workload available from distributed-computing projects; as sand may fill the gaps between rocks packed into a jar. The OLCF has also deployed multiple container strategies to automate deploying containers as a framework for providing user‐required services and applications, and HPC container runtimes focused on use within in a batch submission system. Moreover, Titan's GPU‐accelerated architecture has attracted a surge in machine-learning workloads. With the advent of the Summit supercomputer in 2018 with over 27,000 machine-learning-optimized GPUs, high‐bandwidth data movement, and large node‐local memory, the volume of data analysis and machine-learning workloads is expected to grow significantly into the future. Summit will deliver more than five to ten times the computational performance compared with Titan. Upon completion, Summit will allow researchers in all fields of science unprecedented access to solving some of the world's most pressing challenges.

Bio: Jack Wells is the Director of Science for the Oak Ridge Leadership Computing Facility (OLCF), a DOE Office of Science national user facility, and the Titan supercomputer, located at Oak Ridge National Laboratory (ORNL). Wells is responsible for the scientific outcomes of the OLCF's user programs. Wells has previously lead both ORNL's Computational Materials Sciences group in the Computer Science and Mathematics Division and the Nanomaterials Theory Institute in the Center for Nanophase Materials Sciences. Prior to joining ORNL as a Wigner Fellow in 1997, Wells was a postdoctoral fellow within the Institute for Theoretical Atomic and Molecular Physics at the Harvard-Smithsonian Center for Astrophysics. Wells has a Ph.D. in physics from Vanderbilt University, and has authored or co-authored over 80 scientific papers and edited 1 book, spanning nanoscience, materials science and engineering, nuclear and atomic physics computational science, applied mathematics, and text-based data analytics.

1:55-2:20 PM Talk 2 - Explaining Wide Area Data Transfer Performance

Rick Wagner, Globus

Abstract: Globus is a research data management service developed by the University of Chicago and used by thousands of researchers at institutions in the U.S. and abroad. Starting with log data for millions of Globus transfers involving billions of files and hundreds of petabytes, we developed both linear and nonlinear models of transfer performance. We show that the resulting models have high explanatory power and broadens understanding of factors that influence file transfer rate by clarifying relationships between achieved transfer rates, transfer characteristics, and competing load.

Bio: Rick Wagner is the Globus Professional Services manager, managing a team of engineers supporting organizations and research projects in solving large-scale and complex data management challenges. He was the HPC Systems Manager at the San Diego Supercomputer Center and Rick's starting point in research was computational astrophysics.

2:20-2:45 PM Talk 3 - State of the Lustre File System: Reliability, Resiliency, and Community Roadmap

Shawn Hall, BP

Abstract: The Lustre file system is a pivotal component technology for many HPC centers, providing high performance parallel data access to researchers worldwide. Open Scalable File Systems (OpenSFS) is a nonprofit organization that is dedicated to the success of Lustre. This talk will give the current state and future plans for Lustre, with an emphasis on reliability and resiliency in Lustre as well as an introduction of how OpenSFS is a fundamental component of the Lustre community. A feature introduced in Lustre recently, namely Progressive File Layouts (PFL), has allowed more complex layouts to be defined, paving way for more software reliability and resiliency to be achieved in Lustre with File-level Redundancy (FLR). FLR will finally allow Lustre to break free from the hardware-level reliability dependency, while at the same time has a potential to increase the performance of read-heavy I/O workloads. This talk will also briefly re-introduce the PFL concept and discuss FLR's expected impacts on reliability, resiliency, and performance.

Bio: Mr. Shawn Hall's experience is in large scale system administration, having worked with high performance computing clusters in industry and academia. He has worked on many aspects of large scale systems and his interests include parallel file systems, configuration management, performance analysis, and security. Shawn holds B.S. and M.S. degrees in Electrical and Computer Engineering from Ohio State University. Shawn is on the OpenSFS Board of Directors and currently holds the role of Director at Large.

2:45-3:10 PM Talk 4 - Analytics Shipping Through Virtualized MapReduce on HPC Backend Storage Servers

Weikuan Yu, Florida State University

Abstract: Large-scale scientific applications on High-Performance Computing (HPC) systems such as those from bioinformatics and metagenomics are generating a colossal amount of data that need to be analyzed in a timely manner for new knowledge, but are too costly to transfer due to their sheer size. Many HPC systems have catered to in-situ analytics solutions that can analyze temporary datasets as they are generated, i.e., without storing to long-term storage media. However, there is still an open question on how to conduct efficient analytics of permanent datasets that have been stored to the backend persistent storage because of their long-term value. To fill the void, we exploit the analytics shipping model for fast analysis of large-scale scientific datasets on HPC backend storage servers. In this talk, I will present a Virtualized Analytics Shipping (VAS) framework that can ship MapReduce programs to Lustre storage servers. Our performance evaluation demonstrates that VAS offers an exemplary implementation of analytics shipping and delivers fast and virtualized MapReduce programs on backend HPC storage servers.

Bio: Dr. Weikuan Yu is a Professor in the Department of Computer Science at Florida State University. His main research interests include big data management and analytics frameworks, parallel I/O and storage, GPU memory architecture, and high-performance networking. Yu serves as an associate editor for IEEE Transactions on Parallel and Distributed Systems. He is a senior member of IEEE and life member of ACM.

3:10-3:30 PM Afternoon Refreshment Break (coffee and light refreshments provided)

3:30-3:55 PM Talk 5 - Data Management at the Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory

Eric Lancon, BNL

Abstract: With the High Luminosity Large Hadron Collider (HL-LHC) physics program at CERN looming at the horizon, exabytes of data will be produced over the next decade. In order to store and process this unprecedent amount of data in physics, an intensive R&D program has been established. Preliminary conceptual designs of data management and organization based on large distributed federated data stores will be described.

Bio: Eric Lancon is a senior physicist and Director of Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory. SDCC provides the local processing and storage capabilities for Relativistic Heavy Ion Collider (RHIC) experiments and other laboratory physics programs. SDCC is the data center in the US for the ATLAS experiment at CERN (Geneva, Switzerland) and the Belle II experiment in Japan. The SDCC also serves as an analysis center for the US physicists of the ATLAS experiment. Over the last decade Eric Lancon played major roles in the area of distributed computing for the Large Hadron Collider (LHC) at CERN.

3:55-4:20 PM Talk 6 - IO 500 Benchmark

John Bent, DDN

Abstract: At SC17, we announced the winner of the first bi-annual IO500 benchmark. This announcement instigated a detailed process of defining the benchmark. Although the need to evaluate storage systems is crystal clear; the method is nefariously murky. In this presentation, we will describe the motivation for the IO500 benchmark and a discussion of its methodology. We will also include a discussion and analysis of the list including the more recent ISC18 entrants.

Bio: Currently Global Field CTO at DDN, Dr. John Bent directs world-wide technical pre-sales for the world's largest HPC storage company. An author of almost 100 publications and patents, John has been influencing HPC storage since starting his career at Los Alamos National Lab in 2005. A graduate of Amherst College with a focus on anthropology, John served for two years as a Peace Corps volunteer in the Republic of Palau before earning a Ph.D. in computer science from the University of Wisconsin.

4:20–5:20 PM Panel - Looking to the Future: Data-over-Time-Space Distance

The Panel addresses future aspects of providing access and supporting data in various forms over disparate distances in time and space. The state of the art in the areas of networking, IO, storage, filesysyetms, and data frameworks will be discussed. Future directions in these areas, both for short and long term, will be identified, with a particular attention paid to potential game changers. The panel will also discuss "next big things" in these areas that are both expected and desired but potentially speculative.

Moderator: Neena Imam, Oak Ridge National Laboratory

Panelists:

Rich Carlson, ASCR/DOE
Tom Lehman, University of Maryland
Raj Kettimuthu, Argonne National Laboratory - Data-over-Time-Space Distance: Challenges, Current R&D Trajectories and Future
Inder Monga, ESnet - Networking in the "Jetson Age"
Barney Maccabe, Oak Ridge National Laboratory - Research Challenges

Bio: Dr. Neena Imam is a distinguished research scientist in the Computing and Computational Sciences Directorate (CCSD) at Oak Ridge National Laboratory (ORNL), performing research in extreme-scale computing. Neena Imam has also been serving as the Director of Research Collaboration for ORNL's Computational Science's Directorate (CCSD) for the last six years. Neena Imam holds a Doctoral degree in Electrical Engineering from Georgia Institute of Technology, with Master's and Bachelor's degrees in the same field from Case Western Reserve University and California Institute of Technology, respectively. Neena also served as the Science and Technology Fellow for Senator Lamar Alexander in Washington D.C. (2010-2012).

Bio: Mr. Richard Carlson is a program manager at the Department of Energy/Office of Science (DOE/SC) where he is responsible for the scientific collaboration and network research programs. The focus of these research programs is to develop a comprehensive understanding of how distributed computing environments, and the workflows that use them, behave. This is accomplished by developing the theories of what should happen and experimentally validating those theories on real systems. The research programs also need to explore how to design, build, deploy, and operate network infrastructure and protocols that meet the changing needs of large distributed science communities. He has a MS-EE degree from the Illinois Institute of Technology and over 35 years of experience in the design, construction, and operation of high-performance IP networks to support large-scale DOE science initiatives.

Bio: Mr. Tom Lehman is the Director of Research at the University of Maryland/Mid-Atlantic Crossroads (UMD/MAX). His research and development interests include advanced network architectures, intelligent control planes, multi-layer inter-networking, network function virtualization, and cloud computing. His current research projects are focused on technology development to facilitate the orchestration of high performance network, computation, and storage resources in service of big data driven domain science application workflows.

Bio: Dr. Rajkumar Kettimuthu received the B.E. degree from Anna University, Chennai, India, and an M.S. and Ph. D. from the Ohio State University, Columbus, OH, USA, all in Computer Science and Engineering. Since 2003, he has been working at Argonne National Laboratory, where he is currently a Computer Scientist in the Data Science and Learning Division. He has co-authored more than 100 articles in the areas of high performance computing, distributed computing, and high-performance networking. He is a recipient of R&D 100 award.

Bio: Mr. Inder Monga serves as the Executive Director of Energy Sciences Network, a high-performance network interconnecting the National Laboratory System in the United States, and Division Director for Scientific Networking Division at Lawrence Berkeley National Lab. In addition to managing the organization, his efforts are directed towards advancing networking services for collaborative and distributed science, as well as contributing to ongoing research projects including Intent-based interfaces, Software-Defined Networking, Named-Data Networking, as well as the Advanced Network Testbed Initiatives. He was selected to be ONF's Research Associate, and co-leads the Next-Generation architecture activity. His work experience in the private sector has included network engineering for Wellfleet Communications and Canadian telecom company Nortel, where he focused on application and network convergence.

Bio: Barney Maccabe currently serves as the Director for the Computer Science and Mathematics Division at Oak Ridge National Laboratory (ORNL). The division has over 100 technical staff working in a wide range of areas, including computational and applied mathematics; discrete systems; data analysis, visualization, management, and engineering; programming models, and tools; performance modeling, measurement, and analysis; system software; and emerging technologies. Prior to joining ORNL in January of 2009, Dr. Maccabe served on the Computer Science faculty at the University of New Mexico where he also served as director of the UNM Center for High Performance Computing and the CIO for the university.

5:20 PM Adjourn