IPDPS 2026 Conference

General IPDPS Info

Sponsor

IN COOPERATION WITH

IPDPS 2026 Tutorials

Like 2025, the IPDPS Workshops will be held on the first two days of the conference and to enrich the offerings for those two days, the conference will also conduct 9 Tutorials, open to all attendees. The IPDPS 2026 Tutorials selected for presentation are described here, listing the presenters and providing an abstract. They complement the one-day workshops, enabling attendees on the first two days of the conference to fill their day with workshops and tutorials covering a host of topics of interest to our community.

IPDPS 2026 Tutorial Co-Chairs

Sascha Hunold, TU Wien, Austria
Swann Perarnau, Argonne National Laboratory, USA

The schedule & location for both Workshops & Tutorials is available here.

Handouts/slides for the Tutorials will be made available to registered attendees prior to the conference.

TUTORIAL I

Charm4Py: A Programming Model for Distributed Adaptive Python

Presenters
Laxmikant Kale, Ritvik Rao, Maya Taylor, and Aditya Bhosale (University of Illinois Urbana-Champaign)

Abstract
Many programming models have been developed to implement scalable parallel programs. Charm++ is built around the concept of distributed migratable objects, called chares. This empowers an introspective and adaptive runtime system (aRTS) to support automatic overlap of communication and computation, dynamic load balancing, energy optimization, fault tolerance, and resource elasticity. These signature adaptive capabilities distinguish Charm++ from MPI, the predominant parallel programming model.

Charm4Py is an implementation of the Charm++ programming model that makes these adaptive capabilities available to parallel Python programmers. Charm4Py programs use Python classes to implement chare objects; Charm++ features such as dynamic load balancing and asynchronous communication are made available to Charm4Py programs by linking it with the Charm++ aRTS under the hood. Charm4Py can thus be thought of as Distributed Adaptive Python, a parallel programming model with distributed adaptivity features that is compatible with existing Python conventions, and allows the use of existing popular Python libraries.

This tutorial aims to teach effective parallel programming using Charm4Py. Attendees will start with basic knowledge of Python, and by the end of the tutorial are expected to be able to code sophisticated parallel applications that can run on clusters, supercomputers, and cloud-based resources scalably and adaptively.

TUTORIAL II

In-network Computing with SmartNICs for Parallel Applications

Presenters
Jeffrey Young (Georgia Institute of Technology), Elie Kfoury (University of South Carolina), Richard Graham (NVIDIA), Aaron Jezghani (Georgia Institute of Technology), Antonio Peña (Barcelona Supercomputing Center), and Paul Ruth (UNC Chapel Hill)

Abstract
The past few years have witnessed an increased level of support for and deployment of programmable network adapters, known as “SmartNICs". These enhanced network devices offer standard packet processing capabilities as well as advanced “in-network'' computing features built around programmable lightweight processing cores, FPGAs, and even CPU- and GPU-based platforms capable of running separate operating systems. SmartNICs have gained rapid adoption for data center tasks, including infrastructure management, packet filtering, and I/O acceleration. Increasingly these devices are also being explored for high-performance computing (HPC) and AI application acceleration.

This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to take advantage of SmartNICs for accelerating HPC and AI applications. Specific topics include MPI and OpenMP offloading, algorithmic modifications to utilize SmartNIC processors, in-line packet processing frameworks like P4, security and containerization efforts, and I/O acceleration techniques. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA's BlueField-3 Data Processing Unit (DPU) and a cloud-based Netlab environment. The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.

TUTORIAL III

High-Performance and Smart Networking Technologies for HPC and AI

Presenters
Dhabaleswar K. Panda and Benjamin Michalowicz (The Ohio State University)

Abstract
High-Performance Networking technologies are generating a lot of excitement towards building next-generation High-End Computing (HEC) systems for HPC and AI with GPGPUs, accelerators, and Data Center Processing Units (DPUs), and a variety of application workloads. This tutorial will provide an overview of these emerging technologies, their architectural features, current market standing, and suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, and Omni-Path interconnect. An in-depth overview of the architectural features of these interconnects will be presented. It will be followed with an overview of the emerging NVLink/NVSwitch, EFA, and Slingshot architectures. We will then present advanced features of commodity high-performance networks that enable performance and scalability. We will then provide an overview of enhanced offload capable network adapters like DPUs/IPUs (Smart NICs), their capabilities and features. Next, an overview of software stacks for high-performance networks like Open Fabrics Verbs, LibFabrics, and UCX comparing the performance of these stacks will be given. Next, challenges in designing MPI library for these interconnects, solutions and sample performance numbers will be presented. Sample slides can be found at https://go.osu.edu/ipdps26-hpn

TUTORIAL IV

Strategies, tools, and materials for teaching Parallel and Distributed Computing

Presenters
Erik Saule (University of North Carolina at Charlotte), David Bunde (Knox College), and KR Subramanian (UNC Charlotte)

Abstract
Virtually all the devices that students will program are fundamentally concurrent and parallel systems. This can no longer be ignored as most modern programming models exposes concurrency and/or parallelism to the developper. However teaching parallel and distributed computing remains difficult because it requires deep understanding of programming, systems, and algorithms; and so it is often relegated to high division electives. This tutorial will enable the attendees to break down parallel computing concepts at different technical levels so that they can be taught at different point of the curriculum, from early classes to technical electives. This tutorial will teach engaging pedagogical practices.

This workshop is intended for all who engage with students. It is in particular helpful to graduate students who intend to apply for faculty position and early-career faculty who need to bolster their teaching statements and educational activities for grant submission. We also hope IPDPS can help advertising to the local community in New Orleans.

TUTORIAL V

Resilient Scheduling in Distributed Systems: Benchmarking and Adversarial Analysis with SAGA

Presenters
Jared Coleman (Loyola Marymount University) and Bhaskar Krishnamachari (University of Southern California)

Abstract
Task graph scheduling on heterogeneous parallel and distributed systems remains a foundational challenge in HPC, cloud, and edge computing. While decades of research have produced a large variety of heuristic schedulers, their robustness across diverse workloads is poorly understood, in part due to the reliance on limited benchmark suites.

This tutorial introduces two complementary open-source tools (SAGA and PISA) that provide a principled, reproducible methodology for analyzing scheduling algorithms across both typical and adversarial problem instances. SAGA offers a modular Python framework for implementing and benchmarking schedulers, including classical list-scheduling algorithms and hundreds of parametric variants. PISA, recently published at IPDPS 2025, uses simulated annealing to automatically synthesize problem instances that expose worst-case performance gaps between algorithms. Together, these tools help reveal the structural conditions under which algorithms fail, enabling deeper performance insight and more resilient algorithm design.

A version of this tutorial will be presented at ICDCN 2026, but IPDPS is a natural venue for a large distributed computing audience. The content builds on our recent IPDPS main-track and JSSPP/IPDPS workshop papers and is designed for researchers and practitioners in distributed systems.

TUTORIAL VI

Building Scalable Agentic Systems for Science: Concepts, Architectures, and Hands-On with Academy

Presenters
Kyle Chard, Alok Kamatar, and Ian Foster (UChicago)

Abstract
Agentic systems, in which autonomous agents collaborate to solve complex problems, are emerging as a transformative methodology in AI. However, adapting agentic architectures to scientific cyberinfrastructure—spanning HPC systems, experimental facilities, and federated data repositories—introduces new technical challenges. In this half-day tutorial, we introduce participants to the design, deployment, and management of scalable agentic systems for scientific discovery. We will present Academy, a Python-based middleware platform built to support agentic workflows across heterogeneous research environments. Participants will learn core agentic system concepts, including asynchronous execution models, stateful agent orchestration, and dynamic resource management. A guided hands-on session will help attendees build and launch their own agentic workflows. We will present case studies in materials discovery, biology, and chemistry. This tutorial is designed for researchers, developers, and cyberinfrastructure professionals interested in advancing AI-driven science with next-generation autonomous systems.

TUTORIAL VII

Principles and Practice of High Performance Deep Learning Training and Inference

Presenters
Dhabaleswar K. (DK) Panda and Nawras Alnaasan (The Ohio State University)

Abstract
Recent advances in machine learning and deep learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including PyTorch, TensorFlow, and cuML enable high-performance training, inference, and deployment for various types of ML models and deep neural networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL training and inference, and hyperparameter optimization, with special focus on parallelization strategies for large models such as GPT, LLaMA, DeepSeek, and ViT. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain firsthand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.

TUTORIAL VIII

An Open-Source CXL Emulation Platform at Hyperscale

Presenters
Andres Marquez (Pacific Northwest National Laboratory); Luanzheng Guo (Pacific Northwest National Laboratory (PNNL)); Mujahid Al Rafi (Pacific Northwest National Laboratory); Yiwei Yang (University of California, Santa Cruz); and Kenza Bouasker (University of Washington-Seattle, EPFL - Swiss Federal Technology Institute of Lausanne)

Abstract
The tutorial presents an innovative open-source Compute Express Link (CXL) emulation platform designed for hyperscale environments, offering a comprehensive implementation of the CXL 3.0 specification. This platform includes significant advancements such as memory device emulation accommodating up to 32 Host-managed Device Memory (HDM) decoders, a feature-rich Fabric Manager for efficient memory management, and an adaptable CXL switch topology supporting configurations with up to 256 virtual ports for effective cacheline distribution. Furthermore, the platform ensures full coherence protocol support for reliable and consistent memory operations across multiple hosts.

The tutorial emphasizes hands-on exploration of the platform's design architecture, which integrates QEMU instances as emulated hosts and a central memory server to understand multi-host topologies. Memory accesses are intercepted, redirected, and responded to with topology-aware latency and data injection, all coordinated through a distributed Fabric Manager. This streamlined design enables participants to simulate real-world data center challenges effectively.

In addition to hands-on experience, participants will also learn how to engage with the CXL Emulation Workgroup, a collaborative initiative hosted at https://cxl-emu.github.io, which welcomes researchers to contribute to the ongoing development of this promising platform.

TUTORIAL IX

Object-as-a-Service (OaaS): Streamlining Cloud-Native Application Development for Edge-Cloud continuum

Presenters
Pawissanutt Lertpongrujikorn, Mohsen Amini Salehi, and Tapajit Chandra Paul (University of North Texas)

Abstract
Serverless computing, particularly Function-as-a-Service (FaaS), has revolutionized cloud application development by abstracting infrastructure management. However, FaaS remains fundamentally stateless, forcing developers to rely on external storage services. This separation of compute and state introduces significant latency and complexity, particularly for data-intensive applications with strict Quality of Service (QoS) requirements.

To bridge this gap, we introduce Object-as-a-Service (OaaS)---a novel serverless paradigm that borrows concepts from object-oriented programming to encapsulate application logic, state, and QoS policies into a single, autonomous deployment unit. By treating cloud resources as programmable objects, OaaS enables developers to build complex, stateful distributed applications with the same ease as writing monolithic code. We implemented OaaS in Oparaca, an open-source platform recently re-architected in Rust. Oparaca leverages a decentralized data plane powered by Zenoh and efficient communication protocols to support high-performance scenarios across the Edge-Cloud continuum, significantly minimizing data movement overhead.

In this tutorial, we will present the theoretical foundations of OaaS and the internal architecture of Oparaca. Participants will engage in hands-on exercises to develop, deploy, and manage stateful OaaS applications on a simulated Edge-Cloud environment using k3d. We will demonstrate how OaaS addresses critical IoT challenges—such as data locality, bandwidth constraints, and intermittent connectivity—by dynamically moving processing logic closer to data sources. Attendees will leave with practical experience in building next-generation serverless applications that are both developer-friendly and performance-optimized. This tutorial offers a comprehensive guide to the next generation of stateful serverless computing.

IPDPS 2026 REGISTRATION

Early Bird Deadline
April 6, 2026

Registration Details

Search IPDPS

Follow IPDPS

IPDPS 2025 Report

39th IEEE International Parallel & Distributed Processing Symposium

June 3-7, 2025
Politecnico di Milano
Milan, Italy

REPORT ON IPDPS 2025