General IPDPS Info






IEEE Computer Society Technical Committees on Computer Architecture & Distributed Processing




Call for Participation
IPDPS 2021 Portland

35th IEEE International
Parallel & Distributed
Processing Symposium

May 17-21, 2021
Portland Hilton Downtown
Portland, Oregon USA



This page lists all the 21 workshops that are part of the IPDPS 2020 program. Click on the workshop of interest – Monday workshops at top of page and Friday workshops at bottom – and the link will take you to the home landing page of the workshop. The workshop web page provides detailed information regarding papers in the workshop and any other program material and events. Check individual workshop pages to see what events are planned.

The Main Conference program that follows shows the papers accepted for the conference, organized in Technical Sessions originally scheduled to be held on Tuesday, Wednesday and Thursday. Those papers as well as all of the papers in the workshops are all published in the proceedings and accompanied by presentation slides from the authors.

This publication will be released by May 15th to be available to all registrants.
IPDPS will be holding virtual events to coincide with the conference dates of 18-22 May. Participation details available here and in links in the program that follows.

  • Tuesday, May 19: Best paper presentations and Q&A session. 
  • Wednesday, May 20: Best paper announcement and TCPP public meeting. 
  • Thursday, May 21: IPDPS Town Hall meeting.

Events on these three days will take place from 9:00 AM to 10:00 AM US Central Daylight Time / 2:00 PM UTC. Check individual workshops for any scheduled events.

MONDAY - 18 May 2020




Visit individual
websites at
links shown




Heterogeneity in Computing Workshop


Reconfigurable Architectures Workshop


High Performance Computational Biology


Graphs, Architectures, Programming, and Learning


NSF/TCPP Workshop on Parallel and Distributed Computing Education


High-level Parallel Programming Models and Supportive Environments


High-Performance Big Data and Cloud Computing


Accelerators and Hybrid Exascale Systems


Parallel / Distributed Combinatorics and Optimization


Advances in Parallel and Distributed Computational Models

TUESDAY - 19 May 2020


Virtual Session
9:00 to 10:00 US Central Daylight Time / 2:00 UTC

Best Paper Presentations and Q&A Session


See this page for details and link to join session.

Parallel Technical
Sessions 1, 2, 3, & 4

SESSION 1: Communication & NoCs

DozzNoC: Reducing Static and Dynamic Energy in NoCs with Low-latency Voltage Regulators using Machine

Mark Clark, Yingping Chen, Avinash Karanth, Brian Ma, and Ahmed Louri


Neksus: An Interconnect for Heterogeneous System-In-Package Architectures

Vidushi Goyal, Xiaowei Wang, Valeria Bertacco, and Reetu Das


Accelerated Reply Injection for Removing NoC Bottleneck in GPGPUs

Yunfan Li and Lizhong Chen


Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures

Jahanzeb Maqbool Hashmi, Shulei Xu, Bharath Ramesh, Hari Subramoni, Mohammadreza Bayatpour, and Dhabaleswar K. (DK) Panda



SESSION 2: Storage & IO


ClusterSR: Cluster-Aware Scattered Repair in Erasure-Coded Storage 

Zhirong Shen, Jiwu Shu, Zhijie Huang, and Yingxun Fu


Stitch It Up: Using Progressive Data Storage to Scale Science

Jay Lofstead, John Mitchel, and Enze Chen 


HFetch: Hierarchical Data Prefetching for Scientific Workflows in Multi-Tiered Storage Environments

Hariharan Devarajan, Anthony Kougkas, and Xian-He Sun,


CanarIO: Sounding the Alarm on IO-Related Performance Degradation

Michael Wyatt, Stephen Herbein, Kathleen Shoga, Todd Gamblin, and Michela Taufer



SESSION 3: Applications


A Study of Graph Analytics for Massive Datasets on Large-Scale Distributed GPUs

Vishwesh Jatala, Roshan Dathathri, Gurbinder Gill, Loc Hoang, V. Krishna Nandivada, and Keshav Pingali


A Highly Efficient Dynamical Core of Atmospheric General Circulation Model based on Leap-Format

Hang Cao, Liang Yuan, He Zhang, Baodong Wu, Shigang Li, Pengqi Lu, Yunquan Zhang, Yongjun Xu, and Minghua Zhang


Understanding GPU-Based Lossy Compression for Extreme-Scale Cosmological Simulations

Sian Jin, Pascal Grosset, Christopher M. Biwer, Jesus Pulido, Jiannan Tian, Dingwen Tao, and James P. Ahrens


Optimizing High Performance Markov Clustering for Pre-Exascale Architectures

Oguz Selvitopi, Md Taufique Hussain, Ariful Azad, and Aydin Buluc

SESSION 4: Distributed Algorithms


Tightening Up the Incentive Ratio for Resource Sharing Over the Rings

Yukun Cheng, Xiaotie Deng, Yuhao Li


Communication-Efficient String Sorting

Timo Bingmann, Peter Sanders, and Matthias Schimek


SCSL: Optimizing Matching Algorithms to Improve Real-time for Content-based Pub/Sub Systems

Tianchen Ding, Shiyou  Qian, Jian Cao, Guangtao Xue, and Minglu Li


Distributed Graph Realizations

John Augustine, Keerti Choudhary, Avi Cohen, David Peleg, Sumathi Sivasubramaniam, and Suman Sourav

Parallel Technical Sessions 5, 6, 7, & 8

SESSION 5: Reliability and QoS


Transaction-Based Core Reliability

Sang Wook Stephen Do and Michel Dubois


Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer

Seung-Hwan Lim, Ross Miller, and Sudharshan Vazhkudai, 


EC-Fusion: An Efficient Hybrid Erasure Coding Framework to Improve Both Application and Recovery Performance in Cloud Storage Systems

Han Qiu, Chentao Wu, Jie Li, Minyi Guo,  Tong Liu, Xubin He, Yuanyuan Dong, and Yafei Zhao 



SESSION 6: Learning Algorithms


Learning an Effective Charging Scheme for Mobile Devices

Tang Liu, Baijun Wu, Wenzheng Xu, ,Xiaobo Cao, Jian Peng, and Hongyi Wu


Optimize Scheduling of Federated Learning on Battery-powered Mobile Devices

Cong Wang, Xin Wei, and Pengzhan Zhou


Harnessing Deep Learning via a Single Building Block

Kunal Banerjee, Michael J. Anderson, Sasikanth Avancha, Anand Venkat, Gregory M. Henry, Evangelos Georganas, Hans Pabst, Alexander Heinecke, and Dhiraj D. Kalamkar


Experience-Driven Computational Resource Allocation of Federated Learning by Deep Reinforcement Learning

Yufeng Zhan, Peng Li, and Song Guo



SESSION 7: Data Analysis and Management


An Active Learning Method for Empirical Modeling in Performance Tuning

Jiepeng Zhang, Jingwei Sun, Wenju Zhou, and Guangzhong Sun


DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection

Bin Dong, Veronica Rodriguez, Xin Xing, Suren Byna, Jonathan Ajo-Franklin, and Kesheng Wu


Scaling of Union of Intersections for Inference of Granger Causal Networks from Observational Data

Mahesh Balasubramanian, Trevor Ruiz, Brandon Cook, Mr Prabhat, Sharmodeep Bhattacharyya, Aviral Shrivastava, and Kristofer Bouchard


GPU-Based Static Data-Flow Analysis for Fast and Scalable Android App Vetting

Xiaodong Yu, Fengguo Wei, Xinming Ou, Michela Becchi, Tekin Bicer, and Danfeng(Daphne) Yao 



SESSION 8: Edge Computing


Robust Server Placement for Edge Computing

Dongyu Lu, Yuben Qu, Fan Wu, Haipeng Dai, Chao Dong, and Guihai Chen


EdgeIso: Effective Performance Isolation for Edge Devices
Yoonsung Nam, Yongjun Choi, Byeonghun Yoo, Yongseok Son, and Hyeonsang Eom


Busy-Time Scheduling on Heterogeneous Machines

Runtian Ren and Xueyan Tang


Scheduling Malleable Jobs Under Topological Constraints
Evripidis Bampis, Konstantinos Dogeas,  Alexander Kononov, Giorgio Lucarelli, and Fanny Pascual

Best Papers

Best Papers


XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPUs
Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-mei Hwu


AbstractThere has been a rapid proliferation of machine learning/deep learning (ML) models and wide... Read more



Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection
Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa


Abstract—Genome-wide association studies are performed to correlate a number of diseases and other... Read more



Understanding and Improving Persistent Transactions on Optane DC Memory
Pantea Zardoshti, Michael Spear, Aida Vosoughi, and Garret Swart


Abstract—Storing data structures in high-capacity byte-addressable persistent memory instead... Read more



CycLedger: A Scalable and Secure Parallel Protocol for Distributed Ledger via Sharding
Mengqian Zhang, JiChen Li, Zhaohua Chen, Hongyin Chen, and Xiaotie Deng


Abstract—Traditional public distributed ledgers have not been able to scale-out well and work... Read more

WEDNESDAY - 20 May 2020


Virtual Session
9:00 to 10:00 US Central Daylight Time / 2:00 UTC

Best Paper Announcement and TCPP Public Meeting


See this page for details and link to join session.

Parallel Technical
Sessions 9, 10, 11, & 12

SESSION 9: Cloud Technology


Mitigating Large Response Time Fluctuations through Fast Concurrency Adapting in the Cloud

Jianshu Liu, Shungeng Zhang, Qingyang Wang, and Jinpeng Wei


DAG-Aware Joint Task Scheduling and Cache Management in Spark Clusters

Yinggen Xu, Liu Liu, and Zhijun Ding


Solving the Container Explosion Problem for Distributed High Throughput Computing

Tim Shaffer, Nicholas Hazekamp, Jakob Blomer, and Douglas Thain, 


Amoeba: QoS-Awareness and Reduced Resource Usage of Microservices with Serverless Computing

Zijun Li, Quan Chen, Shuai Xue, Tao Ma, Yong Yang, Zhuo Song, and Minyi Guo



SESSION 10: Machine Learning


Efficient I/O for Neural Network Training with Compressed Data

Zhao Zhang, Lei Huang, J. Gregory Pauloski, and Ian T. Foster


Not All Explorations Are Equal: Harnessing Heterogeneous Profiling Cost for Efficient MLaaS Training

Jun Yi, Chengliang Zhang, Wei Wang, Cheng Li, and Feng Yan


ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning

Saeed Soori, Bugra Can, Mert Gurbuzbalaban, and Maryam Dehnavi


Benanza: Automatic uBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Cheng Li, Abdul Dakkak, Jinjun Xiong, and Wen-mei Hwu





Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription

Debashis Ganguly, Ziyu Zhang, Jun Yang, and Rami Melhem


LOGAN: High-Performance GPU-Based X-Drop Long-Read Alignment
Alberto Zeni, Giulia Guidi, Marquita Ellis, Nan Ding, Marco D. Santambrogio, Steven Hofmeyr, Aydin Buluç, Leonid Oliker, and Katherine Yelick


Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

Qi Yu, Bruce R. Childers, Libo Huang, Cheng Qian, Hui Guo, and Zhiying Wang


A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

Lingqi Zhang, Mohamed Wahib, Haoyu Zhang, and Satoshi Matsuoka



SESSION 12:Applications


DPF-ECC: Accelerating Elliptic Curve Cryptography with Floating-point Computing Power of GPUs

Lili Gao, Fangyu Zheng, Niall Emmart, Jiankuo Dong, Jingqiang Lin, and Charles Weems


Scalability Challenges of an Industrial Implicit Finite Element Code

Francois-Henry Rouet, Cleve Ashcraft, Jef Dawson, Roger Grimes, Erman Guleryuz, Seid Koric, Robert F. Lucas, James S. Ong, Todd Simons, and Ting-Ting Zhu


ETH: An Architecture for Exploring the Design Space of In-Situ Scientific Visualization

Greg Abram, Vignesh Adhinarayanan, Wu-chun Feng, David H. Rogers, and James P. Ahrens


Scaling Betweenness Approximation to Billions of Edges by MPI-based Adaptive Sampling

Alexander van der Grinten and Henning Meyerhenke

Parallel Technical Sessions 13, 14, 15, & 16

SESSION 13: Data Management


Improved Intermediate Data Management for MapReduce Frameworks

Haoyu Wang, Haiying Shen, Charles Reiss, Arnim Jain, and Yunqiao Zhang


Bandwidth-Aware Page Placement in NUMA
David Gureya, João Neto, Reza Karimi, João Barreto, Pramod Bhatotia, Vivien Quema, Rodrigo Rodrigues, Paolo Romano, and Vladimir Vlassov


HCompress: Hierarchical Data Compression for Multi-Tiered Storage Environments

Hariharan Devarajan, Anthony Kougkas, Luke Logan, and Xian-He Sun, 


FRaZ: A Generic High-Fidelity Fixed-Ratio Lossy Compression Framework for Scientific Floating-point Data

Robert R. Underwood, Sheng Di, Jon Calhoun, and Franck Cappello



SESSION 14: Storage & Caching


DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors
Nadja Holtryd, Madhavan Manivannan, Per Stenström, and Miquel Pericas


Coordinated Management of Processor Configuration and Cache Partitioning to Optimize Energy under QoS Constraints
Mehrzad Nejat, Madhavan Manivannan, Miquel Pericas, and Per Stenström


StragglerHelper: Alleviating Straggling in Computing Clusters via Sharing Memory Access Patterns

Wenjie Liu, Ping Huang, and Xubin He



SESSION 15: Numerics


Evaluating the Numerical Stability of Posit Floating Point Arithmetic

Nicholas Buoncristiani, Sanjana Shah, David Donofrio, and John Shalf


Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing

Ignacio Laguna


Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply

Da Yan, Wei Wang, and Xiaowen Chu



SESSION 16: IoT and Consensus


Data Collection of IoT Devices Using an Energy-Constrained UAV

Yuchen Li, Weifa Liang, Wenzheng Xu, and Xiaohua Jia


Argus: Multi-Level Service Visibility Scoping for Internet-of-Things in Enterprise Environments

Qian Zhou, Omkant Pandey, and Fan Ye


G-PBFT: A Location-based and Scalable Consensus Protocol for IoT-Blockchain Applications

LapHou Lao, Xiaohai Dai, Bin Xiao, and Songtao Guo


Byzantine Generalized Lattice Agreement

Giuseppe Antonio Di Luna, Emmanuelle Anceaume, and Leonardo Querzoni

THURSDAY - 21 May 2020


Virtual Session
9:00 to 10:00 US Central Daylight Time / 2:00 UTC

IPDPS Town Hall Meeting


See this page for details and link to join session.

Parallel Technical Sessions 17, 18, 19, & 20

SESSION 17: Graph Processing & Coding


A Heterogeneous PIM Hardware-Software Co-Design for Energy-Efficient Graph Processing

Yu Huang, Long Zheng, Pengcheng Yao, Jieshan Zhao, Xiaofei Liao, Hai Jin, and Jingling Xue


Spara: An Energy-Efficient ReRAM-based Accelerator for Sparse Graph Analytics Applications

Long Zheng, Jieshan Zhao, Yu Huang, Qinggang Wang, Zhen Zeng, Jingling Xue, Xiaofei Liao, and Hai Jin


Optimal Encoding and Decoding Algorithms for the RAID-6 Liberation Codes

Zhijie Huang, Hong Jiang, Zhirong Shen, Hao Che, Nong Xiao, and Ning Li


Sturgeon: Preference-aware Co-location for Improving Utilization of Power Constrained Computers

Pu Pang, Quan Chen, Deze Zeng, Chao Li, Jingwen Leng, Wenli Zheng, and Minyi Guo


SESSION 18: Parallel Algorithms


A High-Throughput Solver for Marginalized Graph Kernels on GPU

Yu-Hang Tang, Oguz Selvitopi, Doru Thom Popovici, and Aydin Buluc


Dynamic Graphs on the GPU

Muhammad A. Awad, Saman Ashkiani, Serban D. Porumbescu, and John D. Owens


Accelerating Parallel Hierarchical Matrix-Vector Products via Data Driven Sampling

Lucas Erlandson, Difeng Cai, Yuanzhe Xi, and Edmond Chow


NC Algorithms for Popular Matchings in One-Sided Preference Systems and Related Problems

Changyong Hu and Vijay Garg


SESSION 19: Performance, Power, and Energy


Smartly Handling Renewable Energy Instability in Supporting A Cloud Datacenter

Jiechao Gao, Haoyu Wang, and Haiying Shen


A Self-Optimized Generic Workload Prediction Framework for Cloud Computing

Vinodh Kumaran Jayakumar, Jaewoo Lee, In Kee Kim, and Wei Wang


SeeSAw: Optimizing Performance of In-Situ Analytics Applications under Power Constraints

Ivana Marincic, Venkatram Vishwanath, and Henry Hoffmann



SESSION 20: Resource Management


What does Power Consumption Behavior of HPC Jobs Reveal?
Tirthak Patel, Adam Wagenhäuser, Christopher Eibel, Timo Hönig, Thomas Zeiser, and Devesh Tiwari


Efficient Parallel Adaptive Partitioning for Load-balancing in Spatial Join

Jie Yang and Satish Puri


Union: An Automatic Workload Manager for Accelerating Network Simulation

Xin Wang, Misbah Mubarak, Yao Kang, Robert B. Ross, and Zhiling Lan


Auto-Tuning Parameter Choices using Bayesian Optimization

Harshitha Menon, Abhinav Bhatele, and Todd Gamblin

Parallel Technical Sessions 21, 22, 23, 24

SESSION 21: Runtime Systems


Inter-Job Scheduling of High-Throughput Material Screening Applications

Zhihui Du, Xining Hui, Yurui Wang, Jun Jiang, Jason Liu, Baokun Lu, Chongyu Wang


Reservation and Checkpointing Strategies for Stochastic Jobs

Ana Gainaru, Brice Goglin, Valentin Honore, Guillaume Pallez, Padma Raghavan, Yves Robert, and Hongyang Sun


A Scheduling Approach to Incremental Maintenance of Datalog Programs

Shikha Singh, Sergey Madaminov, Michael Bender, Michael Ferdman, Ryan Johnson, Benjamin Moseley, Hung Ngo, Dung Nguyen, Soeren Olesen, Kurt Stirewalt, and Geoffrey Washburn


Dynamic Scheduling in Distributed Transactional Memory

Costas Busch, Maurice Herlihy, Miroslav Popovic, and Gokarna Sharma



SESSION 22: Performance Analysis


Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling
Marcus Ritter, Alexandru Calotoiu, Sebastian Rinke, Thorsten Reimann, Torsten Hoefler, and Felix Wolf


The Case of Performance Variability on Dragonfly-based Systems

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, and David Lowenthal


Predicting and Comparing the Performance of Array Management Libraries

Donghe Kang, Oliver Ruebel, Suren Byna, and Spyros Blanas


Demystifying the Performance of HPC Scientific Applications on NVM-based Memory

Ivy B. Peng, Kai Wu, Jie Ren, Dong Li, and Maya Gokhale


SESSION 23: Communication


Packet-in Request Redirection for Minimizing Control Plane Response Time

Rui Xia, Haipeng Dai, Jiaqi Zheng, Hong Xu, Meng Li, and Guihai Chen


PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network

Chao Tian, Lingxiao Ma, Zhi Yang, and Yafei Dai


ConMidbox: Consolidated Middleboxes Selection and Routing in SDN/NFV-Enabled Networks

Guiyan Liu, Songtao Guo, Pan Li, and Liang Liu


Scalable and Memory-Ef?cient Kernel Ridge Regression
Gustavo Chávez, Yang Liu, Pieter Ghysels, Xiaoye Sherry Li, and Elizaveta Rebrova



SESSION 24: Storage


SSDKeeper: Self-Adapting Channel Allocation to Improve the Performance of SSD Devices

Renping Liu, Xianzhang Chen, Yujuan Tan, Runyu Zhang, Liang Liang, and Duo Liu


FlashKey:A High-Performance Flash Friendly Key-Value Store

Madhurima Ray, Krishna Kant, Peng Li, and Sanjeev Trika


Pacon: Improving Scalability and Ef?ciency of Metadata Service through Partial Consistency

Yubo Liu1, Yutong Lu, Zhiguang Chen, and Ming Zhao

Parallel Technical Sessions 25, 26, 27 & 28

SESSION 25: Program Analysis and Runtime Library


XPlacer: Automatic Analysis of Data Access Patterns on Heterogeneous CPU/GPU Systems
Peter Pirkelbauer, Pei-Hung Lin, Tristan Vanderbruggen, and Chunhua Liao


Improving Transactional Code Generation via Variable Annotation and Barrier Elision
João P.L. de Carvalho, Bruno C. Honorio, Alexandro Baldassin, and Guido Araujo


Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi
Hancheng Wu and Michela Becchi


AnySeq: A High Performance Sequence Alignment Library based on Partial Evaluation
André Müller, Bertil Schmidt, Andreas Hildebrandt, Richard Membarth, Roland Leißa, Matthis Kruse, and Sebastian Hack



SESSION 26: Scheduling


Analysis of a List Scheduling Algorithm for Task Graphs on Two Types of Resources

Lionel Eyraud-Dubois and Suraj Kumar


Optimal Convex Hull Formation on a Grid by Asynchronous Robots with Lights

Rory Hector, Ramachandran Vaidyanathan, Gokarna Sharma, and Jerry L. Trahan


On the Complexity of Conditional DAG Scheduling in Multiprocessor Systems
Alberto Marchetti-Spaccamela, Nicole Megow, Jens Schlöter, Martin Skutella, and Leen Stougie


Weaver: Ef?cient Co?ow Scheduling in Heterogeneous Parallel Networks

Xin Sunny Huang, Yiting Xia,  and T. S. Eugene Ng



SESSION 27: Fault Tolerance


Fault-Tolerant Containers Using NiLiCon
Diyu Zhou and Yuval Tamir


Aarohi: Making Real-Time Node Failure Prediction Feasible

Anwesha Das, Frank Mueller, and Barry Rountree


FP4S: Fragment-based Parallel State Recovery for Stateful Stream Applications

Pinchao Liu, Hailu Xu, Dilma Da Silva, Qingyang Wang, Sarker Tanzir Ahmed, and Liting Hu



SESSION 28: Multidisciplinary


Implementation and Evaluation of a Hardware Decentralized Synchronization Lock for MPSoCs
Maxime France-Pillois, Jérôme Martin, and Frederic Rousseau


Communication-Ef?cient Jaccard Similarity for High-Performance Distributed Genome Comparisons
Maciej Besta, Raghavendra Kanakagiri, Harun Mustafa, Mikhail Karasikov, Gunnar Ratsch, Torsten Hoefler, and Edgar Solomonik


Engineering Worst-Case Inputs for Pairwise Merge Sort on GPUs

Kyle Berney and Nodari Sitchinava


The Impossibility of Fast Transactions

Karolos Antoniadis, Diego Didona, Rachid Guerraoui and Willy Zwaenepoel

FRIDAY - 22 May 2020




Visit individual
websites at
links shown



Job Scheduling Strategies for Parallel Processing


Chapel Implementers and Users Workshop


Parallel and Distributed Scientific and Engineering Computing


Automatic Performance Tuning


Parallel Programming Models - Special Edition Machine Learning Performance and Security


Scalable Networks for Advanced Computing Systems


Parallel AI and Systems for the Edge


Resource Arbitration for Dynamic Runtimes


Scalable Deep Learning over Parallel and Distributed Infrastructures


High-Performance Storage


Parallel and Distributed Processing for Computational Social Systems




XSP: Across-Stack Profiling and Analysis of Machine Learning Models on GPU

Cheng Li, Abdul Dakkak, Jinjun Xiong, Wei Wei, Lingjie Xu, and Wen-mei Hwu

AbstractThere has been a rapid proliferation of machine learning/deep learning (ML) models and wide adoption of them in many application domains. This has made profiling and characterization of ML model performance an increasingly pressing task for both hardware designers and system providers, as they would like to offer the best possible system to serve ML models with the target latency, throughput, cost, and energy requirements while maximizing resource utilization. Such an endeavor is challenging as the characteristics of an ML model depend on the interplay between the model, framework, system libraries, and the hardware (or the HW/SW stack). Existing profiling tools are disjoint, however, and only focus on profiling within a particular level of the stack, which limits the thoroughness and usefulness of the profiling results.

This paper proposes XSP — an across-stack profiling design that gives a holistic and hierarchical view of ML model execution. XSP leverages distributed tracing to aggregate and correlate profile data from different sources. XSP introduces a leveled and iterative measurement approach that accurately captures the latencies at all levels of the HW/SW stack in spite of the profiling overhead. We couple the profiling design with an automated analysis pipeline to systematically analyze 65 state-of-the-art ML models. We demonstrate that XSP provides insights which would be difficult to discern otherwise. 


Exploring the Binary Precision Capabilities of Tensor Cores for Epistasis Detection

Ricardo Nobre, Aleksandar Ilic, Sergio Santander-Jiménez, and Leonel Sousa

Abstract—Genome-wide association studies are performed to correlate a number of diseases and other physical or even psychological conditions (phenotype) with substitutions of nucleotides at specific positions in the human genome, mainly single-nucleotide polymorphisms (SNPs). Some conditions, possibly because of the complexity of the mechanisms that give rise to them, have been identified to be more statistically correlated with genotype when multiple SNPs are jointly taken into account. However, the discovery of new associations between genotype and phenotype is exponentially slowed down by the increase of computational power required when epistasis, i.e., interactions between SNPs, is considered. This paper proposes a novel graphics processing unit (GPU)-based approach for epistasis detection that combines the use of modern tensor cores with native support for processing binarized inputs with algorithmic and target-focused optimizations. Using only a single mid-range Turing-based GPU, the proposed approach is able to evaluate 64.8 × 1012 and 25.4 × 1012 sets of SNPs per second, normalized to the number of patients, when considering 2-way and 3-way epistasis detection, respectively. This proposal is able to surpass the state-of-the-art approach by 6× and 8.2× in terms of the number of pairs and triplets of SNP allelic patient data evaluated per unit of time per GPU.


Understanding and Improving Persistent Transactions on Optane DC Memory

Pantea Zardoshti, Michael Spear, Aida Vosoughi, and Garret Swart

Abstract—Storing data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel® OptaneTM Direct Connect (OptaneTM DC) Persistent Memory. OptaneTM DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how OptaneTM DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability domain, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of OptaneTM DC memory.

In this paper we compare the performance of these durability domains on several configurations of five persistent transactional memory applications. We find a large throughput difference, which emphasizes the importance of choosing the best durability domain for each application and system. At the same time, our results confirm that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance, with speedups as high as 6× at 16 threads.


CycLedger: A Scalable and Secure Parallel Protocol for Distributed Ledger via Sharding

Mengqian Zhang, JiChen Li, Zhaohua Chen, Hongyin Chen, and Xiaotie Deng

Abstract—Traditional public distributed ledgers have not been able to scale-out well and work efficiently. Sharding is deemed as a promising way to solve this problem. By partitioning all nodes into small committees and letting them work in parallel, we can significantly lower the amount of communication and computation, reduce the overhead on each node’s storage, as well as enhance the throughput of the distributed ledger. Existing sharding-based protocols still suffer from several serious drawbacks. The first thing is that all non-faulty nodes must connect well with each other, which demands a huge number of communication channels in the network. Moreover, previous protocols have faced great loss in efficiency in the case where the honesty of each committee’s leader is in question. At the same time, no explicit incentive is provided for nodes to actively participate in the protocol.

We present CycLedger, a scalable and secure parallel protocol for distributed ledger via sharding. Our protocol selects a leader and a partial set for each committee, who are in charge of maintaining intra-shard consensus and communicating with other committees, to reduce the amortized complexity of communication, computation, and storage on all nodes. We introduce a novel semi-commitment scheme between committees and a recovery procedure to prevent the system from crashing even when leaders of committees are malicious. To add incentive for the network, we use the concept of reputation, which measures each node’s trusty computing power. As nodes with a higher reputation receive more rewards, there is an encouragement for nodes with strong computing ability to work honestly to gain reputation. In this way, we strike out a new path to establish scalability, security, and incentive for the sharding-based distributed ledger.

Register Today

For Access To
Online Proceedings

Registration Details

Search IPDPS


Follow IPDPS


Tweets by @IPDPS

IPDPS 2019 Report

33rd IEEE International Parallel & Distributed Processing Symposium

May 20-24, 2019
Hilton Copacabana
Rio de Janeiro, Brazil