IPDPS 2001

IPDPS 2001 TUTORIAL 3
Monday, April 23rd, 1 PM - 5 PM

Parallel and Distributed Data Mining

Domenico Talia, ISI-CNR, Italy
Mohammed J. Zaki, Rensselaer Polytechnic Institute, USA

WHO SHOULD ATTEND
This tutorial is meant for researchers, professionals and advanced students who are interested in learning how parallelism is exploited in mining very large volumes of data. People who are dealing with algorithms and applications for the efficient extraction of knowledge using parallel and distributed computing will be interested in this topic.

25% Introductory | 50% Intermediate | 25% Advanced

COURSE DESCRIPTION
Data mining is the automated analysis of large volumes of data looking for relationships and knowledge that are implicit in data and are 'interesting' in the sense of impacting an organization's practice. Data mining and knowledge discovery on large amounts of data can benefit from the use of parallel computers to improve both performance and quality of data selection. The goal of this tutorial is to provide researchers and practitioners with an introduction to mining large data sets by exploiting techniques from parallel and distributed computing and high performance data management.

This tutorial analyzes different forms of parallelism that can be exploited in data mining techniques and algorithms. For each data mining technique, different ways for parallel implementation are presented and discussed. Furthermore, parallel and distributed data mining systems and algorithms are discussed and compared. Finally, current research issues and perspectives in high-performance data mining are outlined.

A. Introduction
1. Data Mining and the data mining process
2. SMP, MPP, Cluster-based architectures, MPI, etc.
3. High performance data management and data warehouses

B. Parallel and Distributed Data Mining Algorithms
   1. Decision Trees
   2. Association Rules
   3. Sequential Patterns
   4. Clustering
   5. Other techniques (parallel GAs, Neural Nets, etc.)

C. Architectures and Systems for High Performance Data Mining
   1. Overview: Parallel and Distributed Systems
   2. HPDM in the KDD Process
   3. High Performance Mining/KDD Systems

D. Challenges and Future Research
   1. Algorithmic Issues
   2. Systems Issues
   3. Summary and References

LECTURERS
DOMENICO TALIA is a senior researcher at the ISI-CNR - Institute of Systems Analysis and Information Technology of the Italian National Research Council - and a lecturer at the University of Calabria. Talia's main research interests are on parallel computation, parallel programming languages, parallel data mining and cellular automata. Talia is a member of the Editorial Board of the IEEE Computer Society Press, a member of the Editorial Board of the Parallel and Distributed Practices journal, a member of the Advisory Board of Euro-Par, and a Distinguished Speaker in the IEEE Computer Society Tutorial Program. He published three books and more than 90 papers in international journals and conference proceedings. He is member of the ACM and the IEEE Computer Society.

MOHAMMED J. ZAKI is an Assistant Professor of Computer Science at Rensselaer Polytechnic Institute. His research interests include the design of efficient, scalable, and parallel algorithms and systems for various data mining tasks. He has published over 45 papers in this area, and he recently co-edited the book, ``Large-scale Parallel Data Mining,'' Springer-Verlag, 2000. He was co-chair for ACM SIGKDD workshop on Large-scale Parallel KDD Systems (1999), and is a co-chair for IEEE IPDPS International Workshop on High Parallel and Distributed Data Mining (2001,2000). He has given a similar tutorial at the ACM
SIGKDD Conference, 2000. He is on many program committees, including the International Conference on Machine Learning (2000), ACM SIGMOD Workshop on Data Mining and Knowledge Discovery (2000), ACM SIGKDD Workshop on Distributed and Parallel Knowledge Discovery (2000), etc. He is a member of ACM and IEEE.