Tutorials

July 22, 2008: This page is a draft, please check back for the complete information.

Process Mining: Beyond Business Intelligence

Wil van der Aalst
Department of Mathematics and Computer Science
Eindhoven University of Technology, The Netherlands


Schedule: Full Day

Abstract:

More and more information about processes is recorded in the form of event logs. Equipment ranging from embedded systems to enterprise information systems are logging the behaviors that take place. This data explosion allows for the analysis of reality and the construction of models that reflect what actually happened. This can be used to diagnose and improve processes in a variety of domains.

Especially business processes involving human actors are interesting to diagnose because these processes are not controlled by software and there may be a gap between what people think that happens and what really happens. Process mining provides a versatile and extendible way to analyze such processes. Using process mining techniques it is possible to extract different types of models from event logs, e.g., the construction of a process and organizational models. Moreover, other techniques support the conversion and analysis of models. Using conformance checking techniques models can also be compared with reality and existing models can be enhanced with additional information, e.g., indicating bottlenecks in a process.

Many vendors claim to offer support for Business Intelligence (BI). Unfortunately, these BI tools are not intelligent at all. Moreover, these tools require input data of a particular type and a predefined model. Process mining overcomes these limitations and makes it possible to extract new knowledge from information systems in a truly intelligent way.

Process mining addresses the problem that most organizations have very limited information about what is actually happening in their organization. In practice, there is often a significant gap between what is prescribed or supposed to happen, and what actually happens. Only a concise assessment of the organizational reality, which process mining strives to deliver, can help in verifying process models, and ultimately be used in a process redesign effort.

This tutorial aims to provide an overview of process mining techniques and, using many real-life examples, it will be shown how particular techniques can be applied and what kind of insights such analyses provide.

top

Large Array Databases: Prospects and Challenges

Peter Baumann
Jacobs University Bremen, Germany

Schedule: Half Day (AM)

Abstract:

Since the launch of Google Earth at the latest it is clear that online services for multi-Terabyte satellite imagery are becoming integral part of our Internet experience. Actually, 2-D imagery is but the tip of the iceberg - the general concept of multi-dimensional spatio-temporal array data covers 1-D sensor time series, 2-D imagery, 3-D CAT scans (x/y/t) and exploration data (x/y/z), 4-D climate, ocean, and cosmological simulation results as well as life science microarray data (x/y/z/t), and many more. Sizes range well into multi-Terabyte object sizes, in future: multi- Petabyte.

Today's inexpensive high-capacity storage media indeed allow to give online access to such data. We face the appearance of navigational interfaces; the next step will consist of advancing from data stewardship to service stewardship based on open, flexible access interfaces for value-adding processing, analysis, and mining. This poses new challenges on data management and, in particular, the design of open, interoperable services.

In this tutorial we present an overview on the state of the art in semantic Web services for large- scale, multi-dimensional array data. All aspects are addressed, such as modeling, query languages, query optimization and parallelization, and storage management. High emphasis will be devoted to applications in geo, life science, and the Grid; to this end, real-life use cases will be presented and discussed which stem from both projects carried out and our standardization work in the Open GeoSpatial Consortium (OGC).

top

Large graph mining: patterns, tools and case studies

Christos Faloutsos
Hanghang Tong
Department of Computer Science
Carnegie Mellon University
Pittsburgh, PA, USA


Schedule: Full Day

Abstract:

How do graphs look like? How do they evolve over time? How can we find patterns, anomalies and regularities in them? How to find influential nodes in the network? We will present both theoretical results and algorithms as well as case studies on several real applications. Our emphasis is on the intuition behind each method, and on guidelines for the practitioner.

The tutorial has the following parts:(a) Statistical properties and models and graph generators of static and evolving networks. (b) Tools for the analysis of static and dynamic graphs, like the Singular Value Decomposition, tensor decomposition for community detection, HITS/PageRank etc. (c) Proximity measurements on graphs, the main ideas to quantify the closeness of two nodes of the graph, fast algorithms to compute the proximity scores, applications of proximity, like CenterPiece subgraphs, pattern match, trend analysis etc.(d)Case studies of how a virus or information or influence spreads through the network, how to find influential bloggers or nodes to target for viral marketing, how to find fraudsters on eBay, how to find communities on graphs.

top

Log-linear models and conditional random fields

Charles Elkan
Department of Computer Science and Engineering
University of California, San Diego
La Jolla, CA, USA


Schedule: Full Day

Abstract:

Log-linear models are a far-reaching extension of logistic regression, while con- ditional random Þelds (CRFs) are a special case of log-linear models suitable for so-called structured learning tasks. Structured learning means learning to predict outputs that have internal structure. For example, recognizing handwritten words is more accurate when the correlations between neighboring letters are used to reÞne predictions. This tutorial will provide a simple but thorough introduction to these new developments in machine learning that have great potential for many novel applications.

The tutorial will Þrst explain what log-linear models are, with with concrete examples but also with mathematical generality. Next, feature-functions will be explained; these are the knowledge-representation technique underlying log-linear models. The tutorial will then present linear-chain CRFs, from the point of view that they are a special case of log-linear models. The Viterbi algorithm that makes inference tractable for linear-chain CRFs will be covered, followed by a discus- sion of inference for general CRFs. The presentation will continue with a general derivation of the gradient of log-linear models; this is the mathematical foundation of all log-linear training algorithms. Then, the tutorial will discuss two impor- tant special-case CRF training algorithms, one that is a variant of the perceptron method, and another one called contrastive divergence. Last but not least, the tu- torial will introduce publicly available software for training and using CRFs, and will explain a practical application of CRFs with hands-on detail.

top

Electronic Contract Systems: Background and Research Challenges

Kamalakar Karlapalem
CDE, IIIT Hyderabad, India
P. Radha Krishna
SET Labs, Infosys Technologies Ltd.
Hyderabad, India


Schedule: Half Day (PM)

Abstract:

The goal of e-contracting is to transform a traditional contract document into an executable e-contract. E-contract development is a novel application area and it affects the confluence of various databases and information infrastructure technologies, including active conceptual modeling, active databases, formal models, event-distributed architecture, XML data management, text mining, workflows, process mining and web services. Since transactional aspects of electronic contracts are equally important, we cover the open issues and related work on commitment aspects of e-contracts and corresponding payment dependencies.

Modeling and deployment of e-contracts is a challenging task as it involves both technological and business aspects. Several e-contract frameworks and systems are available in the literature, which either deal with the automatic handling of paper contracts or provide monitoring and enactment of such contracts. As contracts evolve, it is also necessary to have a system, which models and enacts their evolution. Thus, the e-contracts problem requires a wholesome solution, not a loosely coupled solution of various components. This tutorial provides current technologies that support e-contracting and discusses open issues in e-contracts.

top

Evolution of Rule-based Information Extraction: from Grammars to Algebra

Rajasekar Krishnamurthy
Sriram Raghavan
Huaiyu Zhu
IBM Almaden Research Center
San Jose, CA, USA


Schedule: Half Day (AM)

Abstract:

Information Extraction (IE) -- the task of distilling structured information ("annotations") from unstructured text -- is becoming an integral part of emerging data management applications. This tutorial presents an overview of the techniques and principles of rule-based information extraction. The tutorial will be broadly organized into three parts: (i) classical grammar-based approaches to rule-based IE , (ii) extended grammar-based systems that address some of the limitations of the classical approach, and finally, (iii) recent work that views extraction rules as declarative queries expressed in a formal algebraic framework. We will present a wide range of extraction tasks, from prototypical examples such as named-entity and relationship extraction, to tasks such as review/opinion extraction that have recently received attention in the literature. Using rules drawn from real-world annotators, we will motivate the approach developed in the early 90's (in systems such as FASTUS, TextPro, LaSIE, etc.) of using cascading grammars to express rules as regular expressions over token sequences. We will present an overview of CPSL, a specification language for grammar-based rules, that is at the heart of widely used extraction systems like GATE. In the second part of the talk, using examples of complex extraction tasks, we will highlight certain fundamental limitations of the grammar-based approach with respect to expressivity and the treatment of overlapping annotations. We will present AFst, an extended grammar-based system, that addresses some of these limitations by supporting rules over arbitrary graphs of annotations. Finally, the last part of the tutorial will present recent work that proposes an algebraic approach to rule-based extraction. Borrowing from database principles, rules are viewed as algebraic expressions specified using a declarative query over a database of annotation objects. We describe SystemT, an algebraic information extraction system, that enable scalable and efficient execution of extraction rules over large data sets. The tutorial will end with a summary of open problems as well as directions for future work. In addition, the material associated with the tutorial will include an extensive bibliography and pointers to downloadable extraction systems and benchmark data sets.

top

Information and Knowledge Management with Graphs and Matrices

Fei Wang
Department of Automation, Tsinghua University
Beijing, China

Tao Li
Florida International University
Miami, FL, USA

Chris Ding
University of Texas at Arlington, TX, USA

Schedule: Half Day (AM)

Abstract:

The field of information and knowledge management increasingly adapt methods and algorithms from advanced matrix computations, graph theory and optimization. Prominent examples are spectral clustering, non-negative matrix factorization, Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), graph-Laplacian based semi-supervised learning, diffusion process, etc. Graph and matrix-based methods are rapidly becoming popular and significant in information and knowledge management for the following reasons: (1) Graph and matrix based methods are amenable to vigorous analysis and benefits from the well-established knowledge in matrix computations, graph theory and optimization; (2) Compared to probabilistic and information theoretic approaches, graph and matrix based methods are fast, easy to understand and implement; and (3) Graph and matrix based methods are especially suitable for parallel and distributed-memory computers to solve large scale challenging problems such as searching and extracting patterns from the entire Web.

This tutorial will present recent advances in algorithms and methods using graphs and matrices for modeling and analyzing massive, high-dimensional, and nonlinear-structured data. One main goal of the tutorial is to consolidate the recent ideas on information and knowledge management using graphs and matrices. We will summarize some open problems contained in this field and propose some future trends. We also wish to attract practitioners who seek novel ideas for applications.

top

Algorithmic Challenges in Online Advertising

Deepak Agarwal
Deepayan Chakrabarti
Yahoo! Research
Santa Clara, CA, US


Schedule: Half Day (PM)

Abstract:

This half-day tutorial presents a comprehensive introduction to the statistical challenges that arise in the emerging Þeld of computational advertising. Stated simply, the problem is to Þnd the best matching ads from a large inventory to any given context. This context could be user queries to a search engine where the ads are shown alongside search results ("sponsored search"), or webpage visits by users with ads placed on the webpage itself ("content match"). This problem is germane to many other applications like web search, recommendation systems, etc. Due to the massive scale of these problems, automated algorithmic approaches are the only feasible solutions.

The matching problem involves several algorithmic and statistical challenges, at the heart of which is the issue of constructing a match score for an ad in a given context. Classical approaches based on IR concepts cannot directly exploit rich clickstream data. However, learning from such retrospective data is also difficult due to bias in the ad-serving scheme under which the data was collected (e.g., positional bias, selection bias, etc.). Moreover, the click data is extremely sparse; a large fraction of the (context, ad) pairs occur rarely, and click rates are generally small.

Our roadmap is as follows. We begin with a detailed description of the problem and its accompanying challenging. Then, after a short discussion of the classical IR approach, we provide a comprehensive overview of state-of-the-art methods based on clickstream data. For learning from retrospective data, we look at feature-based methods such as logistic regression, Maximum Entropy, and hierarchical modeling. We also consider online learning methods, that use explore-exploit techniques to enhance the offline models. This will be followed by a discussion of several related problems, such as the modeling of non-stationarity over time, connections to collaborative filtering, indexing for fast retrieval at runtime, and ad ranking at run-time based on both click-rates and bids. We will also brießy discuss the applicability of these techniques to other related domains like web search and recommendation systems.

top

Web Evolution: Models, Methods, and Mining

Sourav S Bhowmick
Nanyang Technological University, Singapore
Qiankun Zhao
AOL Laboratories, China

Schedule: Half Day (PM)

Abstract:

The Web offers access to large amounts of heterogeneous information and allows this information to evolve any time and in any way. In this tutorial session we explore the issues involving evolution management on the Web. We motivate the necessity for managing evolution and give an overview of the evolutionary features of the Web. Next, we identify various research issues involved in evolution management. Specifically, our discussion can be categorized into three main components: (a) study of evolution of Web page content and structure (b) study of evolution of web communities (c) mining of evolutionaryWeb data. The tutorial session also reveal various application domains of evolution management systems(such as social networks, blogs, and web event detection). We conclude by identifying potential research directions in this area.

top

Matrix and Tensor Models for Information Retrieval

Ajit Paul Singh
Geoffrey Gordon
Computer Science Department
Carnegie Mellon University
Pittsburgh, PA, USA


Schedule: Half Day (PM)

Abstract:

Matrix factorization is one of the workhorse methods in data mining, machine learning, and information retrieval. In various decorated forms it is the heart of latent variable topic models and collaborative Þltering. The purpose of this tutorial is to present a statistically grounded view of matrix factorization methods, including weighted singular value decompositions, exponential family PCA, probabilistic latent semantic indexing, max-margin matrix factorization, non-negative matrix factorization, and matrix co-clustering. The differences between these, and many other, methods can be reduced to a small number of modeling choices. We review the state-of-the art in learning these highly non-convex models, with a focus on techniques suited for very large sparse matrices. Many techniques for matrix factorization have natural extensions to the case of multidimensional arrays (tensors) and multiple matrix factorization, where the goal is to mix information from multiple heterogeneous matrices, often as an approach for modeling relational data. We provide a gentle introduction to these techniques, with a focus on when to use a particular variant of low-rank factorization.

top

Web search log analysis and user behavior modeling

Peiling Wang
Lei Wu
University of Tennessee
Knoxville, TN, USA

Dietmar Wolfram
Unviersity of Wisconsin-Milwaukee
Milwaukee, WI, USA


Schedule: Half Day (PM)

Abstract:

Web search logs capture valuable user-generated data as users naturally search the Website. These log data can reveal what users were searching for and how they searched. However, despite rich and informative, these transactional log records are unstructured and messy. The current IR strategies for handling structured documents (e.g., tf-idf, vector space) are not readily applicable to studying user query log data. The query corpora include large amount of search formulations that are short linguistic expressions, which reflects how the majority of the users interact with the Web). With server-side logs, search session boundaries are undefined, which makes individual search sessions difficult to identify. Even though individual users can be identified in an intranet environment or using client-side logs, identifying individual search sessions remains a big challenge. Using data mining strategies and technologies, we can process data once into a data model that is simple and uniformed to allow intensive exploration. We can explore the data in different ways to build models of Web search behaviors. However, current data mining tools developed for business applications do not apply to transactional query logs. Transforming unstructured log data into a relational database for mining requires a deep understanding of both IR and data mining. In addition, innovative tools must be developed to support ongoing analysis because new questions often emerge when the current hypotheses are being studied. Although the literature on Web transaction log analysis is growing fast over the past decade, the published research works, with few exceptions, tend to focus on presenting analytical results with insufficient coverage of technical details to enable later researchers to duplicate the study using the same data or different data. Many tools developed by individual projects are not shared outside of its research context. This gap must be filled so that findings can endure cross-examination and can be systematically compared.

This tutorial is built on research that the instructors have conducted on studying Web search behaviors over the past decade. Through a series of programmed intensive research projects of analyzing large amount of transactional logs from different search environments, one of which is supported by a National Leadership on Research grant from the Institute for Museum and Library Services (IMLS), the instructors have gained in-depth knowledge of and insight into Web search behaviors and unique experiences and skills on processing and analyzing large Web search transactional logs to model these behaviors. This tutorial will teach the algorithms and technical implementations that participants can use for their own research design and for Web applications that incorporate user observation and effective search support.

top

Machine Learning for Information Retrieval

Rong Jin
Michigan State University
East Lansing, MI, USA

Yi Zhang
University of California, Santa Cruz
Santa Cruz, CA, USA


Schedule: Half Day (AM)

Abstract:

The audience will be able to obtain a coherent overview of the diverse techniques from this tutorial that are developed for semi-supervised learning. Through the concrete examples of applying machine learning techniques to information retrieval tasks, the audience will learn how to solve the real-world IR problems with the presented techniques.

top

Data Stream Query Processing

Divesh Srivastava
AT&T Labs-Research
NJ, USA


Schedule: Half Day (AM)

Abstract:

Measuring and monitoring complex, dynamic phenomena -- trafficevolution in internet and telephone communication infrastructures,usage of the web, email and newsgroups, movement of financialmarkets, atmospheric conditions -- produces highly detailed andvoluminous data streams, i.e., data that arrives as a series of``observations'', often very rapidly.The applications that need to analyze these massive data streamsrequire sophisticated query capabilities to identify, e.g.,unusual/anomalous activity such as network intrusion detectionor telecom fraud detection, based on intricate relationshipsbetween the values of the underlying data streams.

Manipulating stream data presents many technical challenges.This is an active research area in the database community,involving stream operators, SQL extensions, query optimizationmethods, operator scheduling techniques, etc., with the goal ofdeveloping general-purpose (e.g., NiagaraCQ, Stanford Stream,Telegraph, Aurora, Nile) and specialized (e.g., Gigascope) datastream management systems.

This tutorial provides a perspective on the functionality andscalability issues in data stream query processing, based on theGigascope data stream management system.

The target audience of this tutorial includes database and datamining researchers, software and database application developers,corporate IT managers and CIOs, who are interested inunderstanding query processing issues over massive data streams.

top

Virtualization and Databases: State of the Art and Challenges

Ashraf Aboulnaga
University of Waterloo
Canada

Cristiana Amza
University of Toronto
Canada

Kenneth Salem
University of Waterloo
Canada



Schedule: Half Day (AM)

Abstract:

There is currently a lot of interest in resource virtualization as an important technique for addressing the problems of manageability, reliability, and security in computer systems. Resource virtualization decouples the user's perception of hardware and software resources from the actual implementation of these resources. It adds a flexible and programmable layer of software between user applications (i.e., programs, such as database systems) and the resources that they use. This layer of software maps the virtual resources perceived by the applications to real physical resources. For example, storage virtualization makes it possible for organizations to store their data centrally on storage servers, while allowing user applications to use traditional operating system interfaces to read and write this data as if it were on a local disk. The mapping from the virtual "disks" perceived by applications to the actual data on the storage server is managed by the storage virtualization layer. Another example is machine virtualization, in which the resources (CPU, disk, memory, network, etc.) of one or more physical machines are partitioned into multiple virtual machines, and independent operating systems and applications are installed on each virtual machine.

The power of resource virtualization comes from the ability to manage the mapping from virtual resources to physical resources in the virtualization layer, and to change it as needed. This is of interest to the database research community because resource virtualization, and the flexible mapping it introduces, can help solve many important problems in the area of database system manageability. At the same time, virtualizing database systems poses some unique research challenges.

The objective of this tutorial is to provide an overview of resource virtualization, focusing on the potential it has for improving the deployment, management, and availability of database systems. We will present the unique challenges of virtualizing database systems and overview the ongoing research in related areas. We will also provide case studies of resource virtualization products and of how resource virtualization affects database systems. We will address the impact of resource virtualization on the interaction between database systems and other components of a typical software stack.

top
Gold Supporters

Silver Supporters

Bronze Supporters

Video Supporter


Book Exhibits

Local Organization