|Full Day: 11:00am - 7:00pm|
|Room TBD||T2-FD||Large Graph Mining: Patterns, Tools, and Case Studies||Christos Faloutsos, Hanghang Tong
Carnegie Mellon University
|Room TBD||T3-FD||Log-Linear Models and Conditional Random Fields||Charles Elkan
University of California, San Diego
|Half Day (AM): 11:00am - 3:00pm|
|Room TBD||T5-AM||Evolution of Rule-based Information Extraction: From Grammars to Algebra||Rajasekar
Krishnamurthy, Sriram Raghavan, Huaiyu Zhu
IBM Almaden Research Center
|Room TBD||T6-AM||Information and Knowledge Management with Graphs and Matrices||
Florida International University
University of Texas at Arlington
|Room TBD||T7-AM||Machine Learning for Information Retrieval||Rong Jin
Michigan State University
University of California Santa Cruz
|Half Day (PM): 3:30pm - 7:00pm|
|Room TBD||T11-PM||Algorithmic Challenges in Online Advertising||Deepak Agrawal, Deepayan Chakrabarti
|Room TBD||T14-PM||Web Search Log Analysis and User Behavior Modeling||Peiling
Wang, Lei Wu
University of Tennessee
University of Wisconsin-Milwaukee
How do graphs look like? How do they evolve over time? How can we find patterns, anomalies and regularities in them? How to find influential nodes in the network? We will present both theoretical results and algorithms as well as case studies on several real applications. Our emphasis is on the intuition behind each method, and on guidelines for the practitioner.
The tutorial has the following parts:(a) Statistical properties and models and graph generators of static and evolving networks. (b) Tools for the analysis of static and dynamic graphs, like the Singular Value Decomposition, tensor decomposition for community detection, HITS/PageRank etc. (c) Proximity measurements on graphs, the main ideas to quantify the closeness of two nodes of the graph, fast algorithms to compute the proximity scores, applications of proximity, like CenterPiece subgraphs, pattern match, trend analysis etc.(d)Case studies of how a virus or information or influence spreads through the network, how to find influential bloggers or nodes to target for viral marketing, how to find fraudsters on eBay, how to find communities on graphs.top
Log-linear models are a far-reaching extension of logistic regression, while con- ditional random Þelds (CRFs) are a special case of log-linear models suitable for so-called structured learning tasks. Structured learning means learning to predict outputs that have internal structure. For example, recognizing handwritten words is more accurate when the correlations between neighboring letters are used to reÞne predictions. This tutorial will provide a simple but thorough introduction to these new developments in machine learning that have great potential for many novel applications.
The tutorial will Þrst explain what log-linear models are, with with concrete examples but also with mathematical generality. Next, feature-functions will be explained; these are the knowledge-representation technique underlying log-linear models. The tutorial will then present linear-chain CRFs, from the point of view that they are a special case of log-linear models. The Viterbi algorithm that makes inference tractable for linear-chain CRFs will be covered, followed by a discus- sion of inference for general CRFs. The presentation will continue with a general derivation of the gradient of log-linear models; this is the mathematical foundation of all log-linear training algorithms. Then, the tutorial will discuss two impor- tant special-case CRF training algorithms, one that is a variant of the perceptron method, and another one called contrastive divergence. Last but not least, the tu- torial will introduce publicly available software for training and using CRFs, and will explain a practical application of CRFs with hands-on detail.top
Information Extraction (IE) -- the task of distilling structured information ("annotations") from unstructured text -- is becoming an integral part of emerging data management applications. This tutorial presents an overview of the techniques and principles of rule-based information extraction. The tutorial will be broadly organized into three parts: (i) classical grammar-based approaches to rule-based IE , (ii) extended grammar-based systems that address some of the limitations of the classical approach, and finally, (iii) recent work that views extraction rules as declarative queries expressed in a formal algebraic framework. We will present a wide range of extraction tasks, from prototypical examples such as named-entity and relationship extraction, to tasks such as review/opinion extraction that have recently received attention in the literature. Using rules drawn from real-world annotators, we will motivate the approach developed in the early 90's (in systems such as FASTUS, TextPro, LaSIE, etc.) of using cascading grammars to express rules as regular expressions over token sequences. We will present an overview of CPSL, a specification language for grammar-based rules, that is at the heart of widely used extraction systems like GATE. In the second part of the talk, using examples of complex extraction tasks, we will highlight certain fundamental limitations of the grammar-based approach with respect to expressivity and the treatment of overlapping annotations. We will present AFst, an extended grammar-based system, that addresses some of these limitations by supporting rules over arbitrary graphs of annotations. Finally, the last part of the tutorial will present recent work that proposes an algebraic approach to rule-based extraction. Borrowing from database principles, rules are viewed as algebraic expressions specified using a declarative query over a database of annotation objects. We describe SystemT, an algebraic information extraction system, that enable scalable and efficient execution of extraction rules over large data sets. The tutorial will end with a summary of open problems as well as directions for future work. In addition, the material associated with the tutorial will include an extensive bibliography and pointers to downloadable extraction systems and benchmark data sets.top
The field of information and knowledge management increasingly adapt methods and algorithms from advanced matrix computations, graph theory and optimization. Prominent examples are spectral clustering, non-negative matrix factorization, Principal Component Analysis (PCA) and Singular Value Decomposition (SVD), graph-Laplacian based semi-supervised learning, diffusion process, etc. Graph and matrix-based methods are rapidly becoming popular and significant in information and knowledge management for the following reasons: (1) Graph and matrix based methods are amenable to vigorous analysis and benefits from the well-established knowledge in matrix computations, graph theory and optimization; (2) Compared to probabilistic and information theoretic approaches, graph and matrix based methods are fast, easy to understand and implement; and (3) Graph and matrix based methods are especially suitable for parallel and distributed-memory computers to solve large scale challenging problems such as searching and extracting patterns from the entire Web.
This tutorial will present recent advances in algorithms and methods using graphs and matrices for modeling and analyzing massive, high-dimensional, and nonlinear-structured data. One main goal of the tutorial is to consolidate the recent ideas on information and knowledge management using graphs and matrices. We will summarize some open problems contained in this field and propose some future trends. We also wish to attract practitioners who seek novel ideas for applications.top
This half-day tutorial presents a comprehensive introduction to the statistical challenges that arise in the emerging Þeld of computational advertising. Stated simply, the problem is to Þnd the best matching ads from a large inventory to any given context. This context could be user queries to a search engine where the ads are shown alongside search results ("sponsored search"), or webpage visits by users with ads placed on the webpage itself ("content match"). This problem is germane to many other applications like web search, recommendation systems, etc. Due to the massive scale of these problems, automated algorithmic approaches are the only feasible solutions.
The matching problem involves several algorithmic and statistical challenges, at the heart of which is the issue of constructing a match score for an ad in a given context. Classical approaches based on IR concepts cannot directly exploit rich clickstream data. However, learning from such retrospective data is also difficult due to bias in the ad-serving scheme under which the data was collected (e.g., positional bias, selection bias, etc.). Moreover, the click data is extremely sparse; a large fraction of the (context, ad) pairs occur rarely, and click rates are generally small.
Our roadmap is as follows. We begin with a detailed description of the problem and its accompanying challenging. Then, after a short discussion of the classical IR approach, we provide a comprehensive overview of state-of-the-art methods based on clickstream data. For learning from retrospective data, we look at feature-based methods such as logistic regression, Maximum Entropy, and hierarchical modeling. We also consider online learning methods, that use explore-exploit techniques to enhance the offline models. This will be followed by a discussion of several related problems, such as the modeling of non-stationarity over time, connections to collaborative filtering, indexing for fast retrieval at runtime, and ad ranking at run-time based on both click-rates and bids. We will also brießy discuss the applicability of these techniques to other related domains like web search and recommendation systems.top
Web search logs capture valuable user-generated data as users naturally search the Website. These log data can reveal what users were searching for and how they searched. However, despite rich and informative, these transactional log records are unstructured and messy. The current IR strategies for handling structured documents (e.g., tf-idf, vector space) are not readily applicable to studying user query log data. The query corpora include large amount of search formulations that are short linguistic expressions, which reflects how the majority of the users interact with the Web). With server-side logs, search session boundaries are undefined, which makes individual search sessions difficult to identify. Even though individual users can be identified in an intranet environment or using client-side logs, identifying individual search sessions remains a big challenge. Using data mining strategies and technologies, we can process data once into a data model that is simple and uniformed to allow intensive exploration. We can explore the data in different ways to build models of Web search behaviors. However, current data mining tools developed for business applications do not apply to transactional query logs. Transforming unstructured log data into a relational database for mining requires a deep understanding of both IR and data mining. In addition, innovative tools must be developed to support ongoing analysis because new questions often emerge when the current hypotheses are being studied. Although the literature on Web transaction log analysis is growing fast over the past decade, the published research works, with few exceptions, tend to focus on presenting analytical results with insufficient coverage of technical details to enable later researchers to duplicate the study using the same data or different data. Many tools developed by individual projects are not shared outside of its research context. This gap must be filled so that findings can endure cross-examination and can be systematically compared.
This tutorial is built on research that the instructors have conducted on studying Web search behaviors over the past decade. Through a series of programmed intensive research projects of analyzing large amount of transactional logs from different search environments, one of which is supported by a National Leadership on Research grant from the Institute for Museum and Library Services (IMLS), the instructors have gained in-depth knowledge of and insight into Web search behaviors and unique experiences and skills on processing and analyzing large Web search transactional logs to model these behaviors. This tutorial will teach the algorithms and technical implementations that participants can use for their own research design and for Web applications that incorporate user observation and effective search support.top
The audience will be able to obtain a coherent overview of the diverse techniques from this tutorial that are developed for semi-supervised learning. Through the concrete examples of applying machine learning techniques to information retrieval tasks, the audience will learn how to solve the real-world IR problems with the presented techniques.top