Machine Learning Summer School/Workshop 2009
University of Chicago
Preliminary schedule

All Tutorials, Plenarys and Parallel Talk I will be located in the Assembly Hall.
Parallel Talk II will be located in the Home Room.

Quick Link: June 1, June 2, June 3, June 4, June 5, June 6, June 7, June 8, June 9, June 10, June 11
Monday, June 1
Breakfast (8:00-8:20)
Welcome Speech By Steve Smale (8:20-8:30)
Tutorial (8:30-12:00)

Title: Kernel Methods and Support Vector Machines
Speaker: John Shawe-Taylor
Slides
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Learning Phenomena Beyond Learning
Speaker: Leslie Valiant

Parallel Talk I (2:00-3:00)

Title: Geometric Inference for Probability Distribution
Speaker: Frederic Chazal
Slides
Video

Parallel Talk II (2:00-3:00)

Title: Zigzag Persistence
Speaker: Vin de Silva
Slides

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: PAC-Bayes Analysis: Background and Applications
Speaker: John Shawe-Taylor
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Harmonic and Geometric Multiscale Analysis of Data Sets in High Dimensions
Speaker: Mauro Maggioni
Slides

Parallel Talk II (4:30-5:30)

Title: Cut Locus and Topology from Point Data
Speaker: Tamal Dey
Slides
Video

Parallel Talk I (4:30-5:30)

Title: TBA
Speaker: TBA

top



Tuesday, June 2
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Graphical Models and Applications
Speaker: Yair Weiss
Video

Lunch Break (12:00-1:00)
Parallel Talk I (1:00-2:00)

Title: Euler Calculus and Topological Data Management
Speaker: Robert Ghrist
Slides
Video

Parallel Talk II (1:00-2:00)

Title: Visibility Constraints on Features of 3D Objects
Speaker: Ronen Basri

Coffee Break (2:00-2:30)
Parallel Talk I (2:30-3:30)

Title: Seeking Interpretable Models for High Dimensional Data
Speaker: Bin Yu
Slides
Video

Parallel Talk II (2:30-3:30)

Title: Learning Compressed Sensing
Speaker: Yair Weiss
Slides

Parallel Talk I (3:30-4:30)

Title: Learning Dictionaries for Image Analysis and Sensing
Speaker: Guillermo Sapiro
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Learning Submanifolds of Euclidean Space
Speaker: Shmuel Weinberger

Poster Session (5:30-7:30)
top



Wednesday, June 3
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Geometric Methods and Manifold Learning
Speaker: Misha Belkin and Partha Niyogi
Slides
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Examining the Relative Influence of Familial, Genetic, and Environmental Covariate Information in Flexible Risk Models
Speaker: Grace Wabha
Slides
Video

Parallel Talk I (2:00-3:00)

Title: Vision and Hodge Theory
Speaker: Stephen Smale
Video

Parallel Talk II (2:00-3:00)

Title: TBA
Speaker: Afra Zomorodian
Slides

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: Learning Deformable Models
Speaker: Yali Amit
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Implementation of Hodge Theory for Simplicial Complexes
Speaker: Anil Hirani
Slides

Parallel Talk I (4:30-5:30)

Title: Statistical Classification and Cluster Processes
Speaker: Peter McCullagh
Slides
Video

Parallel Talk II (4:30-5:30)

Title: Analysis of Reproducing Kernel Spaces for Learning
Speaker: Ding-Xuan Zhou
Slides

top



Thursday, June 4
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Theory, Methods and Applications of Active Learning
Speaker: Rui Castro and Rob Nowak
Slides-Part I Slides-Part II Slides-Part III
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Sparse Representations from Inverse Problems to Pattern Recognition
Speaker: Stephane Mallat
Video

Parallel Talk I (2:00-3:00)

Title: Optimization Algorithms in Support Vector Machines
Speaker: Stephen Wright
Slides
Video

Parallel Talk II (2:00-3:00)

Title: Community Structure in Large Social and Information Networks
Speaker: Michael Mahoney
Slides

Coffee Break (3:00-3:30)
Parallel Talk II (3:30-4:30)

Title: Unsupervised Learning for Stereo Vision
Speaker: David McAllester
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Active Learning and Selective Sensing
Speaker: Robert Nowak
Slides

Parallel Talk I (4:30-5:30)

Title: Learning Feature Hierarchies
Speaker: Yann LeCun
Slides
Video

Parallel Talk II (4:30-5:30)

Title: On the Dimension Complexity of DNFs
Speaker: Alexander Razborov
Slides

top



Friday, June 5
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Theory and Applications of Boosting
Speaker: Robert Schapire
Slides
Video

Lunch Break (12:00-1:00)
Plenary Talk (2:00-3:00)

Title: Generative Models for Image Analysis
Speaker: Stuart Geman
Slides
Video

Parallel Talk I(1:00-2:00)

Title: On Surrogate Loss Functions, f-Divergences and Decentralized Detection
Speaker: Michael Jordan
Slides
Video

Parallel Talk II (2:00-3:00)

Title: TBA
Speaker: TBA

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: Similarity-based Classifiers: Problems and Solutions
Speaker: Maya Gupta
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Object Detection with Discriminatively Trained Part Based Models
Speaker: Pedro Felzenszwalb
Slides

Parallel Talk II (4:30-5:30)

Title: What Do Unique Games, Structural Biology and the Low-Rank Matrix Completion Problem Have In Common?
Speaker: Amit Singer
Slides
Video

Parallel Talk I (4:30-5:30)

Title: TBA
Speaker: TBA

top



Saturday, June 6
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Bounding Excess Risk in machine Learning
Speaker: Vladimir Koltchinskii
Slides
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Learning in Hierarchical Architectures: from Neuroscience to Derived Kernels
Speaker: Tommy Poggio
Slides
Video

Parallel Talk I (2:00-3:00)

Title: More Data Less Work: Runtime As A Monotonically Decreasing Function of Data Set Size
Speaker: Nati Srebro
Slides
Video

Parallel Talk II (2:00-3:00)

Title: Consistent Robust Methods for Logarithmic Time Prediction
Speaker: John Langford

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: On Finding Low Error Clusterings
Speaker: Nina Balcan
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Multi-manifold Data Modeling Via Spectral Curvature Clustering
Speaker: Gilad Lerman

Parallel Talk I (4:30-5:30)

Title: TBA
Speaker: Xiaochuan Pan
Video

Parallel Talk II (4:30-5:30)

Title: TBA
Speaker:

top



Sunday, June 7
top



Monday, June 8
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Analysis of Clustering Procedures
Speaker: Sanjoy Dasgupta
Slides
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: The Stability of the Contour of an Orientable 2-Manifold
Speaker: Herbert Edelsbrunner
Slides
Video

Parallel Talk I (2:00-3:00)

Title: On a Theory of Similarity Functions for Learning and Clustering
Speaker: Avrim Blum
Slides
Video

Parallel Talk II (2:00-3:00)

Title: Data Spectroscopy: Eigenspaces of Convolution Operators and Clustering
Speaker: Tao Shi
Slides

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: A Bahadur Type Representation of the Linear Support Vector Machine and its Relative Efficiency
Speaker: Yoonkyung Lee
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Two Models for Bayesian Supervised Dimension Reduction
Speaker: Sayan Mukherjee
Slides

Parallel Talk I (4:30-5:30)

Title: Fitting a Graph to Vector Data
Speaker: Dan Spielman
Slides
Video

Parallel Talk II (4:30-5:30)

Title: Semi-supervised and Active Learning with Dual Supervision
Speaker: Vikas Sindhwani

top



Tuesday, June 9
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: An Overview of Compressed Sensing and sparse Signal Recovery via L1 Minimization
Speaker: Emmanuel Candes
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Spectral Graph Theory, Linear Solvers, and Applications
Speaker: Gary Miller
Slides
Video

Parallel Talk I (2:00-3:00)

Title: Cheeger Cuts and p-Spectral Clustering
Speaker: Matthias Hein
Slides
Video

Parallel Talk II (2:00-3:00)

Title: TBA
Speaker: TBA

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: Multiscale Geometry and Harmonic Analysis of Data Bases
Speaker: Ronald Coifman
Slides
Video

Parallel Talk II (3:30-4:30)

Title: A Multi-View Approach to Learning
Speaker: Sham Kakade

Parallel Talk I (4:30-5:30)

Title: Graphical Models for Speech Recognition: Articulatory and Audio-Visual Models
Speaker: Karen Livescu
Video

Parallel Talk II (4:30-5:30)

Title: TBA
Speaker: TBA

top



Wednesday, June 10
Breakfast (8:00-8:30)
Tutorial (8:30-12:00)

Title: Semi-supervised Learning
Speaker: Jerry Zhu
Slides
Video

Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Matrix Completion via Convex Optimization: Theory and Algorithms
Speaker: Emmanuel Candes
Video

Parallel Talk I (2:00-3:00)

Title: Drifting Games, Boosting and Online Learning
Speaker: Yoav Freund
Slides
Video

Parallel Talk II (2:00-3:00)

Title: Testing for Stationarity in Acoustic Signals: Parametric and Nonparametric Approaches
Speaker:Patrick Wolfe
Slides

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: Recent progresses in Combinatorial Statistics
Speaker: Elchanan Mossel
Slides
Video

Parallel Talk II (3:30-4:30)

Title: An Inverse Problems Perspective on Learning Theory
Speaker: Lorenzo Rosasco

Parallel Talk I (4:30-5:30)

Title: MAP Estimation with Perfect Graphs
Speaker: Tony Jebara
Slides
Video

Parallel Talk II (4:30-5:30)

Title: Statistical Methods for Cortical Control of Hand Movement
Speaker: Greg Shakhnarovich

top



Thursday, June 11
Breakfast (8:00-8:30)
Poster Session (9:00-11:00)
Lunch Break (12:00-1:00)
Plenary Talk (1:00-2:00)

Title: Inference for Networks
Speaker: Peter Bickel
Slides
Video

Parallel Talk I (2:00-3:00)

Title: Cocktail Party Problem as Binary Classification
Speaker:DeLiang Wang
Slides
Video

Parallel Talk II (2:00-3:00)

Title: Variable-Selection Bayesian Regression: A Novel Prior Distribution and Applications to Genetic Association Studies
Speaker: Matthew Stephens

Coffee Break (3:00-3:30)
Parallel Talk I (3:30-4:30)

Title: Machine Learning in Acoustic Signal Processing
Speaker: Mark Hasegawa-Johnson
Slides
Video

Parallel Talk II (3:30-4:30)

Title: Supervised Spectral Latent Variable Models
Speaker:Cristian Sminchisescu

Parallel Talk I (4:30-5:30)

Title: Nonlinear Dimension Reduction by Spectral Connectivity Analysis and Diffusion Coarse-Graining
Speaker: Ann Lee
Video

Parallel Talk II (4:30-5:30)

Title: Speech As A Pattern of Points In Time
Speaker: Partha Niyogi


List of Abstracts

Yali Amit, University of Chicago
Title: Learning Deformable Models

Abstract: It is widely recognized that the fundamental building block in high level computer vision is the deformable template, which represents realizations of an object class in the image as noisy geometric instantiations of an underlying model. The instantiations typically come from a subset of some group centered at the identity which act on the model or template. Thus in contrast to some machine learning applications where one tries to discover some unspecified manifold structure, here it is entirely determined by the group action and the model. Given a choice of group action and family of template models a major challenge is to use a sample of images of the object to estimate the model and the distribution on the group. The primary obstacle is that the instantiations or group elements that produced each image are unobserved. I will describe a general formulation of this problem and then show some practical applications to object detection and recognition.

top


Nina Balcan, Microsoft
Title: On Finding Low Error Clusterings

Abstract: There has been substantial work on approximation algorithms for clustering data under distance-based objective functions such as k-median, k-means, and min-sum objectives. This work is fueled in part by the hope that approximating these objectives well will indeed yield more accurate solutions. That is, for problems such as clustering proteins by function, or clustering images by subject, there is some unknown correct "target" clustering and the implicit assumption is that clusterings that are approximately optimal in terms of these distance-based measures are also approximately correct in terms of error with respect to the target. In this work we show that if we make this implicit assumption explicit ---that is, if we assume that any c-approximation to the given clustering objective Phi is epsilon-close to the target --- then we can produce clusterings that are O(epsilon)-close to the target, even for values c for which obtaining a c-approximation is NP-hard. In particular, for the k-median, k-means, and min-sum objectives, we show that we can achieve this guarantee for any constant c > 1. Our results shows how for these clustering objectives one can get much better guarantees on accuracy than those implied by results obtained so far in the approximation literature, by wisely using all the available information for the problem at hand.

top


Ronen Basri, TTI-C
Title: Visibility Constraints on Features of 3D Objects

Abstract: To recognize three-dimensional objects it is important to model how their appearances can change due to changed in viewpoint. A key aspect of this involves understanding which object features can be simultaneously visible under different viewpoints. We address this problem in an image based framework, in which we use a limited number of images of an object taken from unknown viewpoints to determine which subsets of features might be simultaneously visible in other views. This leads to the problem of determining whether a set of images, each containing a set of features, is consistent with a single 3D object. We assume that each feature is visible from a disk of viewpoint on the viewing sphere. In this case we show the problem is NP-hard in general, but can be solved efficiently when all views come from a circle on the viewing sphere. We also give iterative algorithms that can handle noisy data and converge to locally optimal solutions in the general case. Our techniques can also be used to recover viewpoint information from the set of features that are visible in different images. We show that these algorithms perform well both on synthetic data and images from the COIL dataset. Joint work done with Pedro Felzenszwalb, Ross Girshick, David Jacobs, and Caroline Klivans.

top


Misha Belkin, Ohio State University
Title:

Abstract:

top


Peter Bickel, University of California, Berkeley
Title: Inference for Networks

Abstract: A great deal of attention has recently been paid to determining sub-communities on the basis of relations, corresponding to edges, between individuals, corresponding to vertices out of an unlabelled graph (Neman, SIAM Review 2003; Airoldi et al JMLR 2008; Leskovec & Kleinberg et al SIGKDD 2005) for probabilistic ergodic models of infinite unlabelled graphs. We drive consistency properties of the Newman-Girvon index, and develop an index with better consistency properties and better performance on simulated data sets. This is joint work with Aiyou Chen.

top


Avrim Blum, Carnegie Mellon University
Title: On a Theory of Similarity Functions for Learning and Clustering

Abstract: Kernel methods have become powerful tools in machine learning. They perform well in many applications, and there is also a well-developed theory of what makes a given kernel useful for a given learning problem. However, this theory requires viewing kernels as implicit (and often difficult to characterize) maps into high-dimensional spaces. In this talk I will describe work on developing a theory that just views a kernel as a measure of similarity between data objects, and describes the usefulness of a given kernel (or more general similarity function) in terms of fairly intuitive, direct properties of how the similarity function relates to the task at hand, without need to refer to any implicit spaces. I will also talk about an extension of this framework to learning from purely unlabeled data, i.e., clustering. In particular, one can ask how much stronger the properties of a similarity function should be (in terms of its relation to the unknown desired clustering) so that it can be used to cluster well: to learn well without any label information at all. We find that if we are willing to relax the objective a bit (for example, allow the algorithm to produce a hierarchical clustering that we will call successful if some pruning is close to the desired clustering), then this question leads to a number of interesting graph-theoretic and game-theoretic properties that are sufficient to cluster well. This work can be viewed defining a kind of PAC model for clustering. (This talk based on work joint with Maria-Florina Balcan, Santosh Vempala, and Nati Srebro).

top


Emmanuel Candes, California Institute of Technology
Tutorial Title: An Overview of Compressed Sensing and Sparse Signal Recovery via L1 Minimization

Abstract: In many applications, one often has fewer equations than unknowns. While this seems hopeless, the premise that the object we wish to recover is sparse or nearly sparse radically changes the problem, making the search for solutions feasible. This lecture will introduce sparsity as a key modeling tool together with a series of little miracles touching on many areas of data processing. These examples show that finding *that* solution to an underdetermined system of linear equations with minimum L1 norm, often returns the ``right'' answer. Further, there is by now a well-established body of work going by the name of compressed sensing, which asserts that one can exploit sparsity or compressibility when acquiring signals of general interest, and that one can design nonadaptive sampling techniques that condense the information in a compressible signal into a small amount of data---in fewer data points than were thought necessary. We will survey some of these theories and trace back some of their origins to early work done in the 50's. Because these theories are broadly applicable in nature, the tutorial will move through several applications areas that may be impacted such as signal processing, bio-medical imaging, machine learning and so on. Finally, we will discuss how these theories and methods have far reaching implications for sensor design and other types of designs.

Plenary Title: Matrix Completion via Convex Optimization: Theory and Algorithms

Abstract: This talk considers a problem of considerable practical interest: the recovery of a data matrix from a sampling of its entries. In partially filled out surveys, for instance, we would like to infer the many missing entries. In the area of recommender systems, users submit ratings on a subset of entries in a database, and the vendor provides recommendations based on the user's preferences. Because users only rate a few items, we would like to infer their preference for unrated items (this is the famous Netflix problem). Formally, suppose that we observe m entries selected uniformly at random from a matrix. Can we complete the matrix and recover the entries that we have not seen? We show that perhaps surprisingly, one can recover low-rank matrices exactly from what appear to be highly incomplete sets of sampled entries; that is, from a minimally sampled set of entries. Further, perfect recovery is possible by solving a simple convex optimization program, namely, a convenient semidefinite program. A surprise is that our methods are optimal and succeed as soon as recovery is possible by any method whatsoever, no matter how intractable; this result hinges on powerful techniques in probability theory. Time permitting, we will also present a very efficient algorithm based on iterative singular value thresholding, which can complete matrices with about a billion entries in a matter of minutes on a personal computer.

top


Rui Castro, Columbia University
Tutorial Title: Theory, Methods and Applications of Active Learning

Abstract: This tutorial will cover a wide range of topics related to active learning. Broadly speaking, active learning refers to sequential data selection and inference procedures that actively seek out highly informative data, rather than relying solely on random training examples or non-adaptively selected data. We will review basic theory and algorithms for active learning, and discuss applications in classification, regression, and sparse signal recovery. The tutorial will be organized into three main parts. The first part will introduce the problem of active learning, describe a few standard problems and algorithms, and pose the fundamental challenges of active learning. The second part is an in-depth "case-study" of active learning for binary classification, a problem where perhaps the most is known. In this context, we will show how theoretical analysis can quantify the capabilities and limitations of active learning. The final part will discuss future directions and emerging applications of active learning, including applications in cognitive and network sciences.

top


Frederic Chazal, INRIA, France
Title: Geometric Inference for Probability Distributions

Abstract: Data often comes in the form of a point cloud sampled from an unknown compact subset of Euclidean space. The general goal of geometric inference is then to recover geometric and topological features (Betti numbers, curvatures,...) of this subset from the approximating point cloud data. In recent years, it appeared that the study of distance functions allows to address many of these questions successfully. However, one of the main limitations of this framework is that it does not cope well with outliers nor with background noise. In this talk, we will show how to extend the framework of distance functions to overcome this problem. Replacing compact subsets by measures, we will introduce a notion of distance function to a probability distribution in $\R^n$.
These functions share many properties with classical distance functions, which makes them suitable for inference purposes. In particular, by considering appropriate level sets of these distance functions, it is possible to associate in a robust way topological and geometric features to a probability measure. If time permits, we will also mention a few other potential applications of this framework.

top


Ronald Coifman, Yale University
Title: Multiscale Geometry and Harmonic Analysis of Data Bases

Abstract: We describe a method for geometrization of databases such as, questionnaires, or lists of sensor outputs. Interlacing multiscale diffusion geometries of rows and columns of a data matrix, results in a pair of -Y´language ontologies¡ which are mutually supportive, (certain words are used in certain contexts). This mutual geometry serves a structure of Harmonic Analysis and signal processing on the database. We will illustrate, on databases of audio (music), psychological questionnaires, science documents, images and many others. Joint work with Mata Gavish, Yale University

top


Sanjoy Dasgupta, University of California, San Diego
Title: Analysis of Clustering Procedures

Abstract: Clustering procedures are notoriously short on rigorous guarantees. In this tutorial, I will cover some of the types of analysis that have been applied to clustering, and emphasize open problems that remain. Part I. Approximation algorithms for clustering Two popular cost functions for clustering are k-center and k-means. Both are NP-hard to optimize exactly. (a) Algorithms for approximately optimizing these cost functions. (b) Hierarchical versions of such clusterings. (c) Clustering when data is arriving in a streaming or online manner. Part II. Analysis of popular heuristics (a) How good is k-means? How fast is it? (b) Probabilistic analysis of EM. (c) What approximation ratio is achieved by agglomerative heuristics for hierarchial clustering? Part III. Statistical theory in clustering What aspects of the underlying data distribution are captured by the clustering of a finite sample from that distribution? (a) Consistency of k-means. (b) The cluster tree and linkage algorithms. (c) Rates for vector quantization.

top


Vin de Silva, Pomona College
Title: Zigzag Persistence

Abstract: Zigzag persistence is a new methodology for studying persistence of topological features across a family of spaces or point-cloud data sets. Building on classical results about quiver representations, zigzag persistence generalizes the highly successful theory of persistent homology and addresses several situations which are not covered by that theory. I will present theoretical and algorithmic foundations with a view towards applications in topological statistics. As an important example, I discuss a particular zigzag sequence derived from the levelsets of a real-valued function on a topological space. A powerful structure theorem, called the Pyramid Theorem, establishes a connection between this "levelset zigzag persistence" and the extended persistence of Cohen-Steiner, Edelsbrunner and Harer. This theorem resolves and open question concerning the symmetry of extended persistence. Moreover, the interval persistence of Dey and Wenger can be understood in this context; in some sense it carries three-quarters of the information produced by the other two theories. This is joint work with Gunnar Carlsson and Dmitriy Morozov.

top


Tamal Dey, Ohio State University
Title: Cut Locus and Topology from Point Data

Abstract: A cut locus of a point p in a compact Riemannian manifold M is defined as the set of points where minimizing geodesics issued from p stop being minimizing. It is known that a cut locus contains most of the topological information of M. Our goal is to utilize this property of cut loci to decipher the topology of M from a point sample. Recently it has been shown that Rips complexes can be built from a point sample P of M systematically to compute the Betti numbers, the rank of the homology groups of M. Rips complexes can be computed easily and therefore are favored over others such as restricted Delaunay, alpha, Cech, and witness complex. However, the sizes of the Rips complexes tend to be large. Since the dimension of a cut locus is lower than that of the manifold M, a subsample of P approximating the cut locus is usually much smaller in size and hence admits a relatively small Rips complex.
In this talk we explore the above approach for point data sampled from surfaces embedded in any high dimensional Euclidean space. We present an algorithm that computes a subsample P' of a sample P of a 2-manifold where P' approximates a cut locus. Empirical results show that the first Betti number of M can be computed from the Rips complexes built on these subsamples. The sizes of these Rips complexes are much smaller than the one built on the original sample of M.

top


Herbert Edelsbrunner, Duke University
Title: The Stability of the Contour of an Orientable 2-Manifold

Abstract: Think of the view of the boundary of a solid shape as a projection of a 2-manifold to R^2. Its apparent contour is the projection of the critical points. Generalizing the projection to smooth mappings of a 2-manifold to R^2, we get the contour as the image of the points at which the derivative is not surjective. Measuring difference with the erosion distance (the Hausdorff distance between the complements), we prove that the contour is stable. Along the way, we introduce the by now well established method of persistent homology, including the stability of its diagrams, as well as an extension using zigzag modules. Joint work with Dmitriy Morozov and Amit Patel.

top


Pedro Felzenszwalb, University of Chicago
Title: Object Detection with Discriminatively Trained Part Based Models

Abstract: Object recognition is one of the fundamental challenges in computer vision. I will consider the problem of detecting objects of a generic category, such as people or cars in static images. This is a difficult problem because objects in such categories can vary greatly in appearance. For example, people wear different clothes and take a variety of poses while cars come in various shapes and colors. We have developed an object detection system based on mixtures of multiscale deformable part models. The system is both highly efficient and accurate, achieving state-of-the-art results on the PASCAL object detection challenges. We train our models using a discriminative procedure that we call latent SVM. A latent SVM leads to a non-convex training problem. However, a latent SVM is semi-convex and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative algorithm that alternates between fixing latent values for positive examples and optimization of the latent SVM objective function. I will also discuss how this framework may eventually make possible the effective use of models based on rich visual grammars. Joint work with Ross Girshick, David McAllester and Deva Ramanan.

top


Yoav Freund, University of California, San Diego
Title: Drifting Games, Boosting and Online Learning

Abstract: Drifting games provide a new and useful framework for analyzing learning algorithms. In this talk I will present the framework and show how it is used to derive a new boosting algorithm, called RobustBoost and a new online prediction algorithm, called NormalHedge. I will present two sets of experiments using these algorithms on synthetic and real world data. The first set demonstrates that RobustBoost can learn from mislabeled training data. The second demonstrating an application of NormalHedge to the tracking moving objects

top


Stuart Geman, Brown University
Title: Generative Models for Image Analysis

Abstract: A probabilistic grammar for the grouping and labeling of parts and objects, when taken together with pose and part-dependent appearance models, constitutes a generative scene model and a Bayesian framework for image analysis. To the extent that the generative model generates features, as opposed to pixel intensities, the ÒinverseÓ or Òposterior distributionÓ on interpretations given images is based on incomplete information; feature vectors are generally insufficient to recover the original intensities. I will argue for fully generative scene models, meaning models that in principle generate actual digital pictures. I will outline an approach to the construction of fully generative models through an extension of context-sensitive grammars and a re-formulation of the popular template models for image fragments. Mostly I will focus on the problem of learning template models from image data. Since the model is fully specified (generative), at the pixel level, the templates can be learned by maximum likelihood. A training set of eyes, for example, yields an ensemble of left and right eyes, of familiar and natural character, but not actually coming from any particular individuals in the training set. The upshot is a mixture distribution on image patches, consisting of a set of templates and a set of conditional patch distributionsÑone for each template. One way to test the model is to examine samples. I will show how to sample from the mixture distribution and I will show sample sets of eyes, mouths, and generic Òbackground.Ó Another way to test the model is to use it for detection, recognition, or classification. I will show the results of a test on ethnic classification based on the eye region of faces.

top


Robert Ghrist, University of Pennsylvania
Title: Euler Calculus and Topological Data Management

Abstract: This talk covers the basic of an integral calculus based on Euler characteristic, and its utility in data problems, particularly in aggregation of redundant data and inverse problems over networks. This calculus is a blend of integral-geometric and sheaf theoretic techniques, and leads to surprisingly practical algorithms and computations. Qualitative versions of integral transforms for signal processing will be stressed.

top


Maya Gupta, University of Washington, Seattle
Title: Similarity-based Classifiers: Problems and Solutions

Abstract: Similarity-based learning assumes one is given similarities between samples to learn from, and can be considered a special case of graph-based learning where the graph is given and fully-connected. Such problems arise frequently in computer vision, bioinformatics, and problems involving human judgment. We will review the field of similarity-based classification and describe the main problems encountered in adapting standard algorithms for this problem, including different approaches to approximating indefinite similarities by kernels. We will motivate why local methods lessen the indefinite similarity problem, and show that a kernelized linear interpolation and local kernel ridge regression can be profitably applied to such similarity-based classification problems by framing them as weighted nearest-neighbor classifiers. Eight real datasets will be used to compare state-of-the-art methods and illustrate the open challenges in this field.

top


Mark Hasegawa-Johnson, University of Illinois
Title: Machine Learning in Acoustic Signal Processing

Abstract: This tutorial presents a framework for understanding and comparing applications of pattern recognition in acoustic signal processing. Representative applications will be delimited by two binary features: (1) regression vs. (2) classification (inferred variables are continuous vs. discrete), (A) instantaneous vs. (B) dynamic. (1. Regression) problems include imaging and sound source tracking using a device with unknown properties, and inverse problems, e.g., articulatory estimation from speech audio. (2. Classification) problems include, e.g., the detection of syllable onsets and offsets in a speech signal, and the classification of non-speech audio events. (A. Instantaneous) inference is performed using a universal approximator (neural network, Gaussian mixture, kernel regression), constrained or regularized, if necessary, to reduce generalization error (resulting in a support vector machine, shrunk net, pruned tree, or boosted classifier combination). (B. Dynamic) inference methods apply prior knowledge of state transition probabilities, either in the form of a regularization term (e.g., using Bayesian inference) or in the form of set constraints (e.g., using linear programming) or both; examples include speech-to-text transcription, acoustic-to-articulatory inversion using a switching Kalman filter, and computation of the query presence probability in an audio information retrieval task.

top


Matthias Hein, Saarland University
Title: Cheeger Cuts and p-Spectral Clustering

Abstract: Spectral clustering has become in recent years one of the most popular clustering algorithm. In this talk I discuss a generalized version of spectral clustering based on the second eigenvector of the graph p-Laplacian, a non-linear generalization of the graph Laplacian. The clustering obtained for 1<=2 can be seen as an interpolation of the relaxation of the normalized cut (p=2) and the Cheeger cut (p=1). However, the main motivation for p-spectral clustering is the fact, that one can show that the cut value obtained by thresholding the second eigenvector of the p-Laplacian converges towards the optimal Cheeger cut as p tends to 1. I will also present an efficient implementation which allows to do p-spectral clustering for large scale datasets.

top



Anil Hirani, University of Illinois, Urbana-Chmpaign
Title: Implementation of Hodge Theory for Simplicial Complexes

Abstract: Hodge decomposition is the main ingredient of Hodge theory. It may be familiar as Helmholtz-Hodge decomposition of vector fields into curl-free, divergence-free and harmonic parts. How can this be efficiently implemented for arbitrary dimensional simplicial complexes embedded in high dimensional Euclidean space? What are even the basic objects to be manipulated by a computer implementation of Hodge theory? How should the differential, its adjoint, and the Laplacian be implemented? I will describe what we use in PyDEC, our Python language based software library which can compute Hodge decompositions (it can also compute finite element solutions for certain partial differential equations). I will also make my own humble attempt to guess what this Hodge stuff may have to do with machine learning. I may be able to show a demo or two relevant to machine learning. Much of this is joint work with Nathan Bell.

top



Ali Jadbabaie, University of Pennsylvania
Title: Distributed Computation of Sparse Generators of Homology

Abstract: In this talk I will show how we can use discrete Hodge theory to compute a generator of homology for a simplicial complex in a distributed fashion. I will then demonstrate how one can refine the computed generator to find the sparsest generator in the same class in a distributed, scalable fashion. The proposed approach has some connections to 11 norm minimization and compressed sensing, and can be thought of a generalization of network flow problems from graphs to simplicial complexes. The application of this approach is then demonstrated by finding the sparsest cover of a sensor network and finding network coverage holes. Joint work with Alireza Tahbaz-Salehi.

top


Tony Jebara, Columbis University
Title:MAP Estimation with Perfect Graphs

Abstract:Efficiently finding the maximum a posteriori (MAP) configuration of a graphical model is an important problem which is often implemented using message passing algorithms and linear programming. The optimality of such algorithms is only well established for singly-connected graphs such as trees. Recently, along with others, we have shown that matching and b-matching also admit exact MAP estimation under max product belief propagation. This leads us to consider a generalization of trees, matchings and b-matchings: the fascinating family of so-called perfect graphs. While MAP estimation in general loopy graphical models is NP, for perfect graphs of a particular form, the problem is in P. This result leverages recent progress in defining perfect graphs (the strong perfect graph theorem which has been resolved after 4 decades), linear programming relaxations of MAP estimation and recent convergent message passing schemes. In particular, we convert any graphical model into a so-called nand Markov random field. This model is straightforward to relax into a linear program whose integrality can be established in general by testing for graph perfection. This perfection test is performed efficiently using a polynomial time algorithm. Alternatively, known decomposition tools from perfect graph theory may be used to prove perfection for certain graphs. Thus, a general graph framework is provided for determining when MAP estimation in any graphical model is in P, has an integral linear programming relaxation and is exactly recoverable by message passing.

top


Michael Jordan, University of California, Berkeley
Title: On Surrogate Loss Functions, f-divergences and Decentralized Detection

Abstract: In 1951, David Blackwell published a seminal paper---widely cited in economics---in which a link was established between the risk based on 0-1 loss and a class of functionals known as f-divergences. The latter functionals have since come to play an important role in several areas of signal processing and information theory, including decentralized detection. Yet their role in these fields has largely been heuristic. We show that an extension of BlackwellÕs programme provides a solid foundation for the use of f-divergences in decentralized detection, as well as in more general problems of experimental design. Our extension is based on a connection between f-divergences and the class of so-called surrogate loss funcions---computationally-inspired upper bounds on 0-1 loss that have become central in the machine learning literature on classification. (Joint work with XuanLong Nguyen and Martin Wainwright)

top


Sham Kakade, TTI-C
Title: A Multi-View Approach to Learning

Abstract:Extracting information relevant to a task in an unsupervised manner is one of the most promising directions for reducing the burden of collecting labeled data --- the latter of which is often prohibitively expensive, as it often involves procedures like human hand labeling, clinical trials, costly experiments, etc. The central questions here are: what structure in the unlabeled data can be exploited and how to utilize this structure? Often, our input X can be partitioned into two 'views' X1 and X2, where each view provides a natural source of information for predicting our target Y. For example, for the task of speaker identification, two input sources are an audio stream and a video stream, or, for a webpage classification task, our 'views' could be the word content and the hyperlink structure. We show that, under certain natural independence assumptions, canonical correlation analysis (a generalization of PCA to pairs of random variables) provides a dimensionality reduction technique such that we can project our input X to a lower dimensional space without loosing predictive power of the target Y. Such a reduction could significantly reduce the number of labeled samples required. We then discuss how this idea can be utilized in a number of settings: including supervised learning, (hierarchical) clustering, timeseries modeling, and domain adaptation (where our training and test distributions could differ substantially).

top


Vladimir Koltchinskii, Georgia Institute of Technology
Tutorial Title: Bounding Excess Risk in Machine Learning

Abstract: We will discuss a general approach to the problem of bounding -Y´the excess risk¡ of learning algorithms based on empirical risk minimization (possibly penalized). This approach has been developed in the recent years by several authors (among others: Massart; Bartlett, Bousquet and Mendelson; Koltchinskii). It is based on powerful concentration inequalities due to Talagrand as well as on a variety of tools of empirical processes theory (comparison inequalities, entropy and generic chaining bounds on Gaussian, empirical and Rademacher processes, etc.). It provides a way to obtain sharp excess risk bounds in a number of problems such as regression, density estimation and classification and for many different classes of learning methods (kernel machines, ensemble methods, sparse recovery). It also provides a general way to construct sharp data dependent bounds on excess risk that can be used in model selection and adaptation problems.

top


John Langford, Yahoo Research
Title: Consistent Robust Methods for Logarithmic Time Prediction

Abstract: A natural goal when predicting one of n choices is to do it in O(log(n)) time. Existing algorithms for doing this suffer from several deficiencies---they are often otherwise inefficient in training, sometimes inconsistent, and not very robust when they are consistent. I will present a family of techniques which addresses all of these problems, yielding mechanisms for prediction which are online, logarithmic time, consistent, and robust, for a wide variety of prediction problems.

top


Yann LeCun, New York University
Title: Learning Feature Hierarchies

Abstract:

top


Ann Lee, Carnegie Mellon University
Title: Nonlinear Dimension Reduction by Spectral Connectivity Analysis and Diffusion Coarse-Graining

Abstract: For naturally occurring data, the dimension of the given input space is often very large while the data themselves have a low intrinsic dimensionality. Spectral kernel methods are non-linear techniques for transforming data into a coordinate system that efficiently reveal the geometric structure in particular, the -Y´connectivity¡ of the data. In this talk, we will focus on one particular technique diffusion maps and diffusion coarse-graining; the construction is based on a Markov random walk on the data and offers a general scheme of simultaneously reorganizing and quantizing graphs and arbitrarily shaped data sets in high dimensions using intrinsic geometry. We show that clustering in embedding spaces is equivalent to compressing operators and that the quantization distortion in diffusion space bounds the error of compression of the operator, thus giving a rigorous justification and a precise measure of performance of k-means clustering in spectral embedding spaces. We will discuss two particular applications of diffusion coarse-graining: One application is choosing an appropriate set of prototype similar stellar population (SSP) spectra for parameter estimation of star formation history in galaxies. The other example is texture discrimination by a novel geometry-based metric on distributions. (Part of this work is joint with R.R. Coifman, S. Lafon, J. Richards and C. Schafer)

top


Yoonkyung Lee, Ohio State University
Title: A Bahadur Type Representation of the Linear Support Vector Machine and its Relative Efficiency

Abstract: The support vector machine has been used successfully in a variety of applications. Also on the theoretical front, its statistical properties including Bayes risk consistency have been examined rather extensively. Taking another look at the method, we investigate the asymptotic behavior of the linear support vector machine through Bahadur type representation of the coefficients established under appropriate conditions. Their asymptotic normality and statistical variability are derived on the basis of the representation. Furthermore, direct theoretical comparison is made with likelihood based approach to classification such as linear discriminant analysis and logistic regression in terms of the asymptotic relative efficiency, where the efficiency of a classification procedure is defined using the excess risk from the Bayes risk.

top


Gilad Lerman, University of Minnesota
Title: Multi-Manifold Data Modeling via Spectral Curvature Clustering

Abstract: We propose a fast multi-way spectral clustering algorithm for multi- manifold data modeling, i.e., modeling data by mixtures of manifolds (possibly intersecting). We describe the supporting theory as well as the practical choices guided by it. We first develop the case of hybrid linear modeling, i.e., when the underlying manifolds are affine subspaces in a Euclidean space, and then we extend this setting to more general manifolds. We exemplify the practical use of the algorithm by demonstrating its successful application to problems of motion segmentation.

top


Karen Livescu, TTI-C
Title: Graphical Models for Speech Recognition: Articulatory and Audio-Visual Models

Abstract: Since the 1980s, the main approach to automatic speech recognition has been using hidden Markov models (HMMs), in which each state corresponds to a phoneme or part of a phoneme in the context of the neighboring phonemes. Despite their crude approximation of the speech signal, and the large margin for improvement still remaining, HMMs have proven difficult to beat. In the last few years, there has been increasing interest in more complex graphical models for speech recognition, involving multiple streams of states. I will describe two such approaches, one modeling pronunciation variation as the result of the "sloppy" behavior of articulatory variables (the states of the lips, tongue, etc.) and the other modeling the audio and visual states in audio-visual speech recognition (i.e. recognition enhanced by "lipreading").

top


Mauro Maggioni, Duke University
Title: Harmonic and Geometric Multiscale Analysis of Data Sets in High Dimensions

Abstract: In many applications one is faced with the task of analyzing large amounts of data, typically embedded in high-dimensional space, but with a lower effective dimensionality, due to physical or statistical constraints. We are interested in studying the geometry of such data sets, modeled as noisy manifolds or graphs, in particular in estimating its intrinsic dimensionality and finding intrinsic coordinate systems on the data. We discuss recent results in these directions, where eigenfunctions of a Laplacian on the data or the associated heat kernel can be used to introduce coordinates with provable guarantees on their bi-Lipschitz distortion. We also discuss ways of studying, fitting, denoising and regularizing functions defined on the data, by using Fourier or a wavelet-like multiscale analysis on the data. We present applications to nonlinear image denoising, semisupervised learning on a family of benchmark datasets, and, if time permits, to Markov decision processes.

top


Michael Mahoney, Stanford University
Title: Community Structure in Large Social and Information Networks

Abstract: The concept of a community is central to social network analysis, and thus a large body of work has been devoted to identifying community structure. For example, a community may be thought of as a set of web pages on related topics, a set of people who share common interests, or more generally as a set of nodes in a network more similar amongst themselves than with the remainder of the network. Motivated by difficulties we experienced at actually finding meaningful communities in large real-world networks, we have performed a large scale analysis of a wide range of social and information networks. Our main methodology uses local spectral methods and involves computing isoperimetric properties of the networks at various size scales -- a novel application of ideas from statistics and machine learning to internet data analysis. Our empirical results suggest a significantly more refined picture of community structure than has been appreciated previously. Our most striking finding is that in nearly every network dataset we examined, we observe tight but almost trivial communities at very small size scales, and at larger size scales, the best possible communities gradually ``blend in'' with the rest of the network and thus become less ``community-like.'' This behavior is not explained, even at a qualitative level, by any of the commonly-used network generation models. Moreover, this behavior is exactly the opposite of what one would expect based on experience with and intuition from expander graphs, from graphs that are well-embeddable in a low-dimensional structure, and from small social networks that have served as testbeds of community detection algorithms. Possible mechanisms for reproducing our empirical observations will be discussed, as will implications of these findings for clustering, classification, and more general data analysis in modern large social and information networks.

top


Stephane Mallat, Ecole Polytechnique
Title: Sparse Representations from Inverse Problems to Pattern Recognition

Abstract: Sparse representations are at the core of many low-level signal processing procedures and are used by most pattern recognition algorithms to reduce the dimension of the search space. Structuring sparse representations fro pattern recognition applications requires taking into account invariants relatively to physical deformations such as rotation scaling or illumination. Sparsity, invariants and stability are conflicting requirements which is a source of open problems. Structured sparse representations with locally linear vector spaces are introduced for super-resolution inverse problems and pattern recognition.

top


David McAllester, TTI-C
Title: Unsupervised Learning for Stereo Vision

Abstract: We consider the problem of learning to estimate depth from stereo image pairs. This can be formulated as unsupervised learning --- the training pairs are not labeled with depth. We have formulated an algorithm which maximizes conditional likelihood the left image given right image in a model that involves latent information (depth). This unsupervised learning algorithm implicitly trains shape from texture and shape from shading monocular depth cues. The talk will present pragmatic results in the stereo vision problem as well as a general formulation of models and methods for maximizing conditional likelihood in a latent variable model where we wish to interpret the latent information as "labels".

top


Peter McCullagh, University of Chicago
Title: Statistical Classifications and Cluster Processes

Abstract: After an introduction to the notion of an exchangeable random partition, we continue with a more detailed discussion of the Ewens process and some of its antecedents. The concept of an exchangeable cluster process will be described, the main example being the Gauss-Ewens process. Some applications of cluster processes will be discussed, including problems of classification or supervised learning, and cluster analysis (unsupervised learning). A second type of probabilistic model based on point processes is also described. By contrast, which the Gauss-Ewes cluster process, the domain associated with each class is more diffuse and not localized in the feature space. For both models, the classification problem is interpreted as the problem of computing the predictive distribution for the class of a new object having a given feature vector. In one case, this is a conditional distribution given the observed features, in the other a Papangelou conditional intensity.

top


Gary Miller, Carnegie Mellon University
Title: Spectral Graph Theory, Linear Solvers, and Applications

Abstract: We discuss the development of combinatorial methods for solving symmetric diagonally dominate linear systems. Over the last fifteen years the computer science community has made substantial progress in fast solvers for SDD systems. For general SDD systems the upper bound is $0(m \log^k n)$ for some constant $k$, where $m$ is the number of non-zero entries, due to Spielman and Teng. Newer methods, combinatorial multigrid, have linear time guarantee for the planar case and work very well in practice. Critical to the use of these new solvers has been the reduction of problems to the solution of SDD systems. We present some of these reductions, including several from image processing.

top


Elchanan Mossel, University of California, Berkeley
Title: Recent Progress in Combinatorial Statistics

Abstract: I will discuss some recent progress in combinatorial statistics. In particular, I will describe progress in the areas of reconstructing graphical models, ML estimation of the Mallows model and diagnostics of MCMC.

top


Sayan Mukherjee, Duke University
Title: Two Models for Bayesian Supervised Dimension Reduction

Abstract: We study and develop two Bayesian frameworks for supervised dimension reduction that apply to nonlinear manifolds: Bayesian mixtures of inverse regressions and gradient based methods. Formal probabilistic models with likelihoods and priors are given for both methods and efficient posterior estimates of the effective dimension reduction space and predictive factors can be obtained by a Gibbs sampling procedure. In the case of the gradient based methods estimates of conditional dependence between covariates predictive of the response can also be inferred. Relations to manifold learning and Bayesian factor models are made explicit. The utility of the approach is illustrated on simulated and real examples

top


Partha Niyogi, University of Chicago
Title: Speech As A Pattern of Points in Time

Abstract: This talk presents techniques for nonstationarity detection in the context of speech and audio waveforms, with broad application to any class of time series that exhibits locally stationary behavior. Many such waveforms, in particular information-carrying natural sound signals, exhibit a degree of controlled nonstationarity, and are often well modeled as slowly time-varying systems. The talk first describes the basic concepts of such systems and their analysis via local Fourier methods. Parametric approaches appropriate for speech are then introduced by way of time-varying autoregressive models, along with nonparametric approaches based on variation of time-localized estimates of the power spectral density of an observed random process, along with an efficient offline bootstrap procedure based on the Wold representation. Several real-world examples are given.

top


Rob Nowak, University of Wisconsin-Madison
Title: Active Learning and Selective Sensing

Abstract: Traditional approaches to machine learning and statistical inference are passive, in the sense that all data are collected prior to analysis in a non-adaptive fashion. One can envision, however more active strategies in which information gleaned from previously collected data is used to guide the selection of new data. This talk discusses the emerging theory of such "active learning" methods. I will show that feedback between data analysis and data collection can be crucial for effective learning and inference. The talk will describe two active learning problems. First, I will consider binary-valued prediction (classification) problems, for which the prediction errors of passive learning methods can be exponentially larger than those of active learning. Second, I will discuss the role of active learning in the recovery of sparse vectors in noise. I will show that certain weak, sparse patterns are imperceptible from passive measurements, but can be recovered perfectly using selective sensing.

top


Tommy Poggio, MIT
Title: Learning in Hierarchical Architectures: from Neuroscience to Derived Kernels

Abstract: Understanding the processing of information in our cortex is a significant part of understanding how the brain works, arguably one of the greatest problems in science today. In particular, our visual abilities are computationally amazing: computer science is still far from being able to create a vision engine that imitates them. Thus, visual cortex and the problem of the computations it performs may well be a good proxy for the rest of the cortex and for intelligence itself.
I will briefly review our work on developing a hierarchical feedforward architecture for object recognition based on the anatomy and the physiology of the primate visual cortex. These architectures compete with state-of-the-art computer vision systems; they mimic human performance on a specific but difficult natural image recognition task. I will sketch current work aimed at extending the model to the recognition of behaviors in time sequences of images and to accounting for attentional effects inhuman vision.
I will then describe a new attempt (with S. Smale, L. Rosasco and J. Bouvrie) to develop a mathematics for hierarchical kernel machines centered around the notion of a recursively defined "derived kernel" and directly suggested by the model and the underlying neuroscience of the visual cortex.

top


Alexander Razborov, University of Chicago
Title: On the Dimension Complexity of DNFs

Abstract: The dimension complexity of a given concept class is defined as the smallest dimension of its halfspace embedding. The best currently known algorithm for learning DNF formulas discovered by A. Klivans and R. Servedio proceeds precisely by placing an upper bound on the dimension complexity of the concept class consisting of all (poly-size) DNF. In this talk we give a matching upper bound thereby ruling out the possibility of improving the Klivans-Servedio algorithm along similar lines. Joint work with Alexander Sherstov (University of Texas).

top


Lorenzo Rosasco, MIT
Title: An Inverse Problems Perspective on Learning Theory

Abstract: Learning machines are systems trained, instead of programmed, with a set of examples to perform a task. At a high level the problem of learning a can be seen as an inverse problem, where one is interested in recovering a model from measurements. However, a closer look reveals that the mathematical frameworks for learning and inverse problems seem quite different, as are the ultimate goals: generalization in learning and stability in inverse problems. In this talk we show that in fact the connection between inverse problems and learning goes beyond an informal analogy. If we assume the functions of interest belong to a Hilbert space where functions can be point-wise evaluated, we can cast the problem of learning as an inverse problem defined by a bounded linear operator. Building on such a connection we can view different learning algorithms, such as early stopping, supervised dimensionality reduction and penalized empirical risk minimization, as different ways of finding stable solutions to a stochastic inverse problem. New as well as old algorithms are described in a unifying functional framework close to that used in other fields such as signal processing. The relation between generalization and stability can be formalized using operator theoretic results from inverse problems and concentration inequalities for random matrices.

top


Guillermo Sapiro, University of Minnesota
Title: Learning Dictionaries for Image Analysis and Sensing

Abstract: Sparse representations have recently drawn much attention from the signal processing and learning communities. The basic underlying model consist of considering that natural images, or signals in general, admit a sparse decomposition in some redundant dictionary. This means that we can find a linear combination of a few atoms from the dictionary that lead to an efficient representation of the original signal. Recent results have shown that learning overcomplete non-parametric dictionaries for image representations, instead of using off-the-shelf ones, significantly improves numerous image and video processing tasks.
In this talk, I will first present our results on learning multiscale overcomplete dictionaries for color image and video restoration. I will present the framework and provide numerous examples showing state-of-the-art results. I will then briefly show how to extend this to image classification, deriving energies and optimization procedures that lead to learning non-parametric dictionaries for sparse representations optimized for classification. I will conclude by showing results on the extension of this to sensing and the learning of incoherent dictionaries. The work I present in this talk is the result of great collaborations with J. Mairal (ENS, Paris), F. Rodriguez (UofM/Spain), J. Martin-Duarte (UofM/Kodak), I. Ramirez (UofM), F. Lecumberry (UofM), F. Bach (ENS, Paris), M. Elad (Technion, Israel), J. Ponce (ENS, Paris), and A. Zisserman (ENS/Oxford).

top


Rob Schapire, Princeton University
Tutorial Title: Theory and Applications of Boosting

Abstract: Boosting is a general method for producing a very accurate classification rule by combining rough and moderately inaccurate "rules of thumb". While rooted in a theoretical framework of machine learning, boosting has been found to perform quite well empirically. This tutorial will introduce the boosting algorithm AdaBoost, and explain the underlying theory of boosting, including explanations that have been given as to why boosting often does not suffer from overfitting, as well as some of the myriad other theoretical points of view that have been taken on this algorithm. Some practical applications and extensions of boosting will also be described.

top


Greg Shakhnarovich, TTI-C
Title: Statistical Methods for Cortical Control of Hand Movement

Abstract: I will talk about statistical methods for decoding neural activity in the cortex to control dexterous hand activity, as part of a brain-machine interface setup, with the eventual goal to enable function neural prostheses for arm and hand. Human hand is a complex biomechanical device with many degrees of freedom. However, movement involved in natural tasks, such as reaching, grasping and object manipulation, exhibits many constraints on the joint behavior of these degrees of freedom, due to a variety of factors. This may impose an inherently low-dimensional spatial and temporal structure on the movement and allow for efficient decoding, even with noisy neural signal from a small cell population. I will describe recent results with statistical models that capture such structure, and its relationship to neural activity, to obtain state of the art open-loop reconstruction of motion from multi-electrode cortical recording. Joint work with C. Vargas-Irwin, P. Yadollahpour, M. Black, and J. Donoghue.

top


Tao Shi, Ohio State University
Title: Data Spectroscopy: Eigenspaces of Convolution Operators and Clustering

Abstract:In this talk, we focus on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analysis to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We warn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the Data Spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the potential of our algorithm. DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than competing methods. This is joint work with Prof. Mikhail Belkin (CSE, Ohio State University) and Prof. Bin Yu (Statistics, UC Berkeley).

top


Vikas Sindhwani, IBM
Title: Semi-supervised and Active Learning with Dual Supervision

Abstract: As the capability to collect and store data moves into the petascale regime, it is clear that for typical machine learning tasks, human annotation effort can only touch a minuscule fraction of the total data available. The problem of learning from small labeled sets by utilizing large amounts of unlabeled examples has attracted significant interest in recent years. In this talk, we consider a different complementary mechanism to potentially reduce the burden of labeling. We consider tasks where it is frequently possible to label smaller data entities more cheaply -- for example, by associating words, as opposed to documents, with classes in text categorization tasks. We present new semi-supervised learning algorithms based on spectral bipartite graph partitioning and matrix approximation formulations for co-clustering, that support dual supervision in the form of labels for both examples and/or features. We then consider the novel problem of active dual supervision, or, how to optimally query an example and feature labeling oracle to simultaneously collect two different forms of supervision, with the objective of building the best classifier in the most cost effective manner. We present empirical studies demonstrating the potential of simple active dual supervision schemes to significantly reduce the overall cost of acquiring labeled data. This is joint work with Prem Melville and Jianying Hu.

top


Amit Singer, Princeton University
Title: What Do Unique Games, Structural Biology and the Low-Rank Matrix Completion Problem Have In Common?

Abstract: We will formulate several data-driven applications as MAX2LIN and d-to-1 games, and show how to (approximately) solve them using efficient spectral and semidefinite program relaxations. The relaxations perform incredibly well in the presence of a large number of outlier measurements that cannot be satisfied. We use random matrix theory to prove that the algorithms almost achieve the information theoretic Shannon bound. The underlying group structure of the different applications (like SO(2), SO(3), GL(n), etc.) is heavily exploited. Applications include: cryo-electron microscopy and NMR spectroscopy for 3D protein structuring, low-rank matrix completion, clock synchronization, and surface reconstruction in computer vision and optics. Partly joint with Yoel Shkolnisky, Ronald Coifman and Fred Sigworth (Yale); Mihai Cucuringu and Yaron Lipman (Princeton); and Yosi Keller (Bar Ilan).

top


Stephen Smale, TTI-C
Title: Vision and Hodge theory

Abstract: A general mathematical Hodge theory will be presented together with its relationship to spaces of images.

top


Cristian Sminchisescu, University of Bonn
Title: Supervised Spectral Latent Variable Models

Abstract: We present probabilistic structured prediction methods for learning input-output dependencies where correlations between outputs are modeled as low-dimensional manifolds constrained by both geometric, distance preserving relations in the output distribution, and predictive power of inputs. Technically this reduces to learning a probabilistic, input conditional model, over latent (manifold) and output variables using an alternation scheme. In one round, we optimize the parameters of an input-driven manifold predictor using latent targets given by preimages (conditional expectations) of the current manifold-to-output model. In the next round, we use the distribution given by the manifold predictor in order to maximize the probability of the outputs with an additional, implicit geometry preserving constraint on the manifold. The resulting Supervised Spectral Latent Variable Model (SSLVM) combines the properties of geometric manifold learning (accommodates geometric constraints corresponding to any spectral embedding method including PCA, ISOMAP or Laplacian Eigenmaps), with the additional supervisory information to further constrain these for predictive tasks in a probabilistic setting. We discuss the gains of this methodology over baseline PPCA + regression frameworks and demonstrate its potential in real-world computer vision benchmarks for recognition and three-dimensional human pose reconstruction. Joint work with Liefeng Bo, TTI-C.

top


Dan Spielman, Yale University
Title: Fitting a Graph to Vector Data

Abstract: We ask "What is the right graph to fit to a set of vectors?" We propose one solution that provides good answers to standard Machine Learning problems, that has interesting combinatorial properties, and that we can compute efficiently. Joint work with Jonathan Kelner and Samuel Daitch.

top


John Shawe-Taylor, University College London
Tutorial Title: Kernel Methods and Support Vector Machines

Abstract: Kernel methods have become a standard tool for pattern analysis during the last fifteen years since the introduction of support vector machines. We will introduce the key ideas and indicate how this approach to pattern analysis enables a relatively easy plug and play application of different tools. The problem of choosing and designing a kernel for specific types of data will also be considered and an overview of different kernels will be given.

top


Nati Srebro, TTI-C
Title: More Data Less Work: Runtime as a Monotonically Decreasing Function of Data Set Size

Abstract: We are used to studying runtime as an increasing function of the data set size, and are happy when this increase is not so bad (e.g. when the runtime increases linearly, or even polynomiall, with the data set size). Traditional runtime analysis of learning is also viewed this way, and studies how training runtime increases as more data is available. However, considering the true objective of training, which is to obtain a good predictor, I will argue that training runtime should actually be studied as a *decreasing* function of training set size. Focusing on training Support Vector Machines (SVMs), and combining ideas from optimization, statistical learning theory, and online methods, I will then present both theoretical and empirical results demonstrating how a simple stochastic subgradient descent approach indeed displays such monotonic decreasing behavior. I will also discuss a similar phenomena in the context of Gaussian mixture clustering, where it appears that excess data turns the problem from computationally intractable to computationally tractable. Joint work with Shai Shalev-Shwartz, Karthik Sridharan, Yoram Singer, Greg Shakhnarovich and Sam Roweis.

top


Matthew Stephens, University of Chicago
Title: Variable-Selection Bayesian Regression: A Novel Prior Distribution, and Applications to Genetic Association Studies

Abstract: Ongoing large-scale genetic association studies, in an attempt to identify variants and genes affecting susceptibility to common diseases, are typing hundreds of thousands of genetic variants (SNPs) in thousands of individuals, with the aim of identifying which SNPs are associated with disease status. Standard statistical analyses consider each SNP in turn, and perform hundreds of thousands of univariate tests. Here we consider analyzing multiple SNPs simultaneously using Bayesian variable-selection multiple regression (in the context of a continuous response). We suggest a novel prior distribution for the effect sizes of associated SNPs, based on the overall proportion of variance explained. We assess the computational feasibility of this approach through simulation, and investigate its merits relative to both standard univariate analyses, and penalised regression approaches such as LASSO.

top


Leslie Valiant, Harvard University
Title: Learning Phenomena Beyond Learning

Abstract: We shall argue that the study of computational learning is fundamental and has implications well beyond the phenomenon of learning itself. We shall give two illustrations. First we shall offer a formulation of biological evolution as a restricted form of PAC learning, and shall describe some results for this formulation. Second we shall discuss the phenomenon of intelligent data processing in terms of the question of how reasoning can be performed on learned data with some guarantees of the validity of conclusions. The framework of robust logic that addresses this phenomenon will be described, as well as the results of some experiments that implement this framework for a natural language dataset.

top


Grace Wahba, University of Wisconsin-Madison
Title: Examining the Relative Influence of Familial, Genetic, and Environmental Covariate Information in Flexible Risk Models

Abstract: We present a novel method for examining the relative influence of familial, genetic and environmental covariate information in flexible nonparametric risk models. Our goal is investigating the relative importance of these three sources of information as they are associated with a particular outcome. To that end, we developed a method for incorporating arbitrary pedigree information in a smoothing spline ANOVA (SS-ANOVA) model. By expressing pedigree data as a positive semidefinite kernel matrix, the SS-ANOVA model is able to estimate a log-odds ratio as a multicomponent function of several variables: one or more functional components representing information from environmental covariates and/or genetic marker data and another representing pedigree relationships. We report a case study on models for retinal pigmentary abnormalities in the Beaver Dam Eye Study (BDES). Our model verifies known facts about the epidemiology of this eye lesion --- found in eyes with early age-related macular degeneration (AMD) --- and shows significantly increased predictive ability in models that include all three of the genetic, environmental and familial data sources. The case study also shows that models that contain only two of these data sources, that is, pedigree-environmental covariates or pedigree-genetic markers, or environmental covariates-genetic markers, have comparable predictive ability, while less than the model with all three. This result is consistent with the notions that genetic marker data encodes ---at least partly --- pedigree data, and that familial correlations encode shared environment data as well.

top


DeLiang Wang, Ohio State University
Title: Cocktail Party Problem as Binary Classification

Abstract: Speech segregation, or the cocktail party problem, has proven to be extremely challenging. Part of the challenge stems from the lack of a carefully analyzed computational goal. While the separation of every sound source in a mixture is considered the gold standard, I argue that such an objective is neither realistic nor what the human auditory system does. Motivated by the auditory masking phenomenon, we have suggested instead the ideal time-frequency (T-F) binary mask as a main goal for computational auditory scene analysis. Ideal binary masking retains the mixture energy in T-F units where the local signal-to-noise ratio exceeds a certain threshold, and rejects the mixture energy in other T-F units. Recent psychophysical evidence shows that ideal binary masking leads to large speech intelligibility improvements in noisy environments for both normal-hearing and hearing-impaired listeners. The effectiveness of the ideal binary mask implies that sound separation may be formulated as a case of binary classification, which opens the cocktail party problem to a variety of pattern classification and clustering methods. As an example, I discuss a recent system that segregates unvoiced speech by supervised classification of acoustic-phonetic features.

top


Shmuel Weinberger, University of Chicago
Title: Learning Submanifolds of Euclidean Space

Abstract: I will explain some of the issues involved with deducing aspects of the topology of submanifolds of Euclidean space from random and noisy samples. Joint work with Partha Niyogi and Steve Smale.

top


Yair Weiss, Hebrew University
Title: Learning Compressed Sensing

Abstract: Compressed sensing is a recent set of mathematical results showing that sparse signals can be exactly reconstructed from a small number of linear measurements. Interestingly, for ideal sparse signals with no measurement noise, random measurements allow perfect reconstruction while measurements based on principal component analysis (PCA) or independent component analysis (ICA) do not. At the same time, for other signal and noise distributions, PCA and ICA can significantly outperform random projections in terms of enabling reconstruction from a small number of measurements. In this paper we ask: given a training set typical of the signals we wish to measure, what are the optimal set of linear projections for compressed sensing? We show that the optimal projections are in general not the principal components nor the independent components of the data, but rather a seemingly novel set of projections that capture what is still uncertain about the signal, given the training set. We also show that the projections onto the learned uncertain components may far outperform random projections. This is particularly true in the case of natural images, where random projections have vanishingly small signal to noise ratio as the number of pixels becomes large. Joint work with Hyun-Sung Chang and Bill Freeman.
I will give a brief introduction to questions of representation, learning and inference in probabilistic graphical models and illustrate these ideas in applications from our own work in computational biology and computer vision.

Tutorial Title: Graphical Models and Applications

Abstract: I will give a brief introduction to questions of representation, learning and inference in probabilistic graphical models and illustrate these ideas in applications from our own work in computational biology and computer vision.

top


Patrick Wolfe, Harvard University
Title: Testing for Stationarity in Acoustic Signals: Parametric and Nonparametric Approaches

Abstract: This talk presents techniques for nonstationarity detection in the context of speech and audio waveforms, with broad application to any class of time series that exhibits locally stationary behavior. Many such waveforms, in particular information-carrying natural sound signals, exhibit a degree of controlled nonstationarity, and are often well modeled as slowly time-varying systems. The talk first describes the basic concepts of such systems and their analysis via local Fourier methods. Parametric approaches appropriate for speech are then introduced by way of time-varying autoregressive models, along with nonparametric approaches based on variation of time-localized estimates of the power spectral density of an observed random process, along with an efficient offline bootstrap procedure based on the Wold representation. Several real-world examples are given.

top


Stephen Wright, University of Wisconsin-Madison
Title: Optimization Algorithms in Support Vector Machines

Abstract: This talk presents techniques for nonstationarity detection in the context of speech and audio waveforms, with broad application to any class of time series that exhibits locally stationary behavior. Many such waveforms, in particular information-carrying natural sound signals, exhibit a degree of controlled nonstationarity, and are often well modeled as slowly time-varying systems. The talk first describes the basic concepts of such systems and their analysis via local Fourier methods. Parametric approaches appropriate for speech are then introduced by way of time-varying autoregressive models, along with nonparametric approaches based on variation of time-localized estimates of the power spectral density of an observed random process, along with an efficient offline bootstrap procedure based on the Wold representation. Several real-world examples are given.

top


Bin Yu, University of California, Berkeley
Title: Seeking Interpretable Models for High Dimensional Data

Abstract: Extracting useful information from high-dimensional data is the focus of today's statistical research and practice. After broad success of statistical machine learning on prediction through regularization, interpretability is gaining attention and sparsity has been used as its proxy. With the virtues of both regularization and sparsity, Lasso (L1 penalized L2 minimization) has been very popular recently. In this talk, I would like to discuss the theory and pratcice of sparse modeling. First, I will give an overview of recent research on sparsity and explain what useful insights have been learned from theoretical analyses of Lasso. Second, I will present collaborative research with the Gallant Lab at Berkeley on building sparse models (linear, nonlinear, and graphical) that describe fMRI responses in primary visual cortex area V1 to natural images.

top


Ding-Xuan Zhou, City University of Hong Kong
Title: Analysis of Reproducing Kernel Spaces for Learning

Abstract: Reproducing kernel Hilbert spaces play an important role in learning theory. Their mathematical properties determine approximation powers and learning ability of generated algorithms. In this talk we provide some analysis of reproducing kernel Hilbert spaces including smoothness of functions ensured by regularity of kernels. In particular, we show that a Gaussian kernel with a fixed variance has weak learning ability while allowing flexible variances of the Gaussian kernels increases the approximation powers. It is verified that the union of unit balls of reproducing kernel Hilbert spaces generated by Gaussian kernels with flexible variances is a uniform Glivenko-Cantelli class. This confirms the learnability of Gaussian kernels with changing variances. Some concrete algorithms for regression and online classification are discussed and learning rates are demonstrated.

top


Jerry Zhu, University of Wisconsin-Madison
Tutorial Title: Semi-Supervised Learning

Abstract: This tutorial covers classification approaches that utilize both labeled and unlabeled data. We will review self-training, Gaussian mixture models, co-training, multiview learning, graph-transduction and manifold regularization, transductive SVMs, and a PAC bound for semi-supervised learning. We then discuss some new development, including online semi-supervised learning, multi-manifold learning, and human semi-supervised learning.

top


Afra Zomorodian, Dartmouth College
Title:

Abstract:

top