Publications
of Mikhail Belkin
[Journal papers]  [Conference papers]  [Ph.D. Thesis]  [Book Chapter]  [Technical Reports]
Journal Papers and preprints:

Laplacian Support Vector Machines
Trained in the Primal
[pdf, bib, code]
S. Melacci, M.Belkin,
Journal of Machine Learning Research, vol.12, pp. 11491184, 2011.

+ abstract

In the last few years, due to the growing ubiquity of unlabeled data, much effort has been spent by the machine learning community to develop better understanding and improve the quality of classifiers exploiting unlabeled data. Following the manifold regularization approach, Laplacian Support Vector Machines (LapSVMs) have shown the state of the art performance in semisupervised classification. In this paper we present two strategies to solve the primal LapSVM problem, in order to overcome some issues of the original dual formulation. Whereas training a LapSVM in the dual requires two steps, using the primal form allows us to collapse training to a single step. Moreover, the computational complexity of the training algorithm is reduced from O(n^3) to O(n^2) using preconditioned conjugate gradient, where n is the combined number of labeled and unlabeled examples. We speed up training by using an early stopping strategy based on the prediction on unlabeled data or, if available, on labeled validation examples. This allows the algorithm to quickly compute approximate solutions with roughly the same classification accuracy as the optimal ones, considerably reducing the training time. Due to its simplicity, training LapSVM in the primal can be the starting point for additional enhancements of the original LapSVM formulation, such as those for dealing with large datasets. We present an extensive experimental evaluation on real world data showing the benefits of the proposed approach.

Polynomial Learning of Distribution Families [arxiv]
M. Belkin, K. Sinha, submitted. Short version, FOCS 2010.

+ abstract

The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a highdimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.

On Learning with Integral Operators
[pdf, bib]
L. Rosasco, M.Belkin, E. De Vito,
Journal of Machine Learning Research, vol.11, pp.905934, 2010.

+ abstract

A large number of learning algorithms, for example,
spectral clustering, kernel Principal Components
Analysis and many manifold methods are
based on estimating eigenvalues and eigenfunctions
of operators defined by a similarity function or a
kernel, given empirical data. Thus for the analysis
of algorithms, it is an important problem to
be able to assess the quality of such approximations.
The contribution of our paper is twofold.
First, we use a technique based on a concentration
inequality for Hilbert spaces to provide new simplified
proofs for a number of results in spectral
approximation. Second, using these methods we
provide several new results for estimating spectral
properties of the graph Laplacian operator extending
and strengthening results from [28].

Data spectroscopy: eigenspace of convolution operators and clustering [pdf, bib]
Tao Shi, Mikhail Belkin, Bin Yu
The Annals of Statistics, vol. 37, Number 6B (2009), 39603984.

+ abstract

This paper focuses on obtaining clustering information about a
distribution from its i.i.d. samples. We develop theoretical results to
understand and use clustering information contained in the eigenvectors
of data adjacency matrices based on a radial kernel function
with a sufficiently fast tail decay. In particular, we provide population
analyses to gain insights into which eigenvectors should be used and
when the clustering information for the distribution can be recovered
from the sample. We learn that a fixed number of top eigenvectors
might at the same time contain redundant clustering information and
miss relevant clustering information. We use this insight to design
the Data Spectroscopic clustering (DaSpec) algorithm that utilizes
properly selected eigenvectors to determine the number of clusters
automatically and to group the data accordingly. Our findings extend
the intuitions underlying existing spectral techniques such as
spectral clustering and Kernel Principal Components Analysis, and
provide new understanding into their usability and modes of failure.
Simulation studies and experiments on real world data are conducted
to show the potential of our algorithm. In particular, DaSpec is found
to handle unbalanced groups and recover clusters of different shapes
better than the competing methods.

Convergence of Laplacian Eigenmaps [pdf, bib]
M. Belkin, P. Niyogi
preprint, short version NIPS 2008.

+ abstract

Geometrically based methods for various tasks of data analysis
have attracted considerable attention over the last few years.
In many of these algorithms, a central role is played by the
eigenvectors of the graph Laplacian of a
dataderived graph. In this paper, we show that if points are
sampled uniformly at random from an unknown submanifold ${\cal M}$
of $\R^N$, then the eigenvectors of a suitably constructed
graph Laplacian converge to the eigenfunctions of the Laplace
Beltrami operator on ${\cal M}$. This basic result directly
establishes the convergence of spectral manifold learning
algorithms such as Laplacian Eigenmaps and Diffusion Maps.
It also has implications for the understanding of geometric algorithms
in data analysis, computational harmonic analysis, geometric
random graphs, and graphics.

Towards a Theoretical Foundation for LaplacianBased
Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi
Journal of Computer and System Sciences, 2008.
Volume 74 , Issue 8, pp. 12891308
Special Issue on Learning Theory (invited).

+ abstract

In recent years manifold methods have attracted a considerable amount of atten
tion in machine learning. However most algorithms in that class may be termed
"manifoldmotivated" as they lack any explicit theoretical guarantees. In this pa
per we take a step towards closing the gap between theory and practice for a class
of Laplacianbased manifold methods. These methods utilize the graph Laplacian
associated to a data set for a variety of applications in semisupervised learning,
clustering, data representation.
We show that under certain conditions the graph Laplacian of a point cloud of
data samples converges to the LaplaceBeltrami operator on the underlying mani
fold. Theorem 3.1 contains the .rst result showing convergence of a random graph
Laplacian to the manifold Laplacian in the context of machine learning.

Manifold Regularization:
a Geometric Framework for Learning from Labeled and
Unlabeled Examples [pdf, bib]
M. Belkin, P. Niyogi, V. Sindhwani
Journal of Machine Learning Research, 7(Nov):23992434,
2006.

+ abstract
 We propose a family of learning algorithms based on a
new form of regularization that allows us to exploit
the geometry of the marginal distribution. We focus
on a semisupervised framework that incorporates
labeled and unlabeled data in a generalpurpose
learner. Some transductive graph learning algorithms
and standard methods including support vector
machines and regularized least squares can be
obtained as special cases. We use properties of
reproducing kernel Hilbert spaces to prove new
Representer theorems that provide theoretical basis
for the algorithms. As a result (in contrast to
purely graphbased approaches) we obtain a natural
outofsample extension to novel examples and so are
able to handle both transductive and truly
semisupervised settings. We present experimental
evidence suggesting that our semisupervised
algorithms are able to use unlabeled data
effectively. Finally we have a brief discussion of
unsupervised and fully supervised learning within our
general framework.

Consistency of Spectral Clustering [pdf, bib]
U. von Luxburg, M. Belkin, O. Bousquet,
The Annals of Statistics 2008, Vol. 36, No. 2, 555586.

+ abstract

Consistency is a key property of statistical
algorithms when the data is drawn from some
underlying probability distribution. Surprisingly,
despite decades of work, little is known about
consistency of most clustering algorithms. In this
paper we investigate consistency of the popular
family of spectral clustering algorithms, which
clusters the data with the help of eigenvectors of
graph Laplacian matrices. We develop new methods to
establish that for increasing sample size, those
eigenvectors converge to the eigenvectors of certain
limit operators. As a result we can prove that one of
the two major classes of spectral clustering
(normalized clustering) converges under very general
conditions, while the other (unnormalized clustering)
is only consistent under strong additional
assumptions, which are not always satisfied in real
data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral
clustering.

Semisupervised Learning on Riemannian Manifolds
[pdf, bib]
M. Belkin, P. Niyogi
Machine Learning, 56, Invited, secial Issue on
Clustering, 209239, 2004.

+ abstract

We consider the general problem of utilizing both
labeled and unlabeled data to improve classification
accuracy. Under the assumption that the data lie on a
submanifold in a high dimensional space, we develop
an algorithmic framework to classify a partially
labeled data set in a principled manner. The central
idea of our approach is that classification functions
are naturally defined only on the submanifold in
question rather than the total ambient space. Using
the LaplaceBeltrami operator one produces a basis
(the Laplacian Eigenmaps) for a Hilbert space of
square integrable functions on the submanifold. To
recover such a basis, only unlabeled examples are
required. Once such a basis is obtained, training can
be performed using the labeled data set. Our
algorithm models the manifold using the adjacency
graph for the data and approximates the Laplace
Beltrami operator by the graph Laplacian. We provide
details of the algorithm, its theoretical
justification, and several practical applications for
image, speech, and text classification.

Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation [pdf, bib]
M. Belkin, P. Niyogi
Neural Computation, June 2003; 15 (6):13731396.

+ abstract

One of the central problems in machine learning and
pattern recognition is to develop appropriate
representations for complex data. We consider the
problem of constructing a representation for data
lying on a lowdimensional manifold embedded in a
highdimensional space. Drawing on the correspondence
between the graph Laplacian, the Laplace Beltrami
operator on the manifold, and the connections to the
heat equation, we propose a geometrically motivated
algorithm for representing the highdimensional data.
The algorithm provides a computationally efficient
approach to non
dimensionality reduction that
has localitypreserving properties and a natural
connection to clustering. Some potential applications
and illustrative examples are discussed.
Refereed and Invited Conference
Proceedings:

Data Skeletonization via Reeb Graphs
X. Ge, I. Safa, M. Belkin, Y. Wang, NIPS 2011.

+ abstract

Recovering hidden structure from complex and noisy nonlinear data is one of the most fundamental
problems in machine learning and statistical inference. While such data is often highdimensional, it is of
interest to approximate it with a lowdimensional or even onedimensional space, since many important
aspects of data are often intrinsically lowdimensional. Furthermore, there are many scenarios where
the underlying structure is graphlike, e.g, river/road networks or various trajectories. In this paper, we
develop a framework to extract, as well as to simplify, a onedimensional "skeleton" from unorganized
data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and
can be easily applied to unorganized highdimensional data such as point clouds or proximity graphs.
It can also represent arbitrary graph structures in the data. We also give theoretical results to justify
our method. We provide a number of experiments to demonstrate the effectiveness and generality of
our algorithm, including comparisons to existing methods, such as principal curves. We believe that the
simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool
for a broad range of applications.

An Iterated Graph Laplacian Approach for Ranking on Manifolds
Xueyuan Zhou, Mikhail Belkin, Nathan Srebro, KDD 2011.

+ abstract

Ranking is one of the key problems in information retrieval.
Recently, there has been significant interest in a class of
ranking algorithms based on the assumption that data is
sampled from a low dimensional manifold embedded in a
higher dimensional Euclidean space.
In this paper, we study a popular graph Laplacian based
ranking algorithm [23] using an analytical method, which
provides theoretical insights into the ranking algorithm go
ing beyond the intuitive idea of diffusion. Our analysis
shows that the algorithm is sensitive to a commonly used
parameter due to the use of symmetric normalized graph
Laplacian. We also show that the ranking function may diverge to infinity at the query point in the limit of infinite
samples. To address these issues, we propose an improved
ranking algorithm on manifolds using Green's function of
an iterated unnormalized graph Laplacian, which is more
robust and density adaptive, as well as pointwise continu
ous in the limit of infinite samples.
We also for the first time in the ranking literature empirically explore two variants from a family of twice normalized
graph Laplacians. Experimental results on text and image
data support our analysis, which also suggest the potential
value of twice normalized graph Laplacians in practice.

Semisupervised Learning by Higher Order Regularization [link]
X. Zhou, M. Belkin, AISTATS 2011.

+ abstract

Solutions of several graph Laplacian regularization based semisupervised learning algorithms were recently shown by Nadler et al. (2009) to degenerate to constant functions with ``spikes'' at labeled points given infinite unlabeled points in $\R^d$ for $d\ge 2$. These algorithms solve optimization problems by penalizing an energy term $\int_{\Omega}\\nabla f(x)\^2 dx$. In this paper, we address this problem by using regularization based on the iterated Laplacian, which is equivalent to higher order Sobolev seminorm. Alternatively, it can be viewed as a generalization of the thin plate spline an unknown domain in high dimension. We also discuss relationships between Reproducing Kernel Hilbert Spaces and Green's functions. Experimental results support our analysis by showing consistently improved results by using iterated Laplacians.

Polynomial Learning of Distribution Families [arxiv, bib]
M. Belkin, K. Sinha, FOCS 2010.

+ abstract

The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a highdimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.

Toward Learning Gaussian Mixtures with Arbitrary Separation [pdf]
Mikhail Belkin, Kaushik Sinha, COLT 2010

+ abstract

In recent years analysis of complexity of learning Gaussian mixture models from sampled data
has received significant attention in computational machine learning and theory communities. In
this paper we present the first result showing that polynomial time learning of multidimensional
Gaussian Mixture distributions is possible when the separation between the component means is
arbitrarily small. Specifically, we present an algorithm for learning the parameters of a mixture of k
identical spherical Gaussians in ndimensional space with an arbitrarily small separation between
the components, which is polynomial in dimension, inverse component separation and other input
parameters for a fixed number of components k. The algorithm uses a projection to k dimensions
and then a reduction to the 1dimensional case. It relies on a theoretical analysis showing that
two 1dimensional mixtures whose densities are close in the L2 norm must have similar means
and mixing coefficients. To produce the necessary lower bound for the L2 norm in terms of the
distances between the corresponding means, we analyze the behavior of the Fourier transform of a
mixture of Gaussians in one dimension around the origin, which turns out to be closely related to
the properties of the Vandermonde matrix obtained from the component means. Analysis of minors
of the Vandermonde matrix together with basic function approximation results allows us to provide
a lower bound for the norm of the mixture in the Fourier domain and hence a bound in the original
space. Additionally, we present a separate argument for reconstructing variance.

Semisupervised Learning using Sparse
Eigenfunction Bases
[pdf, bib]
K. Sinha, M.Belkin, NIPS 2009.

+ abstract

We present a new framework for semisupervised learning with sparse eigenfunction
bases of kernel matrices. It turns out that when the data has clustered, that
is, when the high density regions are suf.ciently separated by low density valleys,
each high density area corresponds to a unique representative eigenvector.
Linear combination of such eigenvectors (or, more precisely, of their Nystrom
extensions) provide good candidates for good classi.cation functions when the
cluster assumption holds. By .rst choosing an appropriate basis of these eigenvectors
from unlabeled data and then using labeled data with Lasso to select a
classi.er in the span of these eigenvectors, we obtain a classi.er, which has a very
sparse representation in this basis. Importantly, the sparsity corresponds naturally
to the cluster assumption.
Experimental results on a number of realworld datasets show that our method
is competitive with the state of the art semisupervised learning algorithms and
outperforms the natural baseline algorithm (Lasso in the Kernel PCA basis).

A Note on Learning with Integral Operators
[pdf, bib]
L. Rosasco, M.Belkin, E. De Vito, COLT 2009.

+ abstract

A large number of learning algorithms, for example,
spectral clustering, kernel Principal Components
Analysis and many manifold methods are
based on estimating eigenvalues and eigenfunctions
of operators defined by a similarity function or a
kernel, given empirical data. Thus for the analysis
of algorithms, it is an important problem to
be able to assess the quality of such approximations.
The contribution of our paper is twofold.
First, We use a technique based on a concentration
inequality for Hilbert spaces to provide new simplified
proofs for a number of results in spectral
approximation. Second, using these methods we
provide several new results for estimating spectral
properties of the graph Laplacian operator extending
and strengthening results from [28].

Constructing Laplace Operator from Point Clouds in R^d
[pdf, bib, code]
M. Belkin, J. Sun, Y. Wang, SODA 2009, pp. 10311040.

+ abstract

We present an algorithm for approximating the Laplace
Beltrami operator from an arbitrary point cloud obtained
from a kdimensional manifold embedded in the d
dimensional space. We show that this PCD Laplace (Point
Cloud Data Laplace) operator converges to the Laplace
Beltrami operator on the underlying manifold as the point
cloud becomes denser. Unlike the previous work, we do
not assume that the data samples are independent identically
distributed from a probability distribution and do
not require a global mesh. The resulting algorithm is easy
to implement. We present experimental results indicating
that even for point sets sampled from a uniform distribution,
PCD Laplace converges faster than the weighted graph
Laplacian. We also show that using our PCD Laplacian we
can directly estimate certain geometric invariants, such as
manifold area.

Data Spectroscopy: Learning Mixture Models using Eigenspaces of
Convolution Operators
[pdf, bib]
T. Shi, M. Belkin, B. Yu, ICML 2008.

+ abstract

In this paper we develop a spectral frame
work for estimating mixture distributions,
specifically Gaussian mixture models. In
physics, spectroscopy is often used for the
identification of substances through their
spectrum. Treating a kernel function K(x, y)
as "light" and the sampled data as "sub
stance", the spectrum of their interaction
(eigenvalues and eigenvectors of the kernel
matrix K) unveils certain aspects of the un
derlying parametric distribution p, such as
the parameters of a Gaussian mixture. Our
approach extends the intuitions and analyses
underlying the existing spectral techniques,
such as spectral clustering and Kernel Prin
cipal Components Analysis (KPCA).
We construct algorithms to estimate param
eters of Gaussian mixture models, includ
ing the number of mixture components, their
means and covariance matrices, which are im
portant in many practical applications. We
provide a theoretical framework and show en
couraging experimental results.

Discrete Laplace Operator for Meshed Surfaces
[pdf, bib]
M. Belkin, J. Sun, Y. Wang, SOCG 2008.

+ abstract

In recent years a considerable amount of work in graphics and geometric optimization used tools
based on the LaplaceBeltrami operator on a surface. The applications of the Laplacian include mesh
editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23,
25] that the popular cotangent approximation schemes do not provide convergent pointwise (or even
L^2) estimates, while many applications rely on pointwise estimation. Existence of such schemes has
been an open question [12].
In this paper we propose the first algorithm for approximating the Laplace operator of a surface from
a mesh with pointwise convergence guarantees applicable to arbitrary meshed surfaces. We show that
for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the LaplaceBeltrami
operator on the surface at every point of the surface.
Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows
that our algorithm exhibits convergence empirically and outperforms cotangentbased methods in providing
accurate approximation of the Laplace operator for various meshes.

Component Based Shape Retrieval Using Differential Profiles.
L. Ding, M. Belkin, Proceedings of ACM International Conference on Multimedia Information Retrieval, 2008.

+ abstract

In this paper, we describe the use of differential profiles, which are computed from 2D shapes smoothed with Gaussian
functions, as the shape features for building a shape retrieval system. We build a global shape component dictionary
from all the shape descriptors collected from shapes available in a database and then represent each shape as a
probabilistic mixture of elements from such a dictionary. Finally, shape retrieval from a given database is simply
done by comparing the mixing coefficients of the model of a query shape and those of known shapes. Our retrieval
experiments are done on both object contour and line drawing collections and show promising results.

The value of labeled and unlabeled examples when the model is imperfect
[pdf, bib]
K. Sinha, M. Belkin, NIPS 2007.

+ abstract

Semisupervised learning, i.e. learning from both labeled and unlabeled data has
received signi.cant attention in the machine learning literature in recent years.
Still our understanding of the theoretical foundations of the usefulness of unlabeled
data remains somewhat limited. The simplest and the best understood situation
is when the data is described by an identi.able mixture model, and where
each class comes from a pure component. This natural setup and its implications
ware analyzed in [11, 5]. One important result was that in certain regimes, labeled
data becomes exponentially more valuable than unlabeled data.
However, in most realistic situations, one would not expect that the data comes
from a parametric mixture distribution with identifiable components. There have
been recent efforts to analyze the nonparametric situation, for example, .cluster.
and manifold assumptions have been suggested as a basis for analysis. Still,
a satisfactory and fairly complete theoretical understanding of the nonparametric
problem, similar to that in [11, 5] has not yet been developed.
In this paper we investigate an intermediate situation, when the data comes from a
probability distribution, which can be modeled, but not perfectly, by an identi.able
mixture distribution. This seems applicable to many situation, when, for example,
a mixture of Gaussians is used to model the data. the contribution of this paper is
an analysis of the role of labeled and unlabeled data depending on the amount of
imperfection in the model.

Convergence of Laplacian
Eigenmaps [long tech report]
M. Belkin, P. Niyogi, NIPS 2006.

On the Relation Between Low Density Separation,
Spectral Clustering and Graph Cuts [pdf, bib]
H. Narayanan, M. Belkin, P. Niyogi, NIPS 2006.

+ abstract

One of the intuitions underlying many graphbased
methods for clustering and semisupervised learning,
is that class or cluster boundaries pass through
areas of low probability density. In this paper we
provide some formal analysis of that notion for a
probability distribution. We introduce a notion of
weighted boundary volume, which measures the length
of the class/cluster boundary weighted by the density
of the underlying probability distribution. We show
that sizes of the cuts of certain commonly used data
adjacency graphs converge to this continuous weighted
volume of the boundary.

Heat Flow and a Faster Algorithm to Compute the
Surface Area of a Convex Body [pdf, bib]
M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.

+ abstract

We draw on the observation that the amount of heat
diffusing outside of a heated body in a short period
of time is proportional to its surface area, to
design a simple algorithm for approximating the
surface area of a convex body given by a membership
oracle. Our method has a complexity of O*(n^4), where
n is the dimension, compared to O*(n^8.5) for the
previous best algorithm. We show that our complexity
cannot be improved given the current stateoftheart
in volume estimation.

Maximum Margin SemiSupervised Learning for Structured
Variables [pdf, bib]
Y. Altun, D. McAllester, M. Belkin, NIPS 2005.

+ abstract

Many realworld classification problems involve the
prediction of multiple interdependent variables
forming some structural dependency. Recent progress
in machine learning has mainly focused on supervised
classification of such structured variables. In this
paper, we investigate structured classification in a
semisupervised setting. We present a discriminative
approach that utilizes the intrinsic geometry of
input patterns revealed by unlabeled data points and
we derive a maximummargin formulation of
semisupervised learning for structured variables.
Unlike transductive algorithms, our formulation
naturally extends to new test points.

Beyond the Point Cloud: from Transductive to
Semisupervised Learning [pdf, bib]
V. Sindhwani, P. Niyogi, M. Belkin, ICML 2005.

+ abstract

Due to its occurrence in engineering domains and
implications for natural learning, the problem of
utilizing unlabeled data is attracting increasing
attention in machine learning. A large body of recent
literature has focussed on the transductive setting
where labels of unlabeled examples are estimated by
learning a function defined only over the point cloud
data. In a truly semisupervised setting however, a
learning machine has access to labeled and unlabeled
examples and must make predictions on data points
never encountered before. In this paper, we show how
to turn transductive and standard supervised learning
algorithms into semisupervised learners. We
construct a family of datadependent norms on
Reproducing Kernel Hilbert Spaces (RKHS). These norms
allow us to warp the structure of the RKHS to reflect
the underlying geometry of the data. We derive
explicit formulas for the corresponding new kernels.
Our approach demonstrates state of the art
performance on a variety of classification tasks.

A CoRegularization Approach to Semisupervised
Learning with Multiple Views [pdf]
V. Sindhwani, P. Niyogi, M. Belkin,
ICML Workshop on Learning with Multiple Views, 2005.

+ abstract

The CoTraining algorithm uses unlabeled
examples in multiple views to bootstrap classifiers in each view, typically in a greedy
manner, and operating under assumptions
of viewindependence and compatibility. In
this paper, we propose a CoRegularization
framework where classifiers are learnt in each
view through forms of multiview regularization. We propose algorithms within this
framework that are based on optimizing measures of agreement and smoothness over la
beled and unlabeled examples. These algorithms naturally extend standard
regularization methods like Support Vector Machines
(SVM) and Regularized Least squares (RLS)
for multiview semisupervised learning, and
inherit their benefits and applicability to
highdimensional classification problems. An
empirical investigation is presented that confirms the promise of this approach.

Linear Manifold Regularization for Large Scale
Semisupervised Learning
V. Sindhwani, P. Niyogi, M. Belkin, S. Keerthi
ICML Workshop on Learning with Partially Classified
Training Data, 2005

Towards a Theoretical Foundation for LaplacianBased
Manifold Methods [pdf, bib]
M. Belkin, P. Niyogi, COLT 2005

On Manifold Regularization [pdf]
M. Belkin, P. Niyogi, V. Sindhwani, AISTATS 2005

Limits of Spectral Clustering [pdf] Outstanding
Student Paper Award
U. von Luxburg, O. Bousquet, M. Belkin, NIPS 2004.

Regularization and Semisupervised Learning on Large
Graphs [pdf, bib]
M. Belkin, I. Matveeva, P. Niyogi, COLT 2004.

On the Convergence of Spectral Clustering on Random
Samples: the Normalized Case [pdf]
U. von Luxburg, O. Bousquet, M. Belkin, COLT 2004.

Tikhonov Regularization and Semisupervised Learning
on Large Graphs
M. Belkin, I. Matveeva, P. Niyogi
ICASSP 2004, Special Session: Manifolds and Geometry in
Signal Processing (Invited paper).

Using Manifold Structure for Partially Labeled
Classification [pdf]
M. Belkin, P. Niyogi, NIPS 2002

Laplacian Eigenmaps and Spectral Techniques for
Embedding and Clustering [pdf]
M. Belkin, P. Niyogi, NIPS 2001.

Using Eigenvectors of the Bigram Graph to Infer
Morpheme Identity [pdf]
M. Belkin, J. Goldsmith
Morphological and Phonological Learning: Proceedings of
the 6th Workshop
of the ACL Special Interest Group in Computational
Phonology (SIGPHON), Philadelphia, July 2002, pp.
4147.
Ph.D. Thesis:

Problems of Learning on Manifolds [pdf, bib]
Ph.D. Dissertation, University of Chicago, Dept. of
Mathematics, 2003.
Advisor: Partha Niyogi.
Book chapter:

The Geometric Basis of Semisupervised Learning [pdf]
V. Sindhwani, M. Belkin, P. Niyogi.
Book chapter in "Semisupervised Learning", Chapelle, O.,
B. Schölkopf and A. Zien, Eds., MIT Press, 2006.
Technical Reports:

Consistency of Spectral Clustering [pdf]
U. von Luxburg, M. Belkin, O. Bousquet
Max Planck Insitute for Biological Cybernetics Technical
Report TR134, 2004. (The Annals of Statistics, to
appear)

+ abstract

Consistency is a key property of statistical
algorithms when the data is drawn from some
underlying probability distribution. Surprisingly,
despite decades of work, little is known about
consistency of most clustering algorithms. In this
paper we investigate consistency of the popular
family of spectral clustering algorithms, which
clusters the data with the help of eigenvectors of
graph Laplacian matrices. We develop new methods to
establish that for increasing sample size, those
eigenvectors converge to the eigenvectors of certain
limit operators. As a result we can prove that one of
the two major classes of spectral clustering
(normalized clustering) converges under very general
conditions, while the other (unnormalized clustering)
is only consistent under strong additional
assumptions, which are not always satisfied in real
data. We conclude that our analysis provides strong
evidence for the superiority of normalized spectral
clustering.

Manifold Regularization: a Geometric Framework for
Learning from Examples
M. Belkin, P. Niyogi, V. Sindhwani
University of Chicago CS Technical Report TR200406,
2004.

Regression and Regularization on Large Graphs
[link]
M. Belkin, I. Matveeva, P. Niyogi
The University of Chicago CS Technical Report
TR200311.

Semisupervised learning on manifolds
[link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report
TR200212.

Laplacian Eigenmaps for Dimensionality Reduction and
Data Representation
[link]
M. Belkin, P. Niyogi
The University of Chicago CS Technical Report TR200201.