Mikhail Belkin

Associate Professor
Department of Computer Science and Engineering
Department of Statistics (courtesy)
Center for Cognitive Science

The Ohio State University,
Computer Science and Engineering,
2015 Neil Avenue, Dreese Labs 597.
Columbus, OH 43210 [map]
phone: 614-292-5841
email: mbelkin at cse.ohio-state.edu
Research interests: theory, applications and limitations of statistical machine learning.

My research focuses on designing and analyzing practical algorithms for machine learning based on non-linear structure of high dimensional data, in particular manifold and spectral methods. I am also interested in a range of theoretical questions concerning the computational and statistical limits of learning and in the mathematical foundations of learning structure from data.


JMLR action editor
ICML 2017 (area chair)
AAAI 2017 (SPC)
COLT 2016
AI & Statistics 2015 (SPC)
NIPS 2014 (area chair)
ICML 2014 (area chair)
NIPS 2011 (area chair)
NIPS 2010 (area chair)
ICML 2010 (area chair)
COLT 2010
SDM 2010 (local chair)

Curriculum Vitae [pdf]

Midwest Machine Learning synposium, The University of Chicago/TTIC, 2017.
2nd Information Modelling and Control of Complex Systems, OSU, 2017.
Information Modelling and Control of Complex Systems, OSU, 2016.
2014 Simons Institute Workshop on Spectral Algorithms: From Theory to Practice (co-chair)
SDM 2010 Workshop on Clustering
2009 Chicago Machine Learning Summer School
AAAI 2009 Fall Symposium on Manifold Learning and its Applications
Geometry, Random Matrices, and Statistical Inference (SAMSI 2007)
2005 Chicago Machine Learning Summer School.

Complete publication list in Google Scholar.
Selected and recent papers:
  • Diving into the shallows: a computational perspective on large-scale shallow learning [arxiv]
    Siyuan Ma, Mikhail Belkin
    + abstract
  • Graphons, mergeons, and so on! [arxiv, 3-min pre-NIPS video, NIPS video]
    Justin Eldridge, Mikhail Belkin, Yusu Wang, NIPS 2016 (oral presentation, 2% of submissions)
    + abstract
    In this work we develop a theory of hierarchical clustering for graphs. Our modeling assumption is that graphs are sampled from a graphon, which is a powerful and general model for generating graphs and analyzing large networks. Graphons are a far richer class of graph models than stochastic blockmodels, the primary setting for recent progress in the statistical theory of graph clustering. We define what it means for an algorithm to produce the "correct" clustering, give sufficient conditions in which a method is statistically consistent, and provide an explicit algorithm satisfying these properties.
  • Clustering with Bregman Divergences: an Asymptotic Analysis [link]
    Chaoyue Liu, Mikhail Belkin, NIPS 2016
    + abstract
    Clustering, in particular k-means clustering, is a central topic in data analysis. Clustering with Bregman divergences is a recently proposed generalization of k-means clustering which has already been widely used in applications. In this paper we analyze theoretical properties of Bregman clustering when the number of the clusters k is large. We establish quantization rates and describe the limiting distribution of the centers as k tends to infinity, extending well-known results for k-means clustering.
  • Learning a Hidden Basis Through Imperfect Measurements: An Algorithmic Primitive [arxiv]
    Mikhail Belkin, Luis Rademacher, James Voss, COLT 2016
    + abstract
    A number of important problems in theoretical computer science and machine learning can be interpreted as recovering a certain basis. These include certain tensor decompositions, Independent Component Analysis (ICA), spectral clustering and Gaussian mixture learning. Each of these problems reduces to an instance of our general model, which we call a "Basis Encoding Function" (BEF). We show that learning a basis within this model can then be provably and efficiently achieved using a first order iteration algorithm (gradient iteration). Our algorithm goes beyond tensor methods, providing a function-based generalization for a number of existing methods including the classical matrix power method, the tensor power iteration as well as cumulant-based FastICA. Our framework also unifies the unusual phenomenon observed in these domains that they can be solved using efficient non-convex optimization. Specifically, we describe a class of BEFs such that their local maxima on the unit sphere are in one-to-one correspondence with the basis elements. This description relies on a certain "hidden convexity" property of these functions. We provide a complete theoretical analysis of gradient iteration even when the BEF is perturbed. We show convergence and complexity bounds polynomial in dimension and other relevant parameters, such as perturbation size. Our perturbation results can be considered as a non-linear version of the classical Davis-Kahan theorem for perturbations of eigenvectors of symmetric matrices. In addition we show that our algorithm exhibits fast (superlinear) convergence and relate the speed of convergence to the properties of the BEF. Moreover, the gradient iteration algorithm can be easily and efficiently implemented in practice. Finally we apply our framework by providing the first provable algorithm for recovery in a general perturbed ICA model.
  • Back to the future: Radial Basis Function networks revisited [link]
    Qichao Que, Mikhail Belkin, AI & Statistics 2016.
    + abstract
    Radial Basis Function (RBF) networks are a classical family of algorithms for supervised learning. The most popular approach for training RBF networks has relied on kernel methods using regularization based on a norm in a Reproducing Kernel Hilbert Space (RKHS), which is a principled and empirically successful framework. In this paper we aim to revisit some of the older approaches to training the RBF networks from a more modern perspective. Specifically, we analyze two common regularization procedures, one based on the square norm of the coefficients in the network and another on using centers obtained by k-means clustering. We show that both of these RBF methods can be recast as certain data-dependent kernels. We provide a theoretical analysis of these methods as well as a number of experimental results, pointing out very competitive experimental performance as well as certain advantages over the standard kernel methods in terms of both flexibility (incorporating of unlabeled data) and computational complexity. Finally, our results shed light on some impressive recent successes of using soft k-means features for image recognition and other tasks.
  • The Hidden Convexity of Spectral Clustering [arxiv]
    James Voss, Mikhail Belkin, Luis Rademacher, AAAI 2016 (oral presentation)
    + abstract
    In recent years, spectral clustering has become a standard method for data analysis used in a broad range of applications. In this paper we propose a new class of algorithms for multiway spectral clustering based on optimization of a certain "contrast function" over a sphere. These algorithms are simple to implement, efficient and, unlike most of the existing algorithms for multiclass spectral clustering, are not initialization-dependent. Moreover, they are applicable without modification for normalized and un-normalized clustering, which are two common variants of spectral clustering. Geometrically, the proposed algorithms can be interpreted as recovering a discrete weighted simplex by means of function optimization. We give complete necessary and sufficient conditions on contrast functions for the optimization to guarantee recovery of clusters. We show how these conditions can be interpreted in terms of certain "hidden convexity" of optimization over a sphere.
  • Learning Privately from Multiparty Data [arxiv]
    Jihun Hamm, Paul Cao, Mikhail Belkin, ICML 2016
    + abstract
    Learning a classifier from private data collected by multiple parties is an important problem that has many potential applications. How can we build an accurate and differentially private global classifier by combining locally-trained classifiers from different parties, without access to any party's private data? We propose to transfer the `knowledge' of the local classifier ensemble by first creating labeled data from auxiliary unlabeled data, and then train a global ε-differentially private classifier. We show that majority voting is too sensitive and therefore propose a new risk weighted by class probabilities estimated from the ensemble. Relative to a non-private solution, our private solution has a generalization error bounded by O(ε^2M^2) where M is the number of parties. This allows strong privacy without performance loss when M is large, such as in crowdsensing applications. We demonstrate the performance of our method with realistic tasks of activity recognition, network intrusion detection, and malicious URL detection.
  • A Pseudo-Euclidean Iteration for Optimal Recovery in Noisy ICA [link]
    James Voss, Mikhail Belkin, Luis Rademacher, NIPS 2015
    Independent Component Analysis (ICA) is a popular model for blind signal separation. The ICA model assumes that a number of independent source signals are linearly mixed to form the observed signals. We propose a new algorithm, PEGI (for pseudo-Euclidean Gradient Iteration), for provable model recovery for ICA with Gaussian noise. The main technical innovation of the algorithm is to use a fixed point iteration in a pseudo-Euclidean (indefinite "inner product") space. The use of this indefinite "inner product" resolves technical issues common to several existing algorithms for noisy ICA. This leads to an algorithm which is conceptually simple, efficient and accurate in testing. Our second contribution is combining PEGI with the analysis of objectives for optimal recovery in the noisy ICA model. It has been observed that the direct approach of demixing with the inverse of the mixing matrix is suboptimal for signal recovery in terms of the natural Signal to Interference plus Noise Ratio (SINR) criterion. There have been several partial solutions proposed in the ICA literature. It turns out that any solution to the mixing matrix reconstruction problem can be used to construct an SINR-optimal ICA demixing, despite the fact that SINR itself cannot be computed from data. That allows us to obtain a practical and provably SINR-optimal recovery method for ICA with arbitrary Gaussian noise.
  • Polynomial Learning of Distribution Families [link]
    M. Belkin, K. Sinha, Siam Journal on Computing (SICOMP),44(4), 889-911, 2015.
    (Short version FOCS 2010).
    + abstract
    The question of polynomial learnability of probability distributions, particularly Gaussian mixture distributions, has recently received significant attention in theoretical computer science and machine learning. However, despite major progress, the general question of polynomial learnability of Gaussian mixture distributions still remained open. The current work resolves the question of polynomial learnability for Gaussian mixtures in high dimension with an arbitrary fixed number of components. The result on learning Gaussian mixtures relies on an analysis of distributions belonging to what we call "polynomial families" in low dimension. These families are characterized by their moments being polynomial in parameters and include almost all common probability distributions as well as their mixtures and products. Using tools from real algebraic geometry, we show that parameters of any distribution belonging to such a family can be learned in polynomial time and using a polynomial number of sample points. The result on learning polynomial families is quite general and is of independent interest. To estimate parameters of a Gaussian mixture distribution in high dimensions, we provide a deterministic algorithm for dimensionality reduction. This allows us to reduce learning a high-dimensional mixture to a polynomial number of parameter estimations in low dimension. Combining this reduction with the results on polynomial families yields our result on learning arbitrary Gaussian mixtures in high dimensions.
  • Beyond Hartigan Consistency: Merge Distortion Metric for Hierarchical Clustering [link]
    Justin Eldridge, Mikhail Belkin, Yusu Wang, COLT 2015, Mark Fulk award (best student paper)!
    + abstract
    Hierarchical clustering is a popular method for analyzing data which associates a tree to a dataset. Hartigan consistency has been used extensively as a framework to analyze such clustering algorithms from a statistical point of view. Still, as we show in the paper, a tree which is Hartigan consistent with a given density can look very different than the correct limit tree. Specifically, Hartigan consistency permits two types of undesirable configurations which we term over-segmentation and improper nesting. Moreover, Hartigan consistency is a limit property and does not directly quantify difference between trees. In this paper we identify two limit properties, separation and minimality, which address both over-segmentation and improper nesting and together imply (but are not implied by) Hartigan consistency. We proceed to introduce a merge distortion metric between hierarchical clusterings and show that convergence in our distance implies both separation and minimality. We also prove that uniform separation and minimality imply convergence in the merge distortion metric. Furthermore, we show that our merge distortion metric is stable under perturbations of the density. Finally, we demonstrate applicability of these concepts by proving convergence results for two clustering algorithms. First, we show convergence (and hence separation and minimality) of the recent robust single linkage algorithm of Chaudhuri and Dasgupta (2010). Second, we provide convergence results on manifolds for topological split tree clustering.
  • Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of Smart Devices
    J. Hamm, A. Champion, G. Chen, M. Belkin, and D. Xuan, ICDCS 2015
    + abstract
    Smart devices with built-in sensors, computational capabilities, and network connectivity have become increasingly pervasive. Crowds of smart devices offer opportunities to collec- tively sense and perform computing tasks at an unprecedented scale. This paper presents Crowd-ML, a privacy-preserving machine learning framework for a crowd of smart devices, which can solve a wide range of learning problems for crowdsensing data with differential privacy guarantees. Crowd-ML endows a crowdsensing system with the ability to learn classifiers or predictors online from crowdsensing data privately with minimal computational overhead on devices and servers, suitable for practical large-scale use of the framework. We analyze the performance and scalability of Crowd-ML and implement the system with off-the-shelf smartphones as a proof of concept. We demonstrate the advantages of Crowd-ML with real and simulated experiments under various conditions.
  • Learning with Fredholm Kernels [link]
    Qichao Que, Mikhail Belkin, Yusu Wang, NIPS 2014
    + abstract
    In this paper we propose a framework for supervised and semi-supervised learning based on reformulating the learning problem as a regularized Fredholm integral equation. Our approach fits naturally into the kernel framework and can be in- terpreted as constructing new data-dependent kernels, which we call Fredholm kernels. We proceed to discuss the “noise assumption” for semi-supervised learn- ing and provide both theoretical and experimental evidence that Fredholm kernels can effectively utilize unlabeled data under the noise assumption. We demonstrate that methods based on Fredholm learning show very competitive performance in the standard semi-supervised learning setting.
  • The More, the Merrier: the Blessing of Dimensionality for Learning Large Gaussian Mixtures. [arxiv]
    Joseph Anderson, Mikhail Belkin, Navin Goyal, Luis Rademacher, James Voss, COLT 2014
    + abstract
    In this paper we show that very large mixtures of Gaussians with known and identical covariance matrix are efficiently learnable in high dimension. More precisely, we prove that a mixture whose number of components is a polynomial of any fixed degree in the dimension n is polynomially learnable as long as a certain non-degeneracy condition on the means is satisfied. It turns out that this condition is generic in the sense of smoothed complexity, as soon as the dimensionality of the space is high enough. Moreover, we prove that no such condition can exist in low dimension. Our main result on mixture recovery relies on a new "Poissonization"-based technique, which transforms a mixture of Gaussian to a projection of a product distribution. The problem of learning the projection can be efficiently solved using some recent results on tensor decompositions, and this gives an efficient algorithm for learning the mixture. While our results require fixed known covariance matrix, we believe that this work is among the first steps toward better understanding the rare phenomenon of the "blessing of dimensionality" in the computational aspects of statistical inference.
  • Fast Algorithms for Gaussian Noise Invariant Independent Component Analysis. [NIPS archive, GI-ICA code]
    James Voss, Luis Rademacher, Mikhail Belkin, NIPS 2013
    + abstract
    The performance of standard algorithms for Independent Component Analysis quickly deteriorates under the addition of Gaussian noise. This is partially due to a common first step that typically consists of whitening, i.e., applying Principal Component Analysis (PCA) and rescaling the components to have identity covariance, which is not invariant under Gaussian noise. In our paper we develop the first practical algorithm for Independent Component Analysis that is provably invariant under Gaussian noise. The two main contributions of this work are as follows: 1. We develop and implement a more efficient version of a Gaussian noise invariant decorrelation (quasi-orthogonalization) algorithm using Hessians of the cumulant functions. 2. We propose a very simple and efficient fixed-point GI-ICA (Gradient Iteration ICA) algorithm, which is compatible with quasi-orthogonalization, as well as with the usual PCA-based whitening in the noiseless case. The algorithm is based on a special form of gradient iteration (different from gradient descent). We provide an analysis of our algorithm demonstrating fast convergence following from the basic properties of cumulants. We also present a number of experimental comparisons with the existing methods, showing superior results on noisy data and very competitive performance in the noiseless case.
  • Inverse Density as an Inverse Problem: The Fredholm Equation Approach [arxiv]
    Qichao Que, Mikhail Belkin, NIPS 2013
    + abstract
    In this paper we address the problem of estimating the ratio $\frac{q}{p}$ where $p$ is a density function and $q$ is another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed in various problems of inference and integration, in particular, when one needs to average a function with respect to one probability distribution, given a sample from another. It is often referred as {\it importance sampling} in statistical inference and is also closely related to the problem of {\it covariate shift} in transfer learning as well as to various MCMC methods. It may also be useful for separating the underlying geometry of a space, say a manifold, from the density function defined on it. Our approach is based on reformulating the problem of estimating $\frac{q}{p}$ as an inverse problem in terms of an integral operator corresponding to a kernel, and thus reducing it to an integral equation, known as the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for analyzing them theoretically. The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, simple and easy to implement. We provide detailed theoretical analysis including concentration bounds and convergence rates for the Gaussian kernel in the case of densities defined on $\R^d$, compact domains in $\R^d$ and smooth $d$-dimensional sub-manifolds of the Euclidean space. We also show experimental results including applications to classification and semi-supervised learning within the covariate shift framework and demonstrate some encouraging experimental comparisons. We also show how the parameters of our algorithms can be chosen in a completely unsupervised manner.
  • Blind Signal Separation in the Presence of Gaussian Noise [arxiv]
    Mikhail Belkin, Luis Rademacher, James Voss, COLT 2013
    + abstract
    A prototypical blind signal separation problem is the so-called cocktail party problem, with n people talking simultaneously and n different microphones within a room. The goal is to recover each speech signal from the microphone inputs. Mathematically this can be modeled by assuming that we are given samples from a n-dimensional random variable X=AS, where S is a vector whose coordinates are independent random variables corresponding to each speaker. The objective is to recover the matrix A^{-1} given random samples from X. A range of techniques collectively known as Independent Component Analysis (ICA) have been proposed to address this problem in the signal processing and machine learning literature. Many of these techniques are based on using the kurtosis or other cumulants to recover the components. In this paper we propose a new algorithm for solving the blind signal separation problem in the presence of additive Gaussian noise, when we are given samples from X=AS + \eta, where {\eta} is drawn from an unknown n-dimensional Gaussian distribution. Our approach is based on a method for decorrelating a sample with additive Gaussian noise under the assumption that the underlying distribution is a linear transformation of a distribution with independent components. Our decorrelation routine is based on the properties of cumulant tensors and can be combined with any standard cumulant-based method for ICA to get an algorithm that is provably robust in the presence of Gaussian noise. We derive polynomial bounds for sample complexity and error propagation of our method. Our results generalize the recent work of Arora et al. which deals with a special case of ICA when S is the uniform probability distribution over the binary cube.
  • Graph Laplacians on Singular Manifolds: Toward understanding complex spaces: graph Laplacians on manifolds with singularities and boundaries [arxiv]
    Mikhail Belkin, Qichao Que, Yusu Wang, Xueyuan Zhou, COLT 2012.
    + abstract
    Recently, much of the existing work in manifold learning has been done under the assumption that the data is sampled from a manifold without boundaries and singularities or that the functions of interest are evaluated away from such points. At the same time, it can be argued that singularities and boundaries are an important aspect of the geometry of realistic data. In this paper we consider the behavior of graph Laplacians at points at or near boundaries and two main types of other singularities: intersections, where different manifolds come together and sharp "edges", where a manifold sharply changes direction. We show that the behavior of graph Laplacian near these singularities is quite different from that in the interior of the manifolds. In fact, a phenomenon somewhat reminiscent of the Gibbs effect in the analysis of Fourier series, can be observed in the behavior of graph Laplacian near such points. Unlike in the interior of the domain, where graph Laplacian converges to the Laplace-Beltrami operator, near singularities graph Laplacian tends to a first-order differential operator, which exhibits different scaling behavior as a function of the kernel width. One important implication is that while points near the singularities occupy only a small part of the total volume, the difference in scaling results in a disproportionately large contribution to the total behavior. Another significant finding is that while the scaling behavior of the operator is the same near different types of singularities, they are very distinct at a more refined level of analysis. We believe that a comprehensive understanding of these structures in addition to the standard case of a smooth manifold can take us a long way toward better methods for analysis of complex non-linear data and can lead to significant progress in algorithm design.
  • Data Skeletonization via Reeb Graphs [pdf]
    X. Ge, I. Safa, M. Belkin, Y. Wang, NIPS 2011.
    + abstract
    Recovering hidden structure from complex and noisy non-linear data is one of the most fundamental problems in machine learning and statistical inference. While such data is often high-dimensional, it is of interest to approximate it with a low-dimensional or even one-dimensional space, since many important aspects of data are often intrinsically low-dimensional. Furthermore, there are many scenarios where the underlying structure is graph-like, e.g, river/road networks or various trajectories. In this paper, we develop a framework to extract, as well as to simplify, a one-dimensional "skeleton" from unorganized data using the Reeb graph. Our algorithm is very simple, does not require complex optimizations and can be easily applied to unorganized high-dimensional data such as point clouds or proximity graphs. It can also represent arbitrary graph structures in the data. We also give theoretical results to justify our method. We provide a number of experiments to demonstrate the effectiveness and generality of our algorithm, including comparisons to existing methods, such as principal curves. We believe that the simplicity and practicality of our algorithm will help to promote skeleton graphs as a data analysis tool for a broad range of applications.
  • Convergence of Laplacian Eigenmaps [pdf, bib]
    M. Belkin, P. Niyogi
    preprint, short version NIPS 2008.
    + abstract
    Geometrically based methods for various tasks of data analysis have attracted considerable attention over the last few years. In many of these algorithms, a central role is played by the eigenvectors of the graph Laplacian of a data-derived graph. In this paper, we show that if points are sampled uniformly at random from an unknown submanifold ${\cal M}$ of $\R^N$, then the eigenvectors of a suitably constructed graph Laplacian converge to the eigenfunctions of the Laplace Beltrami operator on ${\cal M}$. This basic result directly establishes the convergence of spectral manifold learning algorithms such as Laplacian Eigenmaps and Diffusion Maps. It also has implications for the understanding of geometric algorithms in data analysis, computational harmonic analysis, geometric random graphs, and graphics.
  • On Learning with Integral Operators [pdf, bib]
    L. Rosasco, M.Belkin, E. De Vito,
    Journal of Machine Learning Research, vol.11, pp.905-934, 2010.
    + abstract
    A large number of learning algorithms, for example, spectral clustering, kernel Principal Components Analysis and many manifold methods are based on estimating eigenvalues and eigenfunctions of operators defined by a similarity function or a kernel, given empirical data. Thus for the analysis of algorithms, it is an important problem to be able to assess the quality of such approximations. The contribution of our paper is two-fold. First, we use a technique based on a concentration inequality for Hilbert spaces to provide new simplified proofs for a number of results in spectral approximation. Second, using these methods we provide several new results for estimating spectral properties of the graph Laplacian operator extending and strengthening results from [28].
  • Data spectroscopy: eigenspaces of convolution operators and clustering [pdf, bib]
    Tao Shi, Mikhail Belkin, Bin Yu
    The Annals of Statistics, vol. 37, Number 6B (2009), 3960-3984.
    + abstract
    This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the Data Spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.
  • Towards a Theoretical Foundation for Laplacian-Based Manifold Methods [pdf, bib]
    M. Belkin, P. Niyogi
    Journal of Computer and System Sciences, 2008.
    Volume 74, Issue 8, pp. 1289-1308. Special Issue on Learning Theory, invited (short version in COLT 2005).
    + abstract
    In recent years manifold methods have attracted a considerable amount of atten- tion in machine learning. However most algorithms in that class may be termed "manifold-motivated" as they lack any explicit theoretical guarantees. In this pa- per we take a step towards closing the gap between theory and practice for a class of Laplacian-based manifold methods. These methods utilize the graph Laplacian associated to a data set for a variety of applications in semi-supervised learning, clustering, data representation. We show that under certain conditions the graph Laplacian of a point cloud of data samples converges to the Laplace-Beltrami operator on the underlying mani- fold. Theorem 3.1 contains the .rst result showing convergence of a random graph Laplacian to the manifold Laplacian in the context of machine learning.
  • Discrete Laplace Operator for Meshed Surfaces [pdf, code, bib]
    M. Belkin, J. Sun, Y. Wang, 24th Annual Symposium on Computational Geometry (SOCG) 2008.
    + abstract
    In recent years a considerable amount of work in graphics and geometric optimization used tools based on the Laplace-Beltrami operator on a surface. The applications of the Laplacian include mesh editing, surface smoothing, and shape interpolations among others. However, it has been shown [12, 23, 25] that the popular cotangent approximation schemes do not provide convergent point-wise (or even L^2) estimates, while many applications rely on point-wise estimation. Existence of such schemes has been an open question [12]. In this paper we propose the first algorithm for approximating the Laplace operator of a surface from a mesh with point-wise convergence guarantees applicable to arbitrary meshed surfaces. We show that for a sufficiently fine mesh over an arbitrary surface, our mesh Laplacian is close to the Laplace-Beltrami operator on the surface at every point of the surface. Moreover, the proposed algorithm is simple and easily implementable. Experimental evidence shows that our algorithm exhibits convergence empirically and outperforms cotangent-based methods in providing accurate approximation of the Laplace operator for various meshes.
  • Consistency of Spectral Clustering [pdf, bib]
    U. von Luxburg, M. Belkin, O. Bousquet,
    The Annals of Statistics 2008, Vol. 36, No. 2, 555-586.
    + abstract
    Consistency is a key property of statistical algorithms when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of the popular family of spectral clustering algorithms, which clusters the data with the help of eigenvectors of graph Laplacian matrices. We develop new methods to establish that for increasing sample size, those eigenvectors converge to the eigenvectors of certain limit operators. As a result we can prove that one of the two major classes of spectral clustering (normalized clustering) converges under very general conditions, while the other (unnormalized clustering) is only consistent under strong additional assumptions, which are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering.
  • Manifold Regularization: a Geometric Framework for Learning from Labeled and Unlabeled Examples [pdf, bib]
    M. Belkin, P. Niyogi, V. Sindhwani
    Journal of Machine Learning Research, 7(Nov):2399-2434, 2006.
    + abstract
    We propose a family of learning algorithms based on a new form of regularization that allows us to exploit the geometry of the marginal distribution. We focus on a semi-supervised framework that incorporates labeled and unlabeled data in a general-purpose learner. Some transductive graph learning algorithms and standard methods including support vector machines and regularized least squares can be obtained as special cases. We use properties of reproducing kernel Hilbert spaces to prove new Representer theorems that provide theoretical basis for the algorithms. As a result (in contrast to purely graph-based approaches) we obtain a natural out-of-sample extension to novel examples and so are able to handle both transductive and truly semi-supervised settings. We present experimental evidence suggesting that our semi-supervised algorithms are able to use unlabeled data effectively. Finally we have a brief discussion of unsupervised and fully supervised learning within our general framework.
  • Heat Flow and a Faster Algorithm to Compute the Surface Area of a Convex Body [pdf, bib]
    M. Belkin, H. Narayanan, P. Niyogi, FOCS 2006.
    + abstract
    We draw on the observation that the amount of heat diffusing outside of a heated body in a short period of time is proportional to its surface area, to design a simple algorithm for approximating the surface area of a convex body given by a membership oracle. Our method has a complexity of O*(n^4), where n is the dimension, compared to O*(n^8.5) for the previous best algorithm. We show that our complexity cannot be improved given the current state-of-the-art in volume estimation.
  • Semi-supervised Learning on Riemannian Manifolds [pdf, bib]
    M. Belkin, P. Niyogi
    Machine Learning, 56 (invited, special Issue on clustering), 209-239, 2004 (short version in NIPS 2002).
    + abstract
    We consider the general problem of utilizing both labeled and unlabeled data to improve classification accuracy. Under the assumption that the data lie on a submanifold in a high dimensional space, we develop an algorithmic framework to classify a partially labeled data set in a principled manner. The central idea of our approach is that classification functions are naturally defined only on the submanifold in question rather than the total ambient space. Using the Laplace-Beltrami operator one produces a basis (the Laplacian Eigenmaps) for a Hilbert space of square integrable functions on the submanifold. To recover such a basis, only unlabeled examples are required. Once such a basis is obtained, training can be performed using the labeled data set. Our algorithm models the manifold using the adjacency graph for the data and approximates the Laplace- Beltrami operator by the graph Laplacian. We provide details of the algorithm, its theoretical justification, and several practical applications for image, speech, and text classification.
  • Laplacian Eigenmaps for Dimensionality Reduction and Data Representation [pdf, bib]
    M. Belkin, P. Niyogi
    Neural Computation, June 2003; 15 (6):1373-1396 (short version in NIPS 2001).
    + abstract
    One of the central problems in machine learning and pattern recognition is to develop appropriate representations for complex data. We consider the problem of constructing a representation for data lying on a lowdimensional manifold embedded in a high-dimensional space. Drawing on the correspondence between the graph Laplacian, the Laplace Beltrami operator on the manifold, and the connections to the heat equation, we propose a geometrically motivated algorithm for representing the highdimensional data. The algorithm provides a computationally efficient approach to non dimensionality reduction that has locality-preserving properties and a natural connection to clustering. Some potential applications and illustrative examples are discussed.

Links to some implementations.


Slides and videos for some of my talks.

Research Funding.

My research is supported by the NSF and AFOSR.

Partha Niyogi Memorial Conference at the University of Chicago. Dec 4-6, 2011.


Little Sasha
Little Julia