Exploiting Data Geometry Using Semi-Supervised and Active Learning

Event Date: May 11, 2009
Speaker: Dr Aarti Singh
Speaker Affiliation: Princeton University,
Department of Mathematics
Sponsor: ECE Faculty Candidate
Time: 9:30 AM
Location: MSEE 239
Contact Name: Host: Professor Avi Kak
With the advent of sophisticated data collection technologies and our
need to understand increasingly intricate and diverse networked systems,
we are witnessing an explosion in both the amount and complexity of
data. Despite the apparent high dimensionality of data, in many problems
the underlying physics of the data-generating system limits the degrees
of freedom and endows the data with a hidden lower dimensional geometry.
In this talk, I will describe how machine learning techniques can be
used to exploit data geometry in science and engineering applications,
ranging from the Internet and wireless networks to biomedical imaging.
In particular, I will focus on semi-supervised and active learning that
capitalize on the abundance of unlabeled data to enable efficient
inference using minimal labeled/annotated training samples. 

In the semi-supervised learning framework, we have access to a large
amount of unlabeled data and a random small subset of the data is
labeled. It is well accepted that unlabeled data can facilitate
inference if there is a link between unlabeled data geometry and the
target function of the inference task.  For example, the target function
may be smooth on data clusters or along the data manifold.  However,
recent attempts to characterize the amount of improvement possible when
such links exist have provided incomplete, and sometimes contradictory,
explanations of the benefits of using unlabeled data. My work bridges
the gap between these seemingly conflicting views using a novel minimax
framework based on finite sample bounds. The results quantify both the
amount of improvement possible using unlabeled data as well as the
relative value of unlabeled data.  Moreover, the theory suggests a new
algorithmic approach to semi-supervised learning capable of adapting to
a wide range of data geometries. 

Active learning is a sequential feedback-driven process where we have
access to a large amount of unlabeled data from which we can select
examples for labeling. Labels are requested for unlabeled examples that
are predicted to have very informative labels, based on previously
gathered labeled and unlabeled data. If the target function of the
inference task has low-dimensional geometry, then active learning can
facilitate inference from few labeled examples. I will discuss an active
learning approach to designing fast and accurate spatial survey paths
for mobile sensors, and the resulting improvement in accuracy vs latency


Aarti Singh received her BE in Electronics and Communication Engineering
from Delhi University, India in 2001. She received her MS and PhD in
Electrical and Computer Engineering from the University of Wisconsin –
Madison in 2003 and 2008, respectively. Currently, she is a postdoctoral
research associate at the Program in Applied and Computational Math at
Princeton University. Her research interests lie at the intersection of
statistical signal processing and machine learning with applications to
wireless and sensor networks, Internet data analysis and biomedical imaging.