Exploiting Data Geometry Using Semi-Supervised and Active Learning
|Event Date:||May 11, 2009|
|Speaker:||Dr Aarti Singh|
|Speaker Affiliation:||Princeton University,
Department of Mathematics
|Sponsor:||ECE Faculty Candidate|
|Contact Name:||Host: Professor Avi Kak
With the advent of sophisticated data collection technologies and our need to understand increasingly intricate and diverse networked systems, we are witnessing an explosion in both the amount and complexity of data. Despite the apparent high dimensionality of data, in many problems the underlying physics of the data-generating system limits the degrees of freedom and endows the data with a hidden lower dimensional geometry. In this talk, I will describe how machine learning techniques can be used to exploit data geometry in science and engineering applications, ranging from the Internet and wireless networks to biomedical imaging. In particular, I will focus on semi-supervised and active learning that capitalize on the abundance of unlabeled data to enable efficient inference using minimal labeled/annotated training samples. In the semi-supervised learning framework, we have access to a large amount of unlabeled data and a random small subset of the data is labeled. It is well accepted that unlabeled data can facilitate inference if there is a link between unlabeled data geometry and the target function of the inference task. For example, the target function may be smooth on data clusters or along the data manifold. However, recent attempts to characterize the amount of improvement possible when such links exist have provided incomplete, and sometimes contradictory, explanations of the benefits of using unlabeled data. My work bridges the gap between these seemingly conflicting views using a novel minimax framework based on finite sample bounds. The results quantify both the amount of improvement possible using unlabeled data as well as the relative value of unlabeled data. Moreover, the theory suggests a new algorithmic approach to semi-supervised learning capable of adapting to a wide range of data geometries. Active learning is a sequential feedback-driven process where we have access to a large amount of unlabeled data from which we can select examples for labeling. Labels are requested for unlabeled examples that are predicted to have very informative labels, based on previously gathered labeled and unlabeled data. If the target function of the inference task has low-dimensional geometry, then active learning can facilitate inference from few labeled examples. I will discuss an active learning approach to designing fast and accurate spatial survey paths for mobile sensors, and the resulting improvement in accuracy vs latency tradeoff. BIO: Aarti Singh received her BE in Electronics and Communication Engineering from Delhi University, India in 2001. She received her MS and PhD in Electrical and Computer Engineering from the University of Wisconsin – Madison in 2003 and 2008, respectively. Currently, she is a postdoctoral research associate at the Program in Applied and Computational Math at Princeton University. Her research interests lie at the intersection of statistical signal processing and machine learning with applications to wireless and sensor networks, Internet data analysis and biomedical imaging.