Randomized Linear Algebra, Column Subset Selection, and Terabyte-sized Scientific Data

Applied and Computational Mathematics

Speaker:

Michael Mahoney

Speaker Link:

http://www.stat.berkeley.edu/~mmahoney/

Institution:

ICSI and Department of Statistics, UC Berkeley

Time:

Monday, May 9, 2016 - 4:00pm to 5:00pm

Host:

Long Chen

Location:

RH306

One of the most straightforward formulations of a feature selection problem boils down to the linear algebraic problem of selecting good columns from a data matrix. This formulation has the advantage of yielding features that are interpretable to scientists in the domain from which the data are drawn, an important consideration when machine learning methods are applied to realistic scientific data. While simple, this problem is central to many other seemingly nonlinear learning methods. Moreover, while unsupervised, this problem also has strong connections with related supervised learning methods such as Linear Discriminant Analysis and Canonical Correlation Analysis. We will describe recent work implementing Randomized Linear Algebra algorithms for this feature selection problem in parallel and distributed environments on inputs of size ranging from ones to tens of terabytes, as well as the application of these implementations to specific scientific problems in areas such as mass spectrometry imaging and climate modeling.