Speaker: 

Professor Padhraic Smyth

Institution: 

UCI

Time: 

Friday, March 4, 2011 - 4:00pm

Location: 

RH 306

With the proliferation of digital document collections in recent years (Web
pages, blogs, online news articles, etc) there has been an increasing
interest in automated tools that can automatically summarize, classify, and
annotate such text. In this talk I will provide an overview of statistical
topic models (also known as latent Dirichlet allocation models), which
provide a flexible framework for statistical modeling of high-dimensional
count data, in particular, for word counts in text documents.The talk will
begin by discussing the underlying statistical principles of these models,
including their interpretation within a general probabilistic
matrix-decomposition framework and contrasting them with techniques such as
principal component analysis and clustering. Estimation methods and
algorithms will be reviewed, with a focus on Gibbs sampling approaches. This
introduction will be followed by an illustration of how these models can be
to generate high-level summaries of document collections and to
automatically uncovering thematic trends in text over time. The remainder of
the talk will focus on extensions of the basic topic modeling framework,
with applications in document classification, document retrieval, as well as
discussion of parallel algorithms for scaling to very large data sets. A
number of different text data sets will be used during the talk as
illustrative examples, including archives of New York Times articles,
historical records of the Pennsylvania Gazette from the 18th century, large
databases of scientific publications such as PubMed and CiteSeer, and
publicly-available emails from the Enron corporation.