Marcus

Marcus would have you think he's very clever.
He's nice and all, but he's really just following our instructions.

Marcus is a web application designed by Wesley Loo and Andrew Reece.

Marcus uses Latent Dirichlet Allocation (LDA) to extract latent topics and themes from the text of 40,000 reviews for almost 2,000 films produced between 2010-2015. These topics are used to generate predictions about the commercial and critical success of a movie. (There's more high-level information about our methods a bit father down this page, but if you want the real stat-fu, please see the paper we've posted on the Technical page.)

Here's a brief video explaining how Marcus works:

Marcus has a simple user interface - just enter a movie you want to know about. Marcus will ponder your request for a moment, and then output his observations and predictions.

Mouse over the different page elements in the image below for more detail.

There were a few steps involved in deriving the latent topics from raw text documents. Here's how we built our statistical model:

Latent Dirichlet Allocation (LDA) is a generative, hierarchical Bayesian model. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. (Words are usually defined as the unique vocabulary in a given corpus.) The only observed data are the words that make up each document in a target corpus.

Consider this plate diagram:

with these variable definitions:

α is the parameter of the Dirichlet prior on the per-document topic distributions,

β is the parameter of the Dirichlet prior on the per-topic word distribution,

Θ_{i} is the topic distribution for document i,

φ_{k} is the word distribution for topic k,

z_{ij} is the topic for the jth word in document i, and

w_{ij} is the specific word.

Learning these distributions is a problem of Bayesian inference. We used two different approaches to tackle this problem: Gibbs Sampling, and Expectation-Maximization.

Marcus was built with a mix of Python, Javascript, and skillful means.

Implementation details:

- All code was written by Andrew Reece, and is available at his Github repo.
- The entire app runs on a Flask framework with Jinja2 templates.
- Phusion Passenger keeps Flask running.
- Prediction, web scraping and cleaning, and all LDA modeling is done in Python.
- Interactive features are a mashup of JQuery, JQuery-UI, and D3.