Marcus uses Latent Dirichlet Allocation (LDA) to extract latent topics and themes from the text of 40,000 reviews for almost 2,000 films produced between 2010-2015. These topics are used to generate predictions about the commercial and critical success of a movie. (There's more high-level information about our methods a bit father down this page, but if you want the real stat-fu, please see the paper we've posted on the Technical page.)
Here's a brief video explaining how Marcus works:
Marcus has a simple user interface - just enter a movie you want to know about. Marcus will ponder your request for a moment, and then output his observations and predictions.
Mouse over the different page elements in the image below for more detail.
There were a few steps involved in deriving the latent topics from raw text documents. Here's how we built our statistical model:
Latent Dirichlet Allocation (LDA) is a generative, hierarchical Bayesian model. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. (Words are usually defined as the unique vocabulary in a given corpus.) The only observed data are the words that make up each document in a target corpus.
Consider this plate diagram:
with these variable definitions:
α is the parameter of the Dirichlet prior on the per-document topic distributions,
β is the parameter of the Dirichlet prior on the per-topic word distribution,
Θi is the topic distribution for document i,
φk is the word distribution for topic k,
zij is the topic for the jth word in document i, and
wij is the specific word.
Learning these distributions is a problem of Bayesian inference. We used two different approaches to tackle this problem: Gibbs Sampling, and Expectation-Maximization.