Topic Modeling 1: Simulated LDA Corpus

Because I am self-taught in many of the areas of computer science and more advanced statistics and probability theory I am most interested in, and because I have a deep aversion both to looking foolish and being full of it (in no way related to any long, troubled history of doing both or anything), I tend to try, whenever possible, to do things the hard way.

This means I use the command line for way more stuff than is healthy and I often rewrite when I could just copy and paste, obnoxious things like that. It also means that, when I'm learning a new model or tool, I try and write elaborate simulations that I usually never share to make sure I actually understand what's going in and what's coming out. This also happens when I'm trying to explain a concept or problem to someone, as evidenced by the earlier Monty Hall Monte Carlo post. You'll have to ask my coworkers, friends and family members whether this makes me more capable at imparting any of this knowledge to others or doing useful things with it, I guess. (On second thought, don't.) Part of the point of this site is to share these instances of "Incredible Overkill", as one of my professors once called it in the (unlikely?) event that someone else may find them useful.

Probabilistic topic modeling was added to my ever-growing list of obsessions in the last year and I have spent a lot of time consuming research papers, tutorials and tool documentation in an effort to apply this exciting area of research to problems at work and in personal projects. Wikipedia defines a topic model as "a type of statistical model for discovering abstract 'topics' that occur in a collection of documents." The research in this area is quite new, with the major developments of Probabilistic Latent Semantic Indexing and the most common topic model, Latent Dirichlet allocation models, in 1999 and 2003, respectively. The chief developer of the Latent Dirichlet allocation models, David Blei of Princeton's computer science department, has written many useful and accessible treatments of the technique, such as those available here, here, and here. Many of the most exciting areas of research in computational linguistics involve extensions of LDA, and many of those areas are being pursued by talented local machine learning, computer science and computation linguistics professionals and academics in the Baltimore/Washington DC area, where I call home.

This is the first of a set of R scripts I wrote a couple months ago, in an effort to understand all the moving parts and assumptions of the basic model and understand what I am putting in and what I am getting out. When I do these things for myself, I will usually lay out the basic model in (excruciating?) detail and attempt to examine the assumptions inherent in it, write code that simulates data generated according to those assumptions and that fits the model to that simulated data. I usually also try to also break those assumptions and see how those affect the model output. The output of this effort is usually not fit for public consumption - though it is ALWAYS well commented - so this post is the result of a fair amount of polishing. That said, it could still be wrong in any number of ways and if you think it is, please do let me know.

This simulated generative process is based essentially on Blei's descriptions and equations in the papers linked above. As he describes it:

"We formally define a topic to be a distribution over a fixed vocabulary...

We assume that these topics are specified before any data has been generated.

Now for each document in the collection, we generate the words in a two-stage process.

  1. Randomly choose a distribution over topics.
  2. For each word in the document:
    1. Randomly choose a topic from the distribution over topics in step #1.
    2. Randomly choose a word from the corresponding distribution over the vocabulary.

Code that simulates an LDA corpus and source simulation function:

### Basic LDA Topic Model Simulation ###
### Generate Simulated Corpus ###
library(ggplot2)
library(tm)
library(MCMCpack)
 
simulateCorpus <- function(
	M, # number of documents
	nTerms, 
	docLengths, 
	K,  	# Number of Topics
	alphA, 	# parameter for symmetric 
                # Document/Topic dirichlet distribution
	betA, 	# parameter for Topic/Term dirichlet distribution
	Alpha=rep(alphA,K), # number-of-topics length vector 
                            # set to symmetric alpha parameter 
                            # across all topics
	Beta=rep(betA,nTerms))  # number-of-terms length vector 
                                # set to symmetric beta parameter 
                                # across all terms
	{
	# Labels
	Terms <- paste("Term",seq(nTerms))
	Topics <- paste("Topic", seq(K))
	Documents <- paste("Document", seq(M))
 
	## Generate latent topic and term distributions
	# "True" Document/Topic distribution matrix
	Theta <- rdirichlet(M, Alpha) 
	colnames(Theta) <- Topics
	rownames(Theta) <- Documents
 
	# "True" Topic/Term Distribution Matrix
	Phi <- rdirichlet(K, Beta) 
	colnames(Phi) <- Terms
	rownames(Phi) <- Topics
 
	## Function to generate individual document
	generateDoc <- function(docLength, topic_dist, terms_topics_dist){
		# docLength is specific document length
		# topic_dist is specific topic distribution for this document
		# terms_topics_dist is terms distribution matrix over all topics
		document <- c()
		for (i in seq(docLength)){
			# For each word in a document, 
			# choose a topic from that 
			# document's topic distribution
			topic <- rmultinom(1, 1, topic_dist) 
 
			# Then choose a term from that topic's term distribution
			term <- rmultinom(1, 1, terms_topics_dist[topic,]) 
 
			# and append term to document vector
			document <- c(document, 
			colnames(terms_topics_dist)[which.max(term)]) 
			}
		return(document)
		}
 
	## generate "observed" corpus as list of terms
	corpus <- list()
	for (i in seq(M)){
		corpus[[i]] <- generateDoc(docLengths[i], Theta[i,], Phi)
		}
 
	## convert document term vectors to frequency vectors
	freqsLists <- llply(corpus, table)
 
	## write values to termFreqMatrix
	termFreqMatrix <- matrix(nrow=M, ncol=nTerms, 0)
	colnames(termFreqMatrix) <- Terms
	rownames(termFreqMatrix) <- Documents
	for (i in seq(M)){
		termFreqMatrix[i,names(freqsLists[[i]])] <- freqsLists[[i]]
		}
 
	stopifnot(rowSums(termFreqMatrix) == docLengths)
 
	return(list("docs"=corpus, 
	'termFreqMatrix'=termFreqMatrix, 
	"Theta"=Theta, 
	"Phi"=Phi))
 
}

I will finish polishing the inferential and viz code and put that up shortly. But you should be able, if you're exploring topic models, to use this code to see how best to tune the canned R packages and play with the "true" hyperparameters and see what effect they have on the output.