Daily Archives: April 1, 2012

Chunks, Topics, and Themes in LDA

[NB: This post is the continuation of a conversation begun on Ted Underwood’s blog under the post “A touching detail produced by LDA”—in which he demonstrates that there is an overlay between the works of the Shelley/Godwin family and a topic which includes the terms mind / heart / felt.  Rather than hijack his post, I’m responding here to questions having to do more with process than content; however, to understand fully the genesis of this conversation, I encourage you to read Ted’s post and the comments there first. ]


I appreciate your response because it is making me think carefully about what I understand LDA “topics” to represent.  I’m not sure that I’m on board with thinking of topics in terms of discourse or necessarily “ways” of writing.  Honestly, I’m not trying to be difficult here; rather, I’m trying to parse for myself what I mean when I talk about my expectations that particular terms “should” form the basis for a highly probable topic.  It seems to me that what one wants from topic modeling are lexical themes—in other words, lexical trends over the course of particular chunks of text.  I’m taking to heart here Matt Jockers’s recent post on the LDA buffet in which he articulates the assumption that LDA analysis makes—that the world is composed of a certain number of topics (and in Mallet, we define those topics when we run the topic modeling application).  As a result, when I run a topic model analysis in Mallet, I am looking at the way graphemes (because the written symbol, of course, is divorced from its meaning) relate to other similar graphemes.  So, though topics may not have a one-to-one semantic relationship with particular volumes as the “main topic” or “supporting topics,” one might reasonably expect that a text with a 90% probability of including a list of graphemes from an LDA topic lexicon (for lack of a better word) would correspondingly address a thematic topic which depends heavily on a closely related vocabulary.  Similarly, the frequent use of words in a topic lexicon increases the probability that the LDA topic, through the repetition of those words, carries semantic weight—though the degree to which this is the case wouldn’t likely be determined by that initial topic probability.

I’m chasing the rabbit down a hole here, but I do so for the purpose of agreeing with your earlier claim that what kinds of results we get, their reliability, and their usefulness seems to be largely determined by the kinds of questions we’re asking in the first place.  I agree that when we use LDA to describe texts, that’s fundamentally different from using it to test assumptions/expectations.  In my research, I have attempted to draw very clear distinctions between when I am testing assumptions about the kinds of language that dominate a particular genre of poetry and when I am using LDA to generate a list of potential word groups that could then be used to describe poetic trends.  I see those as two very different projects.  When I’m working with poetry and specifically with ekphrasis, I am testing what people who write about this particular genre assume to be true: that the word or variations of the word still will be one of the most commonly used words across all ekphrastic texts and used at a higher rate than in any other genre of poetry. It’s true that the word still could be a semantic topic in many other kinds of poetry; however, what we’re trying to get at is that a group of words closely allied with the word still will be the most dominant and recurring trend across all ekphrastic verse.  The next determination, then, to be made is whether or not that discovery carries semantic weight.  If still, stillness, death, breathless, etc are not actually a dominant trend, have we overstated the case?

It seems that what you’re saying (and please intervene if I’m not articulating this correctly) , which I tend to agree with is that “chunk size” should be something determined by the questions being asked, and stating the way in which data has been chunked reflects the types of results we want to get in return.  Taking this into consideration, though, certainly has helped the way I position what I’m doing.  For me it is significant to chunk at the level of individual poems; however, were I to change my question to something like, “Which poets trend more toward ekphrastic topics than others?”—based on what we’re saying here, that question seems to require chunking volumes rather than individual poems.

In other news, test models on the whole 4500 poems in my dataset, which is chunked at the level of individual poem, yielded much more promising initial results than we thought we would get.  I would guess that it has something to do with the number of topics we assign when we run the model, and maybe one of the other ways forward is to talk about the threshold number of topics we need to assign in order to garner meaningful results from the model.  (Obviously people like Matt and Travis have hands-on experience with this; however, I’m wondering if the type of question we’re asking should have a definable impact on how many topics we generate for the different types of tests….) Hopefully, in the near future I’ll be able to share some of those very preliminary results… but I’m still in the midst of refining my queries and configuring my data.

Again, I’m engaged because I find what you’re doing both relevant and useful, and I think that having these mid-investigation conversations does help to inform the way ahead.  As you mention, perhaps many of these kinds of questions are answered in Matt Jockers’s book, but it is unlikely I’ll be able to use that before this first iteration of my project is done in the next month or two.  I believe that hearing anecdotal conversation about the low-level kinds of tests people are playing with really does help others along in their own work since we’re still figuring out what exactly we can do with this tool.