Category Archives: Uncategorized

Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language

The following is a small part from a much larger work in progress (my dissertation) about the potential to use latent Dirichlet allocation (LDA) to do exploratory work with highly figurative language.  In fact, my project uses LDA to model various iterations of an approximately 4,500 poem dataset (the majority of which are from the 20th century), and to consider the composition of that dataset in relationships to a smaller subset of the data that could be described as belonging to a poetic tradition called ekphrasis: poetry to, for, and about the visual arts.  There’s no way that I could begin to get into all the nitty gritty details about the rest of the project in this one blog post, so with apologies, I’m going to begin in media res, assuming that you know this much: probabilistic topic modeling, and in particular LDA, is a way of looking for patterns in large collections of text.  In previous posts I’ve mentioned that there are many good posts on what LDA is and how some humanists are using it.  Most recently Scott Weingart produced a blog post called “Topic Modeling for Humanists: A Guided Tour” that adds to the much needed collection of “How to get started” conversations.  Rather than focusing on how topic modeling can be useful to you, this is a post about how you, dear Reader, need to read our results—at least those of us who are working with figurative texts and particularly those of us working with poetry, the most figurative of them all.

If you’re just getting started, it’s important to begin with the following knowledge: data mining in any form makes two assumptions that Ian H. Witten, Eibe Frank, and Mark Hall point out in their introduction to the topic and to their graphical interface data mining software Weka.  They remind us data mining results need to be actionable and they need to be comprehensible.  I’ll go into what I think that means for my work, but suffice it to say, topic modeling assumes that texts, though amorphous, don’t hide information.  In fact, text mining in general assumes that writers go to great lengths to make clear, unambiguous arguments.  Computer scientists make that assumption because LDA was written to deal with large repositories and collections of non-fiction text.  When you’re reading the journal Science, for example, you don’t see lines like:

Little lion face

I stooped to pick
among the mass of thick
succulent blooms, the twice
streaked flanges of your silk

sunwheel relaxed in wide
dilation, I brought inside,
placed in a vase. Milk
of your shaggy stem

sticky on my fingers, and
your barbs hooked to my hand,
sudden stings from them
were sweet…

May Swenson wasn’t writing for Science, and her poem is about more than dandelions.  Pretty much any human reading that poem, even my undergraduates, get that this is a poem about sex.  Science’s editors would never publish this; however, they may publish and have published plenty of articles about sex, reproduction, and the propagation of flora and fauna.  The terms they use, though, strive against ambiguity, while poetry revels in it.  We don’t have a well-established way of interpreting topics that account for poetry’s lush ambiguity, but we need to because it would be a mistake to read a topic with the keywords: wind, sky, light, trees, blue, white, snow… when generated from a collection of poems the same way you would read and understand it in, say, David Blei’s 100-topic model of Science.

Rather than reposting Blei’s images here, I suggest that readers interested in understanding LDA look at his article from Communications of the ACM, because his illustrative examples on the first and second pages do an excellent job of showing how topics are generated.   Blei’s results are these wonderfully identifiable topics that make such sense: of course, we can interpret topic as the genetics topic because it is comprised of words like gene and dna and another as the evolutionary biology topic because it is made up of words like survival and mutation.

So while the classic examples of topic models produce semantically and thematically coherent keyword distributions, should we expect highly figurative texts, particularly poems but not exclusive of other forms of highly figurative texts such as fiction and drama, to form around the same kind of thematic topics?  Returning once again to Blei’s most accessible article for humanists, he writes: “The interpretable topic distributions arise by computing the hidden structure that likely generated the observed collection of documents.” Blei clarifies his statement in a footnote which reads: “Indeed calling these models “topic models” is retrospective—the topics that emerge from the inference algorithm are interpretable for almost any collection that is analyzed.  The fact that these look like topics has to do with the statistical structure of observed language and how it interacts with the specific probabilistic assumptions of LDA” (Blei “Introduction” 79).  In other words, the topics from Science scan as comprehensible, cohesive topics because the texts from which they were derived strive to use language that identifies very literally with its subject.  The algorithm, however, does not know the difference between texts that tend to be more literal than figurative.  The same process for identifying topics applies for both literal and figurative texts: topics are a distribution over a fixed vocabulary.  The first stage of a topic modeling experiment with poetry, then, is a matter of determining what those distributions look like and whether or not they can be useful.

What would the same illustrative example Blei created for Science look like in an LDA model run on a corpus of poetry?  The poem below translates the LDA intuitions described in Blei’s article to the situation of a poem in a dataset of 4,500 poems.  For copyright purposes, I’m going to remove from the poem the words that would be removed during preprocessing (the stopwords) of Anne Sexton’s “The Starry Night” but if you want to see the whole poem, look here.

Anne Sexton’s Starry Night with stopwords removed.

In the case of Anne Sexton’s “Starry Night,” LDA assumes that the three most prominent topics in the poem are 32, 2, and 54.  In the chart below, I list the topic assignment at the top with estimated distribution of the topic across the document.  Under each topic is a list of the top 15 keywords most strongly associated with those topics.

Topic 32 (29%)

Topic 2 (12%)

Topic 54 (9%)















































LDA analysis, then, reads Anne Sexton’s “Starry Night” as containing 25% of its words from topic 32, which seems generally to draw on language associated with time of day, 12% of its language from topic 2 which includes many words about death and dying, and 9% of its language from the natural environment.  Strong coherence among keywords in topics 32 and 54 simplify the interpretive task of assigning labels to them; however, topic 2 is not so easily labeled.  The terms “death, life, heart, dead, long, world” are extremely broad, and to my mind easily misread or misinterpreted without the context of the data that it describes.  Only in light of referring back to “The Starry Night” (and other poems closely associated with topic 2) can we develop a sense of hermeneutic confidence about the comprehensibility of such results, which are discussed further on.

I’m doing a lot of cutting from the original document in which I make this argument (insert shameless plug here regarding my dissertation), so please forgive some of the logical leaps.  I want to jump ahead to the point: Why do we care about what kinds of topics these are and how does the relate to the need for close readings?

Essentially, in my dataset, I found four kinds of topics are most likely to appear when topic modeling poetry: OCR and foreign language topics; “large chunk” topics (a document larger than most of the rest with language that dominates a particular topic); semantically evident topics; and semantically opaque topics.  I’ll describe the latter two here:

1.)    Semantically evident topics—Some topics do appear just as one might expect them to in the 100-topic distribution of Science in Blei’s paper.  Topics 32 and 54 illustrated above in Anne Sexton’s “Starry Night” exemplify how LDA groups terms in ways that appear upon first blush to be thematic, as well.  Our understanding, though, of these semantically evident topics as they are generated by highly figurative texts requires a bit of refinement.  It may be accurate to say that time of day and natural landscapes are topics in “Starry Night.”  After all, Sexton does describe a painted landscape under the stars, but it would not be correct to say that 29% of the document is “about” the time of day.  As literary scholars, we understand that Sexton’s use of the tumultuous night sky depicted by Vincent Van Gogh provides a conceit for the more significant thematic exploration of two artists’ struggle with mental illness.  Therefore, it is important not to be seduced by the seeming transparency of semantically evident topics.  These topics reflect most powerfully Ted Underwood’s definition of “LDA topic” as “discourse.”  In other words, topics form around a manner of speech, and the significant questions to be asked regarding such topics have to do with what we learn about the relationships between forms of discourse associated with particular topics across documents within a specific dataset.

2.)    Semantically opaque topics—Some topics, such as topic 2 in the “Starry Night” example are not immediately apparent.  In fact, I found them to be discouraging the first time I started running LDA models of the dataset because they are so difficult to synthesize into the single phrases used by so many of the researchers in not only computer sciences but digital humanities, as well.  Determining a pithy label for a topic with the keywords, “death, life, heart, dead, long, world, blood, earth…” is virtually impossible until you return to the data, read the poems most closely associated with the topic, and infer the commonalities among them:



Proportion     Title



When to the sessions of sweet silent thought (Sonnet 30)



By ways remote and distant waters sped (101)



A Psalm of Life



We Wear the Mask



The times are nightfall, look, their light grows less



The Slave’s Complaint



The Guitar



Tears in Sleep



The Man with the Hoe



A Short Testament



Beyond the Years



Dead Fires



O Little Root of a Dream



Bangladesh II



Vitae Summa Brevis Spem Nos Vetat Incohare Longam

Topic 2 is interesting for a number of reasons, not the least of which is that even though Paul Laurence Dunbar’s “We Wear the Mask” never once mentions the word “death,” the language Dunbar uses to describe the erasure of identity and the shackles of racial injustice are identified as drawing heavily from language associated with death, loss, and internal turmoil—language which “Starry Night” indisputably also draws from.  To say that this is a topic about “death, loss, and internal turmoil” is overly simplistic.  Just as semantically evident topics require interpretation, so do semantically opaque topics.  While the former tends to center around images, metaphors, and particular literary devices, the latter topic often emphasizes tone.  Words like “death, life, heart, dead, long, world” out of context tell us nothing about an author’s attitude or thematic affinities between poems, but when a close reader scales down into the compressed language of the poems themselves that draw from the topic’s language distribution, there are rich deposits of hermeneutic possibility.  There’s a lot that could be said about elegy here and the relationships between elegy and other poetic genres… but I’ll save that for another post.

At long last, the point: if we assume that the “semantically evident” topics are actually about the words by themselves, we’re missing something important.  Semantically evident and semantically opaque topics in LDA models of highly figurative texts must be the starting point for an interpretive process.  It is incumbent upon us as digital humanists who use this methodology to explain that a topic with keywords like “night, light, moon, stars, day” isn’t just about time of day.  More likely, it’s about the use of time of day as images, metaphors, and other figurative proxies for another conversation and none of that is evident without a combination of close and “networked” reading.  These four topic types appear in every model to varying degrees based on the number of topics input during the construction of my LDA models and represent the difference between topic models of figurative language as opposed to topic models of non-fiction, journalistic, or academic prose.  As a result, reading, navigating, and interpreting topics in a figurative dataset requires a slightly different approach than reading, navigating, and interpreting models of other kinds of text collections.  Moreover, understanding topics requires a networked interpretive strategy.  Texts need to be read in relationship to other texts in the corpus, and how that happens, what I suggest for the best practices for doing networked readings is a point I’ll have to make in the next post.

THATCampVA Tweet Visualization

A NodeXL visualization of #THATCampVA tweets and people mentioned in them

The tweeting and the tweeted: THATCampVA in 140 character sprints

When you spend most of your time using a tool in a way in which it was not intended, sometimes it’s satisfying to try to use it for what it was meant to do. This network visualization of Twitter mentions from the past weekend’s THATCampVA proved useful to me for just that purpose. There’s no real argument here besides… this is kind of pretty. I used NodeXL, which is social network analysis (SNA) software, to do the calculations and visualization of the network. NodeXL allowed me to access the Twitter search API and pull in all tweets since April 19th that include the #THATCampVA hashtag. I used the Harel-Koren Fast Multiscale algorithm to create the visualization. Those included in the visualization are people who tweeted about someone else and those who were referenced within tweets. The “edges” or lines between pictures (also known as vertices) represent the direction of the relationship. In other words, arrows originate at the image of the twitterer who wrote the tweet and are pointing to the person tweeted about or to.


“The Venus Hottentot (1925)” as a network

At MSA13 this week, I will be presenting a couple of ways I have started mapping ekphrasis using social network analysis. The following visualization is a very early working through of how to identify “nodes” in the poem and how to define their relationships. In this case, I have “named” the subjectivities, voices, locations, languages, and “actors” within the poem. Then, in an Excel spreadsheet, I placed any subject initiating an action (defined as describing, narrating, relating, comparing, envoicing, placing, observing, etc) in the first column and the correlating object of that action in the second column. In other words, this formalizes my understanding that ekphrasis is something done to something else by someone else for someone else. The following is only a very preliminary visualization using the Network Diagram tool in Many Eyes.



Reading failure

Today’s reading demonstrates how social network analysis can be used to read the silences, the absences, and the failures of text to fill in detail.  Lauren Klein’s article at ARCADE titled “When Reading Fails” today points to the difficulty of reading James Hemmings, Thomas Jefferson’s cook and Sally Hemmings’ older brother.  By mining the Jefferson papers, digitized at the University of Virginia, she visualizes Jefferson’s correspondence and learns that you can begin to see a story about Jefferson’s relationship to his staff by understanding the frequency and subjects about which he corresponded with them.  What I find particularly interesting is her method of visualization.  Rather than the usual “hairball” that typifies most SNA visualizations, this one uses a single line to track correspondence.


Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris nibh lacus, imperdiet ut luctus eu, posuere id ligula. Maecenas posuere tristique libero, vel volutpat ligula porttitor eu. In est dolor, varius vel convallis ut, commodo in elit. Quisque pharetra, sapien quis ornare elementum, mauris lectus aliquam dolor, eu tempor turpis orci sit amet lectus. Nam fringilla molestie ante a posuere. Aenean adipiscing libero feugiat lectus dictum ultricies. Nullam lectus lacus, malesuada sit amet adipiscing in, scelerisque sed urna. Phasellus ut nulla vel turpis scelerisque lacinia. Aenean ipsum leo, pellentesque ut ornare vel, aliquet at elit. Cras ullamcorper lectus sed dolor vulputate id semper nunc convallis. Duis tincidunt adipiscing massa, et egestas urna pellentesque ut. Donec consectetur congue massa, eget auctor diam ornare vitae. Aenean a sodales augue. Aliquam erat volutpat. Integer varius iaculis imperdiet. Mauris eget auctor libero. Vivamus enim lacus, posuere nec iaculis vel, vehicula in mauris. Ut nec dui diam. Nullam eget lacus eros. In pharetra risus vel libero sollicitudin facilisis.

Nam tincidunt cursus imperdiet. Vivamus ac turpis quis enim mattis laoreet. Etiam metus justo, dictum a adipiscing quis, vestibulum a dolor. Cras cursus malesuada orci vel hendrerit. Phasellus nec quam nunc. Nullam vitae mollis odio. Maecenas adipiscing semper arcu, sit amet auctor magna mattis sollicitudin. Nulla facilisi. Phasellus id turpis dui, id blandit dui. Pellentesque sed massa quis odio posuere mattis ut ut ante. Integer tincidunt magna vitae sem ornare interdum. Nullam ac metus non felis posuere vestibulum ac quis sem. Nullam imperdiet dolor in augue ultrices venenatis ac sed erat. Fusce molestie ullamcorper ante in vestibulum. Quisque mi quam, ultrices a congue non, aliquam a dui. Integer tincidunt varius semper. Praesent tempor adipiscing eleifend. Phasellus nec consectetur turpis.