Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language

The following is a small part from a much larger work in progress (my dissertation) about the potential to use latent Dirichlet allocation (LDA) to do exploratory work with highly figurative language.  In fact, my project uses LDA to model various iterations of an approximately 4,500 poem dataset (the majority of which are from the 20th century), and to consider the composition of that dataset in relationships to a smaller subset of the data that could be described as belonging to a poetic tradition called ekphrasis: poetry to, for, and about the visual arts.  There’s no way that I could begin to get into all the nitty gritty details about the rest of the project in this one blog post, so with apologies, I’m going to begin in media res, assuming that you know this much: probabilistic topic modeling, and in particular LDA, is a way of looking for patterns in large collections of text.  In previous posts I’ve mentioned that there are many good posts on what LDA is and how some humanists are using it.  Most recently Scott Weingart produced a blog post called “Topic Modeling for Humanists: A Guided Tour” that adds to the much needed collection of “How to get started” conversations.  Rather than focusing on how topic modeling can be useful to you, this is a post about how you, dear Reader, need to read our results—at least those of us who are working with figurative texts and particularly those of us working with poetry, the most figurative of them all.

If you’re just getting started, it’s important to begin with the following knowledge: data mining in any form makes two assumptions that Ian H. Witten, Eibe Frank, and Mark Hall point out in their introduction to the topic and to their graphical interface data mining software Weka.  They remind us data mining results need to be actionable and they need to be comprehensible.  I’ll go into what I think that means for my work, but suffice it to say, topic modeling assumes that texts, though amorphous, don’t hide information.  In fact, text mining in general assumes that writers go to great lengths to make clear, unambiguous arguments.  Computer scientists make that assumption because LDA was written to deal with large repositories and collections of non-fiction text.  When you’re reading the journal Science, for example, you don’t see lines like:

Little lion face

I stooped to pick
among the mass of thick
succulent blooms, the twice
streaked flanges of your silk

sunwheel relaxed in wide
dilation, I brought inside,
placed in a vase. Milk
of your shaggy stem

sticky on my fingers, and
your barbs hooked to my hand,
sudden stings from them
were sweet…

May Swenson wasn’t writing for Science, and her poem is about more than dandelions.  Pretty much any human reading that poem, even my undergraduates, get that this is a poem about sex.  Science’s editors would never publish this; however, they may publish and have published plenty of articles about sex, reproduction, and the propagation of flora and fauna.  The terms they use, though, strive against ambiguity, while poetry revels in it.  We don’t have a well-established way of interpreting topics that account for poetry’s lush ambiguity, but we need to because it would be a mistake to read a topic with the keywords: wind, sky, light, trees, blue, white, snow… when generated from a collection of poems the same way you would read and understand it in, say, David Blei’s 100-topic model of Science.

Rather than reposting Blei’s images here, I suggest that readers interested in understanding LDA look at his article from Communications of the ACM, because his illustrative examples on the first and second pages do an excellent job of showing how topics are generated.   Blei’s results are these wonderfully identifiable topics that make such sense: of course, we can interpret topic as the genetics topic because it is comprised of words like gene and dna and another as the evolutionary biology topic because it is made up of words like survival and mutation.

So while the classic examples of topic models produce semantically and thematically coherent keyword distributions, should we expect highly figurative texts, particularly poems but not exclusive of other forms of highly figurative texts such as fiction and drama, to form around the same kind of thematic topics?  Returning once again to Blei’s most accessible article for humanists, he writes: “The interpretable topic distributions arise by computing the hidden structure that likely generated the observed collection of documents.” Blei clarifies his statement in a footnote which reads: “Indeed calling these models “topic models” is retrospective—the topics that emerge from the inference algorithm are interpretable for almost any collection that is analyzed.  The fact that these look like topics has to do with the statistical structure of observed language and how it interacts with the specific probabilistic assumptions of LDA” (Blei “Introduction” 79).  In other words, the topics from Science scan as comprehensible, cohesive topics because the texts from which they were derived strive to use language that identifies very literally with its subject.  The algorithm, however, does not know the difference between texts that tend to be more literal than figurative.  The same process for identifying topics applies for both literal and figurative texts: topics are a distribution over a fixed vocabulary.  The first stage of a topic modeling experiment with poetry, then, is a matter of determining what those distributions look like and whether or not they can be useful.

What would the same illustrative example Blei created for Science look like in an LDA model run on a corpus of poetry?  The poem below translates the LDA intuitions described in Blei’s article to the situation of a poem in a dataset of 4,500 poems.  For copyright purposes, I’m going to remove from the poem the words that would be removed during preprocessing (the stopwords) of Anne Sexton’s “The Starry Night” but if you want to see the whole poem, look here.

Anne Sexton’s Starry Night with stopwords removed.

In the case of Anne Sexton’s “Starry Night,” LDA assumes that the three most prominent topics in the poem are 32, 2, and 54.  In the chart below, I list the topic assignment at the top with estimated distribution of the topic across the document.  Under each topic is a list of the top 15 keywords most strongly associated with those topics.

Topic 32 (29%)

Topic 2 (12%)

Topic 54 (9%)

night

light

moon

stars

day

dark

sun

sleep

sky

wind

time

eyes

star

darkness

bright

death

life

heart

dead

long

world

blood

earth

man

soul

men

face

day

pain

die

tree

green

summer

flowers

grass

trees

flower

spring

leaves

sun

fruit

garden

winter

leaf

apple

 

LDA analysis, then, reads Anne Sexton’s “Starry Night” as containing 25% of its words from topic 32, which seems generally to draw on language associated with time of day, 12% of its language from topic 2 which includes many words about death and dying, and 9% of its language from the natural environment.  Strong coherence among keywords in topics 32 and 54 simplify the interpretive task of assigning labels to them; however, topic 2 is not so easily labeled.  The terms “death, life, heart, dead, long, world” are extremely broad, and to my mind easily misread or misinterpreted without the context of the data that it describes.  Only in light of referring back to “The Starry Night” (and other poems closely associated with topic 2) can we develop a sense of hermeneutic confidence about the comprehensibility of such results, which are discussed further on.

I’m doing a lot of cutting from the original document in which I make this argument (insert shameless plug here regarding my dissertation), so please forgive some of the logical leaps.  I want to jump ahead to the point: Why do we care about what kinds of topics these are and how does the relate to the need for close readings?

Essentially, in my dataset, I found four kinds of topics are most likely to appear when topic modeling poetry: OCR and foreign language topics; “large chunk” topics (a document larger than most of the rest with language that dominates a particular topic); semantically evident topics; and semantically opaque topics.  I’ll describe the latter two here:

1.)    Semantically evident topics—Some topics do appear just as one might expect them to in the 100-topic distribution of Science in Blei’s paper.  Topics 32 and 54 illustrated above in Anne Sexton’s “Starry Night” exemplify how LDA groups terms in ways that appear upon first blush to be thematic, as well.  Our understanding, though, of these semantically evident topics as they are generated by highly figurative texts requires a bit of refinement.  It may be accurate to say that time of day and natural landscapes are topics in “Starry Night.”  After all, Sexton does describe a painted landscape under the stars, but it would not be correct to say that 29% of the document is “about” the time of day.  As literary scholars, we understand that Sexton’s use of the tumultuous night sky depicted by Vincent Van Gogh provides a conceit for the more significant thematic exploration of two artists’ struggle with mental illness.  Therefore, it is important not to be seduced by the seeming transparency of semantically evident topics.  These topics reflect most powerfully Ted Underwood’s definition of “LDA topic” as “discourse.”  In other words, topics form around a manner of speech, and the significant questions to be asked regarding such topics have to do with what we learn about the relationships between forms of discourse associated with particular topics across documents within a specific dataset.

2.)    Semantically opaque topics—Some topics, such as topic 2 in the “Starry Night” example are not immediately apparent.  In fact, I found them to be discouraging the first time I started running LDA models of the dataset because they are so difficult to synthesize into the single phrases used by so many of the researchers in not only computer sciences but digital humanities, as well.  Determining a pithy label for a topic with the keywords, “death, life, heart, dead, long, world, blood, earth…” is virtually impossible until you return to the data, read the poems most closely associated with the topic, and infer the commonalities among them:

 

Topic

Proportion     Title

2

0.535248643

When to the sessions of sweet silent thought (Sonnet 30)

2

0.533343438

By ways remote and distant waters sped (101)

2

0.517398877

A Psalm of Life

2

0.481152152

We Wear the Mask

2

0.477938906

The times are nightfall, look, their light grows less

2

0.472091675

The Slave’s Complaint

2

0.451175606

The Guitar

2

0.447100571

Tears in Sleep

2

0.446314271

The Man with the Hoe

2

0.437962153

A Short Testament

2

0.433767746

Beyond the Years

2

0.433152279

Dead Fires

2

0.429638773

O Little Root of a Dream

2

0.427326132

Bangladesh II

2

0.425835136

Vitae Summa Brevis Spem Nos Vetat Incohare Longam

Topic 2 is interesting for a number of reasons, not the least of which is that even though Paul Laurence Dunbar’s “We Wear the Mask” never once mentions the word “death,” the language Dunbar uses to describe the erasure of identity and the shackles of racial injustice are identified as drawing heavily from language associated with death, loss, and internal turmoil—language which “Starry Night” indisputably also draws from.  To say that this is a topic about “death, loss, and internal turmoil” is overly simplistic.  Just as semantically evident topics require interpretation, so do semantically opaque topics.  While the former tends to center around images, metaphors, and particular literary devices, the latter topic often emphasizes tone.  Words like “death, life, heart, dead, long, world” out of context tell us nothing about an author’s attitude or thematic affinities between poems, but when a close reader scales down into the compressed language of the poems themselves that draw from the topic’s language distribution, there are rich deposits of hermeneutic possibility.  There’s a lot that could be said about elegy here and the relationships between elegy and other poetic genres… but I’ll save that for another post.

At long last, the point: if we assume that the “semantically evident” topics are actually about the words by themselves, we’re missing something important.  Semantically evident and semantically opaque topics in LDA models of highly figurative texts must be the starting point for an interpretive process.  It is incumbent upon us as digital humanists who use this methodology to explain that a topic with keywords like “night, light, moon, stars, day” isn’t just about time of day.  More likely, it’s about the use of time of day as images, metaphors, and other figurative proxies for another conversation and none of that is evident without a combination of close and “networked” reading.  These four topic types appear in every model to varying degrees based on the number of topics input during the construction of my LDA models and represent the difference between topic models of figurative language as opposed to topic models of non-fiction, journalistic, or academic prose.  As a result, reading, navigating, and interpreting topics in a figurative dataset requires a slightly different approach than reading, navigating, and interpreting models of other kinds of text collections.  Moreover, understanding topics requires a networked interpretive strategy.  Texts need to be read in relationship to other texts in the corpus, and how that happens, what I suggest for the best practices for doing networked readings is a point I’ll have to make in the next post.

2 thoughts on “Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language

  1. Jonathan

    I have recently begun experimenting with topic modeling, and, in my more limited experience, I had similar intuitions about its usefulness. I have run topic models (using the R topicmodels package) on Marjorie Bowen's historical fiction, most of Conrad's novels, Infinite Jest, and a sample of late 19th-early 20th C sexology. I haven't actually read any of the Bowen novels, and it seemed as if the topics tracked fairly closely to individual novels, as was mostly the case with Conrad. I could also fairly easily identify the individual works of sexology by the topic lists, which I suppose means that the corpus wasn't large enough for it to reveal any commonalities.

    With Infinite Jest, a large but single novel, there were some intriguingly suggestive topic lists (but also semantically opaque ones, as you call it). The Gibbs Sampling method used randomization, as I understand it, and I was quite surprised by how variable each iteration of the model was.

    Reply
  2. Pingback: Text Mining Workshop » THATCamp Southern California 2012

Leave a Reply

Your email address will not be published. Required fields are marked *