Author Archives: lmrhody

Ekphrasis as an LDA Network in NodeXL

In an earlier post, I mention the value of visualizations as a means for exploring topic modeling data.Â That particular example used a small model of 276 poems labeled â€œekphrasticâ€ out of a much larger collection.Â At that point, I was still struggling with how to read the data, which felt overwhelming.Â How could I organize the relationships between topics and documents in such a way as to see salient connections produced by the model? Â The intermediate solution was to break the model down into groups of 3 topics and create bar graphs charting the likelihood that each document contained language from each topic.Â That solution worked in the short-term, because it helped me to discover the fact that one topic was found highly likely within a particular volume of ekphrastic verse: John Hollanderâ€™s The Gazerâ€™s Spirit.

Still, what I wanted was an impressionistic overview of the documentsâ€™ association with all of the topics. The first 40 or so attempts at this process were a dismal failure.Â Partly because it was a learning process and partly because the results frequently resembled the much maligned â€œhairball,â€ what I produced was completely incomprehensible.Â However, August 20th to 24th I attended the NSF, Social Media Research Foundation, and Grand funded Summer Social Webshop on Technology-Mediated Social Participation.Â There, I met Marc Smith, who began developing NodeXL, a social media network analysis tool built to work with Microsoft Excel, while he worked for Microsoft Research.Â Marc, who now leads the Social Media Research Foundation and Connected Action Â generously took time to demonstrate how to import my topic modeling data into NodeXL so that I could generate graphs that are more elegant and streamlined than any Iâ€™ve been able to produce to this point.Â The results arenâ€™t just beautiful: theyâ€™re useful.

So, what are those results? They include unimodal and bimodal network graphs that visualize connections between documents with other documents, topics with other topics, and documents with topics created with an LDA model in MALLET.Â Using NodeXLâ€™s algorithms, I am able to cluster groups with stronger ties in grid areas, assign them unique colors, and demonstrate the degree of probability the model calculates as a connection between nodes (either documents or topics depending on the graph).Â The real power of NodeXL, though, is that in the future I can make my data public through the NodeXL gallery, and you can download my network graph and play with it yourself.Â The data isnâ€™t quite there yet, but thatâ€™s whatâ€™s coming.

In the meantime, Iâ€™ll offer the following image of a network graph that I had hoped to produce with my earlier post about The Gazerâ€™s Spirit.Â Though the topic label is small, Topic 3 can be seen in the top left hand corner of the network diagram. The width and color of the edges in the diagram (meaning the width of the lines) is determined by the modelâ€™s estimation of how much of each topic is in each poem.Â If the lines are thicker and lighter, it means that the model estimates that a large portion of the poem draws its language from the corresponding topic.Â Similarly, the thinner and darker a line is the lower the probability that the poem includes language from the corresponding topic.

Table 1: Ekphrastic Dataset – 276 poems and 15 topics

Â Â Â Â Â Â Â Â Â Â Â Topic 3 (in the top, left-hand corner) is primarily comprised of connections to poems from The Gazerâ€™s Spirit and is affiliated by language that reflects a kind of courtship, including archaic references (thy, thee, thou) and the language of love (er, beauty, grace, eyes, heaven, divine, hand, love).Â This makes sense in the context of existing knowledge about Hollanderâ€™s volume.Â The collection reads very much like a tribute to painting and the visual arts by poetry, and the language of desire is prevalent throughout.Â Moreover, both W.J.T. Mitchell and James A.W. Heffernan, two prominent theorists in the ekphrastic tradition, insist that the language of love and desire is a strong, if not dominant, discourse across all of ekphrasis based on a canon of poems mostly included in The Gazer’s Spirit.Â One might assume, then, that there would be strong connections between a topic comprised of the language of courtship, love, and desire and most of the poems in the collection; however, only a few of the poems with a statistically significant portion of its language from Topic 3 are not also in The Gazerâ€™s Spirit: â€œThe Picture of Little T.C. in a Prospect of Flowers,â€ â€œThe Art of Poetry [excerpt],â€ â€œOzymandius,â€ â€œCanto I,â€ and â€œMy Last Duchess.â€Â Of those poems, none are by female poets.

Poems with highest proportion of Topic 3

The Temeraire (Supposed to Have Been Suggested to an Englishman of the Old Order by the Flight of the Monitor and Merrimac) by Herman Melville

To my Worthy Friend Mr. Peter Lilly: on that Excellent Picture of His majesty, and the Duke of York, drawne by him at Hampton-Court by Sir Richard Lovelace

From The Testament of Beauty, Book III by Robert Bridges

For Spring By Sandro Botticelli (In the Academia of Florence) by Dante Gabriel Rosetti

To the Statue on the Capitol: Looking Eastward at Dawn by John James Piatt

The Poem of Jacobus Sadoletus on the Statue of Laocoon by Jacobus Sadoleto

To the Fragment of a Statue of Hercules, Commonly Called the Torso by Samuel Rogers

The Last of England by Ford Maddox Brown

On the Group of the Three Angels Before the Tent of Abraham, by Rafaelle, in the Vatican by Washington Allston

Death’s Valley To accompany a picture; by request.Â “The Valley of the Shadow of Death,” from the painting by George Inness by Walt Whitman

Elegiac Stanzas Suggested by a Picture of Peele Castle, in a Storm, Painted by Sir George Beaumont by William Wordsworth

On the Medusa of Leonardo da Vinci in the Florentine Gallery by Percy B. Shelley

The Mind of the Frontispiece to a Book by Ben Jonson

Venus de Milo by Charles-Rene Marie Leconte de Lisle

The City of Dreadful Night by James Thomson

Sonnet by Pietro Aretino

For “Our Lady of the Rocks” By Leonardo da Vinci by Dante Gabriel Rosetti

Mona Lisa by Edith Wharton

Ode on a Grecian Urn by John Keats

The National Painting by Joseph Rodman Drake

The “Moses” of Michael Angelo by Robert Browning

Hiram Powers’ Greek Slave by Elizabeth Barrett Browning

From Childe Harold’s Pilgrimage, canto 4 by George Byron Gordon

The Picture of Little T. C. in a Prospect of Flowers by Andrew Marvell

Before the Mirror (Verses written under a Picture)Inscribed to J. A. Whistler by Algernon Charles Swinburne

For Venetian Pastoral By Giorgone (In the Louvre) by Dante Gabriel Rosetti

The Art of Poetry [excerpt] by Nicolas Boileau-Despreaux

Ozymandias by Percy B. Shelley

The Iliad, Book XVIII, [The Shield of Achilles] by Homer

Canto I by Dante Alighieri

The Hunter in the Snow by William Carlos Williams

Tiepolo’s Hound by Derek Wallcot

St. Eustace by Derek Mahon

Three for the Mona Lisa by John Stone

My Last Duchess by Robert Browning

Table 2: Ekphrastic Dataset 15 Topic Model, Topic 3 Highlighted

Â The only remaining topic which includes the word love fairly high in the key word distribution is Topic 4, which includes the following terms: portrait, monument, foreman, felt, woman, monuments, box, press, bacall, detail, young, thick, crimson, instrument, hotel, compartment, picked, cornell, Europe, lovers. As you can see from the network diagram below, none of the topics with high probabilities of containing Topic 3 are included in the Topic 4 distribution.

Table 3: Ekphrastic Dataset 15 Topic Model, Topic 4 Highlighted

Equally interesting, poems with the highest proportion of Topic 4 are also authored by female poets. Â Certainly, more poems by men include significant proportions of Topic 4 than poems by women that include significant portions of Topic three; however, there are striking and salient points to be made about the contrasting networks:

Poems with highest proportion of Topic 4

“Utopia Parkway” after Joseph Cornell’s Penny Arcade Portrait of Lauren Bacall, 1945 â€“ 46 by Linda Hull

Canvas and Mirror by Evie Shockley

Portrait of Madame Monet on Her Deathbed by Mary Rose Oâ€™Reilley

Internal Monument by G. C. Waldrup

The Uses of Distortion by Caroline Crumpacker

Joseph Cornell, with Box by Michael DumanisÂ Â

Drawing Wildflowers by Jorie Graham

The Eye Like a Strange Balloon Mounts Toward Infinity by Mary Jo Bang

Visiting the Wise Men in Cologne by J.P. White

Rhyme by Robert Pinksy

The Street by Stephen Dobyns

The Portrait by Stanley Kunitz

“Picture of a 23-Year-Old Painted by His Friend of the Same Age, an Amateur” by C.P. Cavafy

Portrait in Georgia by Jean Toomer

For the Poem Paterson [1. Detail] William Carlos Williams

The Dance by William Carlos Williams

Late Self-Portrait by Rembrandt by Jane Hirshfield

Sea Life in St. Mark’s Square by Mary Oâ€™Donnell

Washington’s Monument, February, 1885 by Walt Whitman

Still Life by Jorie Graham

Still Life by Tony Hoagland

The Family Photograph by Vona Groarke

The Corn Harvest by William Carlos Williams

Portrait of a Lady by T. S. Eliot

Portrait d’une Femme by Ezra Pound

This impressionistic overview of the ekphrastic dataset prompted through the exploration of a network graph of the relationships between topics and poems is a first step.Â Enough, perhaps, to formulate a new hypothesis about the difference between â€œloveâ€ and â€œloversâ€ in ekphrastic poetry, or to lend further support to the growing sense that there is a much broader range of kinds of attraction and kinshipâ€”a range inclusive of both competitive and kindred discoursesâ€”than previous theorizations of the genre have taken into account. Â The network visualization goes further than to suggest that there are two very different discourses regarding love and affection in ekphrastic verse, but even suggests possible poems to consider reading closely to see what those differences might be and if they are worth pursuing further. Â Through the use of networked relationships between topics and documents, we begin with lists of poems in which the discourse of affinity, affection, and desireâ€”as courtship or as partnershipâ€”can be further explored through close readings.

Meeting Edward Tufte’s claim that evidence should be both beautiful and useful, the NodeXL network diagrams of LDA data are a step toward developing methods of evaluating and exploring models of figurative language that do not necessarily fit the same criteria for models of non-figurative texts.

Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language

2 Replies

The following is a small part from a much larger work in progress (my dissertation) about the potential to use latent Dirichlet allocation (LDA) to do exploratory work with highly figurative language.Â In fact, my project uses LDA to model various iterations of an approximately 4,500 poem dataset (the majority of which are from the 20^th century), and to consider the composition of that dataset in relationships to a smaller subset of the data that could be described as belonging to a poetic tradition called ekphrasis: poetry to, for, and about the visual arts.Â Thereâ€™s no way that I could begin to get into all the nitty gritty details about the rest of the project in this one blog post, so with apologies, Iâ€™m going to begin in media res, assuming that you know this much: probabilistic topic modeling, and in particular LDA, is a way of looking for patterns in large collections of text.Â In previous posts Iâ€™ve mentioned that there are many good posts on what LDA is and how some humanists are using it.Â Most recently Scott Weingart produced a blog post called â€œTopic Modeling for Humanists: A Guided Tourâ€ that adds to the much needed collection of â€œHow to get startedâ€ conversations.Â Rather than focusing on how topic modeling can be useful to you, this is a post about how you, dear Reader, need to read our resultsâ€”at least those of us who are working with figurative texts and particularly those of us working with poetry, the most figurative of them all.

If youâ€™re just getting started, itâ€™s important to begin with the following knowledge: data mining in any form makes two assumptions that Ian H. Witten, Eibe Frank, and Mark Hall point out in their introduction to the topic and to their graphical interface data mining software Weka.Â They remind us data mining results need to be actionable and they need to be comprehensible.Â Iâ€™ll go into what I think that means for my work, but suffice it to say, topic modeling assumes that texts, though amorphous, donâ€™t hide information.Â In fact, text mining in general assumes that writers go to great lengths to make clear, unambiguous arguments.Â Computer scientists make that assumption because LDA was written to deal with large repositories and collections of non-fiction text.Â When youâ€™re reading the journal Science, for example, you donâ€™t see lines like:

Little lion face

I stooped to pick
among the mass of thick
succulent blooms, the twice
streaked flanges of your silk

sunwheel relaxed in wide
dilation, I brought inside,
placed in a vase. Milk
of your shaggy stem

sticky on my fingers, and
your barbs hooked to my hand,
sudden stings from them
were sweet…

May Swenson wasnâ€™t writing for Science, and her poem is about more thanÂ dandelions.Â Pretty much any human reading that poem, even my undergraduates, get that this is a poem about sex.Â Scienceâ€™s editors would never publish this; however, they may publish and have published plenty of articles about sex, reproduction, and the propagation of flora and fauna.Â The terms they use, though, strive against ambiguity, while poetry revels in it.Â We donâ€™t have a well-established way of interpreting topics that account for poetryâ€™s lush ambiguity, but we need to because it would be a mistake to read a topic with the keywords: wind, sky, light, trees, blue, white, snowâ€¦ when generated from a collection of poems the same way you would read and understand it in, say, David Bleiâ€™s 100-topic model of Science.

Rather than reposting Bleiâ€™s images here, I suggest that readers interested in understanding LDA look at his article from Communications of the ACM, because his illustrative examples on the first and second pages do an excellent job of showing how topics are generated.Â Â Bleiâ€™s results are these wonderfully identifiable topics that make such sense: of course, we can interpret topic as the genetics topic because it is comprised of words like gene and dna and another as the evolutionary biology topic because it is made up of words like survival and mutation.

So while the classic examples of topic models produce semantically and thematically coherent keyword distributions, should we expect highly figurative texts, particularly poems but not exclusive of other forms of highly figurative texts such as fiction and drama, to form around the same kind of thematic topics?Â Returning once again to Bleiâ€™s most accessible article for humanists, he writes: â€œThe interpretable topic distributions arise by computing the hidden structure that likely generated the observed collection of documents.â€ Blei clarifies his statement in a footnote which reads: â€œIndeed calling these models â€œtopic modelsâ€ is retrospectiveâ€”the topics that emerge from the inference algorithm are interpretable for almost any collection that is analyzed.Â The fact that these look like topics has to do with the statistical structure of observed language and how it interacts with the specific probabilistic assumptions of LDAâ€ (Blei â€œIntroductionâ€ 79).Â In other words, the topics from Science scan as comprehensible, cohesive topics because the texts from which they were derived strive to use language that identifies very literally with its subject.Â The algorithm, however, does not know the difference between texts that tend to be more literal than figurative.Â The same process for identifying topics applies for both literal and figurative texts: topics are a distribution over a fixed vocabulary.Â The first stage of a topic modeling experiment with poetry, then, is a matter of determining what those distributions look like and whether or not they can be useful.

What would the same illustrative example Blei created for Science look like in an LDA model run on a corpus of poetry?Â The poem below translates the LDA intuitions described in Bleiâ€™s article to the situation of a poem in a dataset of 4,500 poems.Â For copyright purposes, Iâ€™m going to remove from the poem the words that would be removed during preprocessing (the stopwords) of Anne Sextonâ€™s â€œThe Starry Nightâ€ but if you want to see the whole poem, look here.

Anne Sexton’s Starry Night with stopwords removed.

In the case of Anne Sextonâ€™s â€œStarry Night,â€ LDA assumes that the three most prominent topics in the poem are 32, 2, and 54.Â In the chart below, I list the topic assignment at the top with estimated distribution of the topic across the document.Â Under each topic is a list of the top 15 keywords most strongly associated with those topics.

Topic 32 (29%)

Topic 2 (12%)

Topic 54 (9%)

night

light

moon

stars

day

dark

sun

sleep

sky

wind

time

eyes

star

darkness

bright

death

life

heart

dead

long

world

blood

earth

man

soul

men

face

day

pain

die

tree

green

summer

flowers

grass

trees

flower

spring

leaves

sun

fruit

garden

winter

leaf

apple

LDA analysis, then, reads Anne Sextonâ€™s â€œStarry Nightâ€ as containing 25% of its words from topic 32, which seems generally to draw on language associated with time of day, 12% of its language from topic 2 which includes many words about death and dying, and 9% of its language from the natural environment.Â Strong coherence among keywords in topics 32 and 54 simplify the interpretive task of assigning labels to them; however, topic 2 is not so easily labeled.Â The terms â€œdeath, life, heart, dead, long, worldâ€ are extremely broad, and to my mind easily misread or misinterpreted without the context of the data that it describes.Â Only in light of referring back to â€œThe Starry Nightâ€ (and other poems closely associated with topic 2) can we develop a sense of hermeneutic confidence about the comprehensibility of such results, which are discussed further on.

Iâ€™m doing a lot of cutting from the original document in which I make this argument (insert shameless plug here regarding my dissertation), so please forgive some of the logical leaps.Â I want to jump ahead to the point: Why do we care about what kinds of topics these are and how does the relate to the need for close readings?

Essentially, in my dataset, I found four kinds of topics are most likely to appear when topic modeling poetry: OCR and foreign language topics; â€œlarge chunkâ€ topics (a document larger than most of the rest with language that dominates a particular topic); semantically evident topics; and semantically opaque topics.Â Iâ€™ll describe the latter two here:

1.)Â Â Â Semantically evident topicsâ€”Some topics do appear just as one might expect them to in the 100-topic distribution of Science in Bleiâ€™s paper.Â Topics 32 and 54 illustrated above in Anne Sextonâ€™s â€œStarry Nightâ€ exemplify how LDA groups terms in ways that appear upon first blush to be thematic, as well.Â Our understanding, though, of these semantically evident topics as they are generated by highly figurative texts requires a bit of refinement.Â It may be accurate to say that time of day and natural landscapes are topics in â€œStarry Night.â€Â After all, Sexton does describe a painted landscape under the stars, but it would not be correct to say that 29% of the document is â€œaboutâ€ the time of day.Â As literary scholars, we understand that Sextonâ€™s use of the tumultuous night sky depicted by Vincent Van Gogh provides a conceit for the more significant thematic exploration of two artistsâ€™ struggle with mental illness.Â Therefore, it is important not to be seduced by the seeming transparency of semantically evident topics.Â These topics reflect most powerfully Ted Underwoodâ€™s definition of â€œLDA topicâ€ as â€œdiscourse.â€Â In other words, topics form around a manner of speech, and the significant questions to be asked regarding such topics have to do with what we learn about the relationships between forms of discourse associated with particular topics across documents within a specific dataset.

2.)Â Â Â Semantically opaque topicsâ€”Some topics, such as topic 2 in the â€œStarry Nightâ€ example are not immediately apparent.Â In fact, I found them to be discouraging the first time I started running LDA models of the dataset because they are so difficult to synthesize into the single phrases used by so many of the researchers in not only computer sciences but digital humanities, as well.Â Determining a pithy label for a topic with the keywords, â€œdeath, life, heart, dead, long, world, blood, earthâ€¦â€ is virtually impossible until you return to the data, read the poems most closely associated with the topic, and infer the commonalities among them:

Topic		ProportionÂ Â Â Â Title
2	0.535248643		When to the sessions of sweet silent thought (Sonnet 30)
2	0.533343438		By ways remote and distant waters sped (101)
2	0.517398877		A Psalm of Life
2	0.481152152		We Wear the Mask
2	0.477938906		The times are nightfall, look, their light grows less
2	0.472091675		The Slave’s Complaint
2	0.451175606		The Guitar
2	0.447100571		Tears in Sleep
2	0.446314271		The Man with the Hoe
2	0.437962153		A Short Testament
2	0.433767746		Beyond the Years
2	0.433152279		Dead Fires
2	0.429638773		O Little Root of a Dream
2	0.427326132		Bangladesh II
2	0.425835136		Vitae Summa Brevis Spem Nos Vetat Incohare Longam

Topic 2 is interesting for a number of reasons, not the least of which is that even though Paul Laurence Dunbarâ€™s â€œWe Wear the Maskâ€ never once mentions the word â€œdeath,â€ the language Dunbar uses to describe the erasure of identity and the shackles of racial injustice are identified as drawing heavily from language associated with death, loss, and internal turmoilâ€”language which â€œStarry Nightâ€ indisputably also draws from.Â To say that this is a topic about â€œdeath, loss, and internal turmoilâ€ is overly simplistic.Â Just as semantically evident topics require interpretation, so do semantically opaque topics.Â While the former tends to center around images, metaphors, and particular literary devices, the latter topic often emphasizes tone.Â Words like â€œdeath, life, heart, dead, long, worldâ€ out of context tell us nothing about an authorâ€™s attitude or thematic affinities between poems, but when a close reader scales down into the compressed language of the poems themselves that draw from the topicâ€™s language distribution, there are rich deposits of hermeneutic possibility.Â Thereâ€™s a lot that could be said about elegy here and the relationships between elegy and other poetic genresâ€¦ but Iâ€™ll save that for another post.

At long last, the point: if we assume that the â€œsemantically evidentâ€ topics are actually about the words by themselves, weâ€™re missing something important.Â Semantically evident and semantically opaque topics in LDA models of highly figurative texts must be the starting point for an interpretive process.Â It is incumbent upon us as digital humanists who use this methodology to explain that a topic with keywords like â€œnight, light, moon, stars, dayâ€ isnâ€™t just about time of day.Â More likely, itâ€™s about the use of time of day as images, metaphors, and other figurative proxies for another conversation and none of that is evident without a combination of close and “networked” reading.Â These four topic types appear in every model to varying degrees based on the number of topics input during the construction of my LDA models and represent the difference between topic models of figurative language as opposed to topic models of non-fiction, journalistic, or academic prose.Â As a result, reading, navigating, and interpreting topics in a figurative dataset requires a slightly different approach than reading, navigating, and interpreting models of other kinds of text collections.Â Moreover, understanding topics requires a networked interpretive strategy.Â Texts need to be read in relationship to other texts in the corpus, and how that happens, what I suggest for the best practices for doing networked readings is a point Iâ€™ll have to make in the next post.

Why use visualizations to study poetry?

14 Replies

[Note: This post was a DHNow Editor’s Choice on May 1, 2012.]

The research I am doing presently uses visualizations to show latent patterns that may be detected in a set of poems using computational tools, such as topic modeling.Â In particular, Iâ€™m looking at poetry that takes visual art as its subject, a genre called ekphrasis, in an attempt to distinguish the types of language poets tend to invoke when creating a verbal art that responds to a visual one.Â Studying wordsâ€™ relationships to images and then creating more images to represent those patterns calls to mind a longstanding contest between modes of representationâ€”which one represents information â€œbetterâ€?Â Since my research is dedicated to revealing the potential for collaborative and kindred relationships between modes of representationÂ historicallyÂ seen in competition with one another, using images to further demonstrate patterns of language might be seen as counter-productive.Â Why use images to make literary arguments? Do images tell us something â€œnewâ€ that words cannot?

Without answering that question, Iâ€™d like instead to present an instance of when using images (visualizations of data) to â€œseeâ€ language led to an improved understanding of the kinds of questions we might ask and the types of answers we might want to look for that wouldnâ€™t have been possible had we not seen them differentlyâ€”through graphical array.

Currently, Iâ€™m using a tool called MALLET to create a model of the possible â€œtopicsâ€ found in a set of 276 ekphrastic poems.Â There are already several excellent explanations of what topic modeling is and how it works (many thanks to Matt Jockers, Ted Underwood, and Scott WeingartÂ who posted these explanations with humanists in mind), so Iâ€™m not going to spend time explaining what the tool does here; however, I will say that working with a set of 276 poems is atypical.Â Topic modeling was designed to work on millions of words, and 276 poems doesnâ€™t even come close; however, part of the project has been to determine a threshold at which we can get meaningful results from a small dataset.Â So, this particular experiment is playing with the lower thresholds of the toolâ€™s usefulness.

When you run a topic model (train-topics) in MALLET, you tell the program how many topics to create, and when the model runs, it can output a variety of results. Â As part of the tinkering process, Iâ€™ve been working with the number of topics to have MALLET use in order to generate the model, and was just about to despair that the real tests I wanted to run wouldnâ€™t be possible at 276 poems. Â Perhaps it was just too few poems to find recognizable patterns. Â For each topic assignment, MALLET assigns an ID number to the topic and “topic keys” as keywords for that topic. Â Usually, when the topic model is working, the results are â€œreadableâ€ because they represent similar language. Â MALLET would not call a topic “Sea,” for example, but might instead provide the following keywords:

blue, water, waves, sea, surface, turn, green, ship, sail, sailor, drown

The researcher would look at those terms and think, â€œOh, clearly thatâ€™s a nautical/sea/sailingâ€ topic, and dub it as such.Â My results, however, on 15 topics over 276 poems were not readable in the same way.Â For example, topic 3 included the following topic keys:

3Â Â Â Â Â Â Â Â Â 0.04026Â Â Â Â Â Â Â Â Â Â with self portrait him god how made shape give thing centuries image more world dread he lands down back protest shaped dream upon will rulers lords slave gazes hoe future

I donâ€™t blame you if you donâ€™t see the pattern there.Â I didnâ€™t.Â Except, well, knowing some of the poems in the set pretty well, I know that it put together â€œLandscape with the Fall of Icarusâ€ by W.C. Williams with â€œThe Poem of Jacobus Sadoletus on the Statue of Laocoonâ€ with â€œThe New Colossusâ€ with â€œThe Man with the Hoe Written after Seeing the Painting by Millet.â€Â I could see that we had lots of kinds of gods represented, farming, and statues, but thatâ€™s only because I knew the poems. Â Without topic modeling, I might put this category together as a â€œmastersâ€ grouping, but itâ€™s not likely. Â Rather than look for connections, I was focused on the fact that the topic keys didnâ€™t make a strong case for their being placed together, and other categories seemed similarly opaque.Â However, just to be sure that I could, in fact, visualize results of future tests, I went ahead and imported the topic associations by file.Â In other words, MALLET can also produce a file that lists each topic (0-14 in this case) with each file name in the dataset and a percentage.Â The percentage represents the degree to which the topic is represented inside each file.Â I imported the MALLET output of topics and files associated with them into Google Fusion Tables and created a dynamic bar graph that collects file-ids along the vertical axis and along the horizontal axis can be found the degree that the given topic (in this case topic 3) is present in the file.Â Â As I clicked through each topicâ€™s graph, I figured I was seeing results that demonstrated MALLETâ€™s confusion, since the dataset was so small.Â But then I saw this: [Below should be a Google Visualization. Â You may need to “refresh” your browser page to see it. Â If you still cannot see it, a static version of the file is visible here.]

If the graphâ€™s visualization is working, when you pass your mouse over the lines in the bar graph, the ones that are higher than 0.4, then the file-id number (a random number assigned during the course of preparing the data) appears. Â Each of these files begin with the same prefix: GS. Â In my dataset, that means that the files with the highest representation of topic 3 in them can all be found in John Hollanderâ€™s collection The Gazerâ€™s Spirit.Â This anthology is considered to be one of the most authoritative and diverseâ€”beginning with classical ekphrasis all the way up to and including poems from the 1980s and 1990s.Â I had expected, given the disparity in time periods, that the poems from this collection would be the most difficult to group together because the diction of the poems changes dramatically from the beginning of the volume to the end.Â In other words, I would have expected the poems to blend with the other ekphrastic poems throughout the dataset more in terms of their similar diction than by anything else.Â MALLET has no way of knowing that these files are included in the same anthology.Â All of the bibliographical information about the poems has been stripped from the text being tested.Â There has to be something else.Â What something else might be requires another layer of interpretation.Â I will need to return to the topic model to see if a similar pattern is present when I use Â other numbers of topicsâ€”or if I include some non-ekphrastic poems to the set being testedâ€”but seeing the affinity in language between the poems included in The Gazerâ€™s Spirit in contrast to other ekphrastic poems proved useful. Â Now, Iâ€™m not inclined to throw the whole test away, but instead to perform more tests to see if this pattern emerges again in other circumstances.Â Iâ€™m not at square one. Iâ€™m at a square 2 that I didnâ€™t expect.

The visualization in the end didnâ€™t produce â€œnew knowledge.â€Â It isnâ€™t hard to imagine that an editor would choose poems that construct a particular argument about what â€œbestâ€ represents a particular genre of poetry; however, if these poems did truly represent the diversity of ekphrastic verse, wouldnâ€™t we see other poems also highly associated with a â€œGazerâ€™s Spirit topicâ€?Â What makes these poems stand out so clearly from others of their kind?Â Might their similarity mark a reason for why critics of the 90s and 2000s define the tropes, canons, and traditions of ekphrasis in a particular vein?Â Iâ€™m now returning to the test and to the texts to see what answers might exist there that I and others have missed as close readers.Â Could we, for instance, run an analysis that determines how closely other kinds of ekphrasis are associated with Gazerâ€™s Spiritâ€™s definition of ekphrasis?Â Is it possible that poetry by male poets is more frequently associated with that strain of ekphrastic discourse than poetry by female poets?

This particular visualization doesnâ€™t make an “argument” in the way humanists are accustomed to making them.Â It doesnâ€™t necessarily produce anything wholly â€œnewâ€ that couldnâ€™t have been discovered some other way; however, it did help this researcher get past a particular kind of blindness and helped me to see alternativesâ€”to consider what has been missed along the wayâ€”and there is, and will be, something new in that.

Chunks, Topics, and Themes in LDA

4 Replies

[NB: This post is the continuation of a conversation begun on Ted Underwoodâ€™s blog under the post â€œA touching detail produced by LDAâ€â€”in which he demonstrates that there is an overlay between the works of the Shelley/Godwin family and a topic which includes the terms mind / heart / felt.Â Rather than hijack his post, Iâ€™m responding here to questions having to do more with process than content; however, to understand fully the genesis of this conversation, I encourage you to read Tedâ€™s post and the comments there first. ]

Ted-

I appreciate your response because it is making me think carefully about what I understand LDA “topics” to represent. Â Iâ€™m not sure that Iâ€™m on board with thinking of topics in terms of discourse or necessarily â€œwaysâ€ of writing. Â Honestly, Iâ€™m not trying to be difficult here; rather, Iâ€™m trying to parse for myself what I mean when I talk about my expectations that particular terms â€œshouldâ€ form the basis for a highly probable topic.Â It seems to me that what one wants from topic modeling are lexical themesâ€”in other words, lexical trends over the course of particular chunks of text.Â Iâ€™m taking to heart here Matt Jockersâ€™s recent post on the LDA buffet in which he articulates the assumption that LDA analysis makesâ€”that the world is composed of a certain number of topics (and in Mallet, we define those topics when we run the topic modeling application).Â As a result, when I run a topic model analysis in Mallet, I am looking at the way graphemes (because the written symbol, of course, is divorced from its meaning) relate to other similar graphemes.Â So, though topics may not have a one-to-one semantic relationship with particular volumes as the â€œmain topicâ€ or â€œsupporting topics,â€ one might reasonably expect that a text with a 90% probability of including a list of graphemes from an LDA topic lexicon (for lack of a better word) would correspondingly address a thematic topic which depends heavily on a closely related vocabulary.Â Similarly, the frequent use of words in a topic lexicon increases the probability that the LDA topic, through the repetition of those words, carries semantic weightâ€”though the degree to which this is the case wouldnâ€™t likely be determined by that initial topic probability.

Iâ€™m chasing the rabbit down a hole here, but I do so for the purpose of agreeing with your earlier claim that what kinds of results we get, their reliability, and their usefulness seems to be largely determined by the kinds of questions weâ€™re asking in the first place.Â I agree that when we use LDA to describe texts, thatâ€™s fundamentally different from using it to test assumptions/expectations.Â In my research, I have attempted to draw very clear distinctions between when I am testing assumptions about the kinds of language that dominate a particular genre of poetry and when I am using LDA to generate a list of potential word groups that could then be used to describe poetic trends.Â I see those as two very different projects.Â When Iâ€™m working with poetry and specifically with ekphrasis, I am testing what people who write about this particular genre assume to be true: that the word or variations of the word still will be one of the most commonly used words across all ekphrastic texts and used at a higher rate than in any other genre of poetry. Itâ€™s true that the word still could be a semantic topic in many other kinds of poetry; however, what weâ€™re trying to get at is that a group of words closely allied with the word still will be the most dominant and recurring trend across all ekphrastic verse.Â The next determination, then, to be made is whether or not that discovery carries semantic weight.Â If still, stillness, death, breathless, etc are not actually a dominant trend, have we overstated the case?

It seems that what youâ€™re saying (and please intervene if Iâ€™m not articulating this correctly) , which I tend to agree with is that â€œchunk sizeâ€ should be something determined by the questions being asked, and stating the way in which data has been chunked reflects the types of results we want to get in return.Â Taking this into consideration, though, certainly has helped the way I position what Iâ€™m doing.Â For me it is significant to chunk at the level of individual poems; however, were I to change my question to something like, â€œWhich poets trend more toward ekphrastic topics than others?â€â€”based on what weâ€™re saying here, that question seems to require chunking volumes rather than individual poems.

In other news, test models on the whole 4500 poems in my dataset, which is chunked at the level of individual poem, yielded much more promising initial results than we thought we would get.Â I would guess that it has something to do with the number of topics we assign when we run the model, and maybe one of the other ways forward is to talk about the threshold number of topics we need to assign in order to garner meaningful results from the model. Â (Obviously people like Matt and Travis have hands-on experience with this; however, I’m wondering if the type of question we’re asking should have a definable impact on how many topics we generate for the different types of tests….) Hopefully, in the near future Iâ€™ll be able to share some of those very preliminary resultsâ€¦ but Iâ€™m still in the midst of refining my queries and configuring my data.

Again, Iâ€™m engaged because I find what youâ€™re doing both relevant and useful, and I think that having these mid-investigation conversations does help to inform the way ahead.Â As you mention, perhaps many of these kinds of questions are answered in Matt Jockersâ€™s book, but it is unlikely Iâ€™ll be able to use that before this first iteration of my project is done in the next month or two. Â I believe that hearing anecdotal conversation about the low-level kinds of tests people are playing with really does help others along in their own work since we’re still figuring out what exactly we can do with this tool.

Lisa @ Work

This site has moved as of March 20, 2013 to a new location: www.lisarhody.com

Author Archives: lmrhody

Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language

Why use visualizations to study poetry?

Chunks, Topics, and Themes in LDA