In an earlier post, I mention the value of visualizations as a means for exploring topic modeling data.Â That particular example used a small model of 276 poems labeled â€œekphrasticâ€ out of a much larger collection.Â At that point, I was still struggling with how to read the data, which felt overwhelming.Â How could I organize the relationships between topics and documents in such a way as to see salient connections produced by the model? Â The intermediate solution was to break the model down into groups of 3 topics and create bar graphs charting the likelihood that each document contained language from each topic.Â That solution worked in the short-term, because it helped me to discover the fact that one topic was found highly likely within a particular volume of ekphrastic verse: John Hollanderâ€™s The Gazerâ€™s Spirit.
Still, what I wanted was an impressionistic overview of the documentsâ€™ association with all of the topics. The first 40 or so attempts at this process were a dismal failure.Â Partly because it was a learning process and partly because the results frequently resembled the much maligned â€œhairball,â€ what I produced was completely incomprehensible.Â However, August 20th to 24th I attended the NSF, Social Media Research Foundation, and Grand funded Summer Social Webshop on Technology-Mediated Social Participation.Â There, I met Marc Smith, who began developing NodeXL, a social media network analysis tool built to work with Microsoft Excel, while he worked for Microsoft Research.Â Marc, who now leads the Social Media Research Foundation and Connected Action Â generously took time to demonstrate how to import my topic modeling data into NodeXL so that I could generate graphs that are more elegant and streamlined than any Iâ€™ve been able to produce to this point.Â The results arenâ€™t just beautiful: theyâ€™re useful.
So, what are those results? They include unimodal and bimodal network graphs that visualize connections between documents with other documents, topics with other topics, and documents with topics created with an LDA model in MALLET.Â Using NodeXLâ€™s algorithms, I am able to cluster groups with stronger ties in grid areas, assign them unique colors, and demonstrate the degree of probability the model calculates as a connection between nodes (either documents or topics depending on the graph).Â The real power of NodeXL, though, is that in the future I can make my data public through the NodeXL gallery, and you can download my network graph and play with it yourself.Â The data isnâ€™t quite there yet, but thatâ€™s whatâ€™s coming.
In the meantime, Iâ€™ll offer the following image of a network graph that I had hoped to produce with my earlier post about The Gazerâ€™s Spirit.Â Though the topic label is small, Topic 3 can be seen in the top left hand corner of the network diagram. The width and color of the edges in the diagram (meaning the width of the lines) is determined by the modelâ€™s estimation of how much of each topic is in each poem.Â If the lines are thicker and lighter, it means that the model estimates that a large portion of the poem draws its language from the corresponding topic.Â Similarly, the thinner and darker a line is the lower the probability that the poem includes language from the corresponding topic.
Table 1: Ekphrastic Dataset – 276 poems and 15 topics
Â Â Â Â Â Â Â Â Â Â Â Topic 3 (in the top, left-hand corner) is primarily comprised of connections to poems from The Gazerâ€™s Spirit and is affiliated by language that reflects a kind of courtship, including archaic references (thy, thee, thou) and the language of love (er, beauty, grace, eyes, heaven, divine, hand, love).Â This makes sense in the context of existing knowledge about Hollanderâ€™s volume.Â The collection reads very much like a tribute to painting and the visual arts by poetry, and the language of desire is prevalent throughout.Â Moreover, both W.J.T. Mitchell and James A.W. Heffernan, two prominent theorists in the ekphrastic tradition, insist that the language of love and desire is a strong, if not dominant, discourse across all of ekphrasis based on a canon of poems mostly included in The Gazer’s Spirit.Â One might assume, then, that there would be strong connections between a topic comprised of the language of courtship, love, and desire and most of the poems in the collection; however, only a few of the poems with a statistically significant portion of its language from Topic 3 are not also in The Gazerâ€™s Spirit: â€œThe Picture of Little T.C. in a Prospect of Flowers,â€ â€œThe Art of Poetry [excerpt],â€ â€œOzymandius,â€ â€œCanto I,â€ and â€œMy Last Duchess.â€Â Of those poems, none are by female poets.
Poems with highest proportion of Topic 3
|The Temeraire (Supposed to Have Been Suggested to an Englishman of the Old Order by the Flight of the Monitor and Merrimac) by Herman Melville|
|To my Worthy Friend Mr. Peter Lilly: on that Excellent Picture of His majesty, and the Duke of York, drawne by him at Hampton-Court by Sir Richard Lovelace|
|From The Testament of Beauty, Book III by Robert Bridges|
|For Spring By Sandro Botticelli (In the Academia of Florence) by Dante Gabriel Rosetti|
|To the Statue on the Capitol: Looking Eastward at Dawn by John James Piatt|
|The Poem of Jacobus Sadoletus on the Statue of Laocoon by Jacobus Sadoleto|
|To the Fragment of a Statue of Hercules, Commonly Called the Torso by Samuel Rogers|
|The Last of England by Ford Maddox Brown|
|On the Group of the Three Angels Before the Tent of Abraham, by Rafaelle, in the Vatican by Washington Allston|
|Death’s Valley To accompany a picture; by request.Â “The Valley of the Shadow of Death,” from the painting by George Inness by Walt Whitman|
|Elegiac Stanzas Suggested by a Picture of Peele Castle, in a Storm, Painted by Sir George Beaumont by William Wordsworth|
|On the Medusa of Leonardo da Vinci in the Florentine Gallery by Percy B. Shelley|
|The Mind of the Frontispiece to a Book by Ben Jonson|
|Venus de Milo by Charles-Rene Marie Leconte de Lisle|
|The City of Dreadful Night by James Thomson|
|Sonnet by Pietro Aretino|
|For “Our Lady of the Rocks” By Leonardo da Vinci by Dante Gabriel Rosetti|
|Mona Lisa by Edith Wharton|
|Ode on a Grecian Urn by John Keats|
|The National Painting by Joseph Rodman Drake|
|The “Moses” of Michael Angelo by Robert Browning|
|Hiram Powers’ Greek Slave by Elizabeth Barrett Browning|
|From Childe Harold’s Pilgrimage, canto 4 by George Byron Gordon|
|The Picture of Little T. C. in a Prospect of Flowers by Andrew Marvell|
|Before the Mirror (Verses written under a Picture)Inscribed to J. A. Whistler by Algernon Charles Swinburne|
|For Venetian Pastoral By Giorgone (In the Louvre) by Dante Gabriel Rosetti|
|The Art of Poetry [excerpt] by Nicolas Boileau-Despreaux|
|Ozymandias by Percy B. Shelley|
|The Iliad, Book XVIII, [The Shield of Achilles] by Homer|
|Canto I by Dante Alighieri|
|The Hunter in the Snow by William Carlos Williams|
|Tiepolo’s Hound by Derek Wallcot|
|St. Eustace by Derek Mahon|
|Three for the Mona Lisa by John Stone|
|My Last Duchess by Robert Browning|
Table 2: Ekphrastic Dataset 15 Topic Model, Topic 3 Highlighted
Â The only remaining topic which includes the word love fairly high in the key word distribution is Topic 4, which includes the following terms: portrait, monument, foreman, felt, woman, monuments, box, press, bacall, detail, young, thick, crimson, instrument, hotel, compartment, picked, cornell, Europe, lovers. As you can see from the network diagram below, none of the topics with high probabilities of containing Topic 3 are included in the Topic 4 distribution.
Table 3: Ekphrastic Dataset 15 Topic Model, Topic 4 Highlighted
Equally interesting, poems with the highest proportion of Topic 4 are also authored by female poets. Â Certainly, more poems by men include significant proportions of Topic 4 than poems by women that include significant portions of Topic three; however, there are striking and salient points to be made about the contrasting networks:
Poems with highest proportion of Topic 4
|“Utopia Parkway” after Joseph Cornell’s Penny Arcade Portrait of Lauren Bacall, 1945 â€“ 46 by Linda Hull|
|Canvas and Mirror by Evie Shockley|
|Portrait of Madame Monet on Her Deathbed by Mary Rose Oâ€™Reilley|
|Internal Monument by G. C. Waldrup|
|The Uses of Distortion by Caroline Crumpacker|
|Joseph Cornell, with Box by Michael DumanisÂ Â|
|Drawing Wildflowers by Jorie Graham|
|The Eye Like a Strange Balloon Mounts Toward Infinity by Mary Jo Bang|
|Visiting the Wise Men in Cologne by J.P. White|
|Rhyme by Robert Pinksy|
|The Street by Stephen Dobyns|
|The Portrait by Stanley Kunitz|
|“Picture of a 23-Year-Old Painted by His Friend of the Same Age, an Amateur” by C.P. Cavafy|
|Portrait in Georgia by Jean Toomer|
|For the Poem Paterson [1. Detail] William Carlos Williams|
|The Dance by William Carlos Williams|
|Late Self-Portrait by Rembrandt by Jane Hirshfield|
|Sea Life in St. Mark’s Square by Mary Oâ€™Donnell|
|Washington’s Monument, February, 1885 by Walt Whitman|
|Still Life by Jorie Graham|
|Still Life by Tony Hoagland|
|The Family Photograph by Vona Groarke|
|The Corn Harvest by William Carlos Williams|
|Portrait of a Lady by T. S. Eliot|
|Portrait d’une Femme by Ezra Pound|
This impressionistic overview of the ekphrastic dataset prompted through the exploration of a network graph of the relationships between topics and poems is a first step.Â Enough, perhaps, to formulate a new hypothesis about the difference between â€œloveâ€ and â€œloversâ€ in ekphrastic poetry, or to lend further support to the growing sense that there is a much broader range of kinds of attraction and kinshipâ€”a range inclusive of both competitive and kindred discoursesâ€”than previous theorizations of the genre have taken into account. Â The network visualization goes further than to suggest that there are two very different discourses regarding love and affection in ekphrastic verse, but even suggests possible poems to consider reading closely to see what those differences might be and if they are worth pursuing further. Â Through the use of networked relationships between topics and documents, we begin with lists of poems in which the discourse of affinity, affection, and desireâ€”as courtship or as partnershipâ€”can be further explored through close readings.
Meeting Edward Tufte’s claim that evidence should be both beautiful and useful, the NodeXL network diagrams of LDA data are a step toward developing methods of evaluating and exploring models of figurative language that do not necessarily fit the same criteria for models of non-figurative texts.
Great post as always! I am curious how you decided your cutoff for keeping the edges of "poems with a statistically significant portion of its language from" particular topics, and which edges you decided to throw out. Also, just a quick suggestion, making the stronger edges lighter rather than darker was visually confusing, maybe reverse that or change the color scheme somehow?
You ask a great question. In this case, there are several "trimmings" built into the process. In the first layer, the script that I use for converting the model into an Excel workbook with individual spreadsheets (one for document distributions, one for word distributions, one for word probability distributions, one for topic to topic symmetrized K-L Divergence, one for document to document symmetrized K-L divergence, and one for document to topic probabilities, among a few others) allows me to establish a threshold that the topic to document likely proportion must meet in order to be included in the spreadsheet. In this instance, that threshold was .1. In the portion of the post you're referring to, I list all the documents with an estimated topic proportion greater than 10%. In this case, that's 35 documents out of 276. That's not to say that documents with lower proportions might not also render useful results, but I feel fairly confident that my top 35 are sound. In fact, the list is cut and paste almost directly from my spreadsheet, so the poems are listed in order of their probability score, from highest (.8) to lowest (.1). Same is true for Topic 4, meaning that the further down the list you go, the lower the proportion of the poem is estimated to come from that topic.
Regarding the graph's design, I agree. The difficulty is that if you switch the colors (dark blue=strong, light blue=weak) then you completely overwhelm the weaker ties, and they can no longer be seen. Though the weaker ties are… well… weaker, they aren't insignificant. As I mentioned, proportions lower than .1 were trimmed out. Perhaps it would make more sense to use just one visual cue (either thickness of edge or color saturation)–however, that also limits the ability to read the graph. For someone who is color blind, the use of color alone would inhibit understanding the degree of relatedness. Similarly, having a single color with variations in thicknesses of the edge makes a much more difficult graph to read, especially if when you print it out in black and white you still want to be able to distinguish between ties that overlay with one another. Admittedly, yes, it does create a sense of sensory confusion, but I went with what I felt were the lesser of two evils in this case. HOWEVER, in my document to document graph where I'm able to do really interesting things with clustering algorithms (and I'm really excited about how well this has been working), the darker=stronger works better because the position of the documents is more forcefully determined by symmetric K-L divergence.
At some point in the not too distant future, I plan to post some of these graphs on the NodeXL gallery site. If you download the free NodeXL plug-in/template and then download my network graphs, you can play around with it, too. I'm eager to see what other people can do with my graphs.
Thanks! This adds a lot to the original post; I definitely look forward to seeing where this goes. I asked about your network because I've recently been doing similar work, and Ted Underwood apparently has as well. We should all chat about it sometime.
Will you be coming to the NEH-sponsored Topic Modeling Workshop in November? I'm going to be there. I don't know if Ted Underwood will be there, but I hope so. Perhaps, we will have a chance to get together and talk–and maybe even find something to collaborate on–there. Thank you for your questions and your comments, which have been really helpful as I'm working on revisions.
I'll be there! I don't think Ted will, but we can definitely chat then, would love to discuss this project more.