[NB: This post is the continuation of a conversation begun on Ted Underwoodâ€™s blog under the post â€œA touching detail produced by LDAâ€â€”in which he demonstrates that there is an overlay between the works of the Shelley/Godwin family and a topic which includes the terms mind / heart / felt.Â Rather than hijack his post, Iâ€™m responding here to questions having to do more with process than content; however, to understand fully the genesis of this conversation, I encourage you to read Tedâ€™s post and the comments there first. ]
I appreciate your response because it is making me think carefully about what I understand LDA “topics” to represent. Â Iâ€™m not sure that Iâ€™m on board with thinking of topics in terms of discourse or necessarily â€œwaysâ€ of writing. Â Honestly, Iâ€™m not trying to be difficult here; rather, Iâ€™m trying to parse for myself what I mean when I talk about my expectations that particular terms â€œshouldâ€ form the basis for a highly probable topic.Â It seems to me that what one wants from topic modeling are lexical themesâ€”in other words, lexical trends over the course of particular chunks of text.Â Iâ€™m taking to heart here Matt Jockersâ€™s recent post on the LDA buffet in which he articulates the assumption that LDA analysis makesâ€”that the world is composed of a certain number of topics (and in Mallet, we define those topics when we run the topic modeling application).Â As a result, when I run a topic model analysis in Mallet, I am looking at the way graphemes (because the written symbol, of course, is divorced from its meaning) relate to other similar graphemes.Â So, though topics may not have a one-to-one semantic relationship with particular volumes as the â€œmain topicâ€ or â€œsupporting topics,â€ one might reasonably expect that a text with a 90% probability of including a list of graphemes from an LDA topic lexicon (for lack of a better word) would correspondingly address a thematic topic which depends heavily on a closely related vocabulary.Â Similarly, the frequent use of words in a topic lexicon increases the probability that the LDA topic, through the repetition of those words, carries semantic weightâ€”though the degree to which this is the case wouldnâ€™t likely be determined by that initial topic probability.
Iâ€™m chasing the rabbit down a hole here, but I do so for the purpose of agreeing with your earlier claim that what kinds of results we get, their reliability, and their usefulness seems to be largely determined by the kinds of questions weâ€™re asking in the first place.Â I agree that when we use LDA to describe texts, thatâ€™s fundamentally different from using it to test assumptions/expectations.Â In my research, I have attempted to draw very clear distinctions between when I am testing assumptions about the kinds of language that dominate a particular genre of poetry and when I am using LDA to generate a list of potential word groups that could then be used to describe poetic trends.Â I see those as two very different projects.Â When Iâ€™m working with poetry and specifically with ekphrasis, I am testing what people who write about this particular genre assume to be true: that the word or variations of the word still will be one of the most commonly used words across all ekphrastic texts and used at a higher rate than in any other genre of poetry. Itâ€™s true that the word still could be a semantic topic in many other kinds of poetry; however, what weâ€™re trying to get at is that a group of words closely allied with the word still will be the most dominant and recurring trend across all ekphrastic verse.Â The next determination, then, to be made is whether or not that discovery carries semantic weight.Â If still, stillness, death, breathless, etc are not actually a dominant trend, have we overstated the case?
It seems that what youâ€™re saying (and please intervene if Iâ€™m not articulating this correctly) , which I tend to agree with is that â€œchunk sizeâ€ should be something determined by the questions being asked, and stating the way in which data has been chunked reflects the types of results we want to get in return.Â Taking this into consideration, though, certainly has helped the way I position what Iâ€™m doing.Â For me it is significant to chunk at the level of individual poems; however, were I to change my question to something like, â€œWhich poets trend more toward ekphrastic topics than others?â€â€”based on what weâ€™re saying here, that question seems to require chunking volumes rather than individual poems.
In other news, test models on the whole 4500 poems in my dataset, which is chunked at the level of individual poem, yielded much more promising initial results than we thought we would get.Â I would guess that it has something to do with the number of topics we assign when we run the model, and maybe one of the other ways forward is to talk about the threshold number of topics we need to assign in order to garner meaningful results from the model. Â (Obviously people like Matt and Travis have hands-on experience with this; however, I’m wondering if the type of question we’re asking should have a definable impact on how many topics we generate for the different types of tests….) Hopefully, in the near future Iâ€™ll be able to share some of those very preliminary resultsâ€¦ but Iâ€™m still in the midst of refining my queries and configuring my data.
Again, Iâ€™m engaged because I find what youâ€™re doing both relevant and useful, and I think that having these mid-investigation conversations does help to inform the way ahead.Â As you mention, perhaps many of these kinds of questions are answered in Matt Jockersâ€™s book, but it is unlikely Iâ€™ll be able to use that before this first iteration of my project is done in the next month or two. Â I believe that hearing anecdotal conversation about the low-level kinds of tests people are playing with really does help others along in their own work since we’re still figuring out what exactly we can do with this tool.