"Methods of extracting keywords and topics from text collections" (Maciej Eder)

Centre for Digital Humanities and Information Society invites all students and employees to workshops "Methods of extracting keywords and topics from text collections" by Maciej Eder (Institute of Polish Language, Polish Academy of Sciences). Workshops will take place in the Jakobi 2-105 computer class on

  • Tuesday 2nd May 12-16
  • Wednesday 3rd May 12-16
  • Thursday 4th May 10-12

Attending all workshops gives you 1 ECTS and for attending please fill the registration form.


"Methods of extracting keywords and topics from text collections"

The workshop will offer an introduction to information extraction methods from collections of written texts. It will start with a keywords analysis methodology and different methods of extracting keywords, which will be followed by a discussion on topic modeling, or a technique that allows for extracting cohorts of semantically related words from text collections. Keywords analysis in its different flavors (LL keywords, Zeta, tf-idf) allows for identifying the most relevant words in a collection of texts, or between two subcorpora, or even between two texts, by comparing particular word frequencies and determining their statistical significance. Topis modeling, on the other hand, provides a more comprehensive search for words that exhibit some semantic similarity – as defined by their textual contexts – that allows for discovering latent thematic structure of the documents in question.

The workshop will be devided into a theoretical introduction, followed by a hands-on session. The tools used include AntConc (a freeware standalone tool that implemented keywords extraction), and the R programming invironment, to perform topic modeling.

Du, K. & Dudar, J. & Schöch, C., (2022) Evaluation of Measures of Distinctiveness. Classification of Literary Texts on the Basis of Distinctive Words Journal of Computational Literary Studies 1(1). doi: https://doi.org/10.48694/jcls.102

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82, https://www.science.org/doi/10.1126/science.1199644

Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History. University of Illinois Press.

Paquot, M. and Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In Jucker, A. H., Schreier, D. and Hundt, M. (eds). Amsterdam / New York: Rodopi, pp. 247–69.

Goldstone, A. and Underwood, T. (2012). What can topic models of PMLA teach us about the history of literary scholarship?. Journal of Digital Humanities, 2(1) https://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/.

DIGIHUM Talk: Andra Siibak

"From artificial intelligence to artificial stupidity. Mapping the dominant enthusiasms and concerns related to the use of AI technologies in education"
students in library smiling

ReproducibiliTea journal club

DIGIHUM Talk: Nina Tahmasebi

2nd December: Nina Tahmasebi (University of Gothenburg) "Strengths and Pitfalls of Large-Scale Text Mining for DH" (Zoom)