"Methods of extracting keywords and topics from text collections" (Maciej Eder)

Digihumanitaaria ja infoühiskonna keskus kutsub kõiki magistrante, doktorante ja TÜ töötajaid töötubadesse "Methods of extracting keywords and topics from text collections", mida viib läbi Maciej Eder (Institute of Polish Language, Polish Academy of Sciences). Töötoad toimuvad Jakobi 2–105 arvutiklassis aegadel

teisipäeval, 2. mail kl 12–16
kolmapäeval, 3. mail kl 12–16
neljapäeval, 4. mail kl 10–12

Kõigis töötubades osalemise eest on võimalik saada 1 EAP ja osalemiseks täitke palun registreerimisvorm.

---

"Methods of extracting keywords and topics from text collections"

The workshop will offer an introduction to information extraction methods from collections of written texts. It will start with a keywords analysis methodology and different methods of extracting keywords, which will be followed by a discussion on topic modeling, or a technique that allows for extracting cohorts of semantically related words from text collections. Keywords analysis in its different flavors (LL keywords, Zeta, tf-idf) allows for identifying the most relevant words in a collection of texts, or between two subcorpora, or even between two texts, by comparing particular word frequencies and determining their statistical significance. Topis modeling, on the other hand, provides a more comprehensive search for words that exhibit some semantic similarity – as defined by their textual contexts – that allows for discovering latent thematic structure of the documents in question.

The workshop will be devided into a theoretical introduction, followed by a hands-on session. The tools used include AntConc (a freeware standalone tool that implemented keywords extraction), and the R programming invironment, to perform topic modeling.

Du, K. & Dudar, J. & Schöch, C., (2022) Evaluation of Measures of Distinctiveness. Classification of Literary Texts on the Basis of Distinctive Words Journal of Computational Literary Studies 1(1). doi: https://doi.org/10.48694/jcls.102

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., Pickett, J. P., Hoiberg, D., et al. (2011). Quantitative analysis of culture using millions of digitized books. Science, 331(6014): 176–82, https://www.science.org/doi/10.1126/science.1199644

Jockers, M. L. (2013). Macroanalysis: Digital Methods and Literary History. University of Illinois Press.

Paquot, M. and Bestgen, Y. (2009). Distinctive words in academic writing: A comparison of three statistical tests for keyword extraction. In Jucker, A. H., Schreier, D. and Hundt, M. (eds). Amsterdam / New York: Rodopi, pp. 247–69.

Goldstone, A. and Underwood, T. (2012). What can topic models of PMLA teach us about the history of literary scholarship?. Journal of Digital Humanities, 2(1) https://journalofdigitalhumanities.org/2-1/what-can-topic-models-of-pmla-teach-us-by-ted-underwood-and-andrew-goldstone/.

"Methods of extracting keywords and topics from text collections" (Maciej Eder)

"Methods of extracting keywords and topics from text collections" (Maciej Eder)

Digihumanitaaria demopäeval tutvustati üliõpilastele digihumanitaaria kõrvaleriala

DIGIHUMi ettekanne: Andra Siibak

Tule Digihumanitaaria Demopäevale 14. aprillil Deltas