Wordlist generator updated to replace diacritics

— Posted in Taal & Literatuur by

One of the nicest things about the online corpus tools I develop and maintain is that I get questions and suggestions from (a whole range of) researchers around the globe. This time, someone asked me to include the option to remove spanish interpunction at the start of a word/sentence, like ¿ and ¡ and an option to include or exclude diacritics like ø, ö and characters such as ß – the German 'sharp' or 'double S'. The script already filtered out these characters, but in a rather crude way: a word like schön simply became schon, and äh even became a single letter, h, which is strange (and incorrect) to see in the results.

enter image description here

Inverted question mark and exclamation mark

With this update, characters can be either kept (schön stays schön), or replaced (schön becomes schon). The update is live, so if you're interested, head over to https://www.reuneker.nl/files/wordlist.

Updates for the N-gram generator

— Posted in Taal & Literatuur by

Once in a while I receive emails from researchers all over the world with thanks and/or suggestions for the scripts I provide online, such as frequency list and n-grams generators. For this latter tool, I had a nice email conversation with a researcher from overseas, which led to the following enhancements and updates. I really enjoy these kinds of things, so if you have any suggestions or feedback – you know where to find me.

  • Slight efficiency rewrite of output rendering. (2024-01-26)
  • Added feature for respecting or ignoring sentence boundaries. (2024-01-25)
  • Added feature for including or excluding numbers. (2024-01-25)
  • Added top limits above 1.000 (2.000, 3.000, 4.000, 5.000, 10.000) to respect or ignore sentence boundaries. (2024-01-25)
  • Added feature for (virtually) unlimited results. (2024-01-22)
  • Added feature for unigrams. (2024-01-22)