N-gram generator

You can use this page to generate n-grams for your texts. This site was made by Alex Reuneker. For a little demonstration and how to, see https://youtu.be/4aRD_NwuHW4. For questions, see contact details at https://www.reuneker.nl. If you use this site for your research, please cite it as follows.

@misc{reuneker_ngram,
   author = {Alex Reuneker},
   title = {{N}-gram generator (version )},
   howpublished = {\url{https://www.reuneker.nl/ngram}},
   year = {2019},
   note = {[Accessed ...]}
}

Upcoming: working on chunking for parsing very large text files (>250MB; testing involves a 2.3GB plain text file). It's looking promising. Also working on a paper on the n-gram generator for a gentle introduction and to refer to.

Steps

Copy a text (from a website, a book, a larger corpus, et cetera).
Paste the text into the input area below. You don't have to remove weird characters, tags, white spaces and new lines — the script does it for you.
Set 'ngram' to the desired number of words or leave at 2 (bigrams) and set the number of results wanted (or leave at 50). If you're going to sort on probablity (see 'explanation'), it can be useful to set a minimal frequency for the n-grams included in the list. Sentence boundaries are respected by default, meaning that a sequence like '[...] list. Sentence [...]' above is not processed. If you choose to ignore sentence boundaries, the example would count a the bigram 'list sentence'.
Click 'Generate ngrams' and wait a bit.

Input and settings

Choose preferred settings, or leave at default.

N-gram: Results per page: Sort: Minimal frequency: Sentence boundaries: Numbers: Possessive s ('s): Diacritics:

Paste a text to analyze below.

Words to exclude (space separated):

Note that loading a large text file may take a while, and no sign of processing will be visible. Please be patient, and take into account your computer's specifications, as all processing is done client-side. On my laptop, loading a 60MB text file takes 20-25 seconds.

The sample text is Sir Arthur Conan Doyle's A Scandal in Bohemia taken from Project Gutenberg.

Results

Results will be presented here after you clicked 'generate n-grams'... Please realise that large texts may take a while and may cause your browser to slow down. In such cases, it is wise to not set 'results per page' to 'unlimited'.

About

The ngram function used was written using Vanilla Javascript, and your text is not uploaded to any server. Your computer itself does all the work. Small texts are processed very quickly. Longer texts take a bit longer, although getting all bigrams from the King James Bible (4.2MB; almost 800.000 words) took my laptop approximately 90 seconds with multiple tabs and other applications open. Please note that retrieving a long or (virtually) unlimited list of results may slow down or even crash your browser due to memory limitations of your computer and browser architecture.

Short explanation

An n-gram is just a sequence of n words. So, 'cat' is a unigram, 'my cat' is a bi-gram, and 'my sleepy cat' is a tri-gram. There's more interesting things to know about n-grams. For a short explanation of frequencies, probabilities and my (rather unusual) 'strength' measure, click here.

Probabilities

This site does not only generate all n-grams and their frequencies from a text you provide, but it also calculates the conditional probability of the n-gram given the first word. Let's take the text 'I see a cat. I see a cat and a dog.' The n-grams 'I see', 'see a' and 'a cat' all occur twice and have a frequency of 2. If we take the word 'I' in the text, it occurs twice and is followed twice by 'see'. I am thus quite certain that when I encounter the word 'I', the next word will be 'see'. The probability is therefore 1 and is calculated by the number of times 'I see' occurs, divided by the number of times 'I' occurs. 'I see' occurs two times and 'I' occurs two times, so 2/2=1. If we do the same for 'see' in 'see a', which both also occur two times, we get the same result: 2/2=1. But take 'a' in 'a cat'. We have already seen 'a cat' occurs twice, but 'a' occurs three time times ('a cat', 'a cat' and 'a dog'). We therefore divide 2 (frequency of 'a cat') by 3 and get 2/3=0.67. This is the probability that, given the word 'a', the next word is 'cat'. The fun thing is that the remaining 1-0.67=0.33 is exactly the probability of 'dog' being followed by 'a', namely 'a dog' occurring only once, while 'a' occurs three times, so 1/3=0.33.

Strength

While there is lots more to ngrams than what I wrote above, I've added one more feature that I can introduce without to much theorizing. One of the problems of the aforementioned probabilities is that infrequent ngrams involving low-frequency words can have high probabilities. Take 'unladylike girls' in Alcott's 'Little Women'. The n-gram occurs only once, but the adjective 'unladylike' also occurs only once. This means the probability of 'girls' given 'unladylike' is 1. Now take 'Mrs. March', which occurs 141 times in the novel. There are other 'Mesdames' (the plural of mrs.), like 'Mrs. Gardiner' and 'Mrs. King', but none is as frequent as 'Mrs. March'. The probability, however, is 'only' 0.59, because of the other mesdames. I however find this n-gram more interesting than an n-gram that has high probability, but only occurs one or a few times. Therefore, the strength-measure I introduce here takes both frequency and probability into account. To do so, it takes the natural logarithm of the n-gram's frequency and multiplies it by its probability. The resulting number isn't really meaningful in itself, but only relative to that of the other n-grams, which makes it great for sorting and finding those n-grams that are have the right balance between frequency and probability. Check, for instance, Dickens' 'A Christmas Carol'. Sorting n-grams on frequency places 'in the' on top, probability places 'piece of', and strength places 'Tiny Tim' on number 1. As for informativeness, I'd take 'Tiny Tim'.

Updates/version history

Version 1.3.1.9: Option added to either remove or replace diacritics (schön→schn; schön→schon). This also fixes some small inconsistencies in the results in comparison to, for instance, the results of the Wordlist generator. No more single letters, caused by e.g. removing diacritics from a German word like äh, which simply became h. (2025-01-06)
Version 1.3.1.8: Multiple bug fixes, including a severe bug causing deviating results up to a factor of 3. Therefore, be sure to use the latest version of this page. The script may temporarily be slower because of this. This requires work on script efficiency. Added option to remove possessive 's. Added general statistics (word totals, TTR). (2024-04-05)
Version 1.3.1.7: Added feature for loading text file, especially useful for large texts/corpora. (2024-02-03)
Version 1.3.1.6: Fixed maximum in page numbering. (2024-02-01)
Version 1.3.1.5: Added sample text (Sherlock Holmes). (2024-02-01)
Version 1.3.1.4: Improved algorithm for filtering words to exlcude (so their deletion does not render incorrect n-grams). Added processing timer. Added version numbering and links to previous versions. (2024-02-01)
Version 1.3.1.3: Added option to exclude not only numbers, but also words containing numbers. (2024-01-29)
Version 1.3.1.2: Slight efficiency rewrite of output rendering. (2024-01-26)
Version 1.3.1.1: Added feature for respecting or ignoring sentence boundaries. (2024-01-25)
Version 1.3.1.0: Added feature for including or excluding numbers. (2024-01-25)
Version 1.3.0: Added top limits above 1.000 (2.000, 3.000, 4.000, 5.000, 10.000) to respect or ignore sentence boundaries. (2024-01-25)
Version 1.2.1: Added feature for (virtually) unlimited results. (2024-01-22)
Version 1.2.0: Added feature for unigrams. (2024-01-22)
Version 1.1.2: Added feature to export results to CSV. (2022-08-31)
Version 1.1.1: Added feature to exclude words before processing text. (2022-08-31)
Version 1.1.0: Fixed newline problem resulting in n+1-grams. (2021-01-17)
Version 1.0.1: Added strength measure. (2019-12-11)
Version 1.0.1: Added conditional probabilities. (2019-12-10)

N-gram generator (version )