Searching CroALa with lists

See also Transforming an index into a list of CroALa searches.

If you deal with language and literature, lists are very interesting instruments. Not to read, of course, but to do something with them. If we take a list of words and compare it to another, there is a lot to learn and discover — in case the lists are sufficiently long. However, comparing long lists is a job best left to others; in our case, to computers.

But what is interesting in comparing lists? Let's say you have a list of mythological names and a list of words from a (long) Latin epic. A subset at the intersection of these lists tells us which mythological names (from our list) occur in the epic — but also which names don't occur there.

Take ten more lists of ten more long epics, and apply the same procedure. See what happens then, think about what does it all mean, see what this comparison makes you think about.


However, before we come to this part of the action, we have to learn how to prepare lists, and where to find them and how to adapt them.

One list of mythological names — and other names from antiquity — is freely available on Wikisource.1) It's the list of lemmata from the Dictionary of Greek and Roman Biography and Mythology (1867), an encyclopedia by William Smith. Smith's dictionary on Wikisource is incomplete — a lot of articles are missing — but, for list-forging, this is irrelevant.

So, we take (that is, copy and paste) Smith's list of lemmata. In its Wikisource format it looks like this:

*[[../Abammon Magister|Abammon Magister]]
*[[../Abas (mythology) 1.|Abas 1.]]
*[[../Abas (mythology) 2.|Abas 2.]]

and with some searching and replacing we make it much simpler2):


Now we can order the computer to compare this list to another — in our case, to the list of all words occurring in CroALa (called words.R in PhiloLogic system), where the interesting part goes like this (actually it is preceded by a lot of less interesting numbers, also occurring in the texts):


It is important to realize that the list of words in CroALa has 271,281 potentially interesting entries, and the (simplified) list of all Smith's names in A has 1145 words. That means that a computer has to compare each of 1145 Smith's names with each of CroALa's 271,281 words; it has to make 310,616,745 comparisons. This could take a long time — should each comparing operation last a millisecond (it lasts less, I hope), the comparison would require 86 hours. On my old laptop it takes about 86 seconds, but it is enough for me to get impatient. That's how spoiled we have become.

Ordering the computer around

How do you make a computer compare lists? I know of three possible ways.

First, there are Linux tools (commands) grep and sort, which can be combined in a simple bash script to find intersection of the two lists (called in the script ${file1} and ${file2}):

# comparing two lists of words (separated by newlines)
# usage: ./ filename1 filename2
# calls tools: grep, sort
# take arguments, compare with grep, sort alphabetically
grep -if ${file1} -w ${file2} \
| sort -d - \
> "${file1}"-"${file2}"-zajedno

With this script, a search for 100 Smith's A-words among CroALa's 271,281 took 53.337 seconds. A search for 1000 words would require, I guess, ten times more – 8.8 minutes. This is still less than a coffee break.

Finding similar, not only identical

You understand already that our grep search for Abaeus will find only Abaeus. We had to use the grep -i option to make it find abaeus as well.

However, in CroALa, which is a collection of natural language, we can expect not only Abaei, Abaeum, Abaeo, but also Abaeusque, Abaeive, Abaeumne, and even Abęi, Abeus.

What to do?

The solution is called regular expressions.

In grep's dialect of regex notation we would use:3)

 grep -i \baba\?e[iuoe].*

(\b means “only beginning of word”; \? means “zero or one letter”; [iuoe] means “either i, u, o, or e”, and .* — a combination which is not necessary in my version of grep, it is there by default — means “everything and nothing”).

All the changes we should do

The orthography in neo-Latin is so uncontrolled, though not unpredictable, that we have to do a lot of regex transformations to cover all variants. With each transformation the chances grow that you'll find what you didn't look for.

On top of it come all the endings of Latin flexion.

However, when you use computer scripts, you have to think hard once, and later reuse what you've written. Here's the hack I made today to search for names from an index in CroALa (remember regex and PhiloLogic?)

# use on an already partially processed file
# we use sed, linux search-and-replace commandline tool
# 1. replace all x-s at the end of Latin words
sed 's/x$/[xc]*/g' inputfilename \
# 2. diphthong oE can be written as E; 3. remove the ending -Us
| sed 's/[oO]E/o?E/g' | sed 's/U[sm]$/*/g' \
# 4. remove the ending -a; 5. let anything come after endings -o, -r etc.
| sed 's/a$/*/g' | sed 's/[ornlm]$/&*/g' \
# 6. remove the ending -as, leaving the a; 7. let anything come after -os
| sed 's/as$/a*/g' | sed 's/os$/os*/g' \
# 8. -ns and -rs have -tis genitive tc. 9. remove the ending -Es
| sed 's/\([nr]\)s$/\1[st]*/g' | sed 's/Es$/*/g' \
# 9. -is in nominative produces -em, -es etc. 10. nom. -e may produce -ae, -is etc
| sed 's/Is$/[EI]*/g' | sed 's/E$/[aEI]*/g' \
# 11. -i is usually nom pl; 12. -ti- can be written as -ci-
| sed 's/I$/[IoU]*/g' | sed 's/tI\([aEIoU]\)/[tc]I\1/g' \
# 13. aspiration everywhere
| sed 's/t\([aEIOUr]\)/th?\1/g' > outputfilename
# should be improved further for beginnings such as Ae-, Ho-
1) Other lists of interest: Orbis latinus, a list of Latin place names; Pleiades Project prepares a list of ancient place names; there are lists of Latin words according to frequency, and a list of Latin words from old reference books. Also, any index to a book or a text will offer interesting material.
2) You'll notice that some entries have gone missing during the transformation. For experimental purposes, we decided to leave out all entries consisting of several words. We'll think about them tomorrow.
3) The shorter the word, the harder it is to find with regex in flective languages and unstable orthographical systems.
z/croala-lists.txt · Last modified: 03. 09. 2012. 23:38 by njovanov
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki