Lectio quarta: Regular expressions in CroALa

This page is part of CroALa: schola, a manual for using CroALa and its PhiloLogic system.

Regular expressions, the “regex”, are the part of computer science which makes a linguist (or a philologist) feel comfortable. This is something we can understand, it applies to words and sentences, and it helps us realize, or remember, that programming languages — as well as mathematics — are actually languages (and becoming even more so with each new abstraction layer).

The introduction to regex I still like the best, after almost two decades of fiddling with computers, is the one given by Stephen Ramsay, and it can be found here.1)

What does regex do? Through it we can tell the computer e. g. to find all “strings” — series of characters — beginning with patri, followed by any combination of letters. In regex notation, “any combination of letters” is written as .*, or * for short. (The logically inclined will notice that one of possible combinations is also no letters at all.)

Patriae fumus igne alieno luculentior

PhiloLogic understands regex. To demonstrate, let's say we want to find all mentions of patria in CroALa, thinking about what the authors in CroALa had to say about one's own country. So we send to CroALa the following query string:

&word=patri*

See the results here, or, in a form more informative for our current purposes, like this (there we added the &OUTPUT=AF command you remember from Amphitrite).

Non uni angulo natus

As you have seen, in Latin and in CroALa there are many words beginning with patri-, many more than just all cases of patria. Oblique cases of pater, forms of patriarcha and its derivatives, patricius and patricidium and a lot more. How can we make search results more pertinent?

There are two ways. A brute strength approach would simply give as a query string all forms of patria, dividing them with the regex mark |, which means OR:

&word=patria|patriae|patriam|patriarum|patriis|patrios

Thanks to ancient Romans who economized with endings for the first declension, this should cover all cases of patria. However, it doesn't actually do the job — it does something else. Try it out for yourself, with this search string:2)

&word=patria|patriae|patriam|patriarum|patriis|patrios&OUTPUT=kwic

Can you tell what happens?

So it goes. The plural forms of patria are actually plural forms of adjective patrius. That's still OK, because patrii fines is a metonymy for patria. But we are missing occurrences such as patriaeque and patriamne (the ancient Romans built into Latin an option to append certain conjunctions after a word). Moreover, in CroALa the orthography gets quite lively, and patrię is the same as patriae, only written differently — but how to find it?

The second approach requires us, therefore, to compose a more nuanced regex. The best way to do it is to explore the first list of results (for patri*) and add rules to our regex trying to exclude words which are not forms of patria.

We do it step by step, not to get lost.

A more nuanced patria

We expect only vowels after patri-. This is written as:

&word=patri[aeiou].*

See how the list of results gets shorter (remember to use the query string &word=&word=patri[aeiou].*&OUTPUT=AF).

But we don't want forms of patriarcha and patriarchatus. This really means that we don't want the letter after the vowel to be r. There is a regex for that:

&word=patri[aeiou][^r].*&OUTPUT=AF

The list of results, even though it covers more than patria, seems satisfactory. But we lost the patria itself (because it does not have a non-r-letter after the vowel), so we add it back with the | regex operator:

&word=patri[aeiou][^r].*|patri[aeiou]&OUTPUT=AF

The only thing we're still missing is the e caudata (ę) — as well as other possible orthographical quirks. PhiloLogic offers a special function for that. An uppercase E in your regex will find not only e (and E), but also all its accented variants, ę, é, è, ê, etc. Try it out, first on its own:

&word=patriE&OUTPUT=AF

and then add it to our regex:

&word=patri[aeiou][^r].*|patri[aeiou]|patriE*&OUTPUT=AF

Have you noticed how the ranking of authors in the result list varies?

Exercises

1) Isn't Ramsay's page a masterwork of simplicity?
2) I don't give the link because it would confuse the DokuWiki formatting syntax. This manual is a DokuWiki instance, by the way.
 
z/croala-regex.txt · Last modified: 01. 09. 2012. 13:58 by njovanov
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki