====== Lectio quarta: Regular expressions in CroALa ======
//This page is part of// [[z:croala-schola|CroALa: schola]], //a manual for using CroALa and its// [[http://sites.google.com/site/philologic3/|PhiloLogic]] //system.//

Regular expressions, the "regex", are the part of computer science which makes a linguist (or a philologist) feel comfortable. This is something we can understand, it applies to words and sentences, and it helps us realize, or remember, that programming languages --- as well as mathematics --- are actually //languages// (and becoming even more so with each new [[http://en.wikipedia.org/wiki/Abstraction_layer|abstraction layer]]).

<blockquote>The introduction to regex I still like the best, after almost two decades of fiddling with computers, is the one given by [[http://lenz.unl.edu/|Stephen Ramsay]], and it can be found [[http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html|here]].((Isn't [[http://lenz.unl.edu/|Ramsay's page]] a masterwork of simplicity?))</blockquote>

What does regex do? Through it we can tell the computer e. g. to find all "strings" --- series of characters --- beginning with ''patri'', followed by any combination of letters. In regex notation, "any combination of letters" is written as ''.*'', or ''*'' for short. (The logically inclined will notice that one of possible combinations is also //no letters at all//.)

===== Patriae fumus igne alieno luculentior =====

PhiloLogic understands regex. To demonstrate, let's say we want to find all mentions of //patria// in CroALa, thinking about what the authors in CroALa had to say about one's own country. So we send to CroALa the following query string:
  &word=patri*
See the results [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&word=patri*|here]], or, in a form more informative for our current purposes, [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&word=patri*&OUTPUT=AF|like this]] (there we added the ''&OUTPUT=AF'' command you remember from [[z:croala-schola#lectio-secundacontrolling-output-of-a-sea-goddess|Amphitrite]]).

===== Non uni angulo natus =====


As you have seen, in Latin and in CroALa there are many words beginning with //patri-//, many more than just all cases of //patria//.  Oblique cases of //pater//, forms of //patriarcha// and its derivatives, //patricius// and //patricidium// and a lot more.  How can we make search results more pertinent?

There are two ways. A **brute strength** approach would simply give as a query string all forms of patria, dividing them with the regex mark ''|'', which means OR:
  &word=patria|patriae|patriam|patriarum|patriis|patrios
Thanks to ancient Romans who economized with endings for the first declension, this should cover all cases of //patria//.  However, it doesn't actually do the job --- it does something else. Try it out for yourself, with this search string:((I don't give the link because it would confuse the [[http://www.dokuwiki.org/syntax|DokuWiki formatting syntax]]. This manual is a [[http://www.dokuwiki.org/dokuwiki|DokuWiki]] instance, by the way.))

  &word=patria|patriae|patriam|patriarum|patriis|patrios&OUTPUT=kwic

Can you tell what happens?

So it goes. The plural forms of //patria// are actually plural forms of adjective //patrius//.  That's still OK, because //patrii fines// is a metonymy for //patria//. But we are missing occurrences such as //patriaeque// and //patriamne// (the ancient Romans built into Latin an option to append certain conjunctions //after// a word). Moreover, in CroALa the orthography gets quite lively, and //patrię// is the same as //patriae//, only written differently --- but how to find it?

The second approach requires us, therefore, to compose a **more nuanced regex**. The best way to do it is to explore the [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&word=patri*&OUTPUT=AF|first list of results]] (for ''patri*'') and add rules to our regex trying to exclude words which are //not// forms of //patria//.

We do it step by step, not to get lost.
===== A more nuanced patria =====
We expect only vowels after //patri-//. This is written as:
  &word=patri[aeiou].*
See how the list of results gets shorter (remember to use the query string ''&word=&word=patri[aeiou].*&OUTPUT=AF'').

But we don't want forms of //patriarcha// and //patriarchatus//. This really means that we don't want the letter after the vowel to be //r//. There is a regex for that:
  &word=patri[aeiou][^r].*&OUTPUT=AF
The list of results, even though it covers more than //patria//, seems satisfactory. But we lost the //patria// itself (because it does not have a non-r-letter //after// the vowel), so we add it back with the ''|'' regex operator:
  &word=patri[aeiou][^r].*|patri[aeiou]&OUTPUT=AF
The only thing we're still missing is the //e caudata// (ę) --- as well as other possible orthographical quirks.  PhiloLogic offers a special function for that. An uppercase E in your regex will find not only //e// (and //E//), but also all its accented variants, //ę, é, è, ê,// etc. Try it out, first on its own:
  &word=patriE&OUTPUT=AF
and then add it to our regex:
  &word=patri[aeiou][^r].*|patri[aeiou]|patriE*&OUTPUT=AF
Have you noticed how the ranking of authors in the result list varies?
===== Exercises =====