====== Lectio quarta: Regular expressions in CroALa ====== //This page is part of// [[z:croala-schola|CroALa: schola]], //a manual for using CroALa and its// [[http://sites.google.com/site/philologic3/|PhiloLogic]] //system.// Regular expressions, the "regex", are the part of computer science which makes a linguist (or a philologist) feel comfortable. This is something we can understand, it applies to words and sentences, and it helps us realize, or remember, that programming languages --- as well as mathematics --- are actually //languages// (and becoming even more so with each new [[http://en.wikipedia.org/wiki/Abstraction_layer|abstraction layer]]).
The introduction to regex I still like the best, after almost two decades of fiddling with computers, is the one given by [[http://lenz.unl.edu/|Stephen Ramsay]], and it can be found [[http://etext.lib.virginia.edu/services/helpsheets/unix/regex.html|here]].((Isn't [[http://lenz.unl.edu/|Ramsay's page]] a masterwork of simplicity?))
What does regex do? Through it we can tell the computer e. g. to find all "strings" --- series of characters --- beginning with ''patri'', followed by any combination of letters. In regex notation, "any combination of letters" is written as ''.*'', or ''*'' for short. (The logically inclined will notice that one of possible combinations is also //no letters at all//.) ===== Patriae fumus igne alieno luculentior ===== PhiloLogic understands regex. To demonstrate, let's say we want to find all mentions of //patria// in CroALa, thinking about what the authors in CroALa had to say about one's own country. So we send to CroALa the following query string: &word=patri* See the results [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&word=patri*|here]], or, in a form more informative for our current purposes, [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&word=patri*&OUTPUT=AF|like this]] (there we added the ''&OUTPUT=AF'' command you remember from [[z:croala-schola#lectio-secundacontrolling-output-of-a-sea-goddess|Amphitrite]]). ===== Non uni angulo natus ===== As you have seen, in Latin and in CroALa there are many words beginning with //patri-//, many more than just all cases of //patria//. Oblique cases of //pater//, forms of //patriarcha// and its derivatives, //patricius// and //patricidium// and a lot more. How can we make search results more pertinent? There are two ways. A **brute strength** approach would simply give as a query string all forms of patria, dividing them with the regex mark ''|'', which means OR: &word=patria|patriae|patriam|patriarum|patriis|patrios Thanks to ancient Romans who economized with endings for the first declension, this should cover all cases of //patria//. However, it doesn't actually do the job --- it does something else. Try it out for yourself, with this search string:((I don't give the link because it would confuse the [[http://www.dokuwiki.org/syntax|DokuWiki formatting syntax]]. This manual is a [[http://www.dokuwiki.org/dokuwiki|DokuWiki]] instance, by the way.)) &word=patria|patriae|patriam|patriarum|patriis|patrios&OUTPUT=kwic Can you tell what happens? So it goes. The plural forms of //patria// are actually plural forms of adjective //patrius//. That's still OK, because //patrii fines// is a metonymy for //patria//. But we are missing occurrences such as //patriaeque// and //patriamne// (the ancient Romans built into Latin an option to append certain conjunctions //after// a word). Moreover, in CroALa the orthography gets quite lively, and //patrię// is the same as //patriae//, only written differently --- but how to find it? The second approach requires us, therefore, to compose a **more nuanced regex**. The best way to do it is to explore the [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&word=patri*&OUTPUT=AF|first list of results]] (for ''patri*'') and add rules to our regex trying to exclude words which are //not// forms of //patria//. We do it step by step, not to get lost. ===== A more nuanced patria ===== We expect only vowels after //patri-//. This is written as: &word=patri[aeiou].* See how the list of results gets shorter (remember to use the query string ''&word=&word=patri[aeiou].*&OUTPUT=AF''). But we don't want forms of //patriarcha// and //patriarchatus//. This really means that we don't want the letter after the vowel to be //r//. There is a regex for that: &word=patri[aeiou][^r].*&OUTPUT=AF The list of results, even though it covers more than //patria//, seems satisfactory. But we lost the //patria// itself (because it does not have a non-r-letter //after// the vowel), so we add it back with the ''|'' regex operator: &word=patri[aeiou][^r].*|patri[aeiou]&OUTPUT=AF The only thing we're still missing is the //e caudata// (ę) --- as well as other possible orthographical quirks. PhiloLogic offers a special function for that. An uppercase E in your regex will find not only //e// (and //E//), but also all its accented variants, //ę, é, è, ê,// etc. Try it out, first on its own: &word=patriE&OUTPUT=AF and then add it to our regex: &word=patri[aeiou][^r].*|patri[aeiou]|patriE*&OUTPUT=AF Have you noticed how the ranking of authors in the result list varies? ===== Exercises =====