====== Searching CroALa with lists ====== See also [[z:croala-list-index|Transforming an index into a list of CroALa searches]]. If you deal with language and literature, lists are very interesting instruments. Not to read, of course, but to do something with them. If we take a list of words and compare it to another, there is a lot to learn and discover --- in case the lists are sufficiently long. However, comparing long lists is a job best left to others; in our case, to computers. But what is interesting in comparing lists? Let's say you have a list of mythological names and a list of words from a (long) Latin epic. A subset at the intersection of these lists tells us which mythological names (from our list) occur in the epic --- but also which names //don't// occur there. Take ten more lists of ten more long epics, and apply the same procedure. See what happens then, think about what does it all mean, see what this comparison makes you think about. ===== Preparations ===== However, before we come to this part of the action, we have to learn how to prepare lists, and where to find them and how to adapt them. One list of mythological names --- and other names from antiquity --- is freely available on Wikisource.((Other lists of interest: Orbis latinus, a list of Latin place names; Pleiades Project prepares a list of ancient place names; there are lists of Latin words according to frequency, and a list of Latin words from old reference books. Also, any index to a book or a text will offer interesting material.)) It's the list of lemmata from the [[http://en.wikisource.org/wiki/Dictionary_of_Greek_and_Roman_Biography_and_Mythology|Dictionary of Greek and Roman Biography and Mythology]] (1867), an encyclopedia by [[http://en.wikisource.org/wiki/Author:William_Smith_(1813-1893)|William Smith]]. Smith's dictionary on Wikisource is incomplete --- a lot of articles are missing --- but, for list-forging, this is irrelevant. So, we take (that is, copy and paste) Smith's list of lemmata. In its Wikisource format it looks like this: *[[../Abaeus|Abaeus]] *[[../Abammon Magister|Abammon Magister]] *[[../Abantiades|Abantiades]] *[[../Abantias|Abantias]] *[[../Abantidas|Abantidas]] *[[../Abarbarea|Abarbarea]] *[[../Abaris|Abaris]] *[[../Abas (mythology) 1.|Abas 1.]] *[[../Abas (mythology) 2.|Abas 2.]] and with some searching and replacing we make it much simpler((You'll notice that some entries have gone missing during the transformation. For experimental purposes, we decided to leave out all entries consisting of several words. We'll think about them tomorrow.)): Abaeus Abantiades Abantias Abantidas Abarbarea Abaris Abascantus Abderus Abdias Abellio Abgarus Abia Now we can order the computer to compare this list to another --- in our case, to the list of all words occurring in CroALa (called ''words.R'' in PhiloLogic system), where the interesting part goes like this (actually it is preceded by a lot of less interesting numbers, also occurring in the texts): aaron aarone aaronem aaroni aaronis aathor ab ab2 ab3 aba abac abachuc abachuch abacos abacosque abacta abactam abacti abactis abacto abactor abactorem abactos abactum abactus abactę abacuc abacuch abacum abacus abaddir abadon abaffy abagare abagari It is important to realize that the list of words in CroALa has 271,281 potentially interesting entries, and the (simplified) list of all Smith's names in A has 1145 words. That means that a computer has to compare each of 1145 Smith's names with each of CroALa's 271,281 words; it has to make 310,616,745 comparisons. This could take a long time --- should each comparing operation last a millisecond (it lasts less, I hope), the comparison would require 86 hours. On my old laptop it takes about 86 seconds, but it is enough for me to get impatient. That's how spoiled we have become. ===== Ordering the computer around ===== How do you make a computer compare lists? I know of three possible ways. First, there are Linux tools (commands) ''grep'' and ''sort'', which can be combined in a simple [[http://floppix.ccai.com/scripts1.html|bash script]] to find intersection of the two lists (called in the script ''${file1}'' and ''${file2}''): # comparing two lists of words (separated by newlines) # usage: ./cpr-lists.sh filename1 filename2 # calls tools: grep, sort # take arguments, compare with grep, sort alphabetically file1=$1 file2=$2 grep -if ${file1} -w ${file2} \ | sort -d - \ > "${file1}"-"${file2}"-zajedno With this script, a search for 100 Smith's A-words among CroALa's 271,281 took 53.337 seconds. A search for 1000 words would require, I guess, ten times more -- 8.8 minutes. This is still less than a coffee break. ===== Finding similar, not only identical ===== You understand already that our grep search for ''Abaeus'' will find only ''Abaeus''. We had to use the ''grep -i'' option to make it find ''abaeus'' as well. However, in CroALa, which is a collection of natural language, we can expect not only ''Abaei, Abaeum, Abaeo'', but also ''Abaeusque, Abaeive, Abaeumne'', and even ''Abęi, Abeus''. What to do? The solution is called [[z:croala-regex|regular expressions]]. In grep's dialect of regex notation we would use:((The shorter the word, the harder it is to find with regex in flective languages and unstable orthographical systems.)) grep -i \baba\?e[iuoe].* (''\b'' means "only beginning of word"; ''\?'' means "zero or one letter"; ''[iuoe]'' means "either i, u, o, or e", and ''.*'' --- a combination which is not necessary in my version of grep, it is there by default --- means "everything and nothing"). ===== All the changes we should do ===== The orthography in neo-Latin is so uncontrolled, though not unpredictable, that we have to do a lot of regex transformations to cover all variants. With each transformation the chances grow that you'll find what you didn't look for. On top of it come all the endings of Latin flexion. However, when you use computer scripts, you have to think hard //once//, and later reuse what you've written. Here's the hack I made today to search for names from an index in CroALa (remember [[z:croala-regex|regex and PhiloLogic]]?) # use on an already partially processed file # we use sed, linux search-and-replace commandline tool # 1. replace all x-s at the end of Latin words sed 's/x$/[xc]*/g' inputfilename \ # 2. diphthong oE can be written as E; 3. remove the ending -Us | sed 's/[oO]E/o?E/g' | sed 's/U[sm]$/*/g' \ # 4. remove the ending -a; 5. let anything come after endings -o, -r etc. | sed 's/a$/*/g' | sed 's/[ornlm]$/&*/g' \ # 6. remove the ending -as, leaving the a; 7. let anything come after -os | sed 's/as$/a*/g' | sed 's/os$/os*/g' \ # 8. -ns and -rs have -tis genitive tc. 9. remove the ending -Es | sed 's/\([nr]\)s$/\1[st]*/g' | sed 's/Es$/*/g' \ # 9. -is in nominative produces -em, -es etc. 10. nom. -e may produce -ae, -is etc | sed 's/Is$/[EI]*/g' | sed 's/E$/[aEI]*/g' \ # 11. -i is usually nom pl; 12. -ti- can be written as -ci- | sed 's/I$/[IoU]*/g' | sed 's/tI\([aEIoU]\)/[tc]I\1/g' \ # 13. aspiration everywhere | sed 's/t\([aEIOUr]\)/th?\1/g' > outputfilename # should be improved further for beginnings such as Ae-, Ho-