====== Exploring CroALa through Janus Pannonius ====== A lab diary. ===== What we already have ===== * Access Janus Pannonius' works in CroALa (as listed in the [[z:croala-index-auctorum|Index auctorum]]) * Study [[http://ramminger.userweb.mwn.de/search/searchresults.htm?searchField=IAN+PAN&Submit.x=16&Submit.y=15&srcriteria=phrase&aUngleichA=1|Pannonius' words]] excerpted for Johann Ramminger's Neulateinische Wortliste (as listed among the [[z:croala-nlw|Auctores Croatici (et vicini) in NLW]]) * A procedure for [[z:croala-large-scale|large-scale querying in CroALa]], looking for names Pannonius used ====== Querying CroALa with a list of Pannonius' words ====== ===== The algorithm ===== - Produce a list of words from a Pannonius' text. Normalize orthographically if necessary (ę = ae etc) - Send the list to a Latin lemmatizing service, e. g. Morpheus lemmatizer offered by Perseus Digital Library - Collect the words **not** recognized by the lemmatizer. They are interesting because there are either names or words unknown to the service (current version of Morpheus does not recognize [[http://services-qa.projectbamboo.org/bsp/morphologyservice/analysis/word?word=accipiter&lang=lat&engine=morpheus|accipiter]]). - The unrecognized common words can be included in the service, or explored as possible neo-Latin contribution to vocabulary - The unrecognized names are considered as prominently thematic, and worthy of further comparison: what do authors in CroALa have to say about Autolycus or Demophoont? Do they say the same things, or something different? - The list of unrecognized words is then sent to CroALa for automatic querying. A script notes the number of occurrences found - Obviously, there will always //be// some occurrences, until we decide to exclude Pannonius from the queried set through a [[http://www.ffzg.unizg.hr/klafil/croala/cgi-bin/search3t?dbname=croala&author=NOT+panonije|NOT panonije]] search - Query results are formatted into [[http://www.ffzg.unizg.hr/klafil/croala/xpr/2012-10-jp-unk.html|a third list]], with links to searches in CroALa and data on frequency in the collection. This list could also include links to other searches (in CAMENA / Termini, in Poeti d'Italia in lingua latina, in Google Books, in archive.org). - [[http://www.ffzg.unizg.hr/klafil/croala/xpr/2012-10-jp-unk100.html|A subset of this list, including only queries with less than 100 results in CroALa]], seems better suited for close reading experiments - A [[https://www.google.com/fusiontables/DataSource?docid=1YJCACjkiwYb5JD7KfmrnxPSv-OG2PC8EfOoae5Q|Google Fusion Table of this list]] enables not only sharing, but also (public) manipulation of data, such as sorting and filtering ===== A problem: regular expressions homonyms ===== Because we are searching in a corpus of a flective language, and because orthography varies across texts in CroALa, we want to use regular expressions in searches. Fortunately, PhiloLogic offers not only, regex, but also ways to control it. There is a problem. A search for //d*// will find many "homonyms". And yet, when we try to stem words -- to take only the roots as query basis -- //deus// will end up as //d*//. In extreme cases, homonymy can be removed "by hand" --- but our hypothesis is that computers can serve for "triage", for finding interesting passages in a corpus. How to make them do this effectively? We had to experiment. Using regex's capabilities to match //any single character// (through the operator .), we built several searches for the same stem (followed by two, three, four, or by any number of characters). Here are the query strings, together with number of occurrences found in CroALa: "ETH?RUSC..","19" "ETH?RUSC...","" "ETH?RUSC....","2" "ETH?RUSC.*","39" The idea is that a comparison of these slightly different query strings will help us categorize the cases -- find out what is useful where. A starting point for such comparisons can be seen here, as a table of names from Latin poems by Marko Marulić (Marcus Marulus): [[http://www.ffzg.unizg.hr/klafil/croala/xpr/marul-radices.html|X]]. See [[z:croala-regex-homonyms|the algorithm and the scripts]] for achieving all that's described above. ====== Querying Pannonius's texts in CroALa with a set of words found elsewhere ====== * Headings of Ravisius Textor's ([[http://www.uni-mannheim.de/mateo/camenaref/tixier.html|Jean Tixier's]]) Epitheta, presented as starting points for parallel searches in Pannonius and in CroALa (letter A): [[http://www.ffzg.unizg.hr/klafil/croala/xpr/tix-a-q.html|X]] Note that Textor's headings with problematic words --- very short ones, or n-gram phrases --- are also included there. ===== The algorithm ===== - Produce a list of words found or compiled elsewhere (Marulić's names, Ravisius Textor's headings and epitheta) - Prepare the list for orthographic variation: - stem it (keeping the original word as a CSV field) - treat separately stems of 1--3 characters, of 4--5, of 6--7, everything else, and PHRASES (recognizable by a space in the field) - Query all of CroALa with items from the list (we use a [[z:croala-query-all-bash|Bash script]]) - Produce report: - Keep query and number of occurrences, strip everything else - Join (Linux ''paste'') the two CSV fields ("QUERY","NUMBER OF OCCURRENCES") to the original CSV list - Query Pannonius' texts in CroALa with the list - Produce report, as above - Paste it all together (in the order: original, Pannonius, CroALA total) - Transform CSV into HTML rows (with a [[z:croala-csv-rows|Perl script]]) - Produce a HTML report with a report template (with a [[z:croala-report-template|Perl script]]) ===== Problems and fine-tuning ===== We also produce a simple report on ratio of Pannonius' words to CroALa total (n = CroALa / Pannonius). The lesser the ratio, more potentially interesting the result. The report: abyssus (ABIS?S.*), linea 11: 124 accipiter (AC?CIPITE.*), linea 16: 66 acerra (ACER?R.*), linea 18: 74 achaia (ACH?A.*), linea 22: 266 achates (ACH?AT.*), linea 23: 18 achelous (ACH?ELO.*), linea 24: 29 acheron (ACH?ERON.*), linea 25: 59 achiui (ACH?IU.*), linea 27: 12 ACIES (ACIE.), linea 28: 183 ACIES (ACIE*), linea 29: 163 acron (ACRON.*), linea 37: 3 actaeon (ACTA?O?EON.*), linea 38: 10 acus (ACU), linea 42: 14 adamas (ADAM.*), linea 44: 216 addua (ADDU.*), linea 46: 685 adonis (ADON.*), linea 49: 123 aduena (ADUEN.*), linea 53: 221 adulter (ADULTE.*), linea 56: 137 adytum (ADIT.*), linea 58: 238 aeacus (A?O?EAC.*), linea 59: 12 aedes (A?O?ED.*), linea 61: 119 aeetes (A?O?EET.*), linea 63: 7 aegeria (A?O?EGER.*), linea 65: 60 aegis (A?O?EG.*), linea 67: 58 aello (A?O?EL?L.*), linea 73: 235 aeneas (A?O?ENE.*), linea 74: 48 aeneis (A?O?ENE.*), linea 75: 48 aeolus (A?O?EOL.*), linea 77: 57 aequor (A?O?EQUO.*), linea 78: 49 aerumna (A?O?ERUMN.*), linea 82: 179 aeschylus (A?O?ESCH?IL.*), linea 83: 3 aestas (A?O?EST.*), linea 86: 170 AESTUS PRO CALORE (A?ESTU*), linea 87: 102 aether (A?O?ETH?E.*), linea 90: 82 aethiope (A?O?ETH?IOP.*), linea 91: 62 aethra (A?O?ETH?R.*), linea 94: 52 aetna (A?O?ETN.*), linea 95: 12 aeua (A?O?EU.*), linea 96: 17 aeuum (A?O?EU.*), linea 97: 17 afri (AFR*), linea 99: 198 africa (AFRIC.*), linea 100: 153 africus (AFRIC.*), linea 101: 153 agamemnon (AGAMEMNON.*), linea 102: 24 agenor (AGENO.*), linea 107: 4 ager (AGER), linea 108: 55 ager (AGR*), linea 109: 127 agger (AGGE.*), linea 110: 67 agna (AGN.), linea 112: 83 agna (AGN..), linea 113: 155 agnus (AGN..), linea 114: 155 agrestes (AGREST.*), linea 115: 130 agricola (AGRICOL.*), linea 116: 74 ahenum (AH?EN.*), linea 117: 45 ala (AL.), linea 120: 170 ala (AL..), linea 121: 86 alba (ALB..), linea 125: 214 alba (ALB.), linea 126: 27 alcides (ALCID.*), linea 132: 11 alcinous (ALCINO.*), linea 133: 6 alcmaeon (ALCMA?O?EON.*), linea 134: 6 alcmena (ALCMEN.*), linea 135: 5 alea (ALE), linea 137: 24 ales (AL..), linea 140: 86 ales (ALIT...), linea 141: 19 ales (ALIT.), linea 142: 17 alexander (ALEXANDE.*), linea 144: 245 alexandria (ALEXANDR.*), linea 145: 254 alga (ALG..), linea 147: 14 alloquium (AL?LOQUI.*), linea 150: 60 alnus (ALN*), linea 153: 20 aloe (ALOE*), linea 154: 12 aloeus (ALO?A?E.*), linea 155: 127 aloidae (ALOID.*), linea 156: 2 alpes (ALP*), linea 157: 71 alpheus (ALPH?E.*), linea 158: 44 altare (ALTAR.*), linea 159: 169 alueus (ALUE.*), linea 160: 58 aluus (ALU.), linea 162: 83 aluus (ALU..), linea 163: 71 amantes (AMANT.*), linea 165: 39 amaracus (AMARAC.*), linea 166: 5 amaranthus (AMARANTH?.*), linea 167: 12 amaror (AMARO.*), linea 168: 23 amator (AMATO.*), linea 171: 113 amazones (AMAZON.*), linea 172: 39 ambitio (AMBITI.*), linea 174: 102 ambrosia (AMBROS.*), linea 175: 26 amica (AMIC.*), linea 178: 76 amicitia (AMICIT.*), linea 179: 44 amicus (AMIC.*), linea 180: 76 amnis (AMN.), linea 182: 33 amnis (AMN..), linea 183: 55 amphion (AMPH?ION.*), linea 189: 11 amphitheatrum (AMPH?ITH?EATR.*), linea 191: 10 amplexus (AMPLEX.*), linea 195: 259 amyclae (AMICL.*), linea 198: 11 amygdalus (AMIGDAL.*), linea 199: 5 anas (ANAS), linea 201: 7 anchora (ANCH?OR.*), linea 207: 45 ancus (ANC..), linea 210: 14 andromeda (ANDROMED.*), linea 211: 19 anethum (ANETH?.*), linea 212: 4 angelus (ANGEL.*), linea 214: 735 anglia (ANGL.*), linea 215: 946 angli (ANGL.*), linea 216: 946 anguilla (ANGUIL?L.*), linea 218: 7 anguis (ANGU.*), linea 219: 66 angulus (ANGUL.*), linea 220: 244 anima (ANIM.*), linea 223: 193 animus (ANIM.*), linea 226: 193 annales (AN?NAL.*), linea 227: 149 ANNA soror Didus (ANNA?[AE]|ANN[AE][QNU]*), linea 228: 369 annus (AN?N.*), linea 233: 112 anser (ANSE.*), linea 236: 19 antenor (ANTENO.*), linea 237: 20 anteus (ANTE.*), linea 238: 83 antrum (ANTR.*), linea 244: 35 anus (AN), linea 246: 56 apelles (APEL?L.*), linea 247: 118 aper (APER), linea 248: 16 aper (APR..), linea 249: 44 aper (APR.), linea 250: 12 apes (APE.), linea 252: 30 apes (APU.), linea 253: 281 apex (APEX.*), linea 254: 72 apicius (APIC.*), linea 255: 92 apis (API.), linea 258: 60 apollo (APOL?L.*), linea 262: 20 apostoli (APOSTOL.*), linea 264: 310 aqua (AQU..), linea 268: 70 aqua (AQU.), linea 269: 142 aquila (AQUIL.*), linea 273: 374 aquilo (AQUIL.*), linea 274: 374 ara (AR..), linea 275: 48 ara (AR.), linea 276: 111 arabes (ARAB.*), linea 277: 41 ARAR SIUE ARARIS (ARAR.), linea 282: 21 arator (ARATO.*), linea 283: 30 aratrum (ARATR.*), linea 284: 86 aratus (ARAT.*), linea 285: 40 araxes (ARAX.*), linea 286: 12 arbor (ARBI.*), linea 287: 194 arbustum (ARBUST.*), linea 288: 36 arca (ARC..), linea 290: 71 arca (ARC.), linea 291: 71 arcades (ARCAD.*), linea 292: 116 arces (ARCE.), linea 293: 33 arces (ARCI...), linea 295: 26 arctos (ARCT.*), linea 299: 49 ARCUS COELESTIS (ARCU.+CO?A?EL*), linea 301: 1 ardor (ARDO.*), linea 302: 76 area (ARE.), linea 303: 143 area (ARE..), linea 304: 33 arena (AREN.*), linea 305: 53 argentum (ARGENT.*), linea 307: 268 argi (ARG.), linea 309: 16 argi (ARG..), linea 310: 16 argilla (ARGIL?L.*), linea 312: 15 arion (ARION.*), linea 324: 29 ARION EQUUS (ARION*), linea 325: 29 arista (ARIST.*), linea 327: 201 aristoteles (ARISTOTEL.*), linea 331: 343 arma (ARM.), linea 332: 50 arma (ARMI.), linea 333: 56 arma (ARMOR*), linea 334: 182 armenia (ARMEN.*), linea 335: 205 armentum (ARMENT.*), linea 337: 130 arnus (ARN.), linea 340: 12 arrius (AR?R.*), linea 345: 65 artemisia (ARTEMIS.*), linea 346: 12 arundo (ARUND.*), linea 352: 106 ARUNDO PRO IACULO (ARUND*), linea 353: 106 aruum (ARU..), linea 356: 44 aruum (ARU.), linea 357: 31 aruum (ARUI.), linea 358: 41 ascra (ASCR.*), linea 361: 119 asia (ASIA*), linea 363: 288 asinus (ASIN.*), linea 366: 136 aspectus (ASPECT.*), linea 369: 267 aspis (ASPID*), linea 372: 55 astra (ASTR.*), linea 377: 36 astraea (ASTRA?O?E.*), linea 378: 63 astus (AST.), linea 380: 34 astutia (ASTUT.*), linea 381: 28 athenae (ATH?EN.*), linea 388: 258 athesis (ATH?ES.*), linea 390: 10 athos (ATH?.*), linea 392: 142 atlas (ATH?LANT*), linea 395: 13 atlas (ATH?LAS*), linea 396: 36 atomi (ATOM.*), linea 397: 22 atria (ATRI.), linea 399: 208 atria (ATRI..), linea 400: 52 atridae (ATRID.*), linea 402: 21 atropos (ATROP.*), linea 403: 5 auceps (AUCEP.*), linea 409: 8 aucupium (AUCUPI.*), linea 410: 12 audacia (AUDAC.*), linea 411: 154 auena (AUEN.*), linea 413: 81 auernus (AUERN.*), linea 414: 40 augur (AUGU.*), linea 416: 221 augustus (AUGUST.*), linea 419: 200 aula (AUL..), linea 424: 233 aula (AUL.), linea 425: 78 aulaea (AULA?O?E.*), linea 427: 188 aulis (AULID*), linea 428: 3 aura (AUR.), linea 431: 76 aura (AUR..), linea 432: 76 aures (AURE.), linea 433: 67 aures (AURIB..), linea 435: 98 aurora (AUROR.*), linea 438: 254 ausonia (AUSON.*), linea 440: 9 ausonii (AUSONI.*), linea 441: 9 auspicium (AUSPICI.*), linea 443: 58 auster (AUSTE.*), linea 444: 68 ausus (AUS.), linea 445: 36 ausus (AUS....), linea 446: 55 ausus (AUS..), linea 447: 61 autolycus (AUTOLIC.*), linea 449: 1 autumnus (AUTUMN.*), linea 450: 21 auus (AU..), linea 451: 52 auus (AU.), linea 452: 257 axis (AXE), linea 455: 53 axis (AXI.), linea 456: 50 This is done via the following Perl code, applied to a CSV file: #!/usr/bin/perl # csv-2col.pl - report on columns and ratio col5/col3 use strict; use warnings; use Text::CSV; use File::Slurp 'read_file'; my $file = shift or die 'filename!'; my $csv = Text::CSV->new(); open(CSV, '<', $file) or die "Could not open '$file' $!\n"; while () { if ($csv->parse($_)) { my @columns = $csv->fields(); # test for not equal to regex: if ($columns[3] !~ /0|^$/ ) { print $columns[0], " (", $columns[1], "), linea ", $., ": ", int($columns[5] / $columns[3]), "\n"; } } else { my $err = $csv->error_input; print "Failed to parse line: $err"; } }