Diarium Latinitatis

Diarium Latinitatis

Task: find out computationally which words in a Latin text are rare or strange.

Phases

prepare a wordlist (all words in the text)
use an existing lemmatizing service to process the wordlist
compare list of lemmatized words with the primary wordlist
check the list of differences (example: a list of strange words from Ludovik Crijević Tuberon, Commentarii de temporibus suis)
check the strange words in other (nonclassical) dictionaries

Preprocess by excluding frequent words; there are some lists. Claude Pavur's is here (18653 wordforms). James H. Dee's database is here. Anne Mahoney's 200 essential Latin words are here (currently). A list of words where -que ending is not a conjunction is here (among other useful things).

Lemmatizing services: Archimedes (XML-RPC), LemLat, PrePro2010 (XML-API). All results require postprocessing, cleaning etc.

Compare list2 (lemmatized words) with list1 (primary wordlist): comm

Bash script wrapper

Our bash script which serves as a wrapper and pre-processor for Archimedes Project XML-RPC call looks like this:

#!/bin/bash
# Jovanovic, 2011-10, lematiziranje rijeci
# usage: ./comm-lemm.sh antconc-result-filename
# requires: vlist.sed, latstop.txt, rpc3.py, 11lemclean.sed, 11lemclean2.sed
# step 0: clean up an AntConc wordlist
# step 1: remove the frequent Latin words. Ensure the unix format, remove whitespaces.
tr -d '\011' < "$1" | sed -f vlist.sed - | sort - | comm -23 - latstop.txt > c"$1"
# vlist.sed holds cleaning commands
# latstop.txt holds frequent latin words

# let the result of step 1 become a 'file' variable
FILE=c"$1"
# step 2: send rarer words to lemmatizer
# clean up the results
# sort and save the lemmata
python rpc3.py "${FILE}" | sed -f 11lemclean.sed - | sort - | sed -f 11lemclean2.sed - > lem2"${FILE}"
# 11lemclean.sed holds cleaning commands for lemmatizer
# 11lemclean2.sed holds another set of cleaning commands

# keep only the forms which were lemmatized
# then comm, but watch for spaces and tabs!
comm -23 "${FILE}" lem2"${FILE}" > r-"${FILE}"

Diarium Latinitatis

Phases

Tools

Bash script wrapper