Task: find out computationally which words in a Latin text are rare or strange.
Preprocess by excluding frequent words; there are some lists. Claude Pavur's is here (18653 wordforms). James H. Dee's database is here. Anne Mahoney's 200 essential Latin words are here (currently). A list of words where -que ending is not a conjunction is here (among other useful things).
Lemmatizing services: Archimedes (XML-RPC), LemLat, PrePro2010 (XML-API). All results require postprocessing, cleaning etc.
Compare list2 (lemmatized words) with list1 (primary wordlist): comm
Our bash script which serves as a wrapper and pre-processor for Archimedes Project XML-RPC call looks like this:
#!/bin/bash
# Jovanovic, 2011-10, lematiziranje rijeci
# usage: ./comm-lemm.sh antconc-result-filename
# requires: vlist.sed, latstop.txt, rpc3.py, 11lemclean.sed, 11lemclean2.sed
# step 0: clean up an AntConc wordlist
# step 1: remove the frequent Latin words. Ensure the unix format, remove whitespaces.
tr -d '\011' < "$1" | sed -f vlist.sed - | sort - | comm -23 - latstop.txt > c"$1"
# vlist.sed holds cleaning commands
# latstop.txt holds frequent latin words
# let the result of step 1 become a 'file' variable
FILE=c"$1"
# step 2: send rarer words to lemmatizer
# clean up the results
# sort and save the lemmata
python rpc3.py "${FILE}" | sed -f 11lemclean.sed - | sort - | sed -f 11lemclean2.sed - > lem2"${FILE}"
# 11lemclean.sed holds cleaning commands for lemmatizer
# 11lemclean2.sed holds another set of cleaning commands
# keep only the forms which were lemmatized
# then comm, but watch for spaces and tabs!
comm -23 "${FILE}" lem2"${FILE}" > r-"${FILE}"