Diarium Latinitatis

Task: find out computationally which words in a Latin text are rare or strange.

Phases

  1. prepare a wordlist (all words in the text)
  2. use an existing lemmatizing service to process the wordlist
  3. compare list of lemmatized words with the primary wordlist
  4. check the strange words in other (nonclassical) dictionaries

Tools

Preprocess by excluding frequent words; there are some lists. Claude Pavur's is here (18653 wordforms). James H. Dee's database is here. Anne Mahoney's 200 essential Latin words are here (currently). A list of words where -que ending is not a conjunction is here (among other useful things).

Lemmatizing services: Archimedes (XML-RPC), LemLat, PrePro2010 (XML-API). All results require postprocessing, cleaning etc.

Compare list2 (lemmatized words) with list1 (primary wordlist): comm

Bash script wrapper

Our bash script which serves as a wrapper and pre-processor for Archimedes Project XML-RPC call looks like this:

#!/bin/bash
# Jovanovic, 2011-10, lematiziranje rijeci
# usage: ./comm-lemm.sh antconc-result-filename
# requires: vlist.sed, latstop.txt, rpc3.py, 11lemclean.sed, 11lemclean2.sed
# step 0: clean up an AntConc wordlist
# step 1: remove the frequent Latin words. Ensure the unix format, remove whitespaces.
tr -d '\011' < "$1" | sed -f vlist.sed - | sort - | comm -23 - latstop.txt > c"$1"
# vlist.sed holds cleaning commands
# latstop.txt holds frequent latin words

# let the result of step 1 become a 'file' variable
FILE=c"$1"
# step 2: send rarer words to lemmatizer
# clean up the results
# sort and save the lemmata
python rpc3.py "${FILE}" | sed -f 11lemclean.sed - | sort - | sed -f 11lemclean2.sed - > lem2"${FILE}"
# 11lemclean.sed holds cleaning commands for lemmatizer
# 11lemclean2.sed holds another set of cleaning commands

# keep only the forms which were lemmatized
# then comm, but watch for spaces and tabs!
comm -23 "${FILE}" lem2"${FILE}" > r-"${FILE}"

 
z/diariumlatinitatis.txt · Last modified: 01. 11. 2011. 23:23 by njovanov
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki