Add morphological information to aligned text

Add morphological information to aligned text

An XQuery to join the XML with sentence aligned in Alpheios' online editor and the XML with morphological information obtained from Perseus Project Morpheus parser via the SoSoL API.

Caveat: to conform with Perseus treebank XML format, all interpunction in the sentence to be aligned should be separated from the previous word by an added space.

Input

What do we need: an XML file with morphological information (treebank.xml) and an XML file with aligned texts (alignment.xml). In real life it will be necessary to give precise local address of the files (e. g. /home/user1/treebank.xml).

Example of alignment file:

<aligned-text xmlns="http://alpheios.net/namespaces/aligned-text">
    <language lnum="L1" xml:lang="lat"/>
    <language lnum="L2" xml:lang="hrv"/>
    <sentence>
        <wds lnum="L1">
            <w n="1-1">
                <text>Dum</text>
                <refs nrefs="1-1"/>
            </w>
            <w n="1-2">
                <text>paucos</text>
                <refs nrefs="1-8"/>
            </w>
            <w n="1-3">
                <text>dies</text>
                <refs nrefs="1-9"/>
            </w>
            <w n="1-4">
                <text>ad</text>
                <refs nrefs="1-5 1-6"/>
            </w>
            <w n="1-5">
                <text>Vesontionem</text>
                <refs nrefs="1-7"/>
            </w>
    (...)
    </wds>
  </sentence>
</aligned-text>

Example of treebank file:

<?xml version="1.0" encoding="UTF-8"?>
<treebank xmlns="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:cnt="http://www.w3.org/2008/content#" xmlns:oac="http://www.openannotation.org/ns/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xml:lang="lat" xsi:schemaLocation="http://nlp.perseus.tufts.edu/syntax/treebank/1.5 treebank-1.5.xsd" version="1.5" format="aldt">
  <date/>
  <annotator>
    <short/>
    <name/>
    <address/>
  </annotator>
  <sentence id="1" document_id="urn:cts:latinLit:phi0448.phi001.perseus-lat1:1.39.1" subdoc="" span="">
    <word id="1" form="Dum" head="0" relation="nil" lemma="dum" postag="c--------"/>
    <word id="2" form="paucos" head="0" relation="nil" lemma="pauci" postag="n-p---ma-"/>
    <word id="3" form="dies" head="0" relation="nil" lemma="dies" postag="n-p---ma-"/>
    <word id="4" form="ad" head="0" relation="nil" lemma="ad" postag="r--------"/>
    <word id="5" form="Vesontionem" postag="----------" head="0" lemma="" relation="nil"/>
  (...)
  </sentence>
 </treebank>

The XQuery

Luckily, the XQuery is quite simple. Stackoverflow helped, as usual.

(: add morphological information to aligned texts :)
<wds lnum="L1">
   { 
     for $i in doc("alignment.xml")//*:wds[1]/*:w,
         $p in doc("treebank.xml")//*:sentence/*:word[(count($i/preceding-sibling::*:w) + 1)]
     return
           element w {
             attribute n {$i/@n} ,
             $i/*:text,
             $i/*:refs,
             $p
           }           
   }
</wds>

The result

An excerpt here:

<wds lnum="L1">
  <w n="1-1">
    <text xmlns="http://alpheios.net/namespaces/aligned-text">Dum</text>
    <refs xmlns="http://alpheios.net/namespaces/aligned-text" nrefs="1-1"/>
    <word xmlns="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:oac="http://www.openannotation.org/ns/" xmlns:cnt="http://www.w3.org/2008/content#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="1" form="Dum" head="0" relation="nil" lemma="dum" postag="c--------"/>
  </w>
  <w n="1-2">
    <text xmlns="http://alpheios.net/namespaces/aligned-text">paucos</text>
    <refs xmlns="http://alpheios.net/namespaces/aligned-text" nrefs="1-4"/>
    <word xmlns="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:oac="http://www.openannotation.org/ns/" xmlns:cnt="http://www.w3.org/2008/content#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="2" form="paucos" head="0" relation="nil" lemma="pauci" postag="n-p---ma-"/>
  </w>
  <w n="1-3">
    <text xmlns="http://alpheios.net/namespaces/aligned-text">dies</text>
    <refs xmlns="http://alpheios.net/namespaces/aligned-text" nrefs="1-5"/>
    <word xmlns="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:oac="http://www.openannotation.org/ns/" xmlns:cnt="http://www.w3.org/2008/content#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="3" form="dies" head="0" relation="nil" lemma="dies" postag="n-p---ma-"/>
  </w>
  <w n="1-4">
    <text xmlns="http://alpheios.net/namespaces/aligned-text">ad</text>
    <refs xmlns="http://alpheios.net/namespaces/aligned-text" nrefs="1-6"/>
    <word xmlns="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:oac="http://www.openannotation.org/ns/" xmlns:cnt="http://www.w3.org/2008/content#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" id="4" form="ad" head="0" relation="nil" lemma="ad" postag="r--------"/>
  </w>
  (...)
  </wds>

Discussion

An XML aligned text enriched morphologically enables us to construct e. g. a Moodle question where students can check whether the automatic parsing was correct or not (and the translation will be there to help them); also, this is a step towards Moodle cloze exercises in which the lemma is given, and student has to supply the form which is meaningful in the sentence.

We'll also think about using the part of speech information (postag) from the treebank.

And, of course, a fully treebanked file (with dependencies marked) can be used as well; it will enable us to combine syntax, words, translation and grammatical information in exercises.