Lectio quinta: CroALa X-ray

By this time you realize that it could be interesting to know what CroALa is made of. It consists of two main parts. One is the set of texts, the other is the system for indexing and searching the texts. The two parts are relatively independent, like partners in a relationship: if you break up, it's not the end of the world, you can find somebody else (OK, let's not go further with the metaphor).

Laptop X-Ray

We'll explain first the texts, then the system that processes them.

The texts

“The texts are coded in TEI XML.” This actually means that you see this:

Manus elevata

Imbrium satura terra
Inebriatus titubat orbis
in abscondito rerum rumore

Os signatum fons
Manus elevata pons

but underneath is this:

<?xml version="1.0" encoding="utf-8"?>
<TEI xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <fileDesc xml:id="golub-i-poemata">
            <titleStmt>
                <title>Poemata tria, versio electronica</title>
                <author key="golub01">
                    <name>Golub, Ivan</name>
                    <date when="1930">n. 1930</date>
                </author>
                <editor>
                             
                    <name>Neven Jovanović</name>
                </editor>
                <respStmt>
                    <resp>Hanc editionem electronicam curavit</resp>
                    <name>Neven Jovanović</name>
                </respStmt>
            </titleStmt>
            <editionStmt>
                <edition>Prema tiskanom izdanju (1997).</edition>
            </editionStmt>
            <extent ana="A">Mg:A 78 verborum, 20 versus</extent>
            <publicationStmt>
                <p>elektronska verzija: Digitalizacija hrvatskih latinista, znanstveni
                    projekt na Filozofskom fakultetu Sveučilišta u Zagrebu,
                    Hrvatska.  <date when="2012-07">Srpnja 2012</date></p>
            </publicationStmt>
            <sourceDesc>
                <bibl>Golub, Ivan: Ultima solitudo personae / Lice osame. Zagreb : Ceres, 1997 (Biblioteka Salona, knj. 7)  (prvo objavljivanje 1984).</bibl>
            </sourceDesc>
        </fileDesc>
            
        <profileDesc>
            <creation>
                <date when="1984">1984</date>
            </creation>
            <textClass>
                <keywords scheme="biblio/croala-catalogus-aetatum.xml#typ01">
                    <term>poesis</term>
                </keywords>
                <keywords scheme="biblio/croala-catalogus-aetatum.xml#aet01">
                    <term>Latinitas novissima (1800-2000)</term>
                    <term>Saeculum 20 (1901-2000)</term>
                    <term>1951-2000</term>
                </keywords>
                <keywords scheme="biblio/croala-catalogus-aetatum.xml#gen01">
                    <term>poesis - poema</term>
                </keywords>
            </textClass>
            
        </profileDesc>
        <revisionDesc>
            <change>
                <date>2012-07-22</date>
                <name>Neven Jovanović</name>
                Versio prima.
            </change>
           
        </revisionDesc>
    </teiHeader>
    <text>

	

<body>
    <div type="poesis-poema">
        <head>Manus elevata</head>
        
        
<lg><l>Imbrium satura terra</l>
<l>Inebriatus titubat orbis</l>
<l>in abscondito rerum rumore</l></lg>
        
<lg><l>Os signatum fons</l>
<l>Manus elevata pons</l></lg>
    </div>
</body>
    </text>
    </TEI>

What we know — all the literary conventions we acquired by being exposed to culture throughout life — is told to computers by means of encoding, which marks beginnings and ends. div means “here starts (or ends) a whole”; head means “here starts (or ends) a title of a whole”; lg means “here starts (or ends) a group of lines of verse”, and l means “this is a line of verse”. And so on. And the data that enable us to search CroALa by authors, titles of works, periods, genres etc. are found in the header of a TEI XML document, between <teiHeader> and </teiHeader>.

The TEI XML files can be used outside PhiloLogic — we can take them out of the system and put others in; we can publish the files elsewhere, through a system that will use them in different ways. They are not dependent on software, they are dependent on a standard. TEI stands for the Text Encoding Initiative standard for the representation of texts in digital form, developed chiefly for texts studied and used by the humanities, social sciences and linguistics.

You'll also notice that TEI XML texts are not only machine-readable, they are also (almost) readable by humans.

The system

The system used for searching and displaying the texts and its pieces is called PhiloLogic. It is developed at the University of Chicago by the ARTFL Project and the Digital Library Development Center. It is also open source, which means not only that anybody can use it for their own purposes, but that anybody can contribute to it — improve it, create extensions for it etc.

PhiloLogic takes a set of texts — in TEI XML or even a “plain text” (.txt) form — and indexes them; afterwards it uses the indexes to help us find what we want. It is programmed to look for certain most obvious things by default (such as author, editor, title, year of creation, year of publication) and to treat certain encoding in a certain way (present everything between p's as a paragraph of text) — but it can all be changed according to our needs. You have to learn how the system functions and where to configure it (and you have to learn at least the syntax of Perl and Linux), you have to experiment and expect some frustration — but it can be done. Experto credite. Also, the developers are helpful when you have a question.

Important thing here is that we can take “our” (let's say, Croatian Latin) texts out of the system, put in other texts encoded according to our rules,1) create a new database — and everything'll, magically, work — even without configuration.

1) Sorry, in Croatian, at least for the time being.
 
z/croala-x-ray.txt · Last modified: 06. 09. 2012. 18:24 by njovanov
 
Recent changes RSS feed Creative Commons License Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki