

INSTITUTE OF LINGUISTICS
Faculty of Philosophy, University of Zagreb
Ivana Lučića 3, 10000 Zagreb, CROATIA
tel. (+385 1) 6120-011, 6120-142; fax. (+385 1)
6156-879
e-mail: zzl@ffzg.hr
CROATIAN NATIONAL CORPUS
The Institute of linguistics is one of the organizational units
of the Faculty of Philosophy at
the University of Zagreb in Zagreb,
Croatia.
The Institute is the center of most of the linguistic-oriented
projects at the Facultyof Philosophy.
This document covers several topics on the Institute of
linguistics:
- History
- Aims
- Activities (projects, publications)
- Staff
HISTORY
The Institute of Linguistics was founded in 1960 upon the
suggestion of five distinguished professors of the Faculty of
Philosophy (Mirko Deanović, Rudolf Filipović, Vladimir Gortan,
Josip Hamm and Vojmir Vinja). The initial purpose of such an
institution was to obtain more efficient organization of
linguistic research at the Faculty. The first director of the
Institute was professor Ljudevit Jonke (1960-1963), followed by
professor Rudolf Filipović (1963-1983), professor Milan Moguš (1983-1992),
Dr. Maja Bratanić (1992-1994) and Dr. Vesna Muhvić-Dimanovski (1994.-).
AIMS
The general aims of the Institute defined from its beginning
have been:
- the development of linguistics and linguistic research
methodology in the field of Croatian and other languages;
- research of the phenomena in different fields of general
and comparative linguistics of Indo-European and non-Indo-European
languages;
- the investigation of correspondences between Croatian and
other languages.
ACTIVITIES
Projects
In the 1960-s the main activity of the Institute was
predominantly oriented towards Croatian in contact with other
languages. In the 1970-s the project of compiling the Croatian
language corpora started. An overview of projects (concluded and
current) can be divided in three areas:
- Computer processing of the Croatian language
- Contrastive projects (Croatian vs. other, primarily
European, languages)
- Current projects (Funded by Ministry
of Science and Technology of the Republic of Croatia)
1. Computer processing of the Croatian language
Two areas of computational linguistics have been covered by
projects:
- Corpus compiling and processing:
1
million corpus of running text in Croatian (M-corpus)
covering the 1937-1978 period and compiled by Prof. Milan
Moguš (project MZT RH 6-03-048 which ended in 1996).
The corpus is divided in 5 sub-corpora (prose, poetry,
drama, secondary school textbooks, newspapers; 200.000
tokens each)
The processing includes:
- alphabetic dictionary of types and lemmas
- frequency dictionary of types and lemmas
- concordances (KWIC) lemmatized as well as non-lemmatized
Croatian
national corpus composed of two components: 30
million corpus of contemporary Croatian language
(30m) and Croatian Electronic
Text Archive (HETA) (see
current project 130718 below)
Croatian-Slovene parallel corpus (see
current project 130821 below)
Croatian participation in ELAN project (see
current project 130729 below)
For survey of corpus projects in Croatia and Institute
see:
- Marko Tadić, »Računalna obradba hrvatskih
korpusa: povijest, stanje i perspektive«, Suvremena
lingvistika 43-44, (1997), str. 387-394. (ISSN
0586-0296) rtf pdf ps
- Marko Tadić, »Corpus-building projects in
Institute of linguistics, Philosophical Faculty,
University of Zagreb«, invited lecture,
University of Tübingen, 1999-12-01, slides)
- Computational Morphology of Croatian (see
current projet 130718 below)
Computational
model of the Croatian flective system has been
designed and prototyped. As the result of that research
the
Morphological generator of Croatian word-forms
has been developed. At this moment the filling of the
lexicon is on its way. The result of generation will be
Croatian morphological dictionary composed as the list of
word-forms accompanied with MSDs (morphosyntactic
descriptions) that can be used for semi-automatic
lemmatization and corpus tagging. The dictionary will be
coded according to the MulTextEast
v2 specification.
2. Contrastive projects
The areas which are covered by Institute contrastive projects
are:
- Phraseological Dictionaries
of Croatian and 11 different languages (French, Slovene,
Czech, Polish, German, Italian, Russian, Ukrainian, Latin,
Greek, Modern Greek);
- Languages in contact
the investigation of Anglicisms in several European
languages;
- Contrastive analysis
of Croatian and English, German, Russian, French, Spanish,
Italian...
3. Current projects
- CROATIAN PARTICIPATION IN ELAN PROJECT (MZT RH 130729)
Today computationally processed corpora are unavoidable
source of linguistic material for the linguistic
description coping with all levels of natural languages.
Until recently the main obstacle to data access in
corpora was unstandardized format of texts included.
Introduction of SGML (ISO 8879; 1986) and with TEI (Text
Encoding Initiative: Sperberg-McQueen & Burnard 1994)
the standardized format for corpus storing and annotating
was established. Meanwhile, the set of procedures for
access to standardized corpus data remained
unstandardized. That's why EU, via two non-profit
organizations (PAROLE and TELRI) which were constituted
after the closing of two large European project bearing
the same names, started the ELAN project (European
Language Activity Network). ELAN project aims at
delivering the standard tool for mantaining and querying
corpora and/or lexica for more than 20 west-, central-
and east-european languages. Final result of ELAN will be
a network of corpora and/or electronically stored
lexicons installed on servers of the institutions
involved in project. That network of linguistic resources
will be accessible by WWW service and standardized
interface.
Institute of linguistics on the Faculty of Philosophy at
the University of Zagreb was invited to ELAN project
because of its participation in TELRI I project (regrettably
not as a full member). Since the Institute is in fact a
referent institution for corpus linguistics in Republic
of Croatia, it can take a participation in ELAN project
producing a corpus of Croatian language of approximately
2.000.000 words. The Institute will get access to a
newest generation of corpus tools which will became a
standardized tool in the field of corpus linguistics.
The new corpus methodology achieving will have direct
impact to the Croatian national corpus which is also
being collected at the Institute in the scope of MZT
project 130718.
ELAN project is also important because it is the first
international EU-funded project where Croatian language
is institutionally, organizationally and terminologically
completely separated from the Serbian and that is the
opportunity which should not be neglected.
- CROATIAN-SLOVENE PARALLEL CORPUS (MZT RH 130821)
This parallel corpus should be the starting point for
applied linguistic research of two neighbouring languages
which are genetically close also. Croatian and Slovene
have undergone significant changes after both stated
gained independence in 1991. The corpus will be
retrievable by means of WWW and is planned to represent
important linguistic material for students of Croatian
and Slovene on Faculties in Zagreb and Ljubljana, for
translators from and to both languages. Because of old
Croatian-Slovene dictionaries (usually Serbo-Croatian -
Slovene), the compiling of such corpus would be a
unavoidable resource for new dictionaries.
- SEMANTIC FIELDS AND SYNTAX
The main focus of this project would be to analyse the
principles of semantic field organization and their
prototypical syntactic determiners expressed primarily on
the syntagmatic or sentence level. Thus the main aim will
be to show how meaning and vocabulary organization
connectand intermesh with syntactic phenomena. An
analysis of this kind that brings together semantic and
syntactic characteristics of parts of vocabulary of
different languages, sheds light on the basiv principles
of how natural languages function. The analysis would
primarily be centered on semantic fileds in Croatian and
English, but would also include data from non-Indoeuropean
languages. Croatian and English represented languages
belonging to different families of the Indoeuropean group,
and by contrasting them differences can be determined,
but at the same time principles that go beyond language
specific analysis appear. Inclusion of data from non-Indoeuropean
languages represents a way of confirming the basic
principles, of semantic field organization, and it is
expected that they will provide additional proof for
universal principals, of how linguistic phenomena are
manifested on the semantic and syntactic level. Previous
work of the principal investigator on the prototypical
organization of semantic fields, and the researchers
theory of "Role and Reference Grammar" are
starting points that in combination should provide the
basis for a more comprehensive understanding of the
complexity of semantic an syntactic phenomena.
- CROATIAN-GERMAN LINGUISTIC RELATIONS
The project will include both diachronic and synchronic
levels of Croatian-German linguistic relations. On the
diachronic level grammars, dictionaries and teaching
books of German and written in German will be analyzed.
German-language reviews, books and newspapers published
in Croatia will be also dealt with. On the synchronic
level the influence of the German language on 20th-century
Croatian dialects will be investigated, decribed and
analyzed. The social and institutional aspects of
teaching German in Croatia between the 18th and 20th
century will be investigated.
- CONTRASTIVE GRAMMARS: CROATIAN AND FRENCH
The basic aim of the research is the elaboration of
contrastive grammars of Croatian and French. The basic
thesis of the project starts from the fact that there is
no modern Croatian grammar with special reference to
French grammar, nor any French grammar with special
reference to Croatian. As far as the general theory of
grammar is concerned there are different approaches which
in principal do not enable the elaboration of a gramamr
of a language. These grammars should contribute to the
creation of a thorough grammatic theory and develop a
general grammatic theory taking into account different
models of grammatical description as well as modern
linguistic methods. The result of the research would be
the elaboration and publication of two grammars: the
Croatian grammar with special reference to French and the
French grammar with special reference to Croatian. This
project would contribute to the introduction of Croatian
into the world science, especially to its presence in the
French language area. As the project will be using the
results of computational linguistics and formal grammars,
especially so called regular grammars, the project will
contribute also to the development of general knowledge
in this area, both in Croatia and in the world. The
project plans the collaboration with Maurice Gross (Laboratoire
d’automatique documentaire et linguistique - Université
Paris VII et Paris VIII), with the Centre Lucien Tesniere
(Université de Franche Compté - Besançon), with the
Laboratoire de linguistique informatique (Université
Paris XIII) and with Olivier Soutet (Sorbonne, Paris IV).
Publications
The Institute has developed a rather strong publishing
activity which encompasses several series of publications such as:
- Computer concordances of the texts of the (older)
Croatian literature;
- Contrastive studies of Croatian and other languages;
- Phraseological dictionaries;
- The English Element in European Languages series
- Linguistic textbooks
More than 60 books and studies have been published as the
result of the work on the Institute projects.
The Institute is also co-publisher of the journal Suvremena
lingvistika (Contemporary linguistics) which is quoted in
the MLA and BL. It also publishes the Bulletin of the
Institute of Linguistics where thorough bibliographical data
as well as articles on ongoing projects are published.
STAFF
1. Permanent
Vesna Muhvić-Dimanovski,
Ph.D..
Ph.D. in contrastive linguistics; her field of interest:
languages in contact, anglicismsin Croatian and German, research
on neologisms in Croatian and other European languages
Ida Raffaelli,
Ph.D.
Ph.D. in linguistics; field of interest: semantics in historical
perspective, mediaevalFrench, chronicles
Boško Bekavac
Web-page
B.A. in linguistics and informational sciences; field of interest:
computational linguistics, linguistic tools, corpus linguistics,
SGML, XML
Sanja Fulgosi
B.A. in croatistics; field of interest: croatian corpora,
morphology, POS tagging
Krešimir Šojat
B.A. in anglistics and germanistics; field of interest: croatian
corpora, collocations, semantic tagging
Ivana Simeon
B.A. in general linguistics and Russian language and literature; field of interest:
parallel corpora processing and annotation, M(A)T, translation theory
2. Principal researchers of the Institute's current projects
Professor Zrinjka
Glovacki-Bernardi
Ph.D. in linguistics, field of interest: discourse analysis,
languages in contact: Germanand Croatian; principal researcher of
the project: Croatian-German linguistic relations.
Professor
Dubravka Sesar
Ph.D. in linguistics, field of interest: Slavic languages,
language standardization processes, Czech and Slovak; principal
researcher of the project: The analysis of West-Slavic
languages.
Marko Tadić,
assistant professor Web-page
Ph.D. in computational linguistics; field of interest: corpus
linguistics, corpora compiling and processing, computational
morphology; principal researcher of the projects: Computational
processing of the Croatian language, Croatian participation in
ELAN project, Croatian-Slovene parallel corpus
Professor Milena Žic-Fuchs
Ph.D. in semantics, field of interest: cognitive linguistics,
semantics, syntax; principal researcher of the project: Semantic
fields and syntax.
Back to Faculty of Philosophy
Homepage
Last change 2001-04-25 by Marko Tadić