- Nerthusv5 database
- …
- Nerthusv5 database
- Nerthusv5 database
- …
- Nerthusv5 database
Corpus methodology
Standards and milestones in corpus compilation
Aims and parameters
ParCorOE, an aligned parallel corpus of Old English prose, comprises 300,000 words in the source language (Old English), plus the parallel version in the target language (Present-Day English). The following corpus parameters have been set.
- Orientation: ParCorOE is a historical corpus, rather than a corpus devised for translation, comparative linguistics or second language learning.
- Number of languages: ParCorOE is a bilingual corpus, involving Old English and Present-Day English.
- Directionality: ParCorOE is a unidirectional corpus, from Old English to Present-Day English.
- Target of description: ParCorOE is aimed to textual forms (tokens or inflections), instead of revolving around dictionary words or lemmas.
- Genre choice: ParCorOE contains prose texts only.
Compilation standards
The following standards, which serve the general aim of increasing the searchability and recoverability of information, have guided the design and compilation of ParCorOE.
Standard 1: Alignment
An aligned parallel corpus Old English-English consists of a parallel text, that is to say, an Old English text placed along its translation into Present-Day English, with alignment at text, sentence and word level, in such a way that each source language segment is paired with a target language segment. Word, sentence, and text alignment requires tokenisation at these three structural levels. Alignment parings should be marked by means of the highlighting of the source and the target segments.
Standard 2: Annotation
Three types of annotation must be distinguished: mark up at text level, as well as syntactic annotation and morphological tagging at sentence/word level. Fragments (tokens) are comprised of at least one sentence or one syntactically independent period, identified by means of a text number.
Standard 3: Lemmatisation
The corpus must be fully lemmatised, so that all the textual attestations are grouped under the relevant lemma, and each lemma is provided with all its inflections.
Standard 4: Automation
Within the limits imposed by the available written standards and the variation that they present, the annotation of the parallel corpus must be automatic. This includes not only syntactic annotation and morphological tagging, but also the necessary lemmatisation. Lemmas and inflections must be listed dynamically.
Standard 5: Feeding
The corpus must be fed with the information available from The Knowledge Base of Old English (OEKB). The parallel corpus may retrieve information from the relational databases in OEKB in order to maximise the automation of the tasks of tagging, annotation and lemmatisation.
Standard 6: Searchability
The corpus must be searchable by text, fragment and word, as well as by morphological tag and syntactic annotation. Combined searches by inflectional form and lemma are also required. The corpus must be based on a concordance and an index, so that the main layouts are interconnected.
Standard 7: Dissemination
The corpus must be available online in open access and must be searchable with an Internet browser. Users should not have training or previous experience with database software in order to search ParCorOE.
ParCorOEv1: The Pilot Corpus
In order to fix design inadequacies and compilation shortcomings, a ten-thousand-word pilot corpus was compiled and annotated. The selection of texts comprised the major genres of historical prose, religious prose and translations from Latin: The Anglo-Saxon Chronicle, Orosius, Ælfric´s Lives of Saints, Cura Pastoralis, and Bede´s Ecclesiastical History. The Old English texts, as well as their translations into Present-Day English, were extracted from Fernández Cuesta et al. (1997). The pilot corpus had two building blocks: the concordance (including a word index) to the texts and the parallel corpus layouts. Two layouts were distinguished: the static presentation, which offered the running texts Old English-Present-Day English, aligned them by fragment and word and provided word-for-word gloss as well as fragment translation; and the dynamic presentation, which was aligned at word level in such a way that each word was highlighted in the source and in the target text. Full tagging and annotation was imported from the relational databases, including the information on lemma, alternative spellings, lexical category, morphological class, inflectional paradigm, derivational paradigm, meaning definition, and the references of the secondary sources that discuss the lemma or the inflectional form in question.
About us
RGFGs, Nerthus Project
Department of Modern Languages, University of
La Rioja.
Nerthus Project - Universidad de La Rioja © 2024