EGlossa - Linguistic Corpora

Linguistic Corpora in Annotation

Learn how to work with annotated corpora for linguistic analysis and pattern validation.

What AreCorpora

Corpora are collections of annotated linguistic examples used for pattern analysis, validation of grammatical theories, and training machine learning models. In EGlossa, corpora are versioned, searchable, and collaborative.

Support for multiple linguistic frameworks
Interactively explore annotated sentences
Share and collaborate on corpus construction

Example corpus entry

The quick brown fox jumps over the lazy dog

Corpora Features

Interactive Search

Query linguistic patterns across corpora with regex, POS tags, or semantic features.

Version Control

Track changes to corpora over time using Git-style diff and branching.

Export Options

Download corpora in multiple formats including XML, JSON, and CSV for offline analysis.

Workflow Tutorial

1. Search Corpora

Use the search interface to find patterns with filters for word class, syntactic role, or semantic frame.

2. Annotate Samples

Apply grammatical tags interactively and review consensus annotations from other researchers.

Start Corpora Exercise

Example Corpora Use

Verb Subcategorization

Query all instances of transitive verbs and their object types in the Spanish-English parallel corpus.

SELECT * FROM corpus WHERE verb = 'transitive' AND object = 'direct'

Cross-linguistic Patterns

Analyze case marking patterns across Germanic and Slavic language corpora.

FILTER [case: nominative] in German + Polish datasets