- DUE 10 September 2025; 100% correct solutions and
explanations.
QUESTION 1
1) What is a corpus, and how does it differ from other data types?
A corpus is a large, structured, and electronically stored collection of authentic
linguistic data, usually in the form of written texts or transcribed speech, that is
compiled for the purpose of linguistic analysis or natural language processing
(NLP). It is designed to represent language use as naturally as possible and
provides researchers with empirical evidence of how language is used in real
contexts. Corpora may be general (covering many topics and genres) or specialized
(focusing on a particular domain, register, or variety of language).
A corpus differs from other data types in several ways:
Authenticity: Unlike artificial or constructed examples, corpora consist of
naturally occurring language samples.
Structure: Corpora are systematically organized, annotated, and often
tagged with linguistic metadata (e.g., part-of-speech tags, syntactic
structures).
Size: Corpora are usually large-scale, making them more representative of
language patterns than small anecdotal examples or intuition-based data.
Machine-readability: They are stored in electronic form and are accessible
for computational analysis using NLP tools.
Comparability: Unlike general datasets, corpora are specifically designed to
allow linguistic comparison across genres, registers, dialects, or time
periods.
Thus, while general data types may include numbers, images, or arbitrary text
collections, a corpus is unique in being a linguistically informed dataset created for
systematic study of language.
2) What is the technical term for splitting a corpus into different linguistic
units such as paragraphs, sentences, and words in NLP?