Why build a corpus?

Why build a corpus? It's a little hard to give a pat answer to that question. Some people have asked me if I'm here to document an endangered dialect. This is partly true, because documenting written German as it is used in South Tyrol is a major goal of the project. However, South Tyrolean German is not actually immediately endangered. Many South Tyrolean children still learn German natively and go to German-language elementary schools (there are also Italian-language elementary schools; the German-speakers have to study Italian and the Italians have to study German).

Corpus-based approaches have become a major player in the field of linguistics. You can use a corpus for a lot of theoretical research questions and also for a lot of practical applications.

Here's a simple example of a practical application. Imagine that you are building a voice-to-text system (you talk into a microphone, and the computer recognizes your words and prints them on the screen). Suppose you dictate the following:

We had nothing but beautiful blue skies for our whole vacation.

Now, the words blue and blew are pronounced exactly the same. So how can the computer tell which word belongs here? Well, suppose you have a large electronic collection of English texts, containing many millions of words. Suppose the program is also already pretty sure that the following word is skies. Your program could look thru this corpus and count how often each phrase occurs:

blue skies 276 times
blew skies 3 times

So, your program judges that "blue skies" is probable than "blew skies".

This is an overly simplistic example, but it gives the general idea of the sorts of things you can do with a corpus. Many different kinds of software rely on the statistics you can obtain from a corpus. There are also important theoretical questions that you can pursue using the same general statistical approach.

My job is not to build the applications or conduct the research questions, altho I need to know about them to build the corpus the right way. I'm assembling the data, which is further upstream in the whole process.

The job requirements for this position were an odd combination. They needed someone who speaks German; knows something about the dialects of German; has a background in theoretical linguistics; is an experienced programmer who knows Unix, Perl, MySQL, XML, and web development; who knows something about computational linguistics; and who has experience developing online linguistic resources. It's an odd, quirky set of skills, but the position was a perfect match for me, almost as if the job had been created with me in mind (which it wasn't, of course; the folks here had never heard of me before I applied).

The corpus we are building is one regional module of a much larger corpus of German written since the beginning of the 20th century. The main corpus was developed by a team in Berlin. There are other teams developing regional modules in Switzerland, in Vienna, and in other cities. We communicate with these other teams to coordinate our file formats, our metadata (information such as author, year, type of text, etc.), our sampling techniques, and so on.


^ Up