Friday, February 4, 2011

Corpus

Corpus can simply be defined as a collection of linguistic data both spoken or writen. However, can all linguistic data be defined as 'Corpus'? Although there are some defintions, there is a consensus among linguists that a corpus should ideally be (1) machine readable, (2) authentic, (3) representative enough to be a sample of a language variety. By fulfilling those three requirements, a collection of linguistic data can be called corpus.
Among the requirements, second and third points are quite familiar to linguists. Unlike those two, the first term 'machine readable' is more familiar to computer scientists . For this reason, this posting highlights this term. Another essential focus is preprocessing stage of a corpus.

Machine Readability

Why 'machine readable'? The term 'readable' is closely related to 'machine recognition'. Machine can refer to standalone tool such as: iPad, portable text-reader etc, but it mostly refers to computer. Computer recognition is essential for tagging, extraction and other NLP tasks. They cannot be excecuted work unless linguistic data is readable. Therefore, linguistic data that is not recognized by computer must be converted to readable format. For example, handwriting must be scanned first and later be converted to computer-recognizable format. Let's say, you turn it into pdf or word format.

There you go, you have your linguistic data. But what is meant by 'machine readable' is not only this. Your linguistic data must also be able to be 'processed' to further perform some NLP tasks such as pattern matching or extraction.


Preprocessing

Question, what format is the best for processing? As far as I understand, word and pdf are not good. What is best might be .txt (notepad). However, it really depends on what corpus-processing software/program you use. Let's say, you use Python. Using .txt might not be a problem. But for some other software or program, you might need to use other means.
If you use Unitex (assumed that it is already installed in your computer), you must follow these following steps

(1) Have your corpus in .txt file (Font: Unicode)
(2) Convert it to Unitex file

Although more than one step, this does not mean that one program is more complicated than another. In Unitex, this stage is really important for this is the stage where preprocessing and lexical parsing take place. Based on the availability of integrated MRD, Unitex performs characters recognition, tokenization, grammar and semantic tag, phrase and sentence chunking.

Unitex Manual. Preprocessing Window

After preprocessing and lexical parsing, your corpus is already tagged with information from your MRD. Then, your corpus will be displayed with some statistical information such as number of sentences, tokens, simple, compound and unknown words. Consider the following illustration.

Unitex. Corpus Display

What we can say from this posting is, structure and format of linguistic data is very essential for corpus processing. It is important to understand this prior to apply it to your corpus. Be advised that data structure and format might vary from one program/software to another.

No comments:

Post a Comment