Corpus Processing: Machine Readable Dictionary

The accuracy of corpus-based extraction relies so much on MRD. Without this, extraction will be based only on surface form (orthography) without linguistic analysis. But what is a dictionary? Dictionary is collection of entries supported by some linguistic information such as part of speech, conceptual definition or equivalence in pair language (for bilingual dictionary). There are types of dictionary. Let me classify them into two: Published dictionary and MRD for NLP.

(a) Published Dictionary

We might find public dictionary easily on bookstores around us. Some of them are bilingual, some of them are monolingual. Some of them are also supported by CD-ROM, where dictionary database is stored. Database may also be stored in standalone portables such as pocket electronic dictionary, iPhone or cellphone (are they the same?^^). Although dictionaries that I mentioned are machine readable, these dictionary cannot be used to support NLP tasks.

(b) Machine Readable Dictionary for NLP

MRD for NLP is a database, consisting of entries accompanied by linguistic information. This MRD will be used to tag tokens in the corpus. Most of MRDs are monolingual and they are used by linguists or computer scientists.This is the MRD used by various program/software to perform NLP tasks.

Components of MRD

Most of MRD consist of entry and Part of speech tags, but there are also some that include semantic tags as well.

(1) Entry

Entry is the most essential component of dictionary (all dictionary). Entries are usually composed of vocabularies or affixes. Vocabularies consist of simple forms like 'father', 'take', or 'son' or compound forms like 'father in law', 'take into account' etc. They are all free morphemes.

Affixes are independent to free morpheme. For example -ment is affix that derives verb to noun (development, procurement, engagement). Therefore, they are called 'bound' morpheme. It also includes some case markers like 이/가 in Korean that marks subject.

An example of MRD is MRD used in Unitex. MRD for english is composed of simple form, compound form and also inflected form. This makes number of entries multiply. For word 'take' for example, you can have its basic lemma 'take', with agreement 'takes', with gerund 'taking', past 'took', perfective 'taken', agent 'taker'. By having this information, it can give you accurate analysis when recognizing a corpus.

(2) Part of Speech

Process of giving labels of POS is called tagging. It provides grammatical information about the entry. Let's say we have a lemma 'run'. Different MRD comes with different quality of tag. One MRD might put 'V' that stands for verb in lemma 'run'. Some other MRD might be more specific, putting 'VI' (intransitive verb). Some other MRD might put more than a tag. Consider the following examples:

-he runs around the square.

-after six-mile run, he stopped.

In first example, 'run' is verb, but in second example 'run' is noun. Therefore, multiple tags for an entry is really possible. Not only content words (N,V,A,Adv) but functional words also have tags. Here are some examples

the, DET --> determiner

of, PREP --> Preposition

(3) Semantic Information

Having semantic tag on your entry is a plus. Let's say, you are given a task to extract all nouns that are animals. This is impossible if you just rely on POS tag. To carry out this task, you need semantic tag. Tags that involve semantic information might look like this

horse, N+Anl --> Horse is noun that is animal

president, N+Hum p--> President is noun that is human

By having this information, what you need to do is just entering tag code.

Although there might be more components to discuss, but these three components are considered basic of MRD. By having this you can perform corpus-based processing tasks based on linguistic analysis.

Corpus Processing

>(^o^)

Friday, February 4, 2011

Machine Readable Dictionary

No comments:

Post a Comment