Corpus Processing: Auto-Affixing Model for Indonesian with Unitex

This post deals with method of performing automatic affixing tasks in Indonesian with Unitex. There are two types of dictionary in Unitex: simple and inflected word dictionary. Inflected dictionary is a reproduction of simple dictionary. The words in simple dictionary might be composed of canonic or compound words, but they are not inflected. Next, this dictionary is inflected by inflectional LGG. This inflectional LGG brings affixes and rules for inflection, which in turn create inflected dictionary.

In Indonesian, there are three kinds of affixes: prefix, suffix and infix. Well, some might say ‘circumfix’, which is combination of prefix and suffix. The patterns are mostly predictable, but of course, there are some exceptions (or many^^). However, we are not going to discuss the whole phenomena here. To focus the discussion, prefix –meN is highlighted in this post.

In order to describe automatic affixing method, some linguistic backgrounds are presented before we proceed to the processing part. Prefix –meN marks verbs in active form, but can be dropped in some contexts. This prefix can also be used to mark formal situation.

N stands for nasal sound. -meN, is the underlying form, but they might take various surface form such as : -meng, -men, -mem, -me. Consider the following examples

(a) nikah  me+nikah  menikah ‘to get married’ (direct concatenation)
(b) pukul  me+pukul  memukul ‘to hit’ (-mem is opted, first letter of canonic words ‘p’ is deleted)

Instrument

Unitex defines inflection as the process of attaching a bound morpheme to a lexeme. In morphology, inflection is distinguished from derivation. You might want to read this to define whether affixation in Indonesian is categorized into inflection or derivation, or you can read Alwi et al (1988).

To perform automatic prefixing of me-N, there are two essential Unitex Instruments: dictionary and inflectional graph for prefix. Two dictionaries are required for automatic affixing. First, it requires dictionary of canonic words. The content of this dictionary is all canonic/simple words that is to be inflected by prefix –meN to create inflected dictionary. The following presents format of canonic words dictionary in Unitex.

larang, V1+Tr
gali, V2+Tr
…
kemas, V8+Tr
…

The above format consists of verbs in canonic forms. After comma, there are tag codes (V followed by number and Tr). V marks part of speech (POS) and Tr marks transitive verb. This is because all of verbs that can take meN- are all transitive by nature. However, this is not the focus of the discussion. This transitive tag is optional. The focus is POS code. Each pos are accompanied by numeric digit. They are marked V1, V2, V3 etc. These numbers reflects affixing method and LGG file name.

Some of the description arepresented here.

Every canonic words tagged by LGG V1 can be concatenated directly. No deletion is required in first letter of canonic words. The LGG below illustrates direct concatenation of prefix –meN to larang ‘to forbid’.

V1 inflection

This graph is designed to concatenate ‘me’ and ‘larang’ directly without deleting first letter of any canonic words marked by tag V1. The boxes presents inserts ‘e’ and ‘m’ (marked by ‘I=’ in the above boxes). It reads from right for the generation principle in inflectional graph is first in-first out. Output: below the arrow before the final state is the code I designed to mark direct concatenation. V2 inflection is executed in different way. Consider the following illustration.

V2 Inflection

Code :b under the arrow before final state indicates that gali ‘to dig’ takes –meN in form of –meng. No deletion is required here, but it needs two additional letters ‘n’ and ‘g’. Inflectional graph for V8 is different for it includes deletion.

V8 Inflection

For those tagged by V8 first letter of canonic words must be deleted (marked by X=1). Inserts comes first followed by deletion. This inflectional graph applies for words like: kemas, kira, kasih etc. The code for this type of inflection is :h. This LGG changes inflect kemas to mengemas ‘to pack’.

After these inflectional graphs are designed completely (for all canonic words), then they must be inflected to dictionary of canonic words. The result is an inflected dictionary. Here is the illustration.

Inflected Words Dictionary

This dictionary is ready to be applied to Unitex-readable corpus in Indonesian. It will recognize sequences, and provides both canonic and inflected form as it is presented by text automaton below ( I created a small corpus. They are not representative, but enough to show how the proposed Auto-Affixing model works).

Automaton for Inflected Words

The automaton presents two forms: basic form and surface form. Look at the box. The boxes are repository for lexicon. The inflected words mengemas is shown at on top, and canonic word kemas is shown below. Saya ‘I’ is presented in the same way but for the reason that it begins with uppercase. The dictionary is designed in lowercase. However, there is an option to ignore or consider uppercase/lowercase distinction in Unitex.

To sum up, this post has introduced a method of automatic affixing in Indonesian. However, the processing can work in better precision, only when representative dictionary is applied. In This post has described that inflected words dictionary is created by inflectional LGG and in turn perform automatic affixing. When the dictionary is already applied, sequence recognition and extraction tasks can also be executed.

In the future LGG is expected to perform automatic affixing on more complex and irregular affixing. Consider the following –meN prefixes, where the canonic words are all started by same sound/letter, but affixed in di\fferent way

pijat memijat ‘to massage’(first letter is deleted)
pukul  memukul ‘to hit’ (first letter is deleted)
proses  memproses ‘to process (first letter is not deleted)

Corpus Processing

>(^o^)

Sunday, February 6, 2011

Auto-Affixing Model for Indonesian with Unitex

1 comment: