In order to describe automatic affixing method, some linguistic backgrounds are presented before we proceed to the processing part. Prefix –meN marks verbs in active form, but can be dropped in some contexts. This prefix can also be used to mark formal situation.
(a) nikah me+nikah menikah ‘to get married’ (direct concatenation)
(b) pukul me+pukul memukul ‘to hit’ (-mem is opted, first letter of canonic words ‘p’ is deleted)
Instrument
Unitex defines inflection as the process of attaching a bound morpheme to a lexeme. In morphology, inflection is distinguished from derivation. You might want to read this to define whether affixation in Indonesian is categorized into inflection or derivation, or you can read Alwi et al (1988).
To perform automatic prefixing of me-N, there are two essential Unitex Instruments: dictionary and inflectional graph for prefix. Two dictionaries are required for automatic affixing. First, it requires dictionary of canonic words. The content of this dictionary is all canonic/simple words that is to be inflected by prefix –meN to create inflected dictionary. The following presents format of canonic words dictionary in Unitex.
larang, V1+Tr
gali, V2+Tr
…
kemas, V8+Tr
…
The above format consists of verbs in canonic forms. After comma, there are tag codes (V followed by number and Tr). V marks part of speech (POS) and Tr marks transitive verb. This is because all of verbs that can take meN- are all transitive by nature. However, this is not the focus of the discussion. This transitive tag is optional. The focus is POS code. Each pos are accompanied by numeric digit. They are marked V1, V2, V3 etc. These numbers reflects affixing method and LGG file name.
Some of the description arepresented here.
V1 inflection
This graph is designed to concatenate ‘me’ and ‘larang’ directly without deleting first letter of any canonic words marked by tag V1. The boxes presents inserts ‘e’ and ‘m’ (marked by ‘I=’ in the above boxes). It reads from right for the generation principle in inflectional graph is first in-first out. Output: below the arrow before the final state is the code I designed to mark direct concatenation. V2 inflection is executed in different way. Consider the following illustration.
V2 Inflection
Code :b under the arrow before final state indicates that gali ‘to dig’ takes –meN in form of –meng. No deletion is required here, but it needs two additional letters ‘n’ and ‘g’. Inflectional graph for V8 is different for it includes deletion.
V8 Inflection
For those tagged by V8 first letter of canonic words must be deleted (marked by X=1). Inserts comes first followed by deletion. This inflectional graph applies for words like: kemas, kira, kasih etc. The code for this type of inflection is :h. This LGG changes inflect kemas to mengemas ‘to pack’.
After these inflectional graphs are designed completely (for all canonic words), then they must be inflected to dictionary of canonic words. The result is an inflected dictionary. Here is the illustration.
Inflected Words Dictionary
This dictionary is ready to be applied to Unitex-readable corpus in Indonesian. It will recognize sequences, and provides both canonic and inflected form as it is presented by text automaton below ( I created a small corpus. They are not representative, but enough to show how the proposed Auto-Affixing model works).
Automaton for Inflected Words
The automaton presents two forms: basic form and surface form. Look at the box. The boxes are repository for lexicon. The inflected words mengemas is shown at on top, and canonic word kemas is shown below. Saya ‘I’ is presented in the same way but for the reason that it begins with uppercase. The dictionary is designed in lowercase. However, there is an option to ignore or consider uppercase/lowercase distinction in Unitex.
To sum up, this post has introduced a method of automatic affixing in Indonesian. However, the processing can work in better precision, only when representative dictionary is applied. In This post has described that inflected words dictionary is created by inflectional LGG and in turn perform automatic affixing. When the dictionary is already applied, sequence recognition and extraction tasks can also be executed.
In the future LGG is expected to perform automatic affixing on more complex and irregular affixing. Consider the following –meN prefixes, where the canonic words are all started by same sound/letter, but affixed in di\fferent way
pijat memijat ‘to massage’(first letter is deleted)
pukul memukul ‘to hit’ (first letter is deleted)
proses memproses ‘to process (first letter is not deleted)
This comment has been removed by the author.
ReplyDelete