Corpus Processing: January 2014

Auto Affixing Model for Indonesian Revisited

Prihantoro

= = = = = =

you can read here, but i will upload the picts later. So it is a bit messed up. I also uploaded the Ms.Word Version (the tidier one) here or copy-paste-click following link: https://drive.google.com/file/d/0B-kgsOSeEERoT2w3UXBpaURHY2s/edit?usp=sharing

= = = = = =

I received a comment from Alexis Neme about my article ‘Auto Affixing Model for Indonesian ‘ that I posted on my blog corpusprocessing.blogspot.com. Unfortunately it was my vacation, and I left my project at home. As I returned, I tried to apply his suggestion right away. He suggested a way to avoid writing the prefixes backward. I wrote it backward as that was the only method that satisfied the affixing in Indonesian.

1. LEMMA tag in Semitic Module to Avoid Backward Writing

To avoid backward writing, he recommended me to use Semitic mode and <LEMMA> tag, where the lemma is considered as consonantal skeleton, as in Arabic. He applied this feature for Tagalog. There were some problems, at first, in applying the tag. This concerns the Semitic module in UNITEX. But it was resolved pretty soon. The semitic mode satisfies simple prefix concatenation in Indonesian. Consider the following entry line in the uninflected form:

makan, $V1+TO_EAT

It is inflected by the following inflectional graph:

The inflected form entry lines are as follow:

memakan,makan.V+TO_EAT:a

makan-memakan,makan.V+TO_EAT:d

makan-makan,makan.V+TO_EAT:c

Not only that it successfully prefixes <me> to the lemma <makan> resulting on <memakan>, it also successfully reduplicates them into <makan-memakan> (showing reciprocal action), and <makan-makan> (eat together). In Indonesian, reduplication is marked by between the original lemma and the reduplicated form. However, this feature needs more improvement to handle all affixing and reduplication features in Indonesian.

2. Failure to Handle Two-Level Morphology in Semitic Mode

I am aware that infixes are quite productive in Tagalog, and for this, Semitic mode does significantly helps. In Indonesian, infixes are not as productive as prefixes and suffixes. However, unlike Tagalog (at least to my knowledge), some prefixes and reduplication in Indonesian requires two (or more) morphological operations, beyond simple concatenation (some linguists are familiar with these terms: non-concatenative morphology or two-level morphology).

For instance, <menyuruh> is the inflected form of <suruh> ‘to command someone to do something’. You notice here it is not as simple as concatenating <me> to <makan> (see the previous). There are two operations here: first, that the initial character of the lemma string is deleted <suruh>. Then after that, a prefix <meny> is attached. This phenomenon is actually the result of nasal assimilation in Phonology. See Prince and Mc Carthy’s Optimality Theory, 1993. But since in UNITEX, we deal with the orthographical representation so I will not discuss this further here.

Another example is <sayur-mayur> (variety of vegetables), which is the reduplicated form of <sayur> vegetable. Here, hyphen is inserted after the lemma <sayur->, and then the lemma is copied <sayur-sayur>. And finally, the initial character of the copied lemma is replaced by <m> resulting on <sayur-mayur>.

I added the entry lines with <suruh> and <sayur>, and I also wrote the following inflectional graphs expecting <menyuruh> and <sayur-mayur> as the result. However, this is the result that I got:

makan,$V1+TO_EAT

suruh,$V2+TO_COMMAND

sayur,$N2+VEGETABLE

The result was not like I have expected. For <suruh>, I expected that <x=1> would delete the initial character of the lemma <suruh> to <uruh>. But it turned out to delete the initial character of the inflected form to <enysuruh>. As for <mayur>, I wrote <R=m> in my graph before the position of copied lemma (the second one), expecting that it would replace the initial character of the copied lemma. But it turned out to replace the original lemma (the first one).

3. Some Possible Solutions

In my opinion, there are some possible solutions for this with regard to inflectional graph. The first one is to make UNITEX execute the operation orderly. The <x=1> and <R=1> metasymbols must apply to the right (instead to the left). This will avoid deleting <meny> to <eny>, and instead it will delete <suruh> to <uruh>. After the deletion, simple concatenation can be done. As for the reduplication, it will replace the initial character of the copied lemma (the right one), instead of the original lemma (the left one), resulting on <sayur-mayur> instead of <mayor-sayur>.

The second solution is to write metasymbol with SIMILAR (but not the same) functionality of delete <X=?> and replace <I=?> but can apply in with the ordering that supports users’ need.

<1 2 3 4 5 6 7 8 9>

<X=5> delete the fifth character (not all the first five characters)

<12346789>

With the case of <sayur-mayur>, here what I expect it to work. It copies <sayur> to <sayur-sayur>. The metasymbol is expected to replace the 7^th character to <m> resulting on <sayur-mayur>. I am thinking of <R7=m>

<1 2 3 4 5 6 7 8 9 10 11>

<R7=m> replaces the 7^th character of the strings to <m>

<1 2 3 4 5 6 7 8 9 10 11>

Using Semitic module is not completely bad as it is able to handle simple concatenation, and simple reduplication. But unfortunately it failed to account for two-level morphology. Using English module, like what I did previously, can benefit in terms of handling two-level morphology, but it is visually disturbing, plus it cannot handle reduplication. This is normal, as the two modules (semitic and English) are not designed for Indonesian. Therefore, it is necessary to built a special module that is functional to handle the linguistic feature of Indonesian.

Corpus Processing

>(^o^)

Saturday, January 25, 2014

Auto Affixing Model for Indonesian Revisited