Auto
Affixing Model for Indonesian Revisited
Prihantoro
= = = = = =
you can read here, but i will upload the picts later. So it is a bit messed up. I also uploaded the Ms.Word Version (the tidier one) here or copy-paste-click following link: https://drive.google.com/file/d/0B-kgsOSeEERoT2w3UXBpaURHY2s/edit?usp=sharing
= = = = = =
I received a comment
from Alexis Neme about my article ‘Auto Affixing Model for Indonesian ‘ that I
posted on my blog corpusprocessing.blogspot.com. Unfortunately it was my
vacation, and I left my project at home. As I returned, I tried to apply his
suggestion right away. He suggested a way to avoid writing the prefixes
backward. I wrote it backward as that was the only method that satisfied the
affixing in Indonesian.
1. LEMMA tag in Semitic
Module to Avoid Backward Writing
To
avoid backward writing, he recommended me to use Semitic mode and <LEMMA>
tag, where the lemma is considered as consonantal skeleton, as in Arabic. He
applied this feature for Tagalog. There were some problems, at first, in
applying the tag. This concerns the Semitic module in UNITEX. But it was
resolved pretty soon. The semitic mode satisfies simple prefix concatenation in
Indonesian. Consider the following entry line in the uninflected form:
makan, $V1+TO_EAT
It is inflected by the following inflectional graph:
The inflected form entry lines are as follow:
memakan,makan.V+TO_EAT:a
makan-memakan,makan.V+TO_EAT:d
makan-makan,makan.V+TO_EAT:c
Not only that it
successfully prefixes <me> to the lemma <makan> resulting on
<memakan>, it also successfully reduplicates them into
<makan-memakan> (showing reciprocal
action), and <makan-makan> (eat together).
In Indonesian, reduplication is marked by between the original lemma and the
reduplicated form. However, this feature needs more improvement to handle all
affixing and reduplication features in Indonesian.
2. Failure to Handle
Two-Level Morphology in Semitic Mode
I am
aware that infixes are quite productive in Tagalog, and for this, Semitic mode
does significantly helps. In Indonesian, infixes are not as productive as
prefixes and suffixes. However, unlike Tagalog (at least to my knowledge), some
prefixes and reduplication in Indonesian requires two (or more) morphological
operations, beyond simple concatenation (some linguists are familiar with these
terms: non-concatenative morphology or two-level morphology).
For
instance, <menyuruh> is the inflected form of <suruh> ‘to command
someone to do something’. You notice here it is not as simple as concatenating
<me> to <makan> (see the previous). There are two operations here:
first, that the initial character of the lemma string is deleted <suruh>.
Then after that, a prefix <meny> is attached. This phenomenon is actually
the result of nasal assimilation in Phonology. See Prince and Mc Carthy’s
Optimality Theory, 1993. But since in UNITEX, we deal with the orthographical
representation so I will not discuss this further here.
Another
example is <sayur-mayur> (variety
of vegetables), which is the reduplicated form of <sayur> vegetable. Here,
hyphen is inserted after the lemma <sayur->, and then the lemma is copied
<sayur-sayur>. And finally, the initial character of the copied lemma is
replaced by <m> resulting on <sayur-mayur>.
I added
the entry lines with <suruh> and <sayur>, and I also wrote the
following inflectional graphs expecting <menyuruh> and
<sayur-mayur> as the result. However, this is the result that I got:
makan,$V1+TO_EAT
suruh,$V2+TO_COMMAND
sayur,$N2+VEGETABLE
The
result was not like I have expected. For <suruh>, I expected that
<x=1> would delete the initial character of the lemma <suruh> to
<uruh>. But it turned out to delete the initial character of the
inflected form to <enysuruh>. As for <mayur>, I wrote <R=m>
in my graph before the position of copied lemma (the second one), expecting
that it would replace the initial character of the copied lemma. But it turned
out to replace the original lemma (the first one).
3. Some Possible Solutions
In
my opinion, there are some possible solutions for this with regard to
inflectional graph. The first one is to make UNITEX execute the operation
orderly. The <x=1> and <R=1> metasymbols must apply to the right
(instead to the left). This will avoid deleting <meny> to <eny>,
and instead it will delete <suruh> to <uruh>. After the deletion,
simple concatenation can be done. As for the reduplication, it will replace the
initial character of the copied lemma (the right one), instead of the original
lemma (the left one), resulting on <sayur-mayur> instead of
<mayor-sayur>.
The
second solution is to write metasymbol with SIMILAR (but not the same) functionality
of delete <X=?> and replace <I=?>
but can apply in with the ordering that supports users’ need.
<m e n y s u r u h>
<1 2 3 4 5 6 7 8 9>
<X=5> delete the
fifth character (not all the first five characters)
<menyuruh>
<12346789>
With the case of
<sayur-mayur>, here what I expect it to work. It copies <sayur> to
<sayur-sayur>. The metasymbol is expected to replace the 7th
character to <m> resulting on <sayur-mayur>. I am thinking of
<R7=m>
<s a y u r - s a y u r>
<1 2 3 4 5 6 7 8 9 10 11>
m
<R7=m> replaces the
7th character of the strings to <m>
<s a y u r - m a y u r>
<1 2 3 4 5 6 7 8 9 10 11>
Using Semitic module is not completely bad as it is
able to handle simple concatenation, and simple reduplication. But
unfortunately it failed to account for two-level morphology. Using English
module, like what I did previously, can benefit in terms of handling two-level
morphology, but it is visually disturbing, plus it cannot handle reduplication.
This is normal, as the two modules (semitic and English) are not designed for
Indonesian. Therefore, it is necessary to built a special module that is
functional to handle the linguistic feature of Indonesian.