Saturday, January 25, 2014

Auto Affixing Model for Indonesian Revisited


Auto Affixing Model for Indonesian Revisited
Prihantoro

= = = = = = 
you can read here, but i will upload the picts later. So it is a bit messed up.  I also uploaded the Ms.Word Version (the tidier one) here   or copy-paste-click following link: https://drive.google.com/file/d/0B-kgsOSeEERoT2w3UXBpaURHY2s/edit?usp=sharing
= = = = = =
I received a comment from Alexis Neme about my article ‘Auto Affixing Model for Indonesian ‘ that I posted on my blog corpusprocessing.blogspot.com. Unfortunately it was my vacation, and I left my project at home. As I returned, I tried to apply his suggestion right away. He suggested a way to avoid writing the prefixes backward. I wrote it backward as that was the only method that satisfied the affixing in Indonesian.

1.    LEMMA tag in Semitic Module to Avoid Backward Writing
To avoid backward writing, he recommended me to use Semitic mode and <LEMMA> tag, where the lemma is considered as consonantal skeleton, as in Arabic. He applied this feature for Tagalog. There were some problems, at first, in applying the tag. This concerns the Semitic module in UNITEX. But it was resolved pretty soon. The semitic mode satisfies simple prefix concatenation in Indonesian. Consider the following entry line in the uninflected form:

makan, $V1+TO_EAT

It is inflected by the following inflectional graph:

The inflected form entry lines are as follow:
memakan,makan.V+TO_EAT:a
makan-memakan,makan.V+TO_EAT:d
makan-makan,makan.V+TO_EAT:c
Not only that it successfully prefixes <me> to the lemma <makan> resulting on <memakan>, it also successfully reduplicates them into <makan-memakan> (showing reciprocal action), and <makan-makan> (eat together). In Indonesian, reduplication is marked by between the original lemma and the reduplicated form. However, this feature needs more improvement to handle all affixing and reduplication features in Indonesian.

2.    Failure to Handle Two-Level Morphology in Semitic Mode
I am aware that infixes are quite productive in Tagalog, and for this, Semitic mode does significantly helps. In Indonesian, infixes are not as productive as prefixes and suffixes. However, unlike Tagalog (at least to my knowledge), some prefixes and reduplication in Indonesian requires two (or more) morphological operations, beyond simple concatenation (some linguists are familiar with these terms: non-concatenative morphology or two-level morphology).
For instance, <menyuruh> is the inflected form of <suruh> ‘to command someone to do something’. You notice here it is not as simple as concatenating <me> to <makan> (see the previous). There are two operations here: first, that the initial character of the lemma string is deleted <suruh>. Then after that, a prefix <meny> is attached. This phenomenon is actually the result of nasal assimilation in Phonology. See Prince and Mc Carthy’s Optimality Theory, 1993. But since in UNITEX, we deal with the orthographical representation so I will not discuss this further here.
Another example is <sayur-mayur> (variety of vegetables), which is the reduplicated form of <sayur> vegetable. Here, hyphen is inserted after the lemma <sayur->, and then the lemma is copied <sayur-sayur>. And finally, the initial character of the copied lemma is replaced by <m> resulting on <sayur-mayur>.
I added the entry lines with <suruh> and <sayur>, and I also wrote the following inflectional graphs expecting <menyuruh> and <sayur-mayur> as the result. However, this is the result that I got:

makan,$V1+TO_EAT
suruh,$V2+TO_COMMAND
sayur,$N2+VEGETABLE


The result was not like I have expected. For <suruh>, I expected that <x=1> would delete the initial character of the lemma <suruh> to <uruh>. But it turned out to delete the initial character of the inflected form to <enysuruh>. As for <mayur>, I wrote <R=m> in my graph before the position of copied lemma (the second one), expecting that it would replace the initial character of the copied lemma. But it turned out to replace the original lemma (the first one).

3.    Some Possible Solutions
                In my opinion, there are some possible solutions for this with regard to inflectional graph. The first one is to make UNITEX execute the operation orderly. The <x=1> and <R=1> metasymbols must apply to the right (instead to the left). This will avoid deleting <meny> to <eny>, and instead it will delete <suruh> to <uruh>. After the deletion, simple concatenation can be done. As for the reduplication, it will replace the initial character of the copied lemma (the right one), instead of the original lemma (the left one), resulting on <sayur-mayur> instead of <mayor-sayur>.

Right Arrow: <R=m> Applies this wayRight Arrow: <X=1> Applies this way
The second solution is to write metasymbol with SIMILAR (but not the same) functionality of delete <X=?>  and replace <I=?> but can apply in with the ordering that supports users’ need.

<m         e             n             y              s              u             r              u             h>












 


<1           2              3              4              5              6              7              8              9>


<X=5> delete the fifth character (not all the first five characters)

<menyuruh>
<12346789>

With the case of <sayur-mayur>, here what I expect it to work. It copies <sayur> to <sayur-sayur>. The metasymbol is expected to replace the 7th character to <m> resulting on <sayur-mayur>. I am thinking of <R7=m>

<s           a              y              u             r              -              s              a              y              u             r>


















































 


<1           2              3              4              5              6              7              8              9              10           11>
                                                                                                m


 


<R7=m> replaces the 7th character of the strings to <m>
<s           a              y              u             r              -              m            a              y              u             r>











































 


<1           2              3              4              5              6              7              8              9              10           11>
               
                Using Semitic module is not completely bad as it is able to handle simple concatenation, and simple reduplication. But unfortunately it failed to account for two-level morphology. Using English module, like what I did previously, can benefit in terms of handling two-level morphology, but it is visually disturbing, plus it cannot handle reduplication. This is normal, as the two modules (semitic and English) are not designed for Indonesian. Therefore, it is necessary to built a special module that is functional to handle the linguistic feature of Indonesian.

Wednesday, November 13, 2013

Indonesian Spell Chekcer

You might need an Indonesian spell checker running on your MS.Word software. Well, you can use this indodic spell checker! Get it here, or copy-paste this link
http://indodic.com/SpellCheckInstall.html to your address bar.
 It is basically an Indonesian wordlist (tokens) copied into MS. Word wordlist repository for the proofing language. It is as easy as one, two and three. You can even can create your own spell checker by creating your own wordlist! use this as a guide! You can even use this for some other languages as well!!! Try it^^

AntLab_Laurence Anthony


These Laurence Anthony's AntLab software are very useful for corpus processing.

  1. AntConc: A freeware concordance program
  2. AntPConc: A freeware parallel concordance program
  3. AntWordProfiler: A freeware word profiling program for Windows, Macintosh OS X, and Linux similar to Paul Nation's Range program
  4. AntMover: A freeware text structural analyzer program
  5. AntCLAWS-GUI: A front-end interface to the CLAWS tagger developed at Lancaster University, UCREL. Note that you must have CLAWS installed before you can use AntCLAWS-GUI. See the readme file.
The download link is here. If nothing happens, copy-paste this : http://www.antlab.sci.waseda.ac.jp/software.html.

Friday, December 9, 2011

Transducer for Auto-Convert

Local Grammar Graph might be applied to support Computer Assisted Language Learning. Here, I have composed a transducer ( a local grammar that can give output) to convert archaic English words to Modern English). We know that literature and religious texts are two major genres where archaic English are often preserved. Consider the excerpt from Sir Walter Scott's Ivanhoe here:

-----He exclaimed in a lower tone "Couldst thou have ruled thine unreasonable passion, thy father ....----

We can see that there are some archaic words posited on the text. The conventional method to solve this problem is by consulting dictionaries, especially those containing archaic entries. But, should you re-index your dictionary, or typing, if your dictionary is electronic? Can we automate this procesess? The answer is YES. By applying the transducer to this kind of text, which must machine readable text,we might obtain the equivalences in present day English (or we can say M modern English). The result will look like this:

-----He exclaimed in a lower tone "Couldst [could] thou [you] have ruled thine unreasonable passion, thy [your] father ....----


The transducer that designed, can locate the archaic words in English by consulting the lexical resource (machine readable dictionary), which contains archaic entries and their equivalences in present day English (that i had previously constructed), and automatically assigning the present day English to the  words.  


Sunday, February 6, 2011

Auto-Affixing Model for Indonesian with Unitex

This post deals with method of performing automatic affixing tasks in Indonesian with Unitex. There are two types of dictionary in Unitex: simple and inflected word dictionary. Inflected dictionary is a reproduction of simple dictionary. The words in simple dictionary might be composed of canonic or compound words, but they are not inflected. Next, this dictionary is inflected by inflectional LGG. This inflectional LGG brings affixes and rules for inflection, which in turn create inflected dictionary.
In Indonesian, there are three kinds of affixes: prefix, suffix and infix. Well, some might say ‘circumfix’, which is combination of prefix and suffix. The patterns are mostly predictable, but of course, there are some exceptions (or many^^). However, we are not going to discuss the whole phenomena here. To focus the discussion, prefix –meN is highlighted in this post.

In order to describe automatic affixing method, some linguistic backgrounds are presented before we proceed to the processing part. Prefix –meN marks verbs in active form, but can be dropped in some contexts. This prefix can also be used to mark formal situation.
N stands for nasal sound. -meN, is the underlying form, but they might take various surface form such as : -meng, -men, -mem, -me. Consider the following examples

Saturday, February 5, 2011

Position Swap (Transformational Grammar)

a."there is a cat," said the man.
b. the man said "there is a cat".

In the above direct speech, the position of agent, reporting verb and speech are swapped. We may refer to this as transformation. Basically, the strings are composed of three variables. First variable is agent. Agent is the one who's uttering speech, and it is usually human. Syntactically, it is constituted by NP. Second variable is reporting verb. Reporting verbs are verbs that are used in reported speech such as: say, reply, shout etc. Third variable is speech uttered by the agent. In direct speech it is marked by quotes (" speech"). The speech can be just one word, phrase or clause.

First variable (V0): agent

Second variable (V1): Reporting verb

Third variable (V2): speech <"speech">

The aim is to swap the position of to . In this way, sentence (a) is changed to sentence (b). To do this, there are several steps to do. We're going to do it with LGG.


First, LGG must be designed to recognize V2+V1+V0.

Second, the position of variable must be swapped. To swap this, we must set up perimeter around the variable, which is called variable function, marked by red bracket. How to setup variable?

$+variable name+opening bracket+ box filled with lexicon + $ +variable name (must be the same as previous one)+closing bracket

LGG for Direct Speech Transformation


Speech are marked by quotes (opening and closing). However, in Unitex, quote must be escaped by backslash \. Therefore in the box, they are written as \". Inside the quotes, there is a speech sequence. This sequence is marked by uppercase letter just after opening bracket. Token started by uppercase letter is written as <PRE>. Speech aren't just limited to one word, but it can be phrase and clause. For the sequence is unlimited, we put loop on . It means sequences of tokens. The sequence is ended by comma and quotation mark.

Unfortunately there is no dictionary set for reporting verbs. However, the verbs are most likely to be reporting when it exists after speech. Therefore, I just put , indicating any verb. Agent is really open to develop. My LGG there captures only NP consisting of or .

Third (After long explanation, here's the climax), we must swap the order. V0 is set up excluding quotation mark. Why? because it will involve comma inside the speech. Therefore, it is setup between the beginning of speech to right before comma. V1 is set on reporting verb and V2 is set on agent. The following output is given so the position is swapped $var2$ $var1$ "$var0".

From the concordance under the LGG, you can see that it successfully extracts sentence composed of direct speech + reporting verb + agent (e.g "there is a cat" said the man) to agent+reporting verb+direct speech (e.g the man said "there is a cat")

Friday, February 4, 2011

Local Grammar Graph

This post deals with some aspects of local grammar graph (LGG), a sequence recognizer and extraction tool. Local Grammar is developed by Maurice Gross to capture and describe a linguistic phenomena.
Some LGGs Advantageous over Regular Expression

Question. Is it the only tool? There exists another tool which is called regular expression. Many computer scientists are familiar to this term. However, there are some advantageous of LGG over regular expression. First, LGG is supported by graphic interface. It's user friendly. For someone like me who does not know much programming, it is proven to be very helpful. Check this illustration out to see how LGG looks

LGG with Output


Second, with LGG, you can give output. So here is the order. You recognize, extract, and after that give output. An example above shows you an LGG that is equipped with output. In the example, output is designed to give decription of every extracted lexicons. You can also have translation for each lexicon as long as your dictionary supports it