Corpus Processing: LGG

Showing posts with label LGG. Show all posts

Sunday, February 6, 2011

Auto-Affixing Model for Indonesian with Unitex

This post deals with method of performing automatic affixing tasks in Indonesian with Unitex. There are two types of dictionary in Unitex: simple and inflected word dictionary. Inflected dictionary is a reproduction of simple dictionary. The words in simple dictionary might be composed of canonic or compound words, but they are not inflected. Next, this dictionary is inflected by inflectional LGG. This inflectional LGG brings affixes and rules for inflection, which in turn create inflected dictionary.

In Indonesian, there are three kinds of affixes: prefix, suffix and infix. Well, some might say ‘circumfix’, which is combination of prefix and suffix. The patterns are mostly predictable, but of course, there are some exceptions (or many^^). However, we are not going to discuss the whole phenomena here. To focus the discussion, prefix –meN is highlighted in this post.

In order to describe automatic affixing method, some linguistic backgrounds are presented before we proceed to the processing part. Prefix –meN marks verbs in active form, but can be dropped in some contexts. This prefix can also be used to mark formal situation.

N stands for nasal sound. -meN, is the underlying form, but they might take various surface form such as : -meng, -men, -mem, -me. Consider the following examples

Position Swap (Transformational Grammar)

a."there is a cat," said the man.

b. the man said "there is a cat".

In the above direct speech, the position of agent, reporting verb and speech are swapped. We may refer to this as transformation. Basically, the strings are composed of three variables. First variable is agent. Agent is the one who's uttering speech, and it is usually human. Syntactically, it is constituted by NP. Second variable is reporting verb. Reporting verbs are verbs that are used in reported speech such as: say, reply, shout etc. Third variable is speech uttered by the agent. In direct speech it is marked by quotes (" speech"). The speech can be just one word, phrase or clause.

First variable (V0): agent

Second variable (V1): Reporting verb

Third variable (V2): speech <"speech">

The aim is to swap the position of to . In this way, sentence (a) is changed to sentence (b). To do this, there are several steps to do. We're going to do it with LGG.

First, LGG must be designed to recognize V2+V1+V0.

Second, the position of variable must be swapped. To swap this, we must set up perimeter around the variable, which is called variable function, marked by red bracket. How to setup variable?

$+variable name+opening bracket+ box filled with lexicon + $ +variable name (must be the same as previous one)+closing bracket

LGG for Direct Speech Transformation

Speech are marked by quotes (opening and closing). However, in Unitex, quote must be escaped by backslash \. Therefore in the box, they are written as \". Inside the quotes, there is a speech sequence. This sequence is marked by uppercase letter just after opening bracket. Token started by uppercase letter is written as <PRE>. Speech aren't just limited to one word, but it can be phrase and clause. For the sequence is unlimited, we put loop on . It means sequences of tokens. The sequence is ended by comma and quotation mark.

Unfortunately there is no dictionary set for reporting verbs. However, the verbs are most likely to be reporting when it exists after speech. Therefore, I just put , indicating any verb. Agent is really open to develop. My LGG there captures only NP consisting of or .

Third (After long explanation, here's the climax), we must swap the order. V0 is set up excluding quotation mark. Why? because it will involve comma inside the speech. Therefore, it is setup between the beginning of speech to right before comma. V1 is set on reporting verb and V2 is set on agent. The following output is given so the position is swapped $var2$ $var1$ "$var0".

From the concordance under the LGG, you can see that it successfully extracts sentence composed of direct speech + reporting verb + agent (e.g "there is a cat" said the man) to agent+reporting verb+direct speech (e.g the man said "there is a cat")

Friday, February 4, 2011

Local Grammar Graph

This post deals with some aspects of local grammar graph (LGG), a sequence recognizer and extraction tool. Local Grammar is developed by Maurice Gross to capture and describe a linguistic phenomena.

Some LGGs Advantageous over Regular Expression

Question. Is it the only tool? There exists another tool which is called regular expression. Many computer scientists are familiar to this term. However, there are some advantageous of LGG over regular expression. First, LGG is supported by graphic interface. It's user friendly. For someone like me who does not know much programming, it is proven to be very helpful. Check this illustration out to see how LGG looks

LGG with Output

Second, with LGG, you can give output. So here is the order. You recognize, extract, and after that give output. An example above shows you an LGG that is equipped with output. In the example, output is designed to give decription of every extracted lexicons. You can also have translation for each lexicon as long as your dictionary supports it

Corpus Processing

>(^o^)

Sunday, February 6, 2011

Auto-Affixing Model for Indonesian with Unitex

Saturday, February 5, 2011

Position Swap (Transformational Grammar)

Friday, February 4, 2011

Local Grammar Graph