Corpus Processing: 2011

Friday, December 9, 2011

Transducer for Auto-Convert

Local Grammar Graph might be applied to support Computer Assisted Language Learning. Here, I have composed a transducer ( a local grammar that can give output) to convert archaic English words to Modern English). We know that literature and religious texts are two major genres where archaic English are often preserved. Consider the excerpt from Sir Walter Scott's Ivanhoe here:

-----He exclaimed in a lower tone "Couldst thou have ruled thine unreasonable passion, thy father ....----

We can see that there are some archaic words posited on the text. The conventional method to solve this problem is by consulting dictionaries, especially those containing archaic entries. But, should you re-index your dictionary, or typing, if your dictionary is electronic? Can we automate this procesess? The answer is YES. By applying the transducer to this kind of text, which must machine readable text,we might obtain the equivalences in present day English (or we can say M modern English). The result will look like this:

-----He exclaimed in a lower tone "Couldst [could] thou [you] have ruled thine unreasonable passion, thy [your] father ....----

The transducer that designed, can locate the archaic words in English by consulting the lexical resource (machine readable dictionary), which contains archaic entries and their equivalences in present day English (that i had previously constructed), and automatically assigning the present day English to the words.

Auto-Affixing Model for Indonesian with Unitex

This post deals with method of performing automatic affixing tasks in Indonesian with Unitex. There are two types of dictionary in Unitex: simple and inflected word dictionary. Inflected dictionary is a reproduction of simple dictionary. The words in simple dictionary might be composed of canonic or compound words, but they are not inflected. Next, this dictionary is inflected by inflectional LGG. This inflectional LGG brings affixes and rules for inflection, which in turn create inflected dictionary.

In Indonesian, there are three kinds of affixes: prefix, suffix and infix. Well, some might say ‘circumfix’, which is combination of prefix and suffix. The patterns are mostly predictable, but of course, there are some exceptions (or many^^). However, we are not going to discuss the whole phenomena here. To focus the discussion, prefix –meN is highlighted in this post.

In order to describe automatic affixing method, some linguistic backgrounds are presented before we proceed to the processing part. Prefix –meN marks verbs in active form, but can be dropped in some contexts. This prefix can also be used to mark formal situation.

N stands for nasal sound. -meN, is the underlying form, but they might take various surface form such as : -meng, -men, -mem, -me. Consider the following examples

Position Swap (Transformational Grammar)

a."there is a cat," said the man.

b. the man said "there is a cat".

In the above direct speech, the position of agent, reporting verb and speech are swapped. We may refer to this as transformation. Basically, the strings are composed of three variables. First variable is agent. Agent is the one who's uttering speech, and it is usually human. Syntactically, it is constituted by NP. Second variable is reporting verb. Reporting verbs are verbs that are used in reported speech such as: say, reply, shout etc. Third variable is speech uttered by the agent. In direct speech it is marked by quotes (" speech"). The speech can be just one word, phrase or clause.

First variable (V0): agent

Second variable (V1): Reporting verb

Third variable (V2): speech <"speech">

The aim is to swap the position of to . In this way, sentence (a) is changed to sentence (b). To do this, there are several steps to do. We're going to do it with LGG.

First, LGG must be designed to recognize V2+V1+V0.

Second, the position of variable must be swapped. To swap this, we must set up perimeter around the variable, which is called variable function, marked by red bracket. How to setup variable?

$+variable name+opening bracket+ box filled with lexicon + $ +variable name (must be the same as previous one)+closing bracket

LGG for Direct Speech Transformation

Speech are marked by quotes (opening and closing). However, in Unitex, quote must be escaped by backslash \. Therefore in the box, they are written as \". Inside the quotes, there is a speech sequence. This sequence is marked by uppercase letter just after opening bracket. Token started by uppercase letter is written as <PRE>. Speech aren't just limited to one word, but it can be phrase and clause. For the sequence is unlimited, we put loop on . It means sequences of tokens. The sequence is ended by comma and quotation mark.

Unfortunately there is no dictionary set for reporting verbs. However, the verbs are most likely to be reporting when it exists after speech. Therefore, I just put , indicating any verb. Agent is really open to develop. My LGG there captures only NP consisting of or .

Third (After long explanation, here's the climax), we must swap the order. V0 is set up excluding quotation mark. Why? because it will involve comma inside the speech. Therefore, it is setup between the beginning of speech to right before comma. V1 is set on reporting verb and V2 is set on agent. The following output is given so the position is swapped $var2$ $var1$ "$var0".

From the concordance under the LGG, you can see that it successfully extracts sentence composed of direct speech + reporting verb + agent (e.g "there is a cat" said the man) to agent+reporting verb+direct speech (e.g the man said "there is a cat")

Friday, February 4, 2011

Local Grammar Graph

This post deals with some aspects of local grammar graph (LGG), a sequence recognizer and extraction tool. Local Grammar is developed by Maurice Gross to capture and describe a linguistic phenomena.

Some LGGs Advantageous over Regular Expression

Question. Is it the only tool? There exists another tool which is called regular expression. Many computer scientists are familiar to this term. However, there are some advantageous of LGG over regular expression. First, LGG is supported by graphic interface. It's user friendly. For someone like me who does not know much programming, it is proven to be very helpful. Check this illustration out to see how LGG looks

LGG with Output

Second, with LGG, you can give output. So here is the order. You recognize, extract, and after that give output. An example above shows you an LGG that is equipped with output. In the example, output is designed to give decription of every extracted lexicons. You can also have translation for each lexicon as long as your dictionary supports it

Machine Readable Dictionary

The accuracy of corpus-based extraction relies so much on MRD. Without this, extraction will be based only on surface form (orthography) without linguistic analysis. But what is a dictionary? Dictionary is collection of entries supported by some linguistic information such as part of speech, conceptual definition or equivalence in pair language (for bilingual dictionary). There are types of dictionary. Let me classify them into two: Published dictionary and MRD for NLP.

(a) Published Dictionary

We might find public dictionary easily on bookstores around us. Some of them are bilingual, some of them are monolingual. Some of them are also supported by CD-ROM, where dictionary database is stored. Database may also be stored in standalone portables such as pocket electronic dictionary, iPhone or cellphone (are they the same?^^). Although dictionaries that I mentioned are machine readable, these dictionary cannot be used to support NLP tasks.

(b) Machine Readable Dictionary for NLP

Corpus

Corpus can simply be defined as a collection of linguistic data both spoken or writen. However, can all linguistic data be defined as 'Corpus'? Although there are some defintions, there is a consensus among linguists that a corpus should ideally be (1) machine readable, (2) authentic, (3) representative enough to be a sample of a language variety. By fulfilling those three requirements, a collection of linguistic data can be called corpus.

Among the requirements, second and third points are quite familiar to linguists. Unlike those two, the first term 'machine readable' is more familiar to computer scientists . For this reason, this posting highlights this term. Another essential focus is preprocessing stage of a corpus.

Machine Readability

Why 'machine readable'? The term 'readable' is closely related to 'machine recognition'. Machine can refer to standalone tool such as: iPad, portable text-reader etc, but it mostly refers to computer. Computer recognition is essential for tagging, extraction and other NLP tasks. They cannot be excecuted work unless linguistic data is readable. Therefore, linguistic data that is not recognized by computer must be converted to readable format. For example, handwriting must be scanned first and later be converted to computer-recognizable format. Let's say, you turn it into pdf or word format.

There you go, you have your linguistic data. But what is meant by 'machine readable' is not only this. Your linguistic data must also be able to be 'processed' to further perform some NLP tasks such as pattern matching or extraction.

Corpus Processing

>(^o^)