Indonesian WordNet. This project concerns the development of an initial version of an Indonesian WordNet. We have developed a web-application which seeks to collect user annotations on mapping English semantic concepts from Princeton WordNet to Indonesian semantic concepts from the KBBI. We have also experimented with automatically establishing these mappings using Latent Semantic Analysis. (Riset Unggulan Universitas Indonesia - 2007)
Finite-state morphological analyser. In joint collaboration with University of Sydney, we are developing a linguistically sophisticated morphological parser based on two-level morphology. Using the Xerox finite state toolkit, this parser will be able to transduce between surface forms and Indonesian stems along with rich syntactic and semantic information. We are also developing a model of Indonesian reduplication, a form of non-concatenative morphology, using the compile-replace technique.
Corpus repository website. We are designing and implementing a website that will serve as a central corpus repository for various Indonesian resources. The corpora collection will be designed to be compliant to existing standards, e.g. OLAC and TEI, and enable rich annotation of multimedia data. This is a joint collaboration with the University of Sydney.
Speech recognition. We are experimenting with the development of large-scale continuous Indonesian speech recognition using open-source systems such as Sphinx and Julius, with a particular emphasis on improving Indonesian-specific language modelling to increase accuracy.
Indonesian treebank. Work will shortly commence on the development of an Indonesian treebank, i.e. parsed corpora, which will be aligned to the Penn Treebank. Such work should be an invaluable resource for, among others, statistical machine translation, and probabilistic language modelling.
- Electronic version of Kamus Besar Bahasa Indonesia.
- Indonesian spelling checker for Lotus SmartSuite software.
- Modelling of Indonesian syntax using feature-structure unification formalisms.
- Development of various cross-language information retrieval applications for Indonesian.