A language independent approach to multilingual document representation including Arabic
Article Ecrit par: Aliane, Hassina ; Boucham, Souhila ;
Résumé: Arabic language is of increasing interest in the field of Multilingual Information Retrieval (MIR). We deal in this work with the problem of multilingual document representation including Arabic. The proposed approach combines a surface analysis and a Latent Semantic Analysis (LSA) algorithm in a new way to break down the terms of LSA into units which correspond more closely to morphemes. These morphemes are the variable length character N-gram candidates extracted from different fragments separated by borders. The length of the character N-gram candidates is variable because each language has its own properties. This strategy brings an interesting performance for languages such as Arabic in which the words are not explicitly defined and different words are not separated by spaces. The obtained results are encouraging and variability shows that they are perfectible.
Langue:
Anglais
Thème
Informatique
Mots clés:
Multilingual document representation
Virtual document
Principle of border
Concept types
Pivot language
Multilingual information retrieval (MIR)
Variable length character N-grams