This paper seeks to computationally advance the Yorùbá language by designing its rule based tagger. The work adopts Standard Theory and Principles and Parametres Theory to segment and instruct the computer system through Prolog the syntactic structure of the language. Some hundred Yorùbá words are coded t o serve as lexicon or dictionary. Through the words, some syntactic rules are as well programmed. The work tags Yorùbá parts of speech of non-derivative sentences. It reveals that not all Yorùbá NPs can complement prepositions in prepositional phrases ( PPs ) . It is however made known that the re is need to reclassify Yorùbá words so as to enable machines like computers to generate grammatically acceptable Yorùbá sentences .
Key words: Parts of Speech Tagging, Rule-based Parts of Speech Tagging, and Yorùbá Language
Part of speech tagging (POS tagging) has a vital role in different fields of natural language processing (NLP) including machine translation, (Rabbi et al 2009:1). It is defined as the process of assigning to each word in a running text a label which indicates the status of that word within some system of categorizing the words of that language according to their morphological and/or syntactic properties (frequently known as a “tagset”). The process of assigning morpho-syntactic categories of each morpheme including punctuation marks in a given text document according to the context is called POS tagging. It is a significant prerequisite for putting a human language on the engineering track. Before developing a part of speech tagger, a tagset is required. Earlier POS tagging was done manually with much effort, but nowadays automatic POS tagging mechanism is becoming more common.
Part of speech (POS) tagger is one of the important components in the development of any serious application in different fields of NLP in the present world. A POS tagger takes a sentence as input and assigns a unique part of speech tag to each lexical item of the sentence. POS tagging is used as an early stage of linguistic text analysis in many applications including subcategory acquisition, text to speech synthesis, and alignment of parallel corpora. The POS tagger can be used in other areas of NLP such as semantic analysis, information retrieval, shallow parsing, information extraction and machine translation etc. (Abney 1996:3).
POS taggers can be of rule-based and statistic (stochastic) models. Rule based parts of speech tagging is the approach that uses hand written rules for tagging. It depends on dictionary or lexicon to get possible tags for each word to be tagged. Hand-written rules are used to identify the correct tag when a word has more than one possible tag. Disambiguation is done by analyzing the linguistic features of the word, its preceding word, its following word and other aspects (tagset).
Stochastic model of parts of speech tagging, on the other hand, employs probabilistic possibility to assign possible tags to different words in a running text. This type of tagger or tagging model requires little coding of rules. Stochastic parts of speech tagging can be divided into supervised and unsupervised. There has been a dramatic increase in the application of stochastic models to NLP over the last few years. The appeal of stochastic techniques over traditional rule-based techniques comes from the ease with which the necessary statistics can be automatically acquired and the fact that very little handcrafted knowledge is needed to be built into the system. In contrast, the rules in rule-based systems are usually difficult to construct and typically not very robust.
Part of speech tagging is one of the branches of computational linguistics (CL). The name CL already suggests that this discipline comprises two related areas (Linguistics and Computer Science) of research. In it, natural language (NL) is studied and operational methods are developed. Both fields are investigated in their own right and divided into various topics. This area of linguistic specialization introduces a variety of NL phenomena together with appropriate implementations in the programming language (Prolog). The topics dealt with in CL are, among others, morphology, finite state techniques, syntax, context free grammars, parsing, and semantic construction, (Striegnitz 2003:1). CL is a discipline that spans linguistics and computer science. It is concerned with the computational models of human cognition, (Uszkoreit 2000:1 in Odoje 2010:1). By this, Uszkoreit is saying that CL sets out to let the computers capture native speakers’ intuitive/tacit knowledge of their language, so that a person can dialogue with a computer system. Native speakers’ intuitive knowledge involves sentences’generation and grammatical judgment of uttered sentences. The task is overwhelming and inter-disciplinary requiring expertise in artificial intelligence, cognitive psychology, mathematics, logic and linguistics, amongst others.
The Yorùbá Language
The Yorùbá language belongs to the West Benue-Congo of the Niger-Congo phylum ofAfrican languages (Williamson/Blench 2000: 31). Fabunmi and Salawu (2005:392) observes that majority of the speakers of the language reside in the Southwestern part of Nigeria with a population of about sixty million. Yorùbá is regarded as one of the three major languages of Nigeria. So, any language, like Yorùbá spoken by more than a handful of people, exhibits the tendency to split into dialects which may differ from one another. The diverse varieties of the Yorùbá language used by groups smaller than the total community of speakers of the language within the geographical area is referred to as dialects of the same language. But aside from Nigeria, the language is also spoken in countries like Republic of Bénin, Togo, Ghana, Cote D'ivoire, Sudan and Sierra-Leone. Outside Africa, a great number of speakers of the language are in Brazil, Cuba, Haiti, Caribbean Islands, Trinidad and Tobago, UK and America (Fabunmi and Salawu 2005:392).
There have been some efforts at computerizing Yorùbá language. Those efforts are traced to the 1990s. This does not mean that there were no attempts before then. According to Adegbola (2009:56) in Odòjé (2010:29), the ideal of text to speech synthesis started in 1985. This is so because the awareness of the importance of such activity just began. More so, there is little or no support for development of CL in Nigeria by the Nigerian government. Efforts have been through individuals, some universities and a private organization. Apart from Odejobi and some others in the universities, other activities of computerizing Yorùbá language have been done by private organizations like African Languages Technology Initiative (hence forth alt-i) and Ogun Radio, 90.5 in Abeokuta. The prominent of these two organizations is alt-i because Radio Abeokuta is not well known but to our understanding the radio exists on the Internet. Some of the activities of alt-i are automatic speech recognition of Yorùbá, Yorùbá spelling checkers, automatic diacritic application for Yorùbá, localization of Microsoft operating systems and office suite and machine translation and so on (Òdòjé 2010:27). Some other works developed to computerize Yorùbá language include Odòjé’s (2010) Machine Translation, Toye’s (2008) Interlinear Machine Translation, Aládéṣọ̀tẹ̀’s et al (2011) A Computational Model of Yorùbá Morphology Lexical Analyzer and so on.
Every morpho-syntactic computerized work on the language can tag, for instance Odòjé’s (2010) Machine Translation, Toye’s (2008) Interlinear Machine Translation, Aládéṣọ̀tẹ̀’s et al (2011) A Computational Model of Yorùbá Morphology Lexical Analyzer et cetera. Though these works can tag the language’s parts of speech, they were not aimed at that purpose. The only work purposely designed to tag Yorùbá parts of speech is (Adedjouma’s et al 2013) Part-of-Speech Tagging of Standard Yorùbá, Language of Niger-Congo Family.
As good as Adedjouma et al (2013) ‘Part-of-Speech Tagging of Standard Yorùbá, Language of Niger-Congo Family’ could be, at least to have been a pioneer in tagging the language’s parts of speech, it has some Yorùbá linguistic anomalies and limitations. A careful and critical examination of the work reveals:
i. That there are thirty (30) letters of alphabet in the language- eighteen (18) consonants and twelve (12) vowels.
ii. That there are three (3) tones in the language.
iii. That the work used the Bible.
iv. That the work did not show the tagged items.
v. That the work was stochastic based.
Firstly, it may be orthographically misleading for some alien Yorùbá and second language learners to adhere strictly to this work because every Yorùbá speaker understands that there are twenty-five (25) letters of alphabet in the language- eighteen (18) consonants, seven (7) vowels (Bámgbóṣé 2010:16). Phonetically, the language is made up of thirty (30) sounds- eighteen (18) consonants and twelve (12) vowels. The vowels are further divided into both oral (7) and nasal (5) vowels. However, the work presents [n] as a nasal vowel in the language, this may be quite misleading.
Secondly, Oyèbádé (2007:236) affirms that Yorùbá attests three tonemes but five tones. He argues further that the two extra tones can be argued to be phonetic expressions of one or another of the tonemes that can easily be identified through the technique of the minimal pair test. Yorùbá has three tonemes: high ( @ ), low ( $ ), and mid ( # ) (usually unmarked in orthography). This is exemplified in the minimal pairs below:
Abbildung in dieser Leseprobe nicht enthalten
In the pairs of words in (a), the two words are identical except for the pitch on the first vowels; this shows that mid and low tones are phonemic. In the second pair of words (b), the two words are similar up to the tone on the last vowels- high and mid respectively. So, high and mid tones are also phonemic in the language. A combination of these observations shows that Yorùbá has three tonemes- high, mid and low.
The high tone has a variant tone called ‘rising tone’. This is the variant that appears when the high tone follows a low tone. For instance, the word for the verb ‘to sound’ is high tone: ‘ró’.
When it is nominalised, a low toned prefix is added to the verb and the tone of the verb becomes a rising tone ì-ro& ‘a sound’. The gliding (rising tone) may also be derived when a vowel is elided but its tone is transferred to a surviving neighboring vowel, as in the derivation of the word ‘policeman’, with the following schematized steps:
Abbildung in dieser Leseprobe nicht enthalten
The other gliding tone is falling tone and it is a manifestation of the low tone when it occurs after high tone, as in pákò [páko^] ‘chewing stick’; kúrò [kúro^^] ‘leave’. It is however worth noting that though the argument here is phonetic, phonetic explanation can never be discarded in linguistic analysis.