INVITED SPEAKERS

Bringing Vanitha into the Digital Age


The availability of CHILDES, a large database of transcribed speech between parents and their young children (MacWhinney, 2000) has transformed the language acquistion literature over the last 20 years by allowing for largescale, computerized statistical analyses of parental input and hypothesis testing based on computational models. However, there is only one small South Asian language corpus available on CHILDES | the Narasimhan corpus on the acquisition of Tamil, following the development of the child Vanitha from 9 months to 33 months of age (Narasimhan, 1981). It was donated to the CHILDES project in 1985, so there have been no new additions of data on the acquisition of South Asian languages to the CHILDES database in 20 years. We believe that this has been both a consequence and a cause of the relative paucity of literature on the acquisition of South Asian languages.

The problem is that the Narasimhan corpus is unsuitable in its current form for machine processing of the kind that has been so fruitful in studies of English and certain European languages. There are two reasons, the rest being that the corpus is transcribed phonemically. Thus, slightly different pronunciations of the same Tamil word are often represented differently in
the corpus, making it difficult to calculate even the most basic measures used in corpus studies of language acquisition. The second reason is that the transcriptions use non-ASCII characters to represent retroflex consonants.

In combination, these two factors make it impossible to use many standard software tools (including most text editors and spreadsheets) or other computational resources (such as electronic Tamil lexicons) to perform automated morphological, lexical, or syntactic analysis on the transcripts. We have converted the Narasimhan corpus to a standardized orthographic transcription using an ASCII transliteration scheme. This allows us to use the full range of available software tools and lexicographic resources to perform part-of-speech tagging and morphosyntactic analysis. We will review the progress we have made, the obstacles overcome, and the remaining challenges, as well as present some basic data on the corpus, including
lexical frequency and noun/verb co-occurrences. Our goals are to demonstrate the need for additional machine-readable corpora on the acquisition of Dravidian languages, to inspire other researchers to join us in collecting such data and contributing it to CHILDES, and to facilitate their doing so by guiding them around some pitfals.