Bringing Vanitha into the Digital Age
The availability of CHILDES, a large database of transcribed speech between parents and their young children (MacWhinney, 2000) has transformed the
language acquistion literature over the last 20 years by allowing for largescale,
computerized statistical analyses of parental input and hypothesis
testing based on computational models. However, there is only one small
South Asian language corpus available on CHILDES | the Narasimhan
corpus on the acquisition of Tamil, following the development of the child
Vanitha from 9 months to 33 months of age (Narasimhan, 1981). It was
donated to the CHILDES project in 1985, so there have been no new additions
of data on the acquisition of South Asian languages to the CHILDES
database in 20 years. We believe that this has been both a consequence
and a cause of the relative paucity of literature on the acquisition of South
Asian languages.
The problem is that the Narasimhan corpus is unsuitable in its current
form for machine processing of the kind that has been so fruitful in studies
of English and certain European languages. There are two reasons, the rest
being that the corpus is transcribed phonemically. Thus, slightly different
pronunciations of the same Tamil word are often represented differently in
the corpus, making it difficult to calculate even the most basic measures
used in corpus studies of language acquisition. The second reason is that
the transcriptions use non-ASCII characters to represent retroflex consonants.
In combination, these two factors make it impossible to use many
standard software tools (including most text editors and spreadsheets) or
other computational resources (such as electronic Tamil lexicons) to perform
automated morphological, lexical, or syntactic analysis on the transcripts.
We have converted the Narasimhan corpus to a standardized orthographic
transcription using an ASCII transliteration scheme. This allows us
to use the full range of available software tools and lexicographic resources
to perform part-of-speech tagging and morphosyntactic analysis. We will
review the progress we have made, the obstacles overcome, and the remaining
challenges, as well as present some basic data on the corpus, including
lexical frequency and noun/verb co-occurrences. Our goals are to demonstrate
the need for additional machine-readable corpora on the acquisition
of Dravidian languages, to inspire other researchers to join us in collecting
such data and contributing it to CHILDES, and to facilitate their doing so
by guiding them around some pitfals.