The Children's Picture Books Lexicon (CPB-LEX): A large-scale lexical database from children's picture books

This article presents CPB-LEX, a large-scale database of lexical statistics derived from children's picture books (age range 0-8 years). Such a database is essential for research in psychology, education and computational modelling, where rich details on the vocabulary of early print exposure are required. CPB-LEX was built through an innovative method of computationally extracting lexical information from automatic speech-to-text captions and subtitle tracks generated from social media channels dedicated to reading picture books aloud. It consists of approximately 25,585 types (wordforms) and their frequency norms (raw and Zipf-transformed), a lexicon of bigrams (two-word sequences and their transitional probabilities) and a document-term matrix (which shows the importance of each word in the corpus in each book). Several immediate contributions of CPB-LEX to behavioural science research are reported, including that the new CPB-LEX frequency norms strongly predict age of acquisition and outperform comparable child-input lexical databases. The database allows researchers and practitioners to extract lexical statistics for high-frequency words which can be used to develop word lists. The paper concludes with an investigation of how CPB-LEX can be used to extend recent modelling research on the lexical diversity children receive from picture books in addition to child-directed speech. Our model shows that the vocabulary input from a relatively small number of picture books can ...
Source: Behavior Research Methods - Category: Psychiatry & Psychology Authors: Source Type: research