The Oxford Handbook of Corpus Phonology


Edited by Jacques Durand, Ulrike Gut & Gjert Kristoffersen


Oxford Handbooks in Linguistics Series

Oxford: University Press, 2014

Hardcover. xvi + 662 pages. ISBN 978-0199571932. £95


Reviewed by Anne Przewozny-Desriaux

Université Jean-Jaurès, Toulouse 2



The Oxford Handbook of Corpus Phonology is the 21st welcome event in the Oxford Handbooks in Linguistics series. The purpose of this collection of 32 contributions by 47 renowned linguists is to be the first comprehensive examination of corpus phonology defined as the use of purpose-built corpora “for studying speakers’ and listeners’ knowledge and use of the sound system of their native language(s), the laws underlying such sound systems, and the acquisition of these systems in first and second language learning” [13].

Over the last decades, as phonological corpora in a variety of languages have developed and gained descriptive and technical strength, several crucial methodological issues have been well managed on the fly, so as to fit specific structuring and scientific requirements. Corpus phonology developed from the need for phonology and phonetics to interact with other disciplines in social, cognitive and biological sciences. As is thoroughly illustrated in the book, this new field benefits from theoretical advances and methodological inputs from a variety of research fields such as diachronic and synchronic phonology, phonetics, corpus linguistics, psycholinguistics, speech and information technologies, computer science, applied statistics and mathematics. On several occasions within the book, corpus phonology is said to be “still in its infancy”, but it has dramatically evolved over the last few years. This justifies the three major claims of the book: to characterise corpus phonology, to provide the scientific community with a description of the diversity of practices, and to propose concrete solutions towards international standards for the design, compilation, annotation, analysis, archiving and dissemination of large phonological corpora (both in terms of data and metadata). The editors wish to provide the means to reach a scientific consensus for future practice of phonological corpus making and use.

The contributions (587 pages followed by a 48-page general bibliography and a thematic index) are organised in four complementary parts: Part I is composed of contributions relating to the design, compilation and exploitation of phonological corpora. Part II explores major applications in corpus-based phonology. Part III focuses on the tools and methodological issues that are at stake. Part IV gives an overview of major phonological corpora in a variety of languages around the world.

The editors’ Introduction is an overview of corpus phonology as an interdisciplinary field of research. And the book provides ample proof of this interdisciplinary bias, given the domains of research and applications of each of the 47 contributors. Ch.2, by Ulrike Gut and Holger Voormann (“Corpus design”), opens Part I with a full-length definition of corpus phonology. They label corpus phonology and purpose-built phonological corpora as “a method” to be put in perspective with other methods from which phonology has benefited so far (comparative, experimental, acoustic-perceptual approaches for example). The chapter then moves on to such essential properties of phonological corpus design as primary data compilation, data selection and annotation, corpus storage, sustainability and sharing. They discuss the strategies that may be chosen by the researcher and skilfully sketch a multifaceted definition of what a phonological corpus actually is. One burning (but often left aside for lack of proper definition) issue of corpus representativeness and size is discussed. In the light of all these factors, the two scholars offer an interesting methodology for corpus creation and improvement on practical grounds. Their theory of agile (cyclic) corpus creation rejects the classical linear phases of corpus creation that often prevent (this we must admit) the researcher from checking the inadequacies of corpus design and annotation at an early stage. While the crucial question of corpus-based data collection is sometimes not considered as a real issue (or is simply skimmed over), Bruce Birch [ch.3] explores the theoretical prerequisites of data collection, and some key issues such as control over primary data, the continuum of data collection (from scripted speech experiments to non-scripted ones), context and contextual variation, and the observer’s paradox (the author emphasises the fact that unselfconscious speech is no Grail to understanding speech).

In chapter 4, Elisabeth Delais-Roussarie and Brechtje Post pay attention to phonetic and phonological annotation in corpora, the issues of abstraction, segmentation and discretisation. Transcription procedures and the various levels of representation are thoroughly assessed with clear illustrations. In twenty illuminating pages, the authors choose to illustrate the diversity of theoretical approaches and objectives with a focus on symbolic representation systems of prosodic phenomena. Helmer Strik and Catia Cucchiarini [ch.5] present the procedures of (semi)-automatic phonological annotations of the speech signal. This chapter is most welcome as today this is a major problem that researchers in corpus phonology wish to explore … and settle (Strik and Cucchiarini incidentally remind the reader of the costs of orthographic and phonological transcriptions in terms of time and quality). The authors draw a portrait of various notation systems, compare methods and question the accuracy and reliability of phonological transcription, the problems of subjectivity and intra- or inter-subject variation produced from human-made transcriptions. The validity of semi-automated phonological transcriptions is discussed in connection with the notions of reference transcriptions and dynamic programming algorithms. They offer a balanced conclusion about the level of assistance that these methods can bring to human-made transcriptions of large corpora.

Hermann Moisl [Ch.6, “Statistical corpus exploitation”] explores still another side of corpus phonology: that of dealing with an overload of electronic data in corpus linguistics thanks to statistical methods. The author concentrates on cluster analysis that he applies to phonological concerns. Notions such as variable selection and data representation are tackled, followed by clustering methods and hypothesis generation, all illustrated with NECTE data analysis. The chapter ends with a useful selective review of the literature on statistical methods. Peter Wittenburg, Paul Trilsbeek and Florian Wittenburg [ch.7] raise the questions of archiving and disseminating data, from the traditional models to crucial changes in “an all-digital world” [132]. They discuss the management of data, its costs, contemporary channels of dissemination as well as techniques of curation and preservation of the data. The importance of descriptive, structural and administrative metadata in the design and exploitation of phonological corpora is most clearly expressed in chapter 8, where Daan Broeder and Dieter Van Uytvanck stress their strategic raison d’être. They present metadata standards and focus on practical considerations in metadata design, such as granularity and interoperability. Part I ends with Laurent Romary and Andreas Witt’s chapter on data formats, where the authors explore “the possibility of providing the research and industrial communities that commonly use spoken corpora with a set of well-documented standardized formats that allow a high reuse rate of annotated spoken resources” [167]. They review user scenarios and components of annotation schema, so as to encompass the various types of annotation within a multi-tier annotated corpus. The end of the chapter is mainly devoted to the Text Encoding Initiative framework.

Part II is concerned with “Applications”, in other words with how corpus-based methods can strengthen research in phonology and related fields. E. Delais-Roussarie and Hiyon Yoo [ch.10] define the relationship of phonology with phonetics in corpus-based research, and the emergence of new paradigms and subsequent methods. They emphasise the different realities that lie behind a “corpus” and “data”, comment on experimental data, the conditions of production and uses and their advantages and shortcomings. The chapter examines several examples of corpus-based approaches in phonology and in phonetics and it questions the notion of the relevance of data and the validity of their representation. The question raised by Gjert Kristoffersen and Hanne Gram Simonsen [ch.11] is “how far a corpus [TAUS and NoTa-corpus] can take us in our attempt at disentangling the different structural factors that seem to be active” [215] in a complex ongoing change in East Norwegian, namely the apicalisation of former laminal /s/ before /l/ (as in Oslo). This is an enlightening chapter on the treatment of the interaction between phonological, morphological and lexical factors while taking account of intra- and extra-linguistic dimensions. The issue of French liaison enables Jacques Durand [ch.12] to focus on the central notion of variation and variationist principles in contemporary linguistic research and to add proofs, if need be, of the value of corpus-based analysis. After reviewing three paradigmatic examples of the study of French liaison through corpora (Agren 1973, Encrevé 1988 and de Jong 1988), J. Durand focuses more explicitly on the PFC programme. He examines the advantages and shortcomings of corpora for the study of variation, which leads him to assert their contribution to phonological analysis as long as they can account for variability and irregularities in usage and heterogeneity of data on empirical grounds.

Yvan Rose [ch.13] considers “Corpus-based investigations of child phonological development”. After a historical synthesis of the (once restricted) use of corpora for phonological development and a discussion of some issues in corpus-based research (such as the interaction between phonological units) and methodological challenges, the author discusses specific problems in child phonology, such as the specific transitional characteristic of phonological systems in children. The chapter relies on the Phon software to illustrate his points. In chapter 14, Ulrike Gut offers her view on Second Language Acquisition, which uses corpora both for scientific and pedagogical goals, and on corpus-based research on aspects of L2 phonological acquisition. She shows that a corpus-based methodology that would account for variation and the relative frequency of L2 production patterns may be complementary to other methodologies to reinforce L2 teaching and learning. The author is enthusiastic about new opportunities, from the evaluation of current theories on SLA to inventories of tone in L2 English or findings on the fluency in L2 learners.

Part III (“Tools and methods”) offers still another perspective on corpus phonology. In eight contributions, the authors discuss prominent exploitation tools and options for improvement. The scope of ELAN (a generic multimedia annotation tool), its functionalities, data model and search modes are evaluated by Han Sloetjes in chapter 15. In a clear style, Tina John and Lasse Bombien [ch.16] describe the multiple-tool EMU speech database system. They present its database framework, options of annotation, management system, graphical interfaces and additional facilities. Paul Boersma provides an up-to-date technical account of the Praat Program. His chapter is based on a practical strategy. The reader who is not used to Praat yet will find an illustrated step-by-step description of what this computer program offers in terms of corpus building, acoustic analyses and annotation of sounds. Scripting with Praat is not considered, as it is the subject of Caren Brinckmann’s chapter 18. Praat scripts may be created for automated acoustic analyses. Here again the author provides clear examples of what Praat script may add to transcription, annotation, analysis and distribution. Another level of explanation deals with Praat scripting as a programming (interpreted) language, Praat’s variables, built-in mathematical functions, commands and control structures. Ivan Rose and Brian MacWhinney depict The PhonBank database on phonological development [ch.19]. They focus on its salient goals, functionalities and tools (with Phon and CLAN). They put a particular emphasis on data compatibility and the value of PhonBank for many theoretical views in phonology and phonological development.

Chapter 20 is an enthusiastic description of the EXMARaLDA system. Thomas Schmidt and Kai Wörner give an overview of its design, data models and metadata. They provide a detailed account of three software tools used to transcribe, manage and search data. Michael Kipp [ch.21] analyses ANVIL, a track-based annotation tool for video recordings. In this respect the chapter may be seen as complementary to the previous ones that deal with other track-based annotation tools. Annotation concepts, information encoding and interoperability (with ELAN) are discussed in depth. In chapter 22, Atanas Tchobanov tackles the key issues of web-based archiving and sharing of phonological corpora. These are complex problems for researchers who work on ‘old’ corpus projects, since the design of a web-based administration is quite a new concern in corpus phonology. What ‘remains’ to be done is to transfer corpora to the web, which is a complicated task. The author examines solutions to deploy a phonological corpus on the web. He offers step-by-step indications and illustrations to the reader. He answers fundamental questions about dynamic web applications, structuring the corpus, filling in and sharing data, and he anticipates new technological challenges.

Part IV is dedicated to remarkable phonological corpora in different languages around the world. Many theoretical issues of language description and analysis through corpora that are considered earlier in the book are tackled from another perspective. For each corpus, the authors introduce the historical background of the project, the theoretical bias, protocols and methodological features, essential goals and applications as well as their most recent analyses and strategies. Hence the IViE corpus refers to the study of prosodic variation in urban varieties of British English; PFC relies both on dialectological and sociolinguistic methods to describe authentic oral French around the world; NoTa-Oslo and TAUS were designed to evaluate spontaneous speech from Oslo speakers; the LeaP corpus was dedicated to the acquisition of prosody by L2 learners of English and German; DECTE is a diachronic corpus of Tyneside English combining NECTE and NECTE2; the LANCHART corpus (Danish) is focused on the study of real-time language change from a sociolinguistic point of view; the Goeman-Taeldeman-Van Reenen Database and Soundbites cover traditional dialects of Dutch; Valibel is a sociolinguistic corpus of Belgian French with an emphasis on attitudes to language; The ANDOSL Map Task Corpus combines scripted and non-scripted material to study the interaction between intonation and speech in Australian English; TAICORP is a phonological corpus of spontaneous speech dedicated to the study of L1 acquisition of Southern Min Chinese. This mosaic of corpora and databases highlights the great diversity of phonological corpora. It also convinces us that they have become multi-purpose corpora, enlarging the scope of linguistic study. The contributors show beyond doubt that multidisciplinary openness may efficiently contribute to phonological and linguistic knowledge.

As the chapters are written and illustrated by authors who have a deep and recognised experience in the handling of corpora that have proved their worth, the book brings lively analyses of key issues in corpus phonology together with sensible proposals. The present reviewer believes that this Handbook fulfils its aims as a high-level reference work in corpus phonology and as a comprehensive practical guide for a better handling of linguistic corpora in their international and scientific diversity. It will be of interest to any well-informed researcher or advanced student already committed to working with corpora or interested in phonology and phonetics, in language variation and change, dialectology, sociolinguistics and language acquisition across languages.


