Introducing Speech and Language Processing
Reviewed by Graham Ranger
Coleman's Introducing Speech and Language Processing (henceforth ISLP) is a very well presented manual running to more than 300 pages which aims at initiating the reader in some of the fairly complex and technical computational and mathematical methods used in processing linguistic sounds and structures. An introduction is followed by eight evenly balanced chapters each of which targets a particular technique in speech and language processing. Each chapter is headed by a page featuring a "chapter preview" and by a list of "key terms" to be found in the chapter. ISLP contains a good number of exercises both within the text of each chapter, to be done while reading. At the end of the chapters there are further exercises, advice for complementary reading and preparatory reading for the next unit. The index is perhaps a little sparse but it is usefully complemented by a glossary of technical terms. In addition to the book itself, the accompanying CD-ROM contains a good deal of extra material including software (a C-code compiler, a Prolog interpreter, a text editor and a sound editor) in addition to accompanying examples and data files. A companion website http://www.islp.org.uk/ provides answers to some, if not all, of the exercises, help with bugs and problems installing or running the software, free extras, updates and links to other relevant websites. Very little has been left to chance in ensuring that the reader benefits fully from his or her reading of ISLP!
I will begin by presenting the goals of ISLP as they appear in Chapter 1 as well as my own particular interest in these goals before going on to look at the main body of the book.
ISLP is, Chapter 1 tells us, "a first, basic, elementary and short textbook in speech and natural language processing for beginners with little or no previous experience of computer programming […] it is primarily written for language, speech and linguistics students. Consequently, I shall assume some prior knowledge of basic linguistics" . I ought perhaps to say that I am not an expert in computing but a lecturer in linguistics and so this review is written from the point of view of the target student rather than from that of the lecturer interested in adopting ISLP as a coursebook. One of the purposes of ISLP, Coleman tells us, is to bring together two fields in language processing which have traditional remained divided: "signal processing […] which [is] normally taught in electronic engineering degrees, and computational linguistics, which is mainly directed to students of computer science or linguistics" . Coleman goes on to provide a number of reasons to use ISLP, for students in various fields of research. His own reason for writing the book, he tells us, "rests on a refusal to let the old sociological divide between arts and sciences stands in the way of a new wave of spoken language researchers with a foot in both camps" . The book adopts a hands-on approach to programming skills. "The emphasis […] is on pratical exercises in the use and modification of computer programs written in C and Prolog, in order that students will rapidly gain confidence and acquire new skills" . The necessary C and Prolog programs are included on the CD-ROM, as I have said, and the reader is encouraged to "tinker" with these, the better to understand how they work. The introduction goes on to indicate the basic "computational set-up needed" for the book and the "computational skills" that the reader must have. This paragraph, I would suggest, is a little disingenuous, beginning as it does: "You need to know a little about how to use a PC: 1. How to power it up (and, if necessary, how to start Windows). 2. How to use the mouse and keyboard." Some of the skills required later on in the manual, in particular concerning the installation and use of the C compiler and the Prolog interpreter definitely require a little more expertise than this, in particular when things do not go as expected! The introduction ends with the structure of the book, which follows two "'strands' […] one on speech processing and the other on natural language processing" . I will be following this in the next paragraphs, dealing firstly with chapters 2-4 (speech processing) then with 5-9 (natural language processing) both in terms of speech recognition (6-7) and grammars (8-9).
Chapters 2-4 look at speech processing and provide an introduction to working with C programming. Chapter 2, "Sounds and numbers," is conceived around a confidence-building exercise first generating and then interpreting a cosine wave in terms of sound. To produce a cosine wave we cannot take literally every point on the wave, and so we must decide to take a limited number of samples at regular intervals.
These samples are measured using a discrete series of values at fixed intervals. This transformation of the continuous to the discrete is called quantization (2.4). The passage in the opposite direction involved when a signal is played back, for example, is called reconstruction. A series of 64000 quantization values (±32000) is enough, Coleman tells us, for "quite a high fidelity representation of speech" . 64000 values can be encoded using 16 bits (or binary digits): one denotes the sign (+ or -) while the other 15 give us 215 (or 32000) different levels. (At this point in the chapter there is a short paragraph representing an exchange in a lesson perhaps, between the student and the lecturer featuring the student's question "What is the definition of a bit?" and the answer. I enjoyed this original feature of ISLP, in particular since the naive questions of the typical student frequently corresponded to questions I was asking myself while reading. I did however find, in some chapters, when the going got tough, that the student fell conspicuously silent!). Next, the chapter comments in detail upon the listing of a program file called coswave.c the purpose of which is to produce a cosine wave by using the parameters introduced earlier in the chapter (the samples and the values assigned to them). The chapter closes with a summary and a number of exercises to compile and run the coswave.c program. Compilation involves turning the C code of the program firstly into an object code which the computer can understand (the file is then called coswave.o where the suffix .o denotes object code) and then into an executable program bleep.exe (the name bleep is arbitrary). This can now be run. It writes to the disk a data file coswave.dat which can be imported into a sound editor like "Cool Edit" (the program provided on the CD-ROM). These operations are done by typing out commands on the command line, not something many of us are used to doing, I imagine. The next exercises involve modifying the parameters of coswave.c (samples and values) to produce differences in the sound produced and even getting the program to write out the sample numbers and values into a text file which can be visualised or converted into graphic form using Excel or similar software.
Chapter 5 is a pivotal chapter between the speech processing and language processing strands of the book. In it the author presents the theory and practice of finite-state machines (FSM's). "A finite-state machine works rather like a board game in which you move a piece from one position to the next in order to get from one side of the board (the start) to the other (the end)" . A finite-state machine allowing us to generate English-like monosyllables is represented graphically with the following diagram:
Starting from state 1 we move through the machine step by step until we reach an end state. This finite-state machine is implemented using the Prolog programming language (rather than C). Coleman comments upon some of the more significant differences between the two languages and then on the program itself. There follows an exercise in which the reader is asked to try it out using the SWI-Prolog interpreter which can be installed from the CD-ROM. The next step is to move to finite-state transducers (of FST's). These are like FSM's except that they work with pairs of symbols. Again a Prolog implementation of this is provided, with the aim of relating English-like phoneme strings to spellings. And so the Prolog query
In Chapter 6, Introduction to Speech recognition techniques, Coleman divides such techniques into two sorts: knowledge-based approaches and pattern-matching approaches. The first relies on linguistic theory, the second on database examples. He looks at knowledge-based approaches first presenting the decision tree as a way of recognising speech. An example of a decision tree for phoneme recognition is given below:
Such methods are apparently not always reliable and hence are less favoured in commercial applications. Coleman looks briefly at ways in which such systems can be made more flexible and less inclined to fail in case of error. The chapter then move on to Pattern-matching approaches to speech-recognition. Here, rather than look for minimal features, speech samples are compared with a database of recorded patterns, often consisting of whole words. This very pragmatic approach to the problem has not always interested linguists but has achieved more significant results and receives commercial applications. One of the problems it presents, however, is that the sample is not necessarily temporally aligned with the database example. This requires us to implement something called dynamic time warping: a method for stretching or compressing samples intelligently in order to align and compare them with recorded examples. The method of dynamic time warping is explained in some detail. The chapter ends wth a consideration of problems posed for the pattern-matching approach by further sorts of variation: accent, gender, speaker, etc.
Chapter 7 returns to the probabilistic finite-state models evoked towards the end of Chapter 5. Probabilistic methods, Coleman tells us, are useful in situations of uncertainty whether these be acoustic (as addressed with dynamic time warping in Chapter 6), categorial, structural or homophonous. Uncertainties may, moreover, accumulate within a single unit of speech. The question of structural uncertainty is addressed by appealing to the preceding context. Hence in the sequence the low swinging he considers that the possibility of swinging being an adjective is 0.8 while that of it being a noun is 0.2 on a scale of 0-1. A reference corpus is used to establish probabilities. A problem arises though, since one may produce a perfectly good sentence which would receive a probability estimate of 0 simple because it does not occur as such in the reference corpus. This problem is answered in the following way: rather than take into account the whole sentence, the probability of a sentence occurring is calculated, following the Markov assumption, as the product of the probability of bits of the sentence only. For example the probability of finding I saw a Tibetan lama in a corpus may be calculated using bigrams (sets of two words) as the probability of finding "I" times the probability of find "I saw" times the probability of finding "saw a" and so on. Coleman considers a number of problems such probabilistic models (known as Hidden Markov Models) have to take into account. This part of the chapter was, I found, rather too technical and soon became difficult to follow for the non specialist. The chapter ends with a consideration of Chomsky's objections (in Chomsky 1957, 1965 and Miller and Chomsky 19633) to statistical models of language (including Markov models) and a number of possible rejoinders to Chomsky's criticism.
Chapter 8 Parsing brought me back to more familiar territory with a Prolog implementation of phrase structure rules of the form S à NP + VP; NP à det + N etc. Coleman first asks the reader to parse a given sentence intuitively and then looks at how a program might go about the same task. Whereas a person intuitively adopts a bottom-up approach (i.e. starts with the words and then works from them towards abstractions such as NP, VP or S) a program has to take a top-down approach, beginning with the highest term (S or IP) and working down, asking itself "how can we build the tree downwards so as to generate the input string?" . A simple parsing program ("a 'toy' grammar” 232), written in Prolog, is worked through, and we are told how the computer has, whenever it reaches a dead-end, to backtrack and try another route. This program, called dog_grammar.pl, will, when given a string of words, tells us whether a string is well-formed or not, using recursive descent parsing. The Prolog query ip([the,quick,brown,fox,jumped,over,the,lazy,dogs],). returns the answer: Yes. It can also generate a list of all possible sentences, corresponding to the lexicon and the rules programmed into it. More usefully, perhaps, a second program works as a simple parser, so that the query
The book's short final chapter, Using probabilistic grammars, brings together the considerations of probabilities presented in Chapter 7 and the simple Definite Clause Grammars of Chapter 8. The chapter begins with the observation that "Human grammaticality judgements are gradient" . This, and a number of other factors, lead to the elaboration of a probabilistic grammar in which each of the rules receives a probability rating and in which the sum of all sentence probabilities is 1. Probabilities will initially depend upon a reference corpus, but this will, as before, pose the problem of how to deal with sentence structures which include rules that are not attested in the corpus. Coleman considers these problems before looking at an example of such a probabilistic grammar, written in Prolog. We are shown how the grammar, which includes a very limited lexicon and set of rules, allows us for example to calculate the probabilities associated with acceptable sentences. Within this sort of grammar it is still difficult to integrate collocations such as kick the bucket or the semantic constraints which make the emaciated man starved a more likely sentence than the emaciated man was fat. One way of dealing with these difficulties is by using bigger chunks of sentence as primitives in what are called Tree adjoining grammars. In a similar vein, data oriented parsing constructs a corpus with "syntax trees instead of grammar rules" .
ISLP ends with a table presenting the ASCII codes, a useful fourteen-page glossary of technical terms and a full list of references.
In conclusion, ISLP is an ambitious attempt to initiate non specialists to complex techniques for speech and language processing. I found it somewhat difficult to follow occasionally but I suspect that, had I followed all the advice for complementary and preparatory reading (notably in C and Prolog programming), a good many of my difficulties would have evaporated. ISLP is, in any case, undoubtedly more suited for use as a course manual than as a self-study book. In general the explanations provided are clear and Coleman's tone always encouraging. The author has a mission to reconcile otherwise separate disciplines and writes engagingly throughout. A large amount of complementary material is generously included, both on the CD-ROM and online, to help the reader along as much as possible.4 I would recommend ISLP as an excellent starting point for the linguistics researcher who wants to know more about how computational techniques can apply to his or her discipline.
2. N. Chomsky, Syntactic Structures (The Hague: Mouton, 1957). back
3. N. Chomsky, op. cit, N. Chomsky, Aspects of the Theory of Syntax (Cambridge, MA: MIT Press, 1965) and N. Chomsky & G.A. Miller, "Finitary models of language users," in R.D. Luce, R.R. Bush & E. Galanter eds., Handbook of Mathematical Psychology, vol. II (New York: John Wiley, 1963) 419-91. back
4. One very minor caveat: the sound editor Cool Edit is shareware and is not fully functional unless paid for. The SAVE function, in particular, stops after 30 days. The freeware sound editor Audacity is able to do all that is required within the scope of ISLP. back