Statistical approaches to processing natural language text have become dominant in recent years. This foundational text is the first comprehensive introduction to statistical natural language processing (NLP) to appear. The book contains all the theory and algorithms needed for building NLP tools. It provides broad but rigorous coverage of mathematical and linguistic foundations, as well as detailed discussion of statistical methods, allowing students and researchers to construct their own implementations. The book covers collocation finding, word sense disambiguation, probabilistic parsing, information retrieval, and other applications.
A must read for anyone looking to get into NLP. Teaches from first principles, including briefly touching on information theory/entropy. I felt it was well grounded, and proceded at a good pace. No prior knowledge is required.
I picked this up at the same time as "Speech and Language Processing" (Jurafsky & Martin) and while Foundations is a much narrower book (making up with depth), I think it's for the better, as I found SLP far too broad and thin.
As the great American anthropologist-linguist Edward Sapir put it, all grammars leak. Some sentences are obviously grammatical, some are obviously ungrammatical, but there are gray areas; native speakers of English disagree on whether sentences such as "Who did Jo think said John saw him?" and "The boys read Mary's stories about each other" are grammatical. A way of resolving this difficulty is to look at a large corpus of texts; sentence structures that occur there often are grammatical, sentence structures that never occur are ungrammatical, and those that occur rarely are in a gray area. We will also need to assign a nonzero probability to sentence structures that we have never seen before, higher if they resembe ones that we've seen before than if they don't. Before Noam Chomsky invented them in 1957, neither "Colorless green ideas sleep furiously" nor "Furiously sleep ideas green colorless" ever occurred in an English text, but sentences like the former occurred much more frequently than sentences like the latter. This book discusses various algorithms used in corpus-based linguistics: parsing text, aligning text in two languages, deciding on the meaning of ambiguous words such as "plant" (a living organism from the kingdom Plantae, or a factory) and "interest" (curiosity, or share in a company). These algorithms do not always work correctly, but they work well enough to be used in the real world.
A classic on natural language processing. If you know nothing about natural language processing, or have a piecemeal understanding, this book will give you an overview of the field in a rigorous and yet comprehensible way.
Note that this book was written in 1999, so it far predates the current practice to use recursive neural networks for natural language. This book will give you exactly what it says in the title, Foundations, not “modern best practices.”
This 1999 book does a good job of explaining the different areas of statistical NLP. It was easy to read and very clear, even the formula-heavy sections. The sections on collocations (multi-word phrases) and verb subcategorization were largely new to me. The problems that natural-language research has faced are similar to the ones computer vision faces, but easier. What that means is that the researchers have made a lot more progress in the higher-level organization of concepts instead of getting stuck at the level of simple features and recognizing objects like computer vision has been.
This and Speech and Language Processing by Jurafsky and Martin are the two big introductory texts in natural language processing. I prefer the Jurafsky book; it goes into more detail, has more examples, and is written more for use as a class text. The Manning and Schutze book is much more mathematically oriented and goes into more detail on algorithms, so if you're focusing on the statistical aspect more than the language aspect, refer to this book. Ideally, you probably want both.