It is no secret that the world is drowning in text and data. This causes real problems for everyday users who need to make sense of all the information available, and for software engineers who want to make their text-based applications more useful and user-friendly. Whether building a search engine for a corporate website, automatically organizing email, or extracting important nuggets of information from the news, dealing with unstructured text can be daunting.
Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. It explores how to automatically organize text, using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. This book gives examples illustrating each of these topics, as well as the foundations upon which they are built.
Purchase of the print book comes with an offer of a free PDF, ePub, and Kindle eBook from Manning. Also available is all code from the book.
Good overview of different systems (Solr, OpenNLP & Mahout), approaches & algorithms for working with (unstructured) text - search, analyze, cluster, etc. The book itself is completely practical with references to articles & books for people interested in more detailed/theoretical information.
This book is like a hands on guide to Text Analytics and Processing as it talks about the Open Source projects related to this topic. Other than that it is a very good introduction to basics of Text Analysis and how one can use Open Source Solutions like Lucene, Solr, OpenNLP etc to do the same.
The last chapter about Untamed text: the next frontier is very good. For someone interested in Text Processing and Analytics that last chapter is a very good read with a lots of ideas regarding what could be done next in this field.
This book was alright. Many of the theoretical bits are things I've already encountered. The practical bits weren't so relevant because they were too tightly coupled to specific Java libraries and other pieces of software. Nothing too earth shattering here, but might be just right for some readers.