Author(s): Michael McCandless, Erik Hatcher, and Otis Gospodnetić
Published: July, 2010
In these days searching performance is crucial when we work with a huge amount of data available in many enterprise business. This is a hard work, Lucene is here to help us. This is an interesting book, some times you need read many times to understand complex topics
You have more of 450 pages available to learn Lucene
Each chapter include many, many sections!. I going to do mention only for those where they are covered in many pages, no the shortest. Therefore next my summary overview for this interesting work
Since this book includes a lot of sections for each chapter, it has two side effects, in the bottom of my review is expanded my opinion about this. Many of these sections are based with images source code and explanation, some of them are long and other practically concrete, therefore you going read or find this pattern or approach many times below
Part I: Core Lucene
Chapter 01 Meet Lucene
A solid chapter, introducing about the information explosion for these days and then introducing Lucene, explaining what is and what can do, even including the history about its creation. A valuable image about many components involved for the search application is included, even more, long and important explanation for these components is available too.
A sample application with its respective explanation, instructions and result output are shown too. Excellent and a long explanation for core indexing classes and core searching classes are available too.
Chapter 02 Building a search index
Starting and explaining about the indexing process with important material, a sample source code is included with its respective explanation, delete and update methods API are introcuded and explained.
Fields option are well covered, valuable information available, the rest of the chapter is long and based practically only with theory covering like Boosting documents and fields, indexing numbers, dates, times and concurrency, thread safety and locking issues
Chapter 03: Adding search to your application
After of a concrete introduction about the searching API a short sample source code for a TermQuery is introduced and explained, same appreciation for QueryParser even including an image to represent its work.
Covering for IndexSearcher is available too, even including almost a page of source code about Near-ral-time search with its respective explanation, interesting this. Same appreciation about Lucene scoring.
Other section available is Lucene’s diverse queries, where topics such as TermQuery, TermRangeQuery, NumericRangeQuery, PrefixQuery, BooleanQuery, PhraseQuery, WildcardQuery and FuzzyQuery are available through a good amount of pages including important source code with its respective explanation, realize yourself Lucene offer a good support about Query
Chapter 04: Lucene’s analysis process
Starting with an image and explaining about Analysis process during indexing, following with What is inside an analyzer, where important terms like token and token stream are explained with valuable theory and important images for a better explanation.
Among other sections, Synonyms and aliases is covered, important and valuable source code with its respective explanation is available.
Something very crucial is the section Languages analysis issues, well covered.
Chapter 05: Advanced search techniques
A long chapter, after of a concrete introduction for Lucene’s field cache, we have an important covering for Sorting search results, many sorting options represented through source code with its respective explanation and output results are available, many pages used.
Same appreciation for Span queries about its long covering and variations, where each variation include an image for a better understanding. Again same approach used to Filtering a search, well covered.
Chapter 06: Extending search
Starting quickly with a situation about a geographic sorting covered practically with three pages of source code with its respective explanation. Same appreciation about custom Collector, two approaches has been used.
A deeper and long cover about QueryParser is available, based with many samples about source code with its respective explanation, practically covered in five pages, even including a table about its extensibiltiy points. Again same approach used for Filters and Payloads.
Really an interesting chapter with a lot of source code available
Part II: Applied Lucene
Chapter 07: Extracting text with Tika
Starting quickly about an introduction for Tika, including a table of practically two pages about documents format supported to parse, explanation about its API and how install it. Therefore how extract text programmatically is covered with two pages of source code with its respective explanation. No everything is perfect, limitations about Tika is covered too.
To complete the chapter, material covering about indexing custom XML is available, working with SAX and Apache Commons Digester, each one include its own sample source code with its respective explanation
Chapter 08: Essential Lucene extensions
This chapter is based closely about Luke, many images about its environment and explanation of its features is available. I mean images about, tabs overviews, Documents tabs, search for QueryParser, Files support, etc.
Something important and valuable is a table available in two pages about API for Analyzers, Tokenizers and TokenFilters, very interesting this table.
An important section is Highlighting query terms, an image about the flow process and the classes and interfaces involved is shown, explanation for each component is included. Sample source code to work and apply highlighting is available, even working with CSS.
Even more, how to work with Spell checking is covered through source code with its respective explanation. Something valuable is practically a page about ideas to improve spell checking.
To complete the chapter, many Query extensions are introduced like MoreLikeThis, RegexQuery and more
Chapter 09: Further Lucene extensions
Starting quickly covering Chaining Filters based practically in three pages of source code with its respective explanation. An interesting section based in the same approach used before is about Storing an index in Berkley DB.
An interesting section is about Synonyms working with WordNet, how to build an index and how to work with an analyzer is well covered through images, source code and explanation, the images are a good complement.
A valuable section is about the XML QueryParser, where an interesting image about the three commons options for building a Lucene from a search UI is available, valuable source code and explanation detailed for a .xsd code is included too.
Spatial Lucene is included too, based with important images about Globe, Tiers and Grid Boxes, of course, respective source code with its respective explanation is well introduced covering important topics such as searching and perfomance.
Practically to complete the chapter a well covered section available is Searching multiple Indexes remotely, explained with an image and important and valuable source code with a concrete explanation.
Chapter 10: Using Lucene from other programming languages
An interesting chapter for our consideration, practically is based in many sample source code about how you can work with Lucene with others programming languages, these covered are:
- CLucene (C++)
- Lucene.Net (C#)
- KinoSearch and Lucy (Perl)
- Ferret (Ruby)
- PyLucene (Python)
Chapter 11: Lucene administration and performance tuning
This chapter is not neither very long nor very short, but is concrete, valuable theory and explanation about source code is available, covering among many things topics like: Tuning, Threads, managing disk memory usage, index.
Some images about some flow process and output perfomance are available too to complement the long theory offered for many sections covered
Part III: Case studies
Practically we have three very interesting chapters, these has common feautres like considerable theory and explanation about each situation or case, some snippet code to complement some ideas, valuable images about some simple and complex process, some view or output results and finally some JMX configurations.
These three finals chapters are:
- Chapter 12: Case study 1: Krugle
- Chapter 13: Case study 2: SIREn
- Chapter 14: Case study 3: LinkedIn
What I liked:
- A lot of sections practically exists for each chapter, therefore expanded covering about Lucene you got it
- A lot of theory available
- Many tables for a better complement are available
- Valuables images to understand complex process and functions are available
What I disliked:
- A lot of sections practically exists for each chapter, but many of these sections are only based in theory, therefore you only get the idea but not the action
- you need read many times a chapter due the previous point, a lot of topics to learn
- Many times about the sample source code, I felt the impression of a deeper explanation of the code.
I hope dont see strong API changes with the Apache Lucene 3.3.0 against the actual version covered, it is 3.0