SharePoint search: Inside the Index book ‘review’
Inside the Index and Search Engines is 624 pages of lovely SharePoint search info. It is the sort of book that sets me apart from my colleagues. I was delighted when it arrived, everyone else was sympathetic.
The audience is “administrators” and “developers”. I’m never sure how technical they are imagining when they say “administrators” so I waded in anyway. The book defines topics for administrators as; managing the index file; configuring the end-user experience; managing metadata; search usage reports; configuring BDC applications; monitoring performance; administering protocol handlers and iFilters. I skimmed through the content for developers and found some useful nuggets in there too.
Contents:
1. Introducing Enterprise Search in SharePoint 2007
2. The End-User Search Experience
3. Customizing the Search User Interface
4. Search Usage Reports
5. Search Administration
6. Indexing and Searching Business Data
7. Search Deployment Considerations
8. Search APIs
9. Advanced Search Engine Topics
10. Searching with Windows SharePoint Services 3.0
The book begins by setting the scene, and with lots of fluff about why search matters and some slightly awkward praise for Microsoft’s efforts. It gets much more interesting later, so you can probably skip most of the introduction.
Content I found useful:
Chapter 1. Introducing Enterprise Search in SharePoint 2007
p.28-33 includes a comparison of features for a quick overview of Search Server, Search Server Express and SharePoint Server.
“Queries that are submitted first go through layers of word breakers and stemmers before they are executed against the content index file is available. Word breaking is a technique for isolating the important words out of the content, and stemmers store the variations on a word” p.32
Keyword query syntax p.44
- maximum query length 1024 characters
- by default is not case sensitive
- defaults to AND queries
- phrase searches can be run with quote marks
- wildcard searching is not supported at the level of keyword syntax search queries. Developers could build this functionality using CONTAINS in the SQL query syntax
- exclude words with
- you can search for properties e.g rnib author:loasby
- property searches can include prefix searches e.g author:loas
- properties are ANDed unless it the same property repeated (which would run as OR search)
Search URL parameters p.50
- k = keyword query
- s = the scope
- v = sort e.g “&v=date”
Chapter 4: The Search Usage Reports
Search queries report contains:
- number of queries
- query origin site collections
- number of queries per scope
- query terms
Search results report contains:
- search result destination pages (which URL was clicked by users)
- queries with zero results
- most clicked best bets
- search results with zero best bets
- queries with low clickthrough
Data can be exported to Excel (useful if I need to share the data in an accessible format).
You cannot view data beyond the 30 day data window. The suggested solution is to export every report!
Chapter 5: Search Administration
Can manage the crawl by:
- create content sources
- define crawl rules : exclude content (can use wildcard patterns), follow/noindex, crawl URLs with query strings
- define crawl schedules
- removed unwanted items with immediate effect
- troubleshoot crawls
There’s a useful but off-topic box about file shares vs. sharepoint on p.225
Crawler can discover metadata from:
- file properties e.g name, extension, date and size
- additional microsoft office properties
- SharePoint list columns
- Meta Tags from in HTML
- Email subject and to fields
- User profile properties
You can view the list of crawled properties via the Metadata Property Mappings link in the Configure Search Settings page. The Included In Index indicates if the property is searchable.
Managed properties can be:
- exposed in advanced search and in query syntax
- displayed in search results
- used in search scope rules
- used in custom relevancy ranking
Adjusting the weight of properties in ranking is not an admin interface task and can only be done via the programming interface.
High Confidence Results: A different (more detailed?) result for results that the search engine believes are an exact match for the query.
Authoritative Pages
- site central to high priority business process should be authoritative
- sites that encourage collaboration and actions should be authoritative
- external sites should not be authoritative
Thesaurus p.291
- an XML file on the server with no admin interface
- no need to include stemming variations
- different lanuage thesauri exist. The one used depends on the language specified by client apps sending requests
- tseng.xml and tsenu.xml
Noise words p.294
- language specific plain text files, in the same directory as the thesaurus
- for US english the file name is noiseenu.txt
Diacritic-sensitive search
- off by default
Chapter 8 – Search APIs
Mostly too technical but buried in the middle of chapter 8 are the ranking parameters:
- saturation constant for term frequency
- saturation constand for click distance
- weight of click distance for calculating relevance
- saturation constant for URL depth
- weight of URL depth for calculating relevance
- weight for ranking applied to non-default language
- weight of HTML, XML and TXT content type
- weight of document content types (Word, PP, Excel and Outlook)
- weight of list items content types
They’ll come in handy when I’m baffling over some random ranking decisions that SP has made.
Chapter 9 – Advanced Search Engine Topics
Skipped through most of this but it does covers the Codeplex Faceted Search on p.574-585
A good percentage of the book was valuable to a non-developer, particularly one who is happy to skip over chunks of code. I’ve seen and heard a lot of waffle about what SharePoint search does and doesn’t do, so it was great to get some solid answers.
Inside the Index and Search Engines: Microsoft® Office SharePoint® Server 2007
Related posts
SharePoint search: some ranking factors
SharePoint search: good or bad?