ia play

the good life in a digital age

why your search engine (probably) isn’t rubbish

Now all search engines struggle,  to varying degrees,  with the knotty mess that is natural language. But they don’t generally don’t get called rubbish for not succeeding with the meaty search challenges.

Rubbish search engines are the ones that can’t seem to answer the most basic requests in a sensible manner. These are ones that get mocked as “random link generators”, the jibbering wrecks of their breed.

Go to  Homebase and search for “rabbit hutch” (we need another one as two of our girls are about to produce heaps of bunnies at the same time).

The first result is “Small plastic pet carrier”. There’s a number of other carriers and cages. Then there’s a “Beech Finish Small Corner Desk with Hutch”. Finally there’s a Pentland Rabbit Hutch at result no #8.  This is a rubbish set of results. I asked for “rabbit hutch” and they’ve got a rabbit hutch to sell me but they’re showing me pet carriers and beech finish corner desks.

This is a rubbish set of results. But it doesn’t mean the search engine is rubbish.

Somebody made a rubbish decision. They’ve set it up shonky.

So before you reach for the million pound enterprise search project, try having a quick look under the bonnet with a spanner.

Is it AND or OR?

This is reasonably easy to test, if you can’t ask someone who knows.

Pick a word that will be rare on your site and another word that doesn’t appear with the rare one  e.g.  “Topaz form” for my intranet.  A rare word is one that should only appear one or two times in the entire dataset so you can check that the other word doesn’t appear with it.  You may need to be a bit imaginative but unique things like product codes can be helpful here.  If the query returns no results you’ve probably got an AND search.  More than a couple of results (and ones that don’t mention Topaz) and you’ve probably got OR.

(this can get messed up if there is query expansion going on but hopefully the rare word isn’t one whatever query expansion rules there are will work on).

AND is more likely to be problematic as a setting. You’ll get lots of “no results”. You’ll need your users to be super precise with their terminology and spell every word right.  If they are looking for “holiday form” and the form is called “annual leave form” they’ll get no results.

OR will generate lots of results. This is ok if the sort order is sensible. Very few people care that Google returned 2,009,990 results for their query. They just care that the first result is spot-on.

So most of the time you probably want an OR set-up.

(preferably combined with support for phrase searching so the users can choose to put their searches in nice speech marks to run an AND search if they want to and know how to).

Is there crazy stemming/query expansion going on?

Query expansion is search systems trying to be clever,  often getting it wrong and not telling you what they’ve done so you can unpick it. Basically the search system is taking the words you gave it and giving you results for those words, plus some others that it thinks are relevant or related.

Typical types of expansion are stemming (expand a search for fish to include fishes and fishing), misspellings and synonyms (expand a search for cockerel to include rooster).

This is probably what is happening if you are getting results that don’t seem to include the words you searched for anywhere on the page (although metadata is another option).

Now this stuff can be really, really helpful. If it is any good.

Have you got smart sophisticated query expansion like Google?  Or does it do silly (from a day-to-day not a Latin perspective) stemming like equating animation with animals? If it is the silly version then definitely switch it off (or tweak it if you can).

Even if you’ve got smart expansion options available, it’s generally best practice to either give the user the option of running the expanding (or alternate) query, or at the very least of undoing it if you’ve got it wrong. They won’t always spot the options (Google puts lots of effort into coming up with the right way of doing this) but it’s bad search engine etiquette to force your query on a user.

Is the sort order sensible?

That Homebase example. The main problem here is sorting by price low-high. That’d be fine (actually very considerate of Homebase) if I’d navigated to a category full of rabbit hutches. But I didn’t. I searched for rabbit hutches and got a mixed bag of results that included plenty of things that a small child could tell you aren’t rabbit hutches.

The solution? Sort by relevancy.

I’ve seen quite a lot of bad search set-ups recently where the search order was set to alphabetical. Why? Unless as Martin said when I bemoaned this on Twitter your main use case is “to enable people to find stuff about aardvarks”.

News sites sometimes go with most recent as the sort order. Kinda makes sense but you need to be sure the top results are still relevant not just recent.

Interestingly sort order doesn’t matter so much if you’ve gone for AND searches and you haven’t got any query expansion going on. If you’re pretty sure that everything in the result set is relevant, then you’ve got more freedom over sort order.  If not,  stick with relevancy.

(I don’t need to tell you that you want relevancy is high-low, do I?)

So people stop giving me grief over navigation.  Let’s talk about that rubbish search engine you’ve got.  I could probably fix that for you.

Written by Karen

March 5th, 2010 at 6:04 am

Posted in search

Tagged with