More like this

Ben Hammersley.com: Latent Semantic Indexing in the Guardian

Search sucks. No matter how clever your search engine’s system is, no matter how many clever page-ranking formulae you apply, or how many super-speedy processors you throw at it, the current way to search the internet doesn’t work very well.

Searching by a keyword misses out a boatload of stuff, for one simple reason: many of the documents you might find useful do not contain the keyword. Consider this: you want to find documents on Iraqi politics. You’re a leader writer, perhaps, and one sherry too far gone. You turn to Google, and what do you search for? “Iraqi politics”? “Saddam Hussein?” “Abd al-Rahman Arif”? Well, yes, all of these – and each one will be useful, but not the entire picture. You want the search to return not just the keyword hits, but documents on the same topic that don’t necessarily mention the keyword.

Here is the conference blurb on a presentation about LSI.

Latent semantic indexing (LSI) is an information retrieval technique known to substantially improve recall in full-text search engines. LSI works by applying a dimensionality reduction technique called singular value decomposition (SVD) to a vector space data model, reducing noise and bringing out latent relationships within the data. While most of the research on LSI has been done in the domain of text searches, where LSI search engines can actually retrieve relevant documents that do not match any keyword in a query, the linear algebra implementation of the technique makes it applicable to a wide range of problems in bioinformatics, including gene and protein sequencing, gene regulatory networks, and medical imaging. Many of these potential applications remain completely unexplored.

This idea, though we never called it LSI, was the core of the startup I moved out to Seattle three years ago for.