The startup that brought me to Seattle was building a system to locate documents in a repository that resembled the one currently on the user’s display (we called it “no click search” since it required no keywords or query language: we knew what you wanted). For reasons I won’t go into here, the company failed, but the idea is still pretty compelling to me.
Unfortunately, there wasn’t much thought put into how this technology might work in different problem domains. It was all about websites, and we know now what a sound business proposition web content sites were. It really could have been effective as a back-office tool or editorial research tool: see the Google Search Appliance for an example.
How it worked was crude but effective given a small scale. A web crawler process would ingest documents, parse them down to content (removing navigational chrome and HTML tags), boil off the stop words and non-meaningful stuff, then build a vector from the most frequently appearing words in the remaining extract. The vector would be used for the comparisons to other docs, allowing the user to get “more like this,” or documents similar to the one currently on screen. I wish it was working somewhere: it was really something to see. On days when nothing seemed to be going well, I would actually use the system on our customers’ content (the Seattle Post-Intelligencer was the first to sign up) and it never failed to put things in perspective.
But what’s really needed, I think, is the next step after assigning or deriving a vector. Look at the old Dewey Decimal system or the new CIP structure: why can’t that be done for documents, such that each document has a unique code based on what it is/contains?
I still think there’s some kind of hierarchy or structure waiting to be built. It seems likely to me that it will have to be based on a lexicon or domain-specific thesaurus: one for law would not work for aerospace engineering, frinstance. So there’s no way to auto-generate the classification map: I do see that as the end goal. I’d like to think that, given a meaningful sized document and therefore a representative extract — this starts to resemble statistics — that a given document could be “filed” by some code or other arbitrary representation and then evaluated against others.