My own Google
I'm writing my own search engine. I know it sounds crazy but it's just the sort of project that fuels my fire. At this point, I'm keeping the scope of the search to urls that have appeared on my Trend Sweet Trend page currently numbering (>7000). I've written 3 perl scripts so far and setup 5 tables on my newly LAMP'd web server.
One table stores the urls and the last time they were indexed. Another is a queue of urls to be indexed. A third stores the parsed text from the indexing operation. The other two aren't storing anything yet but will be for the cache of each page and to keep track of links between sites once I unleash a crawling feature.
As for the three scripts, one grabs the urls from my Trend Sweet Trend database and puts them in my search engine's crawling table. The second checks the last time the sites in the url table were indexed and adds urls that need indexing to the queue. Finally the third script uses curl to visit each page, parse out text using HTML::Parser and stores the counts of each term in my search terms table. It's quite primitive, indexes a lot of garbage text, and can't do any booleans but besides that it's great. Watch out Google. Anybody got a spare datacenter with 100000 commodity boxes?
One table stores the urls and the last time they were indexed. Another is a queue of urls to be indexed. A third stores the parsed text from the indexing operation. The other two aren't storing anything yet but will be for the cache of each page and to keep track of links between sites once I unleash a crawling feature.
As for the three scripts, one grabs the urls from my Trend Sweet Trend database and puts them in my search engine's crawling table. The second checks the last time the sites in the url table were indexed and adds urls that need indexing to the queue. Finally the third script uses curl to visit each page, parse out text using HTML::Parser and stores the counts of each term in my search terms table. It's quite primitive, indexes a lot of garbage text, and can't do any booleans but besides that it's great. Watch out Google. Anybody got a spare datacenter with 100000 commodity boxes?