This project started so innocently. I had attempted to use a "commercial" indexing application on my web site, which resulted in a simulated Syn Flood attack on my site. That got me wondering just how hard it is to programmatically obtain the content of a web page, and I quickly found out it isn't difficult at all. So, what to do with the result?
Although I doubt my need for indexing (or, at least, to compete with the Big Boys out there), I have been curious about what pages are actually in my web site. And, of course, once you get to programming something, it becomes reason enough just to keep going. There were many, many issues, and it was (sometimes) fun solving them. Sometimes, simply no fun at all.
Once I had my four test sites successfully mapping (during which I was introduced to non-standard HTML—my own!), I began to work through my bookmarked sites (where appropriate). It was fascinating to see that sites are often marked by particular styles of coding, some of which broke SiteMapper. For example, one woman is something of a neat freak, and insists on placing A (anchor) attributes on one line for each. This broke the regular expression evaluator, which doesn't tolerate line breaks. Some of my target sites are Microsoft SharePoint sites, and they required some special treatment. And other problems that have faded from memory (probably into nightmares).
I really haven't tried all that many sites, so do let me know if you find another that breaks the application.