Retroactive Robots Exclusion at the Wayback Machine
I recently discovered exactly how the Wayback Machine deals with changes to robots.txt.
First, some background. I have a weblog I’ve been running since 2002, switching from B2 to WordPress and changing the permalink structure twice (with appropriate HTTP redirects each time) as nicer structures became available. Unfortunately, some spiders kept hitting the old URLs over and over again, despite the fact that they forwarded with a 301 permanent redirect to the new locations. So, foolishly, I added the old links to robots.txt to get the spiders to stop.
Flash forward to earlier this week. I’ve made a post on Slashdot, which reminds me of a review I did of Might and Magic IX nearly four years ago. I head to my blog, pull up the post… and to my horror, discover that it’s missing half a sentence at the beginning of a paragraph and I don’t remember the sense of what I originally wrote!
My backups are too recent (ironic, that), so I hit the Wayback Machine. They only have the post going back to 2004, which is still missing the chunk of text. Then I remember that the link structure was different, so I try hitting the oldest archived copies of the main page, and I’m able to pull up the summary with a link to the original location. I click on it… and I see:
Excluded by robots.txt (or words to that effect).
Now this is a page that was not blocked at the time that ia_archiver spidered it, but that was later blocked. The Wayback machine retroactively blocked access to the page based on the robots.txt content. I searched through the documentation and couldn’t determine whether the data had actually been removed or just blocked, so I decided to alter my site’s robots.txt file, fire off a request for clarification, and see what happened.
As it turns out, several days later, they unblocked the file, and I was able to restore the missing text.
In summary, the Wayback Machine will block end-users from accessing anything that is in your current robots.txt file. If you remove the restriction from your robots.txt, it will re-enable access, but only if it had archived the page in the first place.