[rust-dev] robots.txt prevents Archive.org from storing old documentation

Isaac Dupree ml at isaac.cedarswampstudios.org
Mon Jul 14 19:30:35 PDT 2014


On 07/14/2014 09:56 PM, Chris Morgan wrote:
> On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson <banderson at mozilla.com> wrote:
>> Can somebody file an issue described exactly what we should do and cc me?
> 
> Nothing. Absolutely nothing.
> 
> robots.txt rules do not apply to historical data; if archive.org has
> archived something, the introduction of a new Disallow rule will not
> remove the contents of a previous scan.

Although that is the robots.txt standard, archive.org does retroactively
apply robots.txt Disallow rules to already-archived content.
https://archive.org/about/exclude.php

> It therefore has three months in which to make a scan of a release
> before that release is marked obsolete with the introduction of a
> Disallow directive.
> 
> This is right and proper. Special casing a specific user agent is not
> the right thing to do. The contents won’t be changing after the
> release, anyway, so allowing archive.org to continue scanning it is a
> complete waste of effort.

It's my understanding that archive.org doesn't have the funding to
reliably crawl everything on the Web promptly.  I agree with the
principle that "Special casing a specific user agent is not the right
thing to do." but I also support the Internet Archive's mission.

Another option is a `X-Robots-Tag: noindex` HTTP header, which is more
robust at banning indexing[1], and it allows archiving (vs.
`X-Robots-Tag: noindex, noarchive` would disallow it).  It's likely less
robust from the perspective of keeping our website serving that header
consistently long-term though.  For HTML files, X-Robots-Tag can also go
in a <meta> tag in the head.

-Isaac

[1] (Google can still list a robots.txt-disallowed page as a search
result if many sites it trusts link to that page)



More information about the Rust-dev mailing list