[rust-dev] robots.txt prevents Archive.org from storing old documentation

Evan G eg1290 at gmail.com
Mon Jul 14 19:34:54 PDT 2014


Its not about "special casing a user agent" its about archiving duplicate
copies of old documents. Right now, everything is crawled off of the
current docs, and none of the archived docs are allowed. with this change,
the IA would store multiple copies of old documentation—once as the "old"
entry for docs.rust-lang.org/ and once as the "new" entry for
docs.rust-lang.org/0.9/

At least that's how I'm understanding the situation. Also, if you're really
interested, all you have to do is a "git checkout 0.9" and run rustdoc.


On Mon, Jul 14, 2014 at 9:30 PM, Isaac Dupree <
ml at isaac.cedarswampstudios.org> wrote:

> On 07/14/2014 09:56 PM, Chris Morgan wrote:
> > On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson <banderson at mozilla.com>
> wrote:
> >> Can somebody file an issue described exactly what we should do and cc
> me?
> >
> > Nothing. Absolutely nothing.
> >
> > robots.txt rules do not apply to historical data; if archive.org has
> > archived something, the introduction of a new Disallow rule will not
> > remove the contents of a previous scan.
>
> Although that is the robots.txt standard, archive.org does retroactively
> apply robots.txt Disallow rules to already-archived content.
> https://archive.org/about/exclude.php
>
> > It therefore has three months in which to make a scan of a release
> > before that release is marked obsolete with the introduction of a
> > Disallow directive.
> >
> > This is right and proper. Special casing a specific user agent is not
> > the right thing to do. The contents won’t be changing after the
> > release, anyway, so allowing archive.org to continue scanning it is a
> > complete waste of effort.
>
> It's my understanding that archive.org doesn't have the funding to
> reliably crawl everything on the Web promptly.  I agree with the
> principle that "Special casing a specific user agent is not the right
> thing to do." but I also support the Internet Archive's mission.
>
> Another option is a `X-Robots-Tag: noindex` HTTP header, which is more
> robust at banning indexing[1], and it allows archiving (vs.
> `X-Robots-Tag: noindex, noarchive` would disallow it).  It's likely less
> robust from the perspective of keeping our website serving that header
> consistently long-term though.  For HTML files, X-Robots-Tag can also go
> in a <meta> tag in the head.
>
> -Isaac
>
> [1] (Google can still list a robots.txt-disallowed page as a search
> result if many sites it trusts link to that page)
>
> _______________________________________________
> Rust-dev mailing list
> Rust-dev at mozilla.org
> https://mail.mozilla.org/listinfo/rust-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rust-dev/attachments/20140714/336b29e9/attachment.html>


More information about the Rust-dev mailing list