Wikipedia blacklists Archive.today, starts removing 695,000 archive links

m3t00🌎🇺🇦@lemmy.world · edit-2 9 days ago

Wikipedia blacklists Archive.today, starts removing 695,000 archive links

dan@upvote.au · 10 days ago

This is understandable, but at the same time, none of the anti-paywall lists are as good as archive.today. They actually have paid accounts at a bunch of paywalled sites, and use them when scraping.

CombatWombat@feddit.online · 10 days ago

Unfortunately, they’ve allegedly modified the contents of some archived articles, so even though they may do better to archive, nothing archived is of any value because it cannot be trusted.

Scrollone@feddit.it · 9 days ago

What if somebody used archive.today to bypass a paywall and then archived that using Web Archive? (So we’re sure the content stays the same)

tyler@programming.dev · 9 days ago

They’re injecting data into the sites during archive so that wouldn’t work.

0_o7@lemmy.dbzer0.com · 9 days ago

So are they removing all other websites that post lies or modify their articles to suit their narrative at times?

Fox news? MSN? CNN? BBC? Reuters? AP?

Why the sudden urge to validate the archives? How many articles have been proven to be modified?

Seems like they’ve been wanting to remove an entity the empire doesn’t control and they’re using this as a cover to do it.

WhyJiffie@sh.itjust.works · 8 days ago

that’s exactly one of the main reason they use archive sites for citations. but when an archival site does that it becomes useless.

betterdeadthanreddit@lemmy.world · 10 days ago

brianpeiris@lemmy.ca · 10 days ago

Good reminder to donate to web.archive.org

Zedstrian@sopuli.xyz · 9 days ago

While archive.org is good and more trustworthy than archive.is, it isn’t as useful for bypassing paywalls.

Goodlucksil@lemmy.dbzer0.com · 9 days ago

But Wikipedia doesn’t need to bypass paywalls, and you can bypass them yourself with a bit of work.

Zedstrian@sopuli.xyz · edit-2 9 days ago

There’s websites with paywalls that even Bypass Paywalls Clean can’t bypass. In cases that it can, it sometimes just fetches the article contents from archive.today.

That doesn’t mean an alternative shouldn’t be found, but we also shouldn’t pretend that nothing is being lost by losing access to unpaywalled sources. For practical purposes, a paywalled source means no source for most readers, unless a non-paywalled alternative can be found to replace it.

Goodlucksil@lemmy.dbzer0.com · 9 days ago

That’s good for you, and it is okay for you to use archive.today personally, as long as you block their DDoSing.

But Wikipedia does not need to bypass paywalls, and they don’t require the source to be freely (or easily) viewable to verify the info.

Deebster@infosec.pub · 8 days ago

I’m still deciding how much I agree or disagree with this. It’s true that they do cite books which you often can’t read online, but adding information backed up by a paywalled proof feels a bit “trust me bro”. E.g. I could find/create a site with an impossibly large paywall and no-one would realistically able to check my sources.

mayabuttreeks@lemmy.ca · 9 days ago

I do hope this move results in more support for the IA/Wayback Machine and helps them to update some of their crawler tech — thanks to the rise of AI, some sites are effectively (thru captchas etc.) or actively (through straight-up greed [coughRedditcough]) blocked from being archived almost entirely, which is frustrating for legit archivists/contributors.

Ganbat@lemmy.dbzer0.com · 10 days ago

For anyone curious, I looked into the DDOSing, and what was done is a simple string of JavaScript was added to archive[.]today that made a background request to the blog with a randomly generated search parameter. Every time someone looked at an archive, they unknowingly sent a request to the blog under attack.

Dayroom7485@lemmy.world · 9 days ago

Good reminder to pay for journalism.

The Guardian, Le Monde, El País, Tageszeitung and many others need subscribers to stay independent of the oligarchs.

kepix@lemmy.world · 8 days ago

guardian is surviving by slowly becoming a tabloid. not sure if i would have paid for it anyway, and im not sure if this was preventable by paying for it in the first place.

hector@lemmy.today · 8 days ago

I appreciate the guardian a lot more than I did before now that someone gave me a nytimes subscription, seeing how bad they are now. For the guardian’s faults, they do break some stories still, and somewhat comprehensively cover the news, perhaps better than the times, that is too busy trying to cover for Israel to even report honestly on epstein and apparently surrendered to the administration besides.

Flatfire@lemmy.ca · 9 days ago

Paying for journalism is ideal, but unfortunately makes it difficult to cite/link to a source the way Wikipedia needs as a way to ensure the information remains open and accessible.

Admittedly, I’m not familiar with these outlets enough to know if those paywalls are significant, but the problem with direct article links is that those links can change. Archival services (I suppose not archive[.]is) are important for ensuring those articles remain accessible in the format they were presented in.

I’ve come across a number of older Wikipedia articles about more minor or obscure events where links lead to local new outlet websites that no longer exist or were consumed by larger media outlets and as a result no longer provide an appropriate citation.

Venia Silente@lemmy.dbzer0.com · 8 days ago

Paying for journalism simply promotes that those who don’t pay it don’t get it ie.: more paywalls, not less.

Schmuppes@lemmy.today · 8 days ago

So what you’re saying is if we refuse to pay for journalism long enough, the journalists will eventually give up and just work for free? Not have to travel for their investigations, eat nothing and need no private home?

Dayroom7485@lemmy.world · 8 days ago

Democracy isn’t possible without an independent press.

Epstein was persecuted because the frigging Miami Herold reported about his abuses in 2018. He would have continued raping and trafficking kids for who knows how long without that. In a world where the media is owned by Epstein, that won’t happen.

sibachian@lemmy.ml · 8 days ago

what democracy? every person in the leadership of america and most of the world were either friends with epstein or on his payroll.

rumba@lemmy.zip · 8 days ago

They’re already mostly owned and working for the ultra-rich interests. There have been plenty of outlets over the years that had paying users, they’re mostly owned at this point. Those that aren’t are getting quite click-baity.

Capitalism is hard on News. Facism is worse.

hector@lemmy.today · 8 days ago

It’s not our fault the media decided to switch to a subscription model while not providing a product worthy of paying a subscription, even before they downgrade it every year.

It’s a problem, but one of their own making.

Venia Silente@lemmy.dbzer0.com · 8 days ago

I haven’t said that journalists have to work for free. Just that we don’t have to be the ones who are trickled out to feed them. I doesn’t have to be “poors vs workers” unlike what the media is telling you, ya know? A better system is possible.

Dayroom7485@lemmy.world · 8 days ago

Huh, I don’t get that argument. To me, it seems that citizens paying journalists is desirable. I’m genuinely curious, who else should pay them in your view?

Venia Silente@lemmy.dbzer0.com · 7 days ago

It could be the citizens but done indirectly, for example via taxes. Even better, not all citizens: just tax the rich and put the money into a journalism pool, so the rich can’t choose to benefit any particular newspaper or editorial line.

meep_launcher@sh.itjust.works · 8 days ago

Also remember the journalists that need support the most are local papers and news stations. The big ones have plenty of donors, and while it’s worth the support, they are less likely to completely collapse than the news that is run in your city.

Go look for that independent source. They will report more news that actually affects you as well.

4am@lemmy.zip · 9 days ago

If this is not an announcement, Lemmy lets you edit your post titles so you can correct that mistake instead of luring in people who think lemmy.world is also banning links using archive.today.

I’m not speculating on your intent, only pointing out that you can correct this situation instead of apologizing after the fact.

m3t00🌎🇺🇦@lemmy.world · 10 days ago

https://lemmy.world/c/ukraine was where i saw this. i didn’t write it. thought lemmy would have linked to the original, was wrong. FYI

null@lemmy.org · 8 days ago

The root of the problem is Wikipedia not having local snapshots leaves their articles vulnerable to eroding sources.

T156@lemmy.world · 8 days ago

Is it reasonable for them to keep their own local snapshots?

That’s not a trivial amount of work and data, particularly it it’s multimedia.

null@lemmy.org · 8 days ago

I think it’s a concerning issue affecting long-term viability of the platform. It’ll only get worse as time goes on and sources go offline.

Venia Silente@lemmy.dbzer0.com · 8 days ago

Okay so, what is the currently going-for alternative that bypasses paywalls?

dude@lemmings.world · 8 days ago

I’m afraid there aren’t any. You can use the Bypass Paywalls Clean extension though

Venia Silente@lemmy.dbzer0.com · 8 days ago

Oh well, archive.today it is in the meantime I guess.

m3t00🌎🇺🇦@lemmy.world · 8 days ago

copy the headline and find the same thing free somewhere else. usually it’s a news site full of unreadable slop. pay walls used to be almost worth bypassing. no more. just another money grab, pretending to protect valuable information. not

Venia Silente@lemmy.dbzer0.com · 7 days ago

Fair point. Very few if any news sites provide unique articles.

Tony Bark@pawb.social · 9 days ago

I’ve switched to .md when the community mentioned something was up with the .today domain. Hopefully that one isn’t compromised.

The_Decryptor@aussie.zone · 9 days ago

It’s the same person running all of them, so yeah it is.

Tony Bark@pawb.social · 9 days ago

Damn.

betterdeadthanreddit@lemmy.world · 9 days ago

URL
archive[.]today
archive[.]fo
archive[.]is
archive[.]li
archive[.]md
archive[.]ph
archive[.]vn
archiveiya74codqgiixo33q62qlrqtkgmcitqx5u2oeqnmn5bpcbiyd[.]onion

Source

Formfiller@lemmy.world · 8 days ago

That’s very 1984 of them

Maeve@kbin.earth · 10 days ago

Democracy died in daylight, the darkness hides the rotten body.

rumba@lemmy.zip · 8 days ago

It’s relatively possible it never got out of the planning stages intact.

Maeve@kbin.earth · 8 days ago

Or ever made it into planning?

10 days ago

Bro any archiving/scraping tool can be used for ddos u just tell it to archive the same site over and over and now u have a different IP spamming the endpoint

dan@upvote.au · 10 days ago

In this case, their CAPTCHA page intentionally included code to DoS a particular blog, sending a request to search for a random string every 300ms (search is very CPU-intensive). This was regardless of the archived site you were trying to view.

CombatWombat@feddit.online · 10 days ago

Any good archiver will check for an archived copy before making a request, and batch requests. This was very different than the attack you’re imagining — if you opened any archive.today page, it would poll a developer’s personal blog, regardless of whether you were interacting with content from that blog.

m3t00🌎🇺🇦@lemmy.world · 10 days ago

don’t know all the details. fyi basically. i forget where i saw the same site mentioned for the same thing. don’t call me bro Bro

RobotToaster@mander.xyz · 10 days ago

Everyone seems to be ignoring the fact that he only did this in response to a malicious dox attempt.

Em Adespoton@lemmy.ca · 10 days ago

He only modified archived pages in response to a dox attempt?

And the thing is, the discovery of the modified pages revealed that it wasn’t even the first time he’d modified pages. And he used a real person’s identity to try and shift blame.

Irrespective of the doxxing allegations, if he’s done all this multiple times already, it means the page archives can’t be trusted AND there’s no guarantee that anything archived with the service will be available tomorrow.

Seems like we need to switch to URLs that contain the SHA256 of the page they’re linking to, so we can tell if anything has changed since the link was created.

deathbird@mander.xyz · 9 days ago

Actually a pretty good idea.

Em Adespoton@lemmy.ca · 9 days ago

Only works for archived pages though, because for any regular page, a large portion of the page will be dynamically generated; hashing the HTML will only say the framework hasn’t changed.

conorab@lemmy.conorab.com · 9 days ago

You would need a way of verifying that the SHA256 is a true copy of the site at the time though and not a faked page. You could do something like have a distributed network of archives that coordinate archival at the same time and then using the SHA256 then be able to see which archives fetched exactly the same page at the same time through some search functionality. I mean if addons are already being used for doing the crawling then we may be mostly there already since said addons would just need to certify their archive and after that they can discard the actual copy of the page. You need need a way to validate those workers though since a bad actor could just run a whole bunch at the same time to legitimise a fake archival.

Em Adespoton@lemmy.ca · 9 days ago

The idea is to verify the archival copy’s URL, not to verify the original content. So yes, a server could push different content to the archiver than to people, or vary by region, or an AitM could modify the content as it goes out to the archiver. But adding the sha256 in the URL query parameter means that if someone publishes a link to an archive copy online, anyone else using the link can know they’re looking at the same content the other person was referencing.

If the archive content changes, that URL will be invalid; if someone uses a fake hash, the URL will be invalid (which is why MD5 wouldn’t be appropriate).

The beauty of this technique is that query parameters are generally ignored if unsupported by the web server, so any archival service could start using this technique today, and all it would require is a browser extension to validate the parameter.

Link it to something like Web of Trust, and you’ve solved the separate issue you described.

In fact, this is a feature WoT could add to their extension today, and it would “Just Work”. For that matter, Archive.org could add it to their extension today, too.

Remind me to ping Jason about that.

The_Decryptor@aussie.zone · 9 days ago

Seems like we need to switch to URLs that contain the SHA256 of the page they’re linking to, so we can tell if anything has changed since the link was created.

IPFS says hi

Em Adespoton@lemmy.ca · 9 days ago

Yes; the problem IPFS has is the same problem IPv6 has.

The hash-in-a-URL solution can function cleanly in the background on top of what people already use.

The_Decryptor@aussie.zone · 9 days ago

IPFS has gateways though, so you can link to the latest version of a page which can be updated by the owner, or alternatively link to a specific revision of the page that is immutable and can’t be forged.

betterdeadthanreddit@lemmy.world · 10 days ago

As they should since it doesn’t matter.

dan@upvote.au · 10 days ago

It wasn’t a dox attempt though. The blog just collected information that was already publicly available on other sites.

Anon518@sh.itjust.works · 10 days ago

Unfortunately, they shot themselves in the foot by responding the way they did. They basically did the job of anyone who wants them taken down and not trusted. It was probably the worst way they could have reacted. Such a tragedy to lose such a valuable website.

𝚝𝚛𝚔@aussie.zone · 9 days ago

Who cares why they did it?

It proves they can and do alter the “archived” website, so it’s usefulness as a source is completely gone.

RobotToaster@mander.xyz · 9 days ago

Archiving a site inherently requires altering it, to change embed URLs, scripts, etc. The fact they had that capability was never in question.

deathbird@mander.xyz · 9 days ago

Yeah, ESH. His response of editing an archive showed the site to be unreliable as an archive. DDOSing from the site as a counter to the dox attempt caused the site serious reputational harm as well.

It sucks because his site was actually more reliable than The Internet Archive.