damn i really hope they stay. this right after their spotify crawl and domain suspension doesn’t inspire hope.

    • mrmaplebar@fedia.io
      link
      fedilink
      arrow-up
      6
      ·
      5 hours ago

      Maybe I’m missing something, but I’m confused how they can promise “high speed access” to the data while also claiming:

      We do not host any copyrighted materials here. We are a search engine, and as such only index metadata that is already publicly available. When downloading from these external sources, we would suggest to check the laws in your jurisdiction with respect to what is allowed. We are not responsible for content hosted by others.

      Do they have the data or do they not have it?

      They also claim to be able to do things like extract text and deduplicate the data… That seems to suggest a significant amount of storage and compute power for a non-profit that has only been around for ~3 years.

      I find this entire thing fishy as fuck. Call me a conspiracy theorist, but I’m not convinced that the entire existence of this data theft operation isn’t simply to be a illicit data broker for AI companies. And now their is direct evidence tying both Anthropic and NVidia to them.

      • hexagonwin@lemmy.sdf.orgOP
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 hours ago

        i think they mean they’ll provide direct access to data hosted by "third party"s (torrents?), without the captchas and throttling/rate limiting present when normally using the annas archive website

        they’re asking for text extraction and dedup in exchange for providing datasets. at least publicly they claim this whole project is aimed at data preservation and wide access… they’re mostly aggregating/collecting data from other shadow libraries and even if they have malicious(?) intent, i’d say they’re a net positive since their code and datas are mostly(?) open sourced.

      • B0rax@feddit.org
        link
        fedilink
        English
        arrow-up
        3
        ·
        4 hours ago

        Nono, they need deduplication and text extracts in exchange for access.

  • Almacca@aussie.zone
    link
    fedilink
    English
    arrow-up
    20
    arrow-down
    1
    ·
    23 hours ago

    Have you seen the quality of some of those OCR scans? I’m reaing the Stainless Steel Rat books from Anna’s Archive right now, and the number of errors is ridiculous, and it’s not an isolated case. Pretty much every one I’ve read had at least a few. Good luck getting decent training data from them.

  • BlueSquid0741@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    20
    ·
    1 day ago

    Anna’s Archive is the perfect place to find specific translations of ebooks. Something I hadn’t thought of the need for until recently.