• Hot Saucerman@lemmy.ml
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    edit-2
    9 months ago

    Books3 is the definition of “not publicly available” because it’s all from pirated material downloaded from private torrent tracker Bibliotik.

    Books3 is literally why several of AI groups are being sued by various authors like Sarah Silverman and George R.R. Martin.

    Books3 was always illicitly obtained material which put into question whether an LLM using it could really fall under Fair Use. (It most likely does, but it’s still a legal question that hasn’t been answered yet.)

    Books3 Link: https://huggingface.co/datasets/the_pile_books3

    Books3 Description from Link:

    This dataset is Shawn Presser’s work and is part of EleutherAi/The Pile dataset.

    This dataset contains all of bibliotik in plain .txt form, aka 197,000 books processed in exactly the same way as did for bookcorpusopen (a.k.a. books1). seems to be similar to OpenAI’s mysterious “books2” dataset referenced in their papers. Unfortunately OpenAI will not give details, so we know very little about any differences. People suspect it’s “all of libgen”, but it’s purely conjecture.