Danbooru

Waifu Diffusion

Posted under General

This topic has been locked.

Legitimate question: is it not possible for creators of these art AIs to scrap pixiv in the same way? I mean, if someone wants to use publicly available art to train an AI, then they can get at it whether the artist likes it or not. I'm not saying that it's a good thing to do especially if the artist takes issue with it, but placing blame on an imageboard seems a little scapegoating to me.

Granted, one artist that I saw complaining also said that danbooru hosts paid rewards, which while it's true we do to a certain point, we haven't done for a couple of years at this point specifically because it ran the risk of pissing artists off enough to DMCA the place. So it seems like it's just venting.

Grahf said:

Legitimate question: is it not possible for creators of these art AIs to scrap pixiv in the same way?

Sure that’s possible, but the thing is, the art itself is somewhat useless if you don’t know what it is. I’m pretty sure the AI-makers scraped Danbooru because it is known as the best-tagged anime image board. Nai Diffusion, for example, works on the same tags as Danbooru to give its users what they want because exactly those tags have been used as input to train the AI.

Pixiv tags are pretty useless for that. Not only do they allow only a very small amount of tags (12, I believe?), which is not enough to actually describe an image, but those tags are often used inconsistently, with several different tags used for the same thing, requiring artists to use the precious little amount of tags for redundancy, leaving even less room to describe anything besides characters and copyright. From the perspective of an AI developer, Pixiv tags are absolutely useless. Scraping a site with untagged images is of course possible, but will only let you generate basically random images.

linkining0006 said:

I cannot believe I am making an account just to make this reply but […]

Hi friend o/ I see you’ve found your way here from Twitter too. Getting rid of the API wouldn’t help because it’s rather easy to just scrape the webpage itself and parse the HTML. It would just make it more annoying to do, but would not stop anyone with a commercial interest in getting the data.

Grahf said:
I'm not saying that it's a good thing to do especially if the artist takes issue with it, but placing blame on an imageboard seems a little scapegoating to me.

Pixiv’s and in general most other image boards don’t have nearly as good of tags as Danbooru does. AI training needs data. Good data in, good model out.

Updated

Recently this tweet made the rounds and has resulted in a lot of takedown requests: https://twitter.com/elf_248/status/1576837031140855809. A site called NovelAI (https://novelai.net) recently announced a paid AI generation tool that they advertised as being trained on Danbooru (without our permission). Artists have picked up on this and are placing the blame on Danbooru.

This is my reply to someone who asked me about it via email:

We have no affiliation with NovelAI and we don't support or endorse what they do. They're doing this without our permission. Artists who don't want their works to be used by NovelAI should ask NovelAI directly to remove their works from their training data and to ban their name from their prompt system. Removing an artist's works from Danbooru does nothing to stop NovelAI from using the data they already have. Or from taking their works directly from Pixiv or Twitter instead of Danbooru. If they can take images from Danbooru, they can do the same for Pixiv and Twitter.

In particular, these AI models are all based on Stable Diffusion, which was trained on nearly 5 billion images scraped from the entire internet. Removing works from Danbooru does nothing because these AI models are already pretrained on billions of images taken directly from artists on Twitter, Pixiv, ArtStation, and DeviantArt, along with thousands of other sites like Reddit, Pinterest, and others.

Artists need to understand that these models are trained on the entire internet, not just Danbooru. They're trained on millions of images posted directly by the artists themselves, and reposted by thousands of other sites. The only way to stop their works from being used is to talk to the AI developers themselves. Artists can use https://haveibeentrained.com/ or https://spawning.ai/ to try to check if their works have been used by AI and to try to opt out. Spawning.ai is a new project to build a global opt-out system for artists. It's still new, but it might be an artist's best hope if they want to say their works shouldn't be used by AI.

(PS: I would appreciate it if any fluent Japanese speakers could help me translate something into Japanese to tell artists. Join the #translations channel on the Danbooru discord).

evazion said:

Recently this tweet made the rounds and has resulted in a lot of takedown requests: https://twitter.com/elf_248/status/1576837031140855809. A site called NovelAI (https://novelai.net) recently announced a paid AI generation tool that they advertised as being trained on Danbooru (without our permission). Artists have picked up on this and are placing the blame on Danbooru.

This is my reply to someone who asked me about it via email:

(PS: I would appreciate it if any fluent Japanese speakers could help me translate something into Japanese to tell artists. Join the #translations channel on the Danbooru discord).

Just a user passing by, but I'd contact NovelAI directly and ask them to rectify their statement, possibly with a tweet too. They are actively spreading misinformation and causing damages. I know this website runs on a thin line for copyright, but the archival purposes are undoubtedly of value.
I don't wanna see a rise in banned artists.

kittey said:

Sure that’s possible, but the thing is, the art itself is somewhat useless if you don’t know what it is. I’m pretty sure the AI-makers scraped Danbooru because it is known as the best-tagged anime image board. Nai Diffusion, for example, works on the same tags as Danbooru to give its users what they want because exactly those tags have been used as input to train the AI.

Pixiv tags are pretty useless for that. Not only do they allow only a very small amount of tags (12, I believe?), which is not enough to actually describe an image, but those tags are often used inconsistently, with several different tags used for the same thing, requiring artists to use the precious little amount of tags for redundancy, leaving even less room to describe anything besides characters and copyright. From the perspective of an AI developer, Pixiv tags are absolutely useless. Scraping a site with untagged images is of course possible, but will only let you generate basically random images.

I don't think this is quite true. Stable Diffusion is so effective because it's trained on literal billions of images. Instead of tags it's trained on captions and text-image pairs scraped from thousands of different sites. It doesn't matter that the data is incredibly noisy; the lesson from recent advances in AI is that more data is always better than less data, no matter how noisy the data is.

An AI trained directly on Pixiv, Twitter, ArtStation, DeviantArt, or any other artist site would be even better than one trained on Danbooru, because it has more data, no matter how noisy the tags are. And they're a lot less noisy than images scraped off random sites on the internet. The success of AI comes from the ability to cut through incredible noise to find the signal.

kittey said:

Sure that’s possible, but the thing is, the art itself is somewhat useless if you don’t know what it is. I’m pretty sure the AI-makers scraped Danbooru because it is known as the best-tagged anime image board. Nai Diffusion, for example, works on the same tags as Danbooru to give its users what they want because exactly those tags have been used as input to train the AI.

Pixiv tags are pretty useless for that. Not only do they allow only a very small amount of tags (12, I believe?), which is not enough to actually describe an image, but those tags are often used inconsistently, with several different tags used for the same thing, requiring artists to use the precious little amount of tags for redundancy, leaving even less room to describe anything besides characters and copyright. From the perspective of an AI developer, Pixiv tags are absolutely useless. Scraping a site with untagged images is of course possible, but will only let you generate basically random images.

ComradeMokou said:

Pixiv’s and in general most other image boards don’t have nearly as good of tags as Danbooru does. AI training needs data. Good data in, good model out.

This is neglecting to account for the fact that AIs can already label data, and the AI-labeled data can then be used to train a different AI.

https://openai.com/blog/vpt/
This paper shows a technique where OpenAI:
1. Downloaded a huge amount of unlabeled data (Minecraft letsplays) from the internet
2. Paid a small number of people to create a small amount of labeled data (by playing Minecraft and recording all their mouse and keyboard inputs)
3. Trained an AI on the small amount of labeled data that predicts inputs
4. Ran that AI on all the unlabeled data and create labels for it
5. Trained a second AI on the large amount of now-labeled data to play Minecraft

Even if Danbooru never existed, someone could apply this technique to Pixiv and other sites to get a bunch of well-tagged images.

But Danbooru does exist, so step 2 can be skipped, no need to pay people to tag images.
And Danbooru already has an open source AI tagging model, so step 3 can be skipped too.
All someone really needs is the AI tagging model + the ability to scrape images from Pixiv, Twitter, etc, and they can get tagged images. Danbooru's database itself isn't a key component here.

evazion said:

I don't think this is quite true. Stable Diffusion is so effective because it's trained on literal billions of images. Instead of tags it's trained on captions and text-image pairs scraped from thousands of different sites. It doesn't matter that the data is incredibly noisy; the lesson from recent advances in AI is that more data is always better than less data, no matter how noisy the data is.

An AI trained directly on Pixiv, Twitter, ArtStation, DeviantArt, or any other artist site would be even better than one trained on Danbooru, because it has more data, no matter how noisy the tags are. And they're a lot less noisy than images scraped off random sites on the internet. The success of AI comes from the ability to cut through incredible noise to find the signal.

The benefit of NovelAI's tag-based approach has less to do with the quality of the images themselves and more to do with consistency and control for the user. Noisy data can produce good images but it can be frustrating (or impossible) to get exactly the character you want in two different poses/scenes/expressions/etc due to the randomness inherent to current AI techniques.
But with proper consistent tagging, you get this: https://old.reddit.com/r/NovelAi/comments/xn8r0v/image_generation_progress_showcase_when_you/

You could argue that Danbooru makes it easier. I would argue that artist-added tags on Pixiv and other sites are good enough to produce good results, especially if you're mainly interested in characters. Pixiv tags may seem noisy to a human, but I don't think they are to an AI. It's not hard for an AI to figure out that two tags refer to the same character, or that one tag may refer to two separate characters.

And you don't need Danbooru to make a dataset of well-tagged images. The LAION-5B dataset that Stable Diffusion is based on already contains millions of tagged images. If you browse https://haveibeentrained.com you can see they have millions of images from sites like Gelbooru, Yande.re, Konachan, Zerochan, Minitokyo, Safebooru.org, Pinterest, random wallpaper sites, and others. Many of these images already have full tags in the text description. All you have to do is download the LAION-5B dataset (here), filter it for images from anime-related sites with good tags, and now you have a dataset of millions of well-tagged images.

As far as AI tagging goes, the first open source AI tagging system trained on Danbooru dates back to 2015: https://github.com/rezoo/illustration2vec (by Japanese developers). And the results were actually pretty good even back then. So people have been releasing AI tagging systems based on Danbooru long before we ever made our own.

Toks said:

This is neglecting to account for the fact that AIs can already label data, and the AI-labeled data can then be used to train a different AI.

https://openai.com/blog/vpt/
This paper shows a technique where OpenAI:
1. Downloaded a huge amount of unlabeled data (Minecraft letsplays) from the internet
2. Paid a small number of people to create a small amount of labeled data (by playing Minecraft and recording all their mouse and keyboard inputs)
3. Trained an AI on the small amount of labeled data that predicts inputs
4. Ran that AI on all the unlabeled data and create labels for it
5. Trained a second AI on the large amount of now-labeled data to play Minecraft

Even if Danbooru never existed, someone could apply this technique to Pixiv and other sites to get a bunch of well-tagged images.

But Danbooru does exist, so step 2 can be skipped, no need to pay people to tag images.
And Danbooru already has an open source AI tagging model, so step 3 can be skipped too.
All someone really needs is the AI tagging model + the ability to scrape images from Pixiv, Twitter, etc, and they can get tagged images. Danbooru's database itself isn't a key component here.

The benefit of NovelAI's tag-based approach has less to do with the quality of the images themselves and more to do with consistency and control for the user. Noisy data can produce good images but it can be frustrating (or impossible) to get exactly the character you want in two different poses/scenes/expressions/etc due to the randomness inherent to current AI techniques.
But with proper consistent tagging, you get this: https://old.reddit.com/r/NovelAi/comments/xn8r0v/image_generation_progress_showcase_when_you/

They could attain tags regardless, this is true. That said, we certainly make it easier for them.

Something that just that just occurred to me is that the tagging data used in these models is also subject to copyright. Unless I’ve missed something in the Terms, only Danbooru and the taggers themselves can license that data to 3rd parties, and from what I can see that hasn’t happened. If you are sympathetic to the idea that companies using an artist’s work without permission to create an AI model and profit from it is wrong/illegal, then companies using the tag data generated by the users of this site for such purposes would also be wrong/illegal.

IANAL though, so take everything I’ve said with a grain of salt, but it might be something worth asking a lawyer about. At the very least it would be be good PR for the site to come out against it as well.

evazion said:

Recently this tweet made the rounds and has resulted in a lot of takedown requests: https://twitter.com/elf_248/status/1576837031140855809. A site called NovelAI (https://novelai.net) recently announced a paid AI generation tool that they advertised as being trained on Danbooru (without our permission). Artists have picked up on this and are placing the blame on Danbooru.

This is my reply to someone who asked me about it via email:

(PS: I would appreciate it if any fluent Japanese speakers could help me translate something into Japanese to tell artists. Join the #translations channel on the Danbooru discord).

Maybe it'd be useful to put a disclaimer about this on Danbooru itself, on whichever page tells artists how to submit takedown requests, so that they're more likely to see it. Preemptively tell them "you can do this if you want, but here's why it's pointless".

ComradeMokou said:

A meta tag could be created denoting the artist’s wishes on their art being used for AI training, say do_not_train or something. I guess the opposite could also be created for artists that want to explicitly allow their art to be used for training, but I don’t know how much use that would really get. It could be useful for the dataset creators or those creating the AI models if they care or are forced to care about having explicit permission in the future.

If an artist makes it known to Danbooru (or otherwise?) that they don’t want their art used in AI training sets then their artist tag could be updated to implicate the do_not_train tag. This could even be an incentive for those who would otherwise wish to keep their art on Danbooru vs opting to be a banned_artist.

do_not_train is a very non-intuitive name for anyone who doesn’t know about the training of AI models. Please consider something more descriptive like not_for_ai-training before you start populating it.

evazion said:
I asked NovelAI to stop using our name to advertise their paid AI model without our permission, but I think the damage has been done and they probably don't care about us too much.

So, NovelAI is the reason why many artists I like to translate are dropping like flies right now?

evazion said:

I put a statement on the contact page: https://danbooru.donmai.us/contact.

I asked NovelAI to stop using our name to advertise their paid AI model without our permission, but I think the damage has been done and they probably don't care about us too much.

Nice. For what it's worth I think adding a banner statement might be useful too, at least temporarily. I hardly even noticed the contact page before this and I imagine there's plenty others who won't. Might not make a difference but probably won't hurt?

pronebone said:

So, NovelAI is the reason why many artists I like to translate are dropping like flies right now?

Yeah, a japanese artist pointed out how they announced they have been using Danbooru to train, and the tweet took traction

Xiry said:

Ditto

kittey said:

do_not_train is a very non-intuitive name for anyone who doesn’t know about the training of AI models. Please consider something more descriptive like not_for_ai-training before you start populating it.

Hrmm, yeah that would make it clearer what the tag is for. Do not train popped to mind because it sounded similar to Do Not Track, the browser thing. Could even be no ai-training or against ai-training, the latter of which might be better because it sounds less like an opt-out on Danbooru’s side and more like a record of the artist’s wishes which is the entire goal.

Xiry said:

Nice. For what it's worth I think adding a banner statement might be useful too, at least temporarily. I hardly even noticed the contact page before this and I imagine there's plenty others who won't. Might not make a difference but probably won't hurt?

Might also what to be more straightforward on Twitter, like reply directly to the Novelai post or the other popular posts, make sure they see us. The current two disclaimers aren’t getting enough attention, ppl are still thinking that banning themselves from Danbooru is the way to go.

1 2 3 4