Danbooru

Danbooru Archives

Posted under General

To bump this thread: I've made a new torrent of Danbooru with images up to 30 November 2017 and the BigQuery JSON metadata. (2.9m images, 1.9tb, 50m tags.) This is meant primarily for machine learning, but hopefully it'll be useful for whatever else people might want.

I consider it alpha because there may be some problems with the SQL version we made from the JSON metadata, but if anyone would like to test it out before a formal release and get a jump on the bulk of the downloading, you can get the 10 .torrent files from https://mega.nz/#!QHIxCA5D!Izw5vxZo11upkHNITvuKZ3mmjf8BhabJOwdPMYF5kD4

Let me know about any problems with using the data or client incompatibilities.

Updated

@gwern-bot said:

To bump this thread: I've made a new torrent of Danbooru with images up to 30 November 2017 and the BigQuery JSON metadata. (2.9m images, 1.9tb, 50m tags.) This is meant primarily for machine learning, but hopefully it'll be useful for whatever else people might want.

I consider it alpha because there may be some problems with the SQL version we made from the JSON metadata, but if anyone would like to test it out before a formal release and get a jump on the bulk of the downloading, you can get the 10 .torrent files from https://mega.nz/#!QHIxCA5D!Izw5vxZo11upkHNITvuKZ3mmjf8BhabJOwdPMYF5kD4

Let me know about any problems with using the data or client incompatibilities.

Interesting, I hope it's useful for your purposes. You may consider ruling out some of the tags in help:third-party edit or posts marked as status:deleted.

Often posts are deleted because they have a third-party problem and the post was just marked as "poor quality" or "breaks rules" by approvers or it's in the flag reason. Those poor quality/breaks rules messages only stay up for a month or two until the log of it is removed.

Also did you download tags for the images as well? It would be useful for any project to have them as well as a means of updating them to remove and add alongside the site.

Additional filtering might not be a bad idea. ('waifu2x' also for the reasons I outline in my comments in that thread.) I haven't done it by default, though - that would've complicated the download and if it turns out to be an issue, people can do it easily in SQL queries.

And yes, the tags are included: they're present as the JSON export of the BigQuery mirror of the Danbooru database, and then converted to a (hopefully more convenient) Sqlite3 database. So a developer could use a little commandline tool to simultaneously edit the local SQL and push the same change to the live Danbooru website with appropriate POSTs in the API.

To update this: the JSON we used turned out to be the wrong BQ dump (oops) and was badly out of date. We pulled the correct up-to-date one but the schema changed and is substantially more complex so the SQL conversion script is broken ATM. We did, however, dump the December images while we were messing with it, so now it's 2.94m images with 77.5m tags. This should be an official/final version of it. I hope it's useful.

Writeup: https://www.gwern.net/Danbooru2017
The new torrents: https://www.gwern.net/docs/anime/danbooru2017-torrent.tar.xz

Yes, no cropping and aspect ratio is preserved; the empty space caused by shrinking the biggest dimension to 512px is filled in with a black background (since JPG doesn't do transparency). The exact ImageMagick call goes (from 'rescale_image.sh'):

convert -resize 512x512\> -extent 512x512\> -limit thread 1 -gravity center -background black "$@" ./512px/$BUCKET/$ID.jpg

Huh. Checking a few 512px examples locally, that is indeed what happened. That's unfortunate. I can't recall the 512px torrent because a lot of people are already on it. But it can't be *that* common an issue since I did browse several hundred images (checking to make sure I hadn't screwed up the SFW filtering) and didn't notice it... (A few hundred if 'transparent_background lineart' is reasonably comprehensive.) I guess for the next version I'll switch to white backgrounds even though it'll invalidate most of the files and I think it looks a little worse. Alternately, users of it could simply do an image quality check on average brightness or number of distinct colors?

An update on uses of Danbooru2017:

- the drawing-colorizer Paintschainer (https://github.com/lllyasviel/style2paints) recommends its use for training the colorizing NN
- yu45020's "Text Segmentation and Image Inpainting (https://github.com/yu45020/Text_Segmentation_Image_Inpainting) project, with the goal of automatically erasing text in manga/anime images for scanlation, uses it
- the thesis "Application of Generative Adversarial Network on Image Style Transformation and Image Processing" (https://cloudfront.escholarship.org/dist/prd/content/qt66w654x7/qt66w654x7.pdf), Wang 2018, uses it for anime<->face CycleGAN conversion; samples of photographic human faces turned into anime faces are... somewhat recognizable but not very good
- the paper "Improving Shape Deformation in Unsupervised Image-to-Image Translation" (https://arxiv.org/abs/1808.04325), Gokaslan et al 2018, does something similar but on more sets of images (eg face<->cat); the anime<->face conversions (pg11) are, IMO, much better than in Wang 2018

1 2 3 4