Danbooru

BigQuery dataset (queryable dump)

Posted under General

EDIT 2022-01-10: There's now an official dump that can be found here. The database is danbooru1.danbooru_public.

My dump is no longer maintained.

Edit by @nonamethanks - 2020-11-21

Since BigQuery old-style links are dead, the database can now be found at this link

The database name requires the whole name to be selected, so use danbooru-data.danbooru.posts instead of the old [danbooru.posts] See forum #178212 for an example of updated query.

Danbooru now also has its official bigquery dump of most data, not just posts, see this commit for the URLs.

With some assistance from albert, I've set up a Google BigQuery data dump of the Danbooru posts tables so anyone who cares to do so can run queries. You can access the dump here. You may see Konachan and Yandere tables there too but those aren't complete so I don't recommend bothering with them.

It's updated nightly and should contain basically anything you can get through the API. If you really want to, you can download the whole thing as a dump using these instructions. Should save you from trying to scrape the API or something silly like that.

The query syntax is very similar to SQL, with a few differences due to fancier datatypes like repeated/nested fields.

An example query:

SELECT tags.name, COUNT(id) AS num, SUM(file_size)/1000000000 AS GB FROM [danbooru.posts] GROUP BY tags.name ORDER BY GB DESC LIMIT 20;

This returns the tags with the most file_size associated with them. For example this query will show you that 1girl has just over a terabyte of image data over 1.6 million images.

SELECT favs, COUNT(id) AS num, SUM(file_size)/1000000000 AS GB FROM [danbooru.posts] GROUP BY favs ORDER BY GB DESC LIMIT 20;

This groups by user favorites instead of tags. You can see that user 19831 has 414000 favourites, which are around 380GB.

Updated by nonamethanks

Just found this last night and holy crap, this is so useful. @Allynay, don't know if you're still around, but thanks for making this.

Here are some quick overall stats on the site, total filesizes and average scores broken down by rating.

Show
Row
rating

posts

GB

total_score

upvotes

downvotes

total_votes

average_score

average_upvotes

average_downvotes

1
s18421321235.257949761874574259165079263960091464.747619605978293.2117714691455337-0.05028901294804064

2
q382207280.737538207319202321326814942321821048.3515555706724375.579910885985866-0.12930951029154358

3
e220375136.929995903186328912374696389513013648.4550833806012485.615287577992059-0.28993760635280774
SELECT
  rating,
  COUNT(id) AS posts,
  SUM(file_size)/1000000000 AS GB,
  SUM(score) AS total_score,
  SUM(up_score) AS upvotes,
  ABS(SUM(down_score)) AS downvotes,
  SUM(up_score + ABS(down_score)) AS total_votes,
  AVG(score) AS average_score,
  AVG(up_score) AS average_upvotes,
  AVG(down_score) AS average_downvotes,
  STDDEV_POP(score) AS stddev_score,
  STDDEV_POP(up_score) AS stddev_upvotes,
  STDDEV_POP(down_score) AS stddev_downvotes
FROM
  [danbooru-data:danbooru.posts] 
GROUP BY
  rating;

This is also useful for doing queries that timeout or don't fit into the 6 tag limit. Here's one to find unbanned artists, since banned_artist -status:banned times out:

Show
SELECT
  id
FROM
  [danbooru-data:danbooru.posts]
WHERE
  tags.name = 'banned_artist' AND
  is_banned = false;

evazion said:

Just found this last night and holy crap, this is so useful. @Allynay, don't know if you're still around, but thanks for making this.

Here are some quick overall stats on the site, total filesizes and average scores broken down by rating.

Show
Row
rating

posts

GB

total_score

upvotes

downvotes

total_votes

average_score

average_upvotes

average_downvotes

1
s18421321235.257949761874574259165079263960091464.747619605978293.2117714691455337-0.05028901294804064

2
q382207280.737538207319202321326814942321821048.3515555706724375.579910885985866-0.12930951029154358

3
e220375136.929995903186328912374696389513013648.4550833806012485.615287577992059-0.28993760635280774
SELECT
  rating,
  COUNT(id) AS posts,
  SUM(file_size)/1000000000 AS GB,
  SUM(score) AS total_score,
  SUM(up_score) AS upvotes,
  ABS(SUM(down_score)) AS downvotes,
  SUM(up_score + ABS(down_score)) AS total_votes,
  AVG(score) AS average_score,
  AVG(up_score) AS average_upvotes,
  AVG(down_score) AS average_downvotes,
  STDDEV_POP(score) AS stddev_score,
  STDDEV_POP(up_score) AS stddev_upvotes,
  STDDEV_POP(down_score) AS stddev_downvotes
FROM
  [danbooru-data:danbooru.posts] 
GROUP BY
  rating;

This is also useful for doing queries that timeout or don't fit into the 6 tag limit. Here's one to find unbanned artists, since banned_artist -status:banned times out:

Show
SELECT
  id
FROM
  [danbooru-data:danbooru.posts]
WHERE
  tags.name = 'banned_artist' AND
  is_banned = false;

No worries, glad someone found it useful. I'll be adding Yande.re and Gelbooru at some point as well. Konachan maybe.

How are you keeping this updated? Are you scraping the API or using the database replica (I think I saw albert say something about giving people access to that once?)?

Because if you have more tables, and it wouldn't be too much trouble to put them on BigQuery too, that'd be extremely helpful to me. It'd be nice to have the users table to resolve user IDs. And post_versions in particular would save me a lot of time in scraping it, if you happen to already have it.

evazion said:

How are you keeping this updated? Are you scraping the API or using the database replica (I think I saw albert say something about giving people access to that once?)?

Because if you have more tables, and it wouldn't be too much trouble to put them on BigQuery too, that'd be extremely helpful to me. It'd be nice to have the users table to resolve user IDs. And post_versions in particular would save me a lot of time in scraping it, if you happen to already have it.

I'm using the DB replica. Until very recently there was no easy way to keep an up to date copy of the DB (other boorus have a change_seq column you can order by, Danbooru only has updated). I'm working on the new way (PubSub) and if it seems to work without compromising integrity I'll probably stick to that.

I had a look and I don't have permissions for the post_versions or user tables (I do have tags and tag_aliases though). While there doesn't seem to be anything sensitive in the post_versions table and Albert might be okay with giving me access, the users table contains waaaay too much sensitive info. Postgres does have column level access though.

I might work on scraping it but it'll take a bit of work. What do you need the post_versions table and the ability to resolve users for? (if it's something interesting to me it's more likely I'll put in the effort :p)

The users table is no problem. I already have a scraper and most things, except the posts table and the various *_versions tables, can be scraped quickly. Main thing I need it for is matching uploader/approver IDs with usernames in the posts table.

As for post_versions, the big thing is for doing "Which posts did user X add tag Y to?" queries. That came up twice today in needing to mass revert certain edits in topic #13160 and in topic #13153.

The other thing I'm interested in is answering various tag usage questions. Namely:

For any given tag:

  • Who coined this tag?
  • How has this tag grown over time?
  • How many people have ever used this tag?
  • Who are the top taggers for this tag?
  • What posts has it been added to recently? What posts has it been removed from?

For any given user:

  • How many different posts have they tagged?
  • What are the top tags that they've added? That they've removed? And to which posts?

I had a crazy idea of writing a userscript that would query these things from BigQuery and add reports to user pages and to wiki pages. I think it would be doable, assuming the cost of these queries isn't prohibitive.

But I can scrape post_versions myself, don't worry about it. I was just hoping that you might have access in the replica already.

I'm not that familiar with SQL, so is there a way to search for positive and negative tags, or various tag combinations?

I originally tried...

SELECT
  id,tags
FROM
  [danbooru-data:danbooru.posts]
WHERE
  tags.name = 'tank' AND
  tags.name <> 'girls_und_panzer'

...as a test case, but that returned zero results.

If I can search with tag combinations, one thing I'd like to do is to search for all posts that have the tank tag but aren't tagged with a specific tank name.

Try this:

Search: tank -girls_und_panzer

Results: https://docs.google.com/spreadsheets/d/1sfOF2lbsQ8-xTLtmVelpbK4kGxeEGl0hVkwluCgfrsA/edit?usp=sharing

SELECT
  id,
  CONCAT("http://danbooru.donmai.us/posts/", STRING(id)) AS url,
  created_at,
  rating,
  score,
  COUNT(favs) WITHIN RECORD as favcount,
FROM
  [danbooru-data:danbooru.posts]
WHERE
  id     IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE tags.name = 'tank') AND
  id NOT IN (SELECT id FROM [danbooru-data:danbooru.posts] WHERE tags.name = 'girls_und_panzer') AND
  TRUE
ORDER BY id DESC;

I'm not sure if that's the best way of doing it, but it seems to work.

Here are some more interesting queries:

Average upload scores of contributors vs members, grouped by rating

https://docs.google.com/spreadsheets/d/1dmM-GFUJ-Tz4tA9yJR4X0xbTJhz3KMcvSh8zXNVSfJk/edit?usp=sharing

SELECT
  CASE WHEN approver_id = 0 THEN TRUE ELSE FALSE END AS contributor,
  rating,
  COUNT(id) AS posts,
  AVG(score) AS average_score,
  STDDEV(score) AS stddev_score,
  COUNT(favs) / posts AS average_favcount
FROM
  [danbooru-data:danbooru.posts]
WHERE
  created_at BETWEEN TIMESTAMP('2016-08-01') AND TIMESTAMP('2016-09-01')
GROUP BY
  rating,
  contributor
ORDER BY
  rating DESC;
Most active approvers

https://docs.google.com/spreadsheets/d/1YGl69uN_ZOcWwn8Q0AekUj0ur0CY__UO_T7-L68QJDE/edit?usp=sharing

SELECT
  CONCAT("http://danbooru.donmai.us/users/", STRING(approver_id)) AS user,
  COUNT(id) AS approved_posts,
  AVG(score) AS average_score,
  COUNT(favs) / approved_posts AS average_favcount
FROM
  [danbooru-data:danbooru.posts] 
WHERE
  created_at BETWEEN TIMESTAMP('2016-08-01') AND TIMESTAMP('2016-09-01') AND
  approver_id <> 0
GROUP BY
  user
ORDER BY
  approved_posts DESC;
Top sources, by highest average score

https://docs.google.com/spreadsheets/d/1Fhq_HZjOpf8XqW0pzWNn7tOSzEi3GrdsBucnEVPDScA/edit?usp=sharing

SELECT
  COUNT(id) AS posts,
  DOMAIN(source) AS domain,
  AVG(score) AS avg_score,
FROM
  [danbooru-data:danbooru.posts]
GROUP BY
  domain
HAVING
  posts > 500
ORDER BY
  avg_score DESC;
Top sources, by number of uploads

https://docs.google.com/spreadsheets/d/1Qzafuo-dYxvEl8nLAk32MAM2IZUP7FRUyBQU2kFHpLg/edit?usp=sharing

SELECT
  COUNT(id) AS posts,
  DOMAIN(source) AS domain,
  AVG(score) AS avg_score,
FROM
  [danbooru-data:danbooru.posts]
GROUP BY
  domain
ORDER BY
  posts DESC;
Unusual Pixiv source URLs

https://docs.google.com/spreadsheets/d/13y9mSms5SqqWZHAuNhmzIt11agkEXYUyqGjwgiutwUg/edit?usp=sharing

SELECT
  id,
  CONCAT("http://danbooru.donmai.us/posts/", STRING(id)) AS url,
  source
FROM
  [danbooru-data:danbooru.posts]
WHERE
  REGEXP_MATCH(source, r'pixiv\.net') AND
  NOT REGEXP_MATCH(source, r'https?://comic.pixiv.net/viewer/stories/\d+') AND 
  
  NOT REGEXP_MATCH(source, r'https?://img\d*.pixiv.net/img/[-_a-zA-Z0-9]+/\d+((_m)|((_big)?_p\d+))?.(jpg|png|gif)') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/img\d{1,3}/img/[-_a-zA-Z0-9]+/\d+(_m|((_big)?_p\d+))?.(jpg|png|gif)') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/img\d{1,3}/img/[-_a-zA-Z0-9]+/\d+.webm') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/img\d{1,3}/img/[-_a-zA-Z0-9]+/mobile/\d+_480mw(_p\d+)?.jpg') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/img-original/img/\d{4}/(\d\d/){5}\d+_p\d+.(jpg|png|gif)') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/c/(480x960|600x600|1200x1200)/img-master/img/\d{4}/(\d\d/){5}\d+_p\d+_master1200.jpg') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/img-zip-ugoira/img/\d{4}/(\d\d/){5}\d+_ugoira(600x600|1920x1080).zip') AND 
  
  NOT REGEXP_MATCH(source, r'https?://touch.pixiv.net/member_illust.php\?mode=medium&illust_id=\d+') AND
  NOT REGEXP_MATCH(source, r'https?://www.pixiv.net/member_illust.php\?mode=(medium|manga)&illust_id=\d+') AND
  NOT REGEXP_MATCH(source, r'https?://www.pixiv.net/member_illust.php\?mode=manga_big&illust_id=\d+&page=\d+') AND
  
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/background/img/\d{4}/(\d\d/){5}\d+_[a-f0-9]{32}.jpg') AND
  NOT REGEXP_MATCH(source, r'https?://img\d+.pixiv.net/profile/[-_a-zA-Z0-9]+/(ws_|bg_)?\d+.(jpg|png|gif)') AND 
  NOT REGEXP_MATCH(source, r'https?://i[1234].pixiv.net/img\d{1,3}/profile/[-_a-zA-Z0-9]+/(ws_|bg_)?\d+.(jpg|png|gif)') AND 
  NOT REGEXP_MATCH(source, r'https?://www.pixiv.net/member.php\?id=\d+') AND 
  TRUE
ORDER BY id DESC;

Some of this deserves to be looked at in more detail than I have time for now, but tl;dr:

  • Contributor uploads on average score a couple points higher than member uploads, no matter the rating.
  • One approver accounts for ~33% of approvals. The top two account for ~50%. The top ~20% of approvers (<10 people of 50) account for ~80% of approvals. (This was for one month only though).
  • Pixiv has 1.6 million uploads, Twitter is second at only ~80,000 (twitter.com + twimg.com).
  • ...but interestingly, nijie.info and hentai-foundry.com uploads have the highest average scores out of any source.

Here's a list of the top 50 uploaders ranked by highest total scores. Some things to note:

  • Score is more a measure of popularity than of quality, and porn is more popular than non-porn. So take scores with a grain of salt.
  • Upvotes and downvotes don't add up to total score. I think that's because favorites used to give a free point that wasn't counted as an upvote.
  • rating:e attracts more downvotes than rating:s, so if a person has a lot of downvotes it could be because they upload a lot of rating:e.
  • This is only the top 50, but there's Google spreadsheet below that lists all 14265 people who have ever uploaded anything. Which is not as many uploaders as I'd thought we'd have tbh.
Top Uploaders By Highest Score

https://docs.google.com/spreadsheets/d/18s5PT20Oz7IcjnpLyf5RxAOQtHZlUiezFUypInGPVW4/edit?usp=sharing

0iduseruploadsup_votesdown_votestotal_scoretotal_favsavg_scoreavg_favs
1user #254161Schrobby54365301200-33205314619789859.7818.01
2user #32251Ars27588206541-384043027492614515.633.57
3user #13047NCAA_Gundam44547180026-33073226027440927.2416.7
4user #11314Kikimaru59652179956-149523177278248975.3313.83
5user #1albert131514283523-432931159614242462.3710.83
6user #102191Mr_GT46699270253-146727926111945715.9825.58
7user #15754Evangeline_A.K._McDowell28459164048-18952714306286499.5422.09
8user #49091nanami64792185851-20652698456884434.1610.63
9user #19599Apollyon23333160806-390125302088697710.8438.01
10user #81291CodeKyuubi12917135341-147223913357554218.5144.56
11user #351692dean_exia14264113847-271822875258248316.0440.84
12user #366860gary2556624948126220-15062271154100689.116.44
13user #49984dereyoruk15489133326-165321070344797913.628.92
14user #30072Doragonn25081122132-16872088064570448.3318.22
15user #307587psich2242087533-8621896004893028.4621.82
16user #460797user_46079712943144479-180717393429510013.4422.8
17user #356975SciFi2680278308-32951737783531186.4813.18
18user #13506RaisingK33145115268-5741697163091945.129.33
19user #20119Snesso22466147889-2971512705721836.7325.47
20user #11077Gunflame14071130845-84114747666848010.4847.51
21user #366578DakuTree16767116616-2441322093881427.8923.15
22user #30466Mysterio00613747110508-4251246094470369.0632.52
23user #330014zaregoto1046276895-158912313723659311.7722.61
24user #371758zeparoh1141788884-80312266321422510.7418.76
25user #62191v5718661372953600-5751216893178418.8623.15
26user #136247Kadoya2431861369-12661176092920314.8412.01
27user #110107HNTI970281257-131311620040918911.9842.18
28user #159945Tsuki_no_Sakura1940799427-5261151061910135.939.84
29user #149704keonas1018460866-122511369922480211.1622.07
30user #95414Herrmobel20256109377-10241123665915945.5529.21
31user #412246RinkaS994153628-121411105117189111.1717.29
32user #34945Monki1320474785-7241108662225468.416.85
33user #369231Lannihan904862863-93010469718481811.5720.43
34user #12464Altered1585555236-2501012312365786.3814.92
35user #39411Action_Kamen1528690192-1022953184906806.2432.1
36user #206319Xeano94985753196-1182901552240579.1522.73
37user #11896animeboy121119862691-1706863124064007.7136.29
38user #166417Randeel512067819-8258606714383916.8128.09
39user #155924magenta-crimson1880152406-1126856221815054.559.65
40user #372231lkjh098956826835-2672845021745468.8318.24
41user #420773Fenen1260728863-594804711532026.3812.15
42user #133311Gauron17861628050431-470799101817484.9111.16
43user #14602Dbx1812047623-455792921988474.3810.97
44user #397518Sacriven970062221-1207756931337867.813.79
45user #15864Magus893260385-337733231449198.2116.22
46user #108584Krugger2005644826-1515723192180023.6110.87
47user #348646Rastamepas554953098-4317127116126912.8429.06
48user #39276john19801178831445-1509687032043815.8317.34
49user #59648Jigsy647246830-4746781120533910.4831.73
50user #179709Anonymous90005337545869-864672641967841.263.69
SELECT
  CONCAT("user #", STRING(u.id)) AS id,
  u.name AS user,
  COUNT(p.id) AS uploads,
  SUM(p.up_score) AS up_votes,
  SUM(p.down_score) AS down_votes,
  SUM(p.score) AS total_score,
  COUNT(p.favs) AS total_favs,
  ROUND(SUM(p.score) / COUNT(p.id), 2) AS avg_score,
  ROUND(COUNT(p.favs) / COUNT(p.id), 2) AS avg_favs,
FROM [danbooru-data:danbooru.posts] AS p
JOIN [turing-zone-143603:danbooru_latest.users] AS u ON
  p.uploader_id = u.id
GROUP BY id, user
ORDER BY total_score DESC

evazion said:

Ayyy, some hard evidence that is proof of my dedication to quality. It's a shame I can't code for shit, I'd like to do some querying myself but I can't see the benefit of taking hours to learn something I'll never use again.

CodeKyuubi said:

Ayyy, some hard evidence that is proof of my dedication to quality. It's a shame I can't code for shit, I'd like to do some querying myself but I can't see the benefit of taking hours to learn something I'll never use again.

Well to be fair, nearly all above rank 15 are some veteran users, in terms of join date :p.

Comics are another thing that rarely get high scores. Just 2 of the 390 images with 100+ score are comic rating:safe, despite this combination making up over 1 in 20 posts on the site. An average of 50 purely from uploading hentai is probably equivalent to about an average of 5 for uploading non-hentai comics.

The above data cannot be taken wholly at face value. Should someone that uploads a million posts with an average score of 1 be the top uploader? Should someone with only 40 posts with an average score of 50 be the top uploader? Should someone with the highest score count but also the lowest ratio of positive to negative score be the top uploader? I say no to all of the above.

Total positive score, total negative score and total post count must all be taken into consideration individually otherwise you start losing context. Some kind of weighting system where all of those values are normalized would be a bit more useful, but the weighting would have to be tailored to the interests of that particular user.

Provence said:

Score =/= Quality
Score = Popularity
And with that, users like Zettamorose, oecchi or bad mongo will never rank high there, since they mostly upload male_focus :3.

To be fair, less than 10% of my posts are Kancolle and Touhou, which are the biggest culprits of score = popularity != quality. About a third of my posts are original or have no copytag due to being unknown, and separately, about 30% of my posts are also scans.

You don't have to believe me, but I've always held myself to a high standard for uploading high-quality art, and not min-tagging images just to get an upload.

Yay! This is fun to play with. I decided to do some analysis of tags. There's been a lot of argument that pictures with certain tags get more points than others, but how true is it?

Top Tags by Total Score

https://docs.google.com/spreadsheets/d/1K_WZ9OuUOtGlRgAMTdY1Q1JKISIUXjv2V-CAIaQYbqQ/edit?usp=sharing

SELECT
  tags.name,
  COUNT(id) AS num,
  SUM(score) AS total_score,
  AVG(score) AS average_score,
  STDDEV_SAMP(score) as stddev_score,
  AVG(IF(rating = "e", score, NULL)) as average_score_e,
  AVG(IF(rating = "q", score, NULL)) as average_score_q,
  AVG(IF(rating = "s", score, NULL)) as average_score_s,
  AVG(rating = "e") * 100 as rating_e_pct,
  AVG(rating = "q") * 100 as rating_q_pct,
  AVG(rating = "s") * 100 as rating_s_pct
FROM
  [danbooru-data:danbooru.posts]
GROUP BY
  tags.name
ORDER BY
  total_score DESC
LIMIT
  100;

It looks like the answer is that tags do make a difference. comic doesn't show up in the top 100 at all, which probably means a very low average score. High-scoring tags on safe images include ass, small_breasts, bikini, collarbone, and panties. Low-scoring tags on safe images include monochrome, 1boy, school_uniform, glasses and weapon. Only two copyrights showed up: kantai_collection is unusually high-scoring, but touhou posts have average scores.

1 2 3