Danbooru

Danbooru Archives

Posted under General

OK, so I went out and ordered a new HD. I'll be having the dump then. And what Shinjidude says, do not rar it if at all possible, it'll make it a pain to update or get parts of. Although I realise torrent wasn't made for packing 380K individual files, so some kind of organisation scheme is necessary.

EDIT: Actually, I've seen EVERYTHING.TORRENT with several hundred thousands of files previously. It was so big it was distributed by a torrent itself, but other than that I believe it worked fine.

Shinjidude said:
This might be a nice idea. Does anyone know if it's possible to update a torrent as it's running? It would be useful to be able to keep it up-to-date as it went.

Nope. A .torrent is static, there's no way to update it.

Its not possible to update a torrent as it runs. Perhaps using something like Arch Linux's pacman, where you could sync the database against your local files and only download based on tags you want.

I'd program it myself but I only know a basic knowledge of python and wouldn't even know where to start.

The images are distributed into directories based on the first 2 characters of the MD5, so I would have created 256 seperate files using the 'store' compression method. I could break it up into 10k-20k images per archive based on the post id.

Something centralized like rsync isn't an option due to the size and number of files. It takes several hours for the daily backup to my NAS to complete over a gigabit LAN, with both the source and destination volumes running 5 HDD's in RAID.

Grouping based on prefix sounds appropriate. It'd be nice to have a way of dynamically distributing something in a de-centralized way that can auto-update though. Maybe I'll try to do some research to see if such a mechanism exists.

I still think they should go unpacked. This way you can actually move to a newer snapshot using the same target/source dirs, without fighting with 256 outdated rars. And anyway, if you're packing, do not use rars. I just remembered that RAR has the same stupidity tar has, ie. no sane central header, requiring it to scan everything just to see the contents. It's a huge pain for big archives with many files.

葉月 said:
I still think they should go unpacked. This way you can actually move to a newer snapshot using the same target/source dirs, without fighting with 256 outdated rars. And anyway, if you're packing, do not use rars. I just remembered that RAR has the same stupidity tar has, ie. no sane central header, requiring it to scan everything just to see the contents. It's a huge pain for big archives with many files.

I'll just do unpacked, creating torrent now. I'm used to using RAID arrays and SSD's so large archives don't affect me as much as other people :D

yosome said:
I'll just do unpacked, creating torrent now. I'm used to using RAID arrays and SSD's so large archives don't affect me as much as other people :D

It's not just about unpacking, but about the fact that with archives you prevent the possibility of updating the data in place. So people can't just move over to a newer snapshot without redownloading everything, etc.

Leaving it unpacked would also help with updates: someone who has an older copy can just point the updated torrent at his already existing copy. The newer torrent will automatically check existing content before downloading, so the user will only have to download the updated material. It'll also create seeds faster (for the same reason).

Edit: And 葉月 already mentioned this. So, yeah.

Well, a single torrent with all 400,000 images doesn't appear to be possible. The resulting .torrent (2MB piece size) is 53.2MB, mostly from the file names. µTorrent can't even open the .torrent file. Azureus consumes about 2.7GB of RAM before hanging. The original python client hangs as well, while consuming around 1.5GB of RAM.

Packing the files up into RAR archives while using the store method won't require downloading the whole set again for updates. When a file is added to a RAR archive, it is appended onto the end. The 'store' compression method doesn't perform any compression. Bittorrent performs hash checking on pieces, not on the individual files. Clients only need to download pieces which have changed, the existing pieces still pass the hash check.

yosome said:
Well, a single torrent with all 400,000 images doesn't appear to be possible. The resulting .torrent (2MB piece size) is 53.2MB, mostly from the file names. µTorrent can't even open the .torrent file. Azureus consumes about 2.7GB of RAM before hanging. The original python client hangs as well, while consuming around 1.5GB of RAM.

That's odd. I made some test torrents with 100k files (100MB data in total) and they were only 3.6MB in size (almost 100% of that filenames, the piece size didn't change the resulting file's size in any noticeable way). I'll try again with a bit more data and longer filenames.

Packing the files up into RAR archives while using the store method won't require downloading the whole set again for updates. When a file is added to a RAR archive, it is appended onto the end. The 'store' compression method doesn't perform any compression. Bittorrent performs hash checking on pieces, not on the individual files. Clients only need to download pieces which have changed, the existing pieces still pass the hash check.

That's only true assuming you download the RARs and never touch them. Which means the snapshot now takes 260GB instead of 130. That's a big hit. If anything, grouping the files by post ID is the only way which'd let us roll out updates which could be downloaded without having everything else at hand. But it's still not gonna help with seeding, as you can't seed from the unpacked snapshot anymore. Maybe splitting it into 4 or so torrents, each having a portion of the prefixes (0-4, 5-9, etc.) would be a way to bring it down to a manageable level?

葉月 said:
That's odd. I made some test torrents with 100k files (100MB data in total) and they were only 3.6MB in size (almost 100% of that filenames, the piece size didn't change the resulting file's size in any noticeable way). I'll try again with a bit more data and longer filenames.

Each file name is 32 bytes, not including the extensions or any path information. Multiply that by 400,000 and you're looking at over 12MB of text.

葉月 said:
Maybe splitting it into 4 or so torrents, each having a portion of the prefixes (0-4, 5-9, etc.) would be a way to bring it down to a manageable level?

Attempting this now with 00-3f. 107448 files totaling 32.15GB.

yosome said:
Each file name is 32 bytes, not including the extensions or any path information. Multiply that by 400,000 and you're looking at over 12MB of text.

So I just retried with 100k files, the same dir structure as danbooru and 32 bytes + .jpg per file name. The result is 4.5M for a 600MB of data vs. 4.7MB for 4GB of data, both with 512KB pieces (that's the biggest Deluge allows, for whatever reason). So yeah, the filenames do add up to 12MB+, but given the size increase for data, it shouldn't really be more than maybe 20MB total with 2MB pieces. And it seems that deluge has no problems opening a 100k torrent, it just takes 40s or so to chew through it before it displays the selection dialogue.

Updated

I ran into issues with the HD and then utorrent, which made me stop downloading until now. Any chance you could seed it again? Also, while we're at it, post all the rest of the torrents, if it's not inconvenient for you.

I've started re-seeding 4-7, though my bandwidth isn't great. If someone could seed 0-3 for a while as well, I'd appreciate it, as I only got about 2/3 through it before it died. A future 8-9 would be much appreciated too if it's not too much trouble.

EDIT: I don't think the tracker is up, actually, but you should still be able to find it via DHT, I think.

Ah, so that's why it's so slow. I'll try beating µTorrent into seeding 0-3 tomorrow, but I had to remove it from the torrent list for now, as it pretty much crashes when both are in the list (2h+ startup times are more or less equivalent to a crash, IMHO).

1 2 3 4