Danbooru

Image search

Posted under General

petopeto said:
I'm just cutting off at a fixed value; 90% by default, 30% on "show more", or whatever's specified in the "threshold" parameter. If you include a value for the default threshold, I can use it as the default and for as dupe check threshold.

You might as well not use any threshold at all for "show more", and just show all 16 results. Using 30% means you either get none of the other results (for grayscale images where the noise level is often below 30%), or you get all of them anyway.

I do now return the suggested threshold level anyway. In my tests it was always slightly above the first noise pic, though I'm open to tuning the std.dev. threshold of 5% if needed. In case of no relevant match, I return a 90% threshold. You should use the returned threshold to decide match relevance, and use no threshold (or 0%) for "show more".

Another good post to test this is http://moe.imouto.org/post/similar/609?services=all

This is trickier, but: is there any way we could get tags for results?

I really don't want to attempt to mirror Danbooru's DB. I can return the image resolution because it never changes, and because the image DB has fields for it anyway. But tracking tag and rating changes would be a nightmare, I'd have to retrieve the entire Danbooru DB at regular intervals. Or for tags one could implement a JSON/XML interface to query recent tag changes, but there isn't even an internal change log for ratings, so retrieving that would be impossible. But even if it were feasible I'd like to avoid this if at all possible.

You could however retrieve tags and ratings with a standard JSON/XML call for each relevant match. You might even cache the result. I decided not to do this for myself because it adds a delay to showing the matches, especially if one of the services is down. If you do it in JS in the browser, it would not slow down the results at all, but the user might briefly see images he does not want to see.

If you decide to do it either way, it would be useful to implement a JSON/XML query interface in the danbooru engine that takes multiple MD5s for all relevant matches from that service, so you only have to make one call per service instead of one for each match. This should not be too much work.

Updated

piespy said:
I really don't want to attempt to mirror Danbooru's DB. I can return the image resolution because it never changes, and because the image DB has fields for it anyway. But tracking tag and rating changes would be a nightmare, I'd have to retrieve the entire Danbooru DB at regular intervals. Or for tags one could implement a JSON/XML interface to query recent tag changes, but there isn't even an internal change log for ratings, so retrieving that would be impossible. But even if it were feasible I'd like to avoid this if at all possible.

Adding XML to post_tag_history_controller/index would be fairly trivial, and I think it's important that Danbooru track ratings internally too (there's a ticket for this, don't know if it'll be implemented).

You could however retrieve tags and ratings with a standard JSON/XML call for each relevant match. You might even cache the result. I decided not to do this for myself because it adds a delay to showing the matches, especially if one of the services is down. If you do it in JS in the browser, it would not slow down the results at all, but the user might briefly see images he does not want to see.

can_see_post needs to be done server-side. I don't like putting load on other image boards for every image search. If any remote server is down (all danboorus go down fairly regularly), then it'll fail and either results fail or filtering fails; with four separate servers in the mix for every query, problems would be common. I don't think this is a workable solution.

Maybe I could do this on a separate metadata-mirror server, running on the same host as the danbooru. I'll ping that ticket and see if rating history will be implemented any time soon, or if it's just a maybe-someday wishlist.

If everyone who uses your search mod would need to maintain a copy of the DB of each service, then it'd be better to integrate it into the search after all.

With a nice interface to query recent tag and rating changes (ideally, all changes since $TIMESTAMP), it should be workable after all. Since it's just storing and retrieving, and no tag searches or other complicated DB queries, performance should be no problem.

Well, I'll have to wait for these changes anyway.

Oh, and I wanted to mention that the search would be a little faster if you just passed the URL of the preview on moe.imouto.org. For pics that I already have in the DB, the script simply uses that data, so you eliminate having to transfer the image data in most cases, and I can also avoid having to make a thumbnail since I already have one.

From my Apache logs, your queries with image data take on average about 100 ms longer due to this than similar queries passing only a URL. Using a file upload should probably only be used if you don't have a URL to pass instead.

I don't think the tag thing is urgent: it's a problem for the general image search, but not for the dupe search, since it connects local hits to local posts, and all of those are processed just like regular local results; blacklisting those works fine (can_see_post probably doesn't yet, I'll check that later). But I'd like to have it working in general eventually.

For pics that I already have in the DB, the script simply uses that data, so you eliminate having to transfer the image data in most cases, and I can also avoid having to make a thumbnail since I already have one.

I'll make this an option, since it won't work for private servers. If it'll help, I can add a "thumbnail=0" parameter to searches for local posts, too; since the server already has thumbs for its own images, you can always skip thumbnailing those. It only needs the thumbs for searching remote URLs.

Can you put the real PHP code up? This mini one is falling more and more out of date.

Something broke when you added the url argument, it's always passed in empty here.

Maybe you should show a message when the query returns an error, so that you can see when there's a problem, instead of just showing no posts, which could also be due to no similar posts existing.

petopeto said:
Can you put the real PHP code up? This mini one is falling more and more out of date.

OK... http://haruhidoujins.yi.org/iqdb-php-20080229.tar.bz2

It's very specific to my site though. I'm not sure if it'll be usable anywhere else without quite some work. I should probably just maintain the mini script better instead. Then again, with the proposed changes the mini script wouldn't be too useful anymore either...

Well, if you have any suggestions on how to make the script more adaptable please let me know.

I'd suggest making the mini-script the primary interface to the engine (searching, thumbnailing, thresholding, etc), and using that as the backend for your HTML results, too. There'll only be one script to maintain that anyone else will really need.

It shows the server as (down) in the service list on the left if there's no response. I'll have it do that on <error>, too. The actual errors are available in the XML output.

Updated

OK, I've refactored it now. The backend code is in web/iqdb-php.inc, the thin XML frontend in web/iqdb-xml.php with site-specific settings stored in web/iqdb-opt.inc. Make sure to set up a cronjob to regularly clean out old thumbnails generated from the uploaded data.

http://haruhidoujins.yi.org/iqdb-20080303.tar.bz2

It also has the latest DB code, including a decently working find_duplicates function (though this is very raw and has no front-end, it just outputs the internal image IDs). I hope I didn't break anything though...

MugiMugi said:
Guess I need to say thx for that code :) I've don't own an Linux server, and code didn't compile on my BSD box due network difference.

It compiles fine on OS X though, which is BSD based. Without error messages it's hard to guess why it failed. Maybe you didn't have the ImageMagick devel package? I suppose I should look into autoconf et.al. but without a box to test it on (besides OS X and Linux where it already works), it's hard to do much with it. And I've never worked with autoconf before... other than running ./configure regularly.

# gmake
g++ -c -o iqdb.o iqdb.cpp -O2 -DNDEBUG -Wall -DLinuxBuild -DImMagick -g `pkg-config --cflags ImageMagick`
iqdb.cpp: In function `void server(const char*, int, char**)':
iqdb.cpp:493: error: `IPPROTO_TCP' was not declared in this scope
iqdb.cpp:496: error: aggregate `sockaddr_in bindaddr' has incomplete type and cannot be defined
iqdb.cpp:551: error: aggregate `sockaddr_in client' has incomplete type and cannot be defined
gmake: *** [iqdb.o] Error 1

Cannot provide much more info then that, I'm no expert on C/C++

Looks like BSD defines the constants in different headers. After a little googling, possibly adding these at the top of iqdb.cpp might help:

#include <netinet/in.h>
#include <netinet/ip.h>
#include <netinet/tcp.h>

If not, someone who actually knows BSD would need to help.

Updated version

http://iqdb.yi.org/iqdb-20080417.tar.bz2

Aside from some bugfixes I've reworked the update code. It now modifies the DB file directly instead of first reading it all in, and then writing it all back. So updates are much faster now. It's also possible to update a running server, because it now supports the "add" and "remove" commands. Note that these changes cannot be saved to disk, so you will need to update the DB file in parallel.

To enable these changes I've had to redesign the DB file format. To upgrade an old version of the file, use "iqdb rehash foo.db". Afterwards it can no longer be read by older versions of the code, so work on a copy first until you're sure it all works fine.

I STRONGLY RECOMMEND making a backup of your DB before upgrading, especially if you have a lot of pics that would take forever to rebuild. I've tested these changes a lot and they work for me, but bugs are always a possibility. If the program aborts without the DB being closed properly, it can become corrupt (since it's now updated directly and not rewritten completely).

Another change is that by default the server reads the entire DB into memory, so that searching is possible without having to read anything from disk. If you are short on memory, uncomment the "DEFS+=-DUSE_DISK_CACHE" line in the Makefile before compiling to reduce memory usage to the absolute minimum, and read query data from disk as needed (using mmap).

I've also added a testing application to check that all DB operations work as they should. Build it with "make test-db" and then simply run it, with any test.jpg file in the current directory and it will do various regression tests. If it finishes running without an exception getting thrown, it's all working.

For anyone else running Opera, I've found a neat way to add a "Query IQDB" item to the context menu of images. First you need to make a custom menu file (if you don't have one already) and copy the [Image Link Popup Menu] and [Image Popup Menu] sections from the default menu file. Then add the following line to both sections where you want the entry to be:

Item, "Query image on IQDB" = Copy image address & Go to page, "http://iqdb.yi.org/?url=%c"

This does a multi-search query, if you only want to search Danbooru change the hostname to danbooru.iqdb.yi.org.

By the way, if albert is reading this, we should probably change the "Similar" link on the post pages to point to iqdb.yi.org rather than haruhidoujins.yi.org.

Thanks for keeping at this, piespy, this is as useful a tool as it ever was :)

1 2 3 4 5 6