-
Notifications
You must be signed in to change notification settings - Fork 50
[WIP] Add preliminary phash ETL #76
base: master
Are you sure you want to change the base?
Conversation
a3a08c9 is a little bit of cleanup, and also manually wraps the whole action in a transaction. I discovered this is necessary after running the ETL with the data loaded up by ads downloaded from the propublica data store. Under load, the reads/writes got a chance to overlap in time and uncovered an error with the code before a3a08c9, where my streaming SQL library was automatically creating a read-only transaction if it detected it was being run with no transaction. Manually wrapping the whole thing in my own transactions allows reads and writes to be interleaved. Running 10000 ads took about 15 minutes with the default parallelism settings. |
this looks awesome and I intend to take a deeper look tomorrow, hopefully in the morning. thanks!! |
:) Great! No hurry! |
Hope I'm not jumping the gun too much, but I wanted to explore some search ideas. Implemented in ca9bfd6. There's also some code to generate a view of the search results, which isn't intended for public consumption, but just for debugging the search and giving a sense of how phash distances map to perceptual similarity. I'm not thrilled with the ability to find similar images, but probably I need to import more images before I can get a sense for whether search-by-meme will work when backed by phash. Same caveats as before: needs cleanup and documentation, open to rewriting in a language already used in Example CLI call and output:
("range-bounds" search type finds some images at various phash distances from the query target - really useful for seeing what phash considers a good match vs. a terrible match) The top match doesn't say anything about phash usability unfortunately, because my search query was a url taken directly from the Oh and I see the LICENSE file has a Greg Hale copywrite. That's just autogenerated boilerplate. During cleanup of course proublica can have the copyright and pick the license. |
Hey @imalsogreg, this looks awesome. Seriously impressive! (Especially the parallelism and speed you mention.) I don't know much about nix. We use Docker almost exclusively and however I end up deploying this, it'll be as a Docker image. So that means Haskell is probably fine for now, since it won't have to share dependencies or live on the same server as anything else. Maintainability-wise, something like Ruby or Python always makes everyone's lives easier, but if you find that you've scratched your itch and don't want to port it, Haskell can work in production. I'm working on getting it installed now, locally, with homebrew. I'll keep some notes, but so far it's just been zeromq and I'm going to have to take a look at installing phash from source, it looks like. I'll keep you posted, but I love the work so far! |
@jeremybmerrill ooh I shudder thinking of trying to wrangle the dependencies in homebrew from scratch - but if you're happy to try, it may work! I can build this into a docker container, if you're interested in playing with it sooner than later and dependencies get frustrating. Thanks for the kind words! I'll stick around as a contributor until this gets used and finally retired. So feel free to set priorities in that context - if the most useful thing for me to do is to port to ruby or python in order to help you and other contributors, that's totally fine. If other priorities come first, also fine :) A hint for building phash - use the configure flags and post-configure steps here: https://github.com/NixOS/nixpkgs/blob/98c1ad879a34954944972ac7465343d325f7156b/pkgs/development/libraries/phash/default.nix#L16-L19 |
Ah, thanks for those phash hints! I think I'm almost there getting this running in Docker. I gave up on doing it locally once I realized I'd have to build pHash from source; not going to try to do that twice :). Right now, it looks like the Haskell app almost builds. It complains about not being able to find
|
Nice. I forgot to add |
Ha, I've done it a million times. Thanks! I will hope to pick this up again tomorrow. I got it building correctly in a Dockerfile and |
Oh hey, and it looks like it's calculating phashes. Seems to require me to run |
I get this error occasionally. Do you know what it is?
|
You got the dependencies sorted? Rad! I haven't seen your error, but Quick disclaimer that I wouldn't trust the cli tool with prod database yet 😅 Yah, |
Got it. Here's an easy question: Right now, I kick off all the Haskell stuff with |
@jeremybmerrill a couple options for the sandboxed build:
|
@imalsogreg ah, great! I see it there in Which, FYI, is this:
|
Looks like installing graphicsmagick fixes that Incidentally, the I'm trying to search now Unfortunately, I can't get anything but a blank HTML page. (I mean, it has HTML content, but the search results are nothign but the exact image I passed in.) I wonder if the problem is the |
Null is the new sentinal value for computing phashes during ETL During search, only non-null-phash rows are considered All considered rows must have images length == phash length Number of considered rows must be greater than 0
@jeremybmerrill I think probably the hard failure on image download could be responsible for both issues? When the download fails, the whole transaction populating I just pushed a commit that silences download errors, so they won't bother the data load. I also changed the behavior of
If the number of valid rows from the db for a search is zero, Could you rerun the Sorry, bugs :( A good |
Hey @imalsogreg, I think it was my bad with the search function, I think I had made some modification that messed it up. Now I'm a little bit confused. Is the |
@jeremybmerrill |
Without the annotations, some possible combination of psql, haskell postgres libraries, and libpq was resulting in a type error during the sql query (`phash` treated as `text` rather than `text[]`)
027173c
to
7e06414
Compare
Proof of concept phash ETL job (#74)
This adds data to a
phash
column in theads
table, with the perceptual hash of each image in the ad. A prereq for doing reverse image search.Written in Haskell to make it easy for me to experiment, but I'm happy to rewrite in the other languages from the project if that will help with maintainability. The code here is more for demo of the approach.
There is a single executable with 3 subcommands: one to update the
ads
table with phashes for all images in each ad, one to reset that column, and one to check database connectivity.The implementation is careful to stream rows from
ads
and to batch writes back in to update thephashs
column, since your database may be larger than what fits in memory on the cron machine driving this. It also keeps a least-recently-used cache of phashes for urls, since we expect a lot of redundancy in ad image urls across different users' ads, and we don't want to repeatedly download those. I did some light testing with my local database and verified the basic logic and caching work. I need to do more work to verify that the threading stuff works and gives a speedup.The code in the PR has a dependency on
phash
bindings that might not be available on your system. I've been using nix locally to manage the dependencies across languages. This (and a few other things) may prevent this branch from running on your machine. I wanted to get comments on the approach before doing the work of trying to pick versions that work with the package managers on the machines you use in dev. If you don't mind using nix for this, I find that a very clean way of sharing dependencies across machines. But of course if you don't like nix or your environment rules it out, I can get things working in your current standard way.Future directions: it would be very natural to add one or two commands to the executable for image-search-by-similarity. That would be too expensive a query for users to run from scratch (full table scan for every request) - I have a couple ideas for how to store state across runs to make that search cheap.
Any comments/thoughts appreciated! :) Looking forward to getting to the next steps.