Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] embeded tx hash for unletterboxing #1684

Merged
merged 5 commits into from
Nov 27, 2024

Conversation

Mackay-Fisher
Copy link
Contributor

@Mackay-Fisher Mackay-Fisher commented Nov 6, 2024

Preprocessing and Hashing Enhancements in ThreatExchange CLI


This update introduces new preprocessing functionality to remove black letterbox borders and apply image rotations before hashing, improving the flexibility and accuracy of image-based workflows. This update resolves issue #1666 and integrates preprocessing with brute-force rotation workflows for more robust hashing options.

Summary of Changes

Enhanced unletterbox Method in photo.py

The unletterbox method preprocesses images by detecting and removing black letterbox borders, with support for flexible configuration:

  • Parameters:
    • file_path: The path to the input image file.
    • black_threshold: Sets the brightness threshold for detecting black borders (default: 10).
  • Returns: Cropped image bytes, enabling in-memory processing or downstream workflows.

Updated HashCommand in hash_cmd.py

The HashCommand class now supports preprocessing control through additional command-line arguments:

  • --photo-preprocess: Specifies the preprocessing type:
    • unletterbox: Removes black borders from images.
    • rotations: Generates all simple rotations of the image (e.g., ROTATE90, FLIPX).
  • --black-threshold: Sets the brightness threshold for detecting black borders. Lower values make the border detection more sensitive, while higher values make it less sensitive.
  • --save-preprocess: If enabled, saves the processed image to disk with a suffix reflecting the preprocessing type (_unletterboxed, _ROTATE90, etc.).

Improved Handling of Image Formats and Extensions

  • The output file format is now inferred from the input file's extension, ensuring consistency between the input and output file types.
  • Default fallback to PNG for unsupported or missing extensions.
  • Output filenames dynamically reflect the preprocessing type (_unletterboxed, _ROTATE90, etc.) and retain the original format (e.g., .jpg, .png).

Expanded Testing in test_pdq_letterboxing.py

  • test_unletterbox_image: Verifies that unletterboxed image bytes generate the expected PDQ hash.
  • test_rotations_with_photo_content: Validates that all eight rotations produce distinct and accurate hashes.
  • test_unletterboxfile_creates_output_file: Ensures that the processed file is correctly saved with the expected suffix when --save-preprocess is enabled.

@Mackay-Fisher Mackay-Fisher changed the title [py-tx] embeded tx hash passthrough for file generation and byte augm… [py-tx] embeded tx hash for unletterboxing Nov 6, 2024
@Dcallies
Copy link
Contributor

Dcallies commented Nov 8, 2024

Before I look at the code, thank you for the very comprehensive summary and test writeup. You could probably even skimp a little bit in future PRs!

I also like that you looked for other potential improvements that made sense in the area. I was considering suggesting a --save option during @haianhng31 --rotation diffs as well.

@Mackay-Fisher
Copy link
Contributor Author

Of course, thank you!

I can also update the save output option and file output for the rotation if you would like.

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned, thanks for the strong summary writeup.

Toplevel things:

  1. I think there's a /data/ directory in python-threatexchange. We've also previously loaded test images from the pdq/data directory - there's a trick you can use to find relative imports from the current file path you can see in a couple of tests. What do you think about synthesizing a new letterboxed bridge photo from the bridge-mods directory and use the meta logo only for the manual testing for this diff? Why - I don't want folks to be concerned about image rights, and the bridge-mods are only for the purpose of this repo.
  2. I think we should simplify the version of the interface we use on PhotoContent for now, and move the specific letterboxing into its own directory under content_type, or at least its own file. I forsee us adding more preprocessing tricks like this in the future.
  3. Can you explain more how you came to choose the thresholds, .e.g 40?

I found this random library that does something similar - you might want to give their code a read for ideas: https://github.com/Animenosekai/bordercrop/blob/main/bordercrop/bordercrop.py, though as mentioned we are trying to limit the number of external dependencies, otherwise we could even just use the library as is. We don't need all the features they have.

Overall strong work, thanks for the effort put in!

Comment on lines 103 to 104
ap.add_argument(
"--rotations",
Copy link
Contributor

@Dcallies Dcallies Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(no changes needed): Do you think we should combine rotations into your generalization of preprocessing?

Alternatively, what do you think about making rotation and unletterboxing mutually exclusive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought that it made the most sense to combine but didn't want to jump ahead and do it beforehand but I can add it to the list of actions.

Also, would there ever be a workflow where someone may want to both check for rotation and process the image for letterboxing?

python-threatexchange/threatexchange/cli/hash_cmd.py Outdated Show resolved Hide resolved
Comment on lines 98 to 99
type=bool,
default=False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: If you make the action store_true then the default is false IIRC.

@@ -118,7 +144,17 @@ def execute(self, settings: CLISettings) -> None:
if not self.rotations:
for file in self.files:
for hasher in hashers:
hash_str = hasher.hash_from_file(file)
if isinstance(hasher, PdqSignal) and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only PdqSignal? Wouldn't other image perceptual hashing algorithms benefit from this?

We also generally want to avoid places where we do isinstance(<interface_implementation>) in preference of it being handled by the interface itself.

Copy link
Contributor Author

@Mackay-Fisher Mackay-Fisher Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I thought it was specific to PdqSignal and as for why I bypassed the interface it is because they do not all have the method hash from bytes and I did not always want to create the new file with updated images bytes. Is there a way I can go around this or would it be better to always create the new file even if temporarily, and then save the output if the user passes the flag to save it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not specific to PdqSignal, and the hash_from_bytes method is itself part of a wider interface.

I think I eventually decided it was simpler to write everything to tmpfiles, which is how we ended up with the current implementation.

However, similar to the feedback I gave for --rotations, we can make our life a lot easier by having the preprocessing happen in between the file input and the for file in self.files.

I like your idea of providing a way to pass flag to save it, but let's save that for a followup.

@@ -118,7 +144,17 @@ def execute(self, settings: CLISettings) -> None:
if not self.rotations:
for file in self.files:
for hasher in hashers:
hash_str = hasher.hash_from_file(file)
if isinstance(hasher, PdqSignal) and (
self.content_type.get_name() == "photo"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, more specialization.

Comment on lines 106 to 181

@classmethod
def detect_top_border(
cls, grayscale_img: Image.Image, black_threshold: int = 10
) -> int:
"""
Detect the top black border by counting rows with only black pixels.
Uses a defualt black threshold of 10 so that only rows with pixel brightness
of 10 or lower will be removed.

Returns the first row that is not all blacked out from the top.
"""
width, height = grayscale_img.size
for y in range(height):
row_pixels = list(grayscale_img.crop((0, y, width, y + 1)).getdata())
if all(pixel < black_threshold for pixel in row_pixels):
continue
return y
return height

@classmethod
def detect_bottom_border(
cls, grayscale_img: Image.Image, black_threshold: int = 10
) -> int:
"""
Detect the bottom black border by counting rows with only black pixels from the bottom up.
Uses a defualt black threshold of 10 so that only rows with pixel brightness
of 10 or lower will be removed.

Returns the first row that is not all blacked out from the bottom.
"""
width, height = grayscale_img.size
for y in range(height - 1, -1, -1):
row_pixels = list(grayscale_img.crop((0, y, width, y + 1)).getdata())
if all(pixel < black_threshold for pixel in row_pixels):
continue
return height - y - 1
return height

@classmethod
def detect_left_border(
cls, grayscale_img: Image.Image, black_threshold: int = 10
) -> int:
"""
Detect the left black border by counting columns with only black pixels.
Uses a defualt black threshold of 10 so that only colums with pixel brightness
of 10 or lower will be removed.

Returns the first column from the left that is not all blacked out in the column.
"""
width, height = grayscale_img.size
for x in range(width):
col_pixels = list(grayscale_img.crop((x, 0, x + 1, height)).getdata())
if all(pixel < black_threshold for pixel in col_pixels):
continue
return x
return width

@classmethod
def detect_right_border(
cls, grayscale_img: Image.Image, black_threshold: int = 10
) -> int:
"""
Detect the right black border by counting columns with only black pixels from the right.
Uses a defualt black threshold of 10 so that only colums with pixel brightness
of 10 or lower will be removed.

Returns the first column from the right that is not all blacked out in the column.
"""
width, height = grayscale_img.size
for x in range(width - 1, -1, -1):
col_pixels = list(grayscale_img.crop((x, 0, x + 1, height)).getdata())
if all(pixel < black_threshold for pixel in col_pixels):
continue
return width - x - 1
return width
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By putting these all at the top level, we are signaling that they are part of the "official" interface for photos.

Instead, let's move this functionality into its own file in a new /preprocessing directory. We can add unletterbox.py as its own module, with these 4 methods then as standalone.


@classmethod
def unletterbox(
cls, file_path: Path, save_output: bool = False, black_threshold: int = 40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Instead of making save_output an argument here, it might be better to compose it from the outside by taking the bytes, which will give the caller more control over the file directory.

blocking q: Can you tell more about how you picked 40? We may want to be very conservative by default (even to only 100% black).


Then removing the edges to give back a cleaned image bytes.

Return the new hash of the cleaned image with an option to create a new output file as well
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this returns the hash, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh no I had it returning the hash at first but then it created a circular dependency so I removed it but did not update the comment. I will update.


# Convert the cropped image to bytes for hashing
with io.BytesIO() as buffer:
cropped_img.save(buffer, format=img.format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .img? Should we use the same format that was passed in?

"""
# Open the original image
with Image.open(file_path) as img:
grayscale_img = img.convert("L")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: Hmm, why convert to grayscale first? Won't think convert some full colors to black?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revised this and updated to check each individual value of the r g b pixel

@Mackay-Fisher
Copy link
Contributor Author

  1. What do you think about synthesizing a new letterboxed bridge photo from the bridge-mods directory and use the meta logo only for the manual testing for this diff? Why - I don't want folks to be concerned about image rights, and the bridge-mods are only for the purpose of this repo.

I will update the images.

@Mackay-Fisher
Copy link
Contributor Author

I found this random library that does something similar - you might want to give their code a read for ideas: https://github.com/Animenosekai/bordercrop/blob/main/bordercrop/bordercrop.py, though as mentioned we are trying to limit the number of external dependencies, otherwise we could even just use the library as is. We don't need all the features they have.

This was super helpful I can go ahead and use the pillow library and implement it in a similar format for simplicity of dependencies and cleaner code. It should work the same way. However, if you would like me to go ahead and use this library I can do that as well.

@Mackay-Fisher
Copy link
Contributor Author

I found this random library that does something similar - you might want to give their code a read for ideas: https://github.com/Animenosekai/bordercrop/blob/main/bordercrop/bordercrop.py, though as mentioned we are trying to limit the number of external dependencies, otherwise we could even just use the library as is. We don't need all the features they have.

This was super helpful I can go ahead and use the pillow library and implement it in a similar format for simplicity of dependencies and cleaner code. It should work the same way. However, if you would like me to go ahead and use this library I can do that as well.

Also, I looked more into it and this focuses on also being able to address image URL types and more than the basic image format that we are using. If that is by design I can add the pass-through to do that.

@Dcallies Dcallies self-requested a review November 13, 2024 16:37
Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some response to comments - use "request review" in the upper right hand of the summary page to send it back to me for review (little refresh logo).

python-threatexchange/threatexchange/cli/hash_cmd.py Outdated Show resolved Hide resolved
@@ -118,7 +144,17 @@ def execute(self, settings: CLISettings) -> None:
if not self.rotations:
for file in self.files:
for hasher in hashers:
hash_str = hasher.hash_from_file(file)
if isinstance(hasher, PdqSignal) and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not specific to PdqSignal, and the hash_from_bytes method is itself part of a wider interface.

I think I eventually decided it was simpler to write everything to tmpfiles, which is how we ended up with the current implementation.

However, similar to the feedback I gave for --rotations, we can make our life a lot easier by having the preprocessing happen in between the file input and the for file in self.files.

I like your idea of providing a way to pass flag to save it, but let's save that for a followup.

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer!

blocking: Instead of adding the test file to /threatexchange, can you add it to https://github.com/facebook/ThreatExchange/tree/main/pdq/data/bridge-mods instead?

Because threatexchange is in the same repo as PDQ, you can read the pdq directory from ThreatExchange tests using the trick I noted inline with __file__.

Optional / For your consideration: At the rate we are going, this feels like it will take a few more passes. You can simplify the PR by breaking out the changes to unletterboxing.py and photo to its own PR (the bottom/simplest part of the stack), though splitting a PR in git is pretty daunting for the first time. Here's a random article describing a method using cherry-pick, but there are others.

Comment on lines 96 to 98
default=10,
help=(
"Set the black threshold for unletterboxing (default: 5)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: documentation for default seems off. There also might be a default argparse option that will display the default for you.

blocking q: Can you tell me how you chose 10 for this?

Copy link
Contributor Author

@Mackay-Fisher Mackay-Fisher Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the implementation against the bordercropped case and found that a black_threshold of 0 consistently failed to identify all black borders. Setting the threshold to 15 gave the best results, as shown in the tests I ran (https://github.com/Mackay-Fisher/Image-Testing).

The reason a threshold of 0 doesn’t work is due to the effects of image compression and processing. In compressed image formats like JPEG, compression introduces variations in pixel values so a pixel intended to represent pure black (0, 0, 0) might be encoded as (1, 1, 1) - (15, 15, 15) (That is the largest pixel value to represent true black that I found. Although it seems like they can be collectively different). For example, we could get a pixel (8,9,15) which would return false given the threshold look at each individual boundary. Grayscaling typically reduces this but also opens up the possibility of misclassifying colors. These small deviations occur because compression algorithms prioritize reducing file size over preserving exact pixel values.

Additionally, image noise and color profile conversions can further alter pixel values slightly. This is most likely caused during the resizing operations when creating the letterboxed image. I found that by not greyscaling like the border cropped library I was better able to reduce the black threshold needed to clean the image.

By setting the threshold to 15, I accounted for the variations and it works best as the minimum threshold that consistently cropped the borders correctly without hurting the original image.

Comment on lines 104 to 106
"--save-output",
action="store_true",
help="for photos, generate all 8 simple rotations",
help="If true, saves the processed image as a new file.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: To help a user understand what this option does, suggest naming it --save-preprocess

Since store_true doesn't take an argument, suggest this as an alternative help:

save the preprocessed image data as new files

rotation_type = []
if self.photo_preprocess == "unletterbox":
updated_bytes.append(
PhotoContent.unletterbox(str(file), self.black_threshold)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fine as a followup: whoops, we should make unletterbox take a Path object for consistency with the rest of the library

Comment on lines 160 to 169
with tempfile.NamedTemporaryFile() as temp_file:
temp_file.write(bytes_data)
temp_file_path = pathlib.Path(temp_file.name)
for hasher in hashers:
hash_str = hasher.hash_from_file(temp_file_path)
if hash_str:
print(rotation_type.name, hasher.get_name(), hash_str)
print(
f"{rotation_type[idx].name if rotation_type else ''} {hasher.get_name()} {hash_str}"
)
if self.save_output:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: We can simplify this logic by using the delete= keyword of NamedTemporaryfile: https://docs.python.org/3/library/tempfile.html#tempfile.NamedTemporaryFile

delete=not self.save_output

Comment on lines 175 to 178
output_path = file.with_stem(f"{file.stem}{suffix}")
with open(output_path, "wb") as output_file:
output_file.write(bytes_data)
print(f"Processed image saved to: {output_path}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might get a bit messy - you can include files from many locations. Additionally, do we know the format of the resulting image? Without the extension the file might not be usable.

cropped_img = image.crop((left, top, width - right, height - bottom))

with io.BytesIO() as buffer:
cropped_img.save(buffer, format=image.format)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: Ah I see, we keep the original format, I like this choice.

Check if each color channel in the pixel is below the threshold
"""
r, g, b = pixel
return r < threshold and g < threshold and b < threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: Shouldn't this be <=? Your default threshold is 0. Can it be negative?

from PIL import Image


def is_pixel_black(pixel, threshold):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: missing typing

"""
width, height = image.size
for y in range(height):
row_pixels = list(image.crop((0, y, width, y + 1)).getdata())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't tell from reading the pillow docs, but can you use the returned core.image object as an iterator without wrapping it in a list?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I responded to this but I get a mypy error if I do not wrap it in a list because it does not recognize the iterator. It does completely work because there is an iterator to the image object but for the sake of mypy I left it as a list. I can change it to have the ignore if that would be better.

@Mackay-Fisher
Copy link
Contributor Author

Sorry for the late response, I resolved the blocking comments and tried my best to address the follow-up. Please let me know if there are added things or if you would like more clarification on the testing.

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see any leftover blockers, so let's merge this version!

@Dcallies Dcallies merged commit cb1f21f into facebook:main Nov 27, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants