Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pytx] Implement FileContent class #1680

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from enum import Enum, auto
import typing as t

from threatexchange import common
import common


class ContentType:
Expand All @@ -32,7 +32,7 @@ def extract_additional_content(
* Photo => run OCR and extract text
* Video => break out photo thumbnail, close caption text, audio
"""
return []
return {}


class RotationType(Enum):
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
import typing as t
from content_base import ContentType
from photo import PhotoContent
from video import VideoContent
from PIL import Image

class FileContent(ContentType):
"""
ContentType representing a generic file.

Determines if a file is a photo or video based on file extension.
"""

VALID_PHOTO_EXTENSIONS = {ext.lower() for ext in Image.registered_extensions()}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is doing work during module import time, which is generally considered an anti-pattern, but I don't think it's worth changing

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide more details?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you tell me more details?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the first result for me from google for "doing work during module import time"
https://www.benkuhn.net/importtime/

As mentioned, I don't think it's worth changing.

VALID_VIDEO_EXTENSIONS = {".mp4", ".avi", ".mov", ".mkv"}

@classmethod
def get_content_type_from_filename(cls, file_name: str) -> t.Type[ContentType]:
"""
Determines content type based on file extension.
"""
file_extension = file_name.lower().rsplit('.', 1)[-1]
file_extension = f".{file_extension}"

if file_extension in cls.VALID_PHOTO_EXTENSIONS:
return PhotoContent
elif file_extension in cls.VALID_VIDEO_EXTENSIONS:
return VideoContent
else:
return None
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
import pytest
from threatexchange.content_type.photo import PhotoContent
from threatexchange.content_type.video import VideoContent
from threatexchange.content_type.file_content import FileContent

@pytest.mark.parametrize("file_name,expected_content_type", [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

("file.jpg", PhotoContent),
("file.JPG", PhotoContent),
("file.mp4", VideoContent),
("file.MP4", VideoContent),
("archive.photo.png", PhotoContent),
("movie.backup.mp4", VideoContent),
])
def test_file_content_detection(file_name, expected_content_type):
"""
Tests that FileContent correctly identifies the content type
as either PhotoContent or VideoContent based on file extension.
"""
content_type = FileContent.get_content_type_from_filename(file_name)
assert content_type == expected_content_type, f"Failed for {file_name}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably don't need to add a message - parameterize will do that for you.


def test_unknown_file_type():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: You can combine this into your previous test by just doing

("file.txt", None)

"""
Tests that an unknown file type returns None.
"""
file_content = FileContent.get_content_type_from_filename("file.txt")
assert file_content is None
Loading