Add CLIP text model #643

sogartar · 2024-12-04T01:31:36Z

Ports the CLIP text model from Hugging Face. This is the first iteration so not much is changed from the original model. Things like dropout and checkpointing are removed.
Add numeric verification tests for the various components of the stack when executing in eager mode. Verifications are made for float32 and bfloat16. There are tests for toy-sized components and the whole model as well as the Large pretrained variant.
These tests does not include testing with IREE.

Functionalities for mask creation are not yet ported.

Ports the CLIP text model from Hugging Face. Add numeric verification tests for the various components of the stack when executing in eager mode. Verifications are made for float32 and bfloat16. There are tests for toy-sized components and the whole model as well as the Large pretrained variant. These tests does not include testing with IREE. Functionalities for mask creation are not yet ported.

rsuderman · 2024-12-04T05:38:19Z

sharktank/sharktank/layers/norm.py

+        if bias_name in self.theta.keys:
+            self.bias = self.theta_tensor(bias_name)
+        else:
+            self.bias = None


Don't bother if-else ing. Just set self.bias = None prior to the if statement

rsuderman · 2024-12-04T05:40:55Z

sharktank/sharktank/models/clip/clip.py

+        # position_ids (1, len position emb) is contiguous in memory and exported when serialized
+        self.register_buffer(
+            "position_ids",
+            torch.arange(config.max_position_embeddings).expand((1, -1)),


You should use unsqueeze instead of expand

I forgot to mention in the PR description that this is the initial port of the model and I did not attempt to optimize anything. The main goal was to put it under test. I would rather do these modifications later as they would be tracked in different commits and will be clear what changed compared to the original.

If you're directly referencing the huggingface code for implementation, do you want to link it in a comment?

rsuderman · 2024-12-04T05:42:43Z

sharktank/sharktank/models/clip/clip.py

+                f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
+                f" {self.num_heads})."
+            )
+        self.scale = self.head_dim**-0.5


Avoid using ** for inverse powers. Its better to do 1.0 / math.sqrt(self.head_dim). The numerical precision on pow operators is usually significantly worse.

rsuderman · 2024-12-04T05:46:06Z

sharktank/sharktank/models/clip/clip.py

+            )
+
+        # apply the causal_attention_mask first
+        if causal_attention_mask is not None:


Why do you separate the causal attention mask from the regular attention mask? They should just occur together. Even the causal attention mask should really just be a bool.

rsuderman · 2024-12-04T05:46:47Z

sharktank/sharktank/models/clip/clip.py

+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        attn_weights = ops.softmax(attn_weights, dim=-1)


Rather than decomposing we should use the ops.scaled_dot_product_attention operation. Attention is attention, so we should avoid replicating the decomposed version everywhere.

rsuderman · 2024-12-04T05:49:57Z

sharktank/sharktank/models/clip/clip.py

+            return_dict if return_dict is not None else self.config.use_return_dict
+        )
+
+        encoder_states = () if output_hidden_states else None


This logic of empty tuple vs none and adding feels unclear what it actually is attempting to do. Relying on none + tuple feels weird

rsuderman · 2024-12-04T05:51:00Z

sharktank/sharktank/types/tensors.py

+    def size(self, dim: Optional[int] = None) -> tuple[int]:
+        if dim is None:
+            return tuple(self.shape)
+        else:


No else condition required. Just include the return when the condition is not taken.

IanNod · 2024-12-04T16:31:33Z

sharktank/sharktank/models/clip/clip.py

+        return embeddings
+
+
+class ClipAttention(BaseLayer):


At a quick glance is seems much of this is reused from the punet Attention (outside the decomposed SDPA which we should try avoiding) from here

shark-ai/sharktank/sharktank/models/punet/layers.py

Line 390 in 5f98e73

class AttentionLayer(ThetaLayer):

. Can we just extend and reuse that implementation?

IanNod · 2024-12-04T16:35:13Z

sharktank/sharktank/models/clip/clip.py

+        last_hidden_state = self.final_layer_norm(last_hidden_state)
+
+        if self.eos_token_id == 2:
+            # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.


This PR I assume is reference to diffusers? Confusing to have in our repo

IanNod · 2024-12-04T16:38:10Z

sharktank/tests/models/clip/clip_test.py

+
+    @with_clip_data
+    def testSmokeExportLargeF32FromHuggingFace(self):
+        repo_id = "openai/clip-vit-large-patch14"


We might want to avoid downloading full models for developers running locally. Should we implement a toy model to accommodate? Thoughts @rsuderman?

sogartar force-pushed the clip-text-model branch from 24bc7e7 to d658a9f Compare December 4, 2024 01:31

sogartar force-pushed the clip-text-model branch from d658a9f to a6de6cb Compare December 4, 2024 01:47

sogartar marked this pull request as ready for review December 4, 2024 02:16

sogartar requested review from IanNod and KyleHerndon December 4, 2024 02:16

rsuderman requested changes Dec 4, 2024

View reviewed changes

Address some PR comments

2b09600

sogartar requested a review from rsuderman December 4, 2024 15:21

IanNod reviewed Dec 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CLIP text model #643

Add CLIP text model #643

sogartar commented Dec 4, 2024 •

edited

Loading

rsuderman Dec 4, 2024

sogartar Dec 4, 2024

rsuderman Dec 4, 2024

sogartar Dec 4, 2024

KyleHerndon Dec 4, 2024

rsuderman Dec 4, 2024

rsuderman Dec 4, 2024

rsuderman Dec 4, 2024

rsuderman Dec 4, 2024

rsuderman Dec 4, 2024

sogartar Dec 4, 2024

IanNod Dec 4, 2024

IanNod Dec 4, 2024

IanNod Dec 4, 2024

Add CLIP text model #643

Are you sure you want to change the base?

Add CLIP text model #643

Conversation

sogartar commented Dec 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sogartar commented Dec 4, 2024 •

edited

Loading