Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cv2 5085 move get items with similar text to presto #2023

Conversation

DGaffney
Copy link
Contributor

@DGaffney DGaffney commented Sep 4, 2024

Description

Move get_items_from_similar_text functionality to use sync endpoint on presto

References: CV2-5079, CV2-5083, CV2-5085

How has this been tested?

Not tested yet - will be breaking all tests first to highlight needed fixes intentionally

Things to pay attention to during code review

Nothing in particular! One question for @caiosba is if we're comfortable using sync for these things, generally speaking - perhaps another refactor at the end of the epic to make everything async?

Checklist

  • I have performed a self-review of my own code
  • I have added unit and feature tests, if the PR implements a new feature or otherwise would benefit from additional testing
  • I have added regression tests, if the PR fixes a bug
  • I have added logging, exception reporting, and custom tracing with any additional information required for debugging
  • I considered secure coding practices when writing this code. Any security concerns are noted above.
  • I have commented my code in hard-to-understand areas, if any
  • I have made needed changes to the README
  • My changes generate no new warnings
  • If I added a third party module, I included a rationale for doing so and followed our current guidelines

* CV2-5087 move Articles side effecting saves to to it via presto

* CV2-5082 move article indexing to presto

* resolve test errors

* updates for broken tests

* small tweak

* set to sync

* more fixes

* rename function and revert request

* add response suppression and move to specific path for side effecting requests

* extend similar media to allow for temporary texts

* fix broken test fixture

* revert back to async

* fix another test

* fixes per PR review

* fixes per PR review

* more fixes after review
* CV2-5080 update request model alegre calls to use presto-based alegre querying

* move to sync

* update for bypassing async calls in tests
@DGaffney DGaffney marked this pull request as ready for review September 5, 2024 17:09
Copy link
Contributor

@caiosba caiosba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DGaffney is it expected that this PR is against develop and not the epic branch?

@DGaffney DGaffney changed the base branch from develop to epic/cv2-5050-text-vectorization-via-presto September 9, 2024 13:19
@DGaffney
Copy link
Contributor Author

DGaffney commented Sep 9, 2024

@DGaffney is it expected that this PR is against develop and not the epic branch?

Good catch, fixed!

…085-move-get-items-with-similar-text-to-presto
@DGaffney DGaffney requested a review from caiosba September 9, 2024 21:31
@DGaffney DGaffney merged commit c361771 into epic/cv2-5050-text-vectorization-via-presto Sep 11, 2024
4 checks passed
@DGaffney DGaffney deleted the cv2-5085-move-get-items-with-similar-text-to-presto branch September 11, 2024 13:55
DGaffney added a commit that referenced this pull request Oct 24, 2024
* Cv2 5082 article indexing to presto (#1994)

* CV2-5087 move Articles side effecting saves to to it via presto

* CV2-5082 move article indexing to presto

* resolve test errors

* updates for broken tests

* small tweak

* set to sync

* more fixes

* rename function and revert request

* add response suppression and move to specific path for side effecting requests

* extend similar media to allow for temporary texts

* fix broken test fixture

* revert back to async

* fix another test

* fixes per PR review

* fixes per PR review

* more fixes after review

* Cv2 5080 request model to presto (#2015)

* CV2-5080 update request model alegre calls to use presto-based alegre querying

* move to sync

* update for bypassing async calls in tests

* Cv2 5086 smooch nlu to presto 2 (#2019)

* Cv2 5082 article indexing to presto (#1994)

* CV2-5087 move Articles side effecting saves to to it via presto

* CV2-5082 move article indexing to presto

* resolve test errors

* updates for broken tests

* small tweak

* set to sync

* more fixes

* rename function and revert request

* add response suppression and move to specific path for side effecting requests

* extend similar media to allow for temporary texts

* fix broken test fixture

* revert back to async

* fix another test

* fixes per PR review

* fixes per PR review

* more fixes after review

* CV2-5086 second attempt on clean Smooch NLU to Presto branch

* fix broken test stubs

* fix typo brought over from previous PR

* alias and rename per caio review

* fix syntax

* Mayyyyybe its the alias?

* fix old reference

* move another stale function reference

* Replace with alias

* symbolize aliased method names

* Revert to proper function

* Cv2 5085 move get items with similar text to presto (#2023)

* Cv2 5082 article indexing to presto (#1994)

* CV2-5087 move Articles side effecting saves to to it via presto

* CV2-5082 move article indexing to presto

* resolve test errors

* updates for broken tests

* small tweak

* set to sync

* more fixes

* rename function and revert request

* add response suppression and move to specific path for side effecting requests

* extend similar media to allow for temporary texts

* fix broken test fixture

* revert back to async

* fix another test

* fixes per PR review

* fixes per PR review

* more fixes after review

* Cv2 5080 request model to presto (#2015)

* CV2-5080 update request model alegre calls to use presto-based alegre querying

* move to sync

* update for bypassing async calls in tests

* CV2-5085 move get_items_from_similar_text calls to use sync endpoint

* review and resolve broken tests

* update stub

* Cv2 5084 Update Reindexing to use async presto endpoint (#2031)

* CV2-5084 update reindexing strategy to use singular requests for now

* switch to async

* mrege in latest changes on epic branch

* update fixture

* fix stub path

* Update reindex_alegre_workspace.rb

* CV2-5081 switch text to presto-based querying (#2034)

* CV2-5081 switch text to presto-based querying

* more test stub updates

* fix more stubs

* update stub and typo

* CV2-5050 add explicit callback for text

* update stub

* more tweaking during testing

* Bot events for test endpoint

* move async query to sync

* CV2-5324: create a method for create relationship and use it everywhere (#2053)

* A couple of improvements for shared feeds (#2056)

* Making sure that the way the feed Cluster.last_request_date field is calculated is the same as the ClusterTeam.last_request_date
* Make sure that if a parent item is tagged, all child items are also included in the cluster, even if, individually, they are not tagged

Reference: CV2-5331.

* CV2-5371: fix sentry issue (#2058)

* Update setuptools module and pin to known good version. (#2059)

* 5120 – Dont create duplicate tags and clean up `#` (#2054)

Context
While looking into a tags issue I noticed a few things:
    - when we made a request with duplicate tags:
        - we got an error, so the job was retried
        - the tag was added twice to the FactCheck
        - the tag is added once to the ProjectMedia
    - when we made a request with a tag with a #
        - we got an error, so the job was retried
            - there seem to have been two errors related to this:
                - ActiveRecord::RecordInvalid: Tag already exists
                - ActiveRecord::RecordInvalid: Text has already been taken
        - the tag with the # is added to the FactCheck
        - the tag is not added to the ProjectMedia

What was happening
    There are a few things happening at the same time:
        - Creation of a ProjectMedia with tags
        - Creation of a FactCheck with tags
        - Tags are an Object from the Tag Class for ProjectMedia, but are a simple array for FactCheck
    For the ProjectMedia:
        - It would create the tag, then it would try to create the same tag again, and then it would fail and retry again, and so on
        - This happen because we have validations in place for the Tag class
    For FactCheck:
        - It would just create the tags twice
        - Because we have no validations for the tags array

TLDR: There were some issues in the tags clean-up before we create them for ProjectMedia and FactCheck, and there was a mismatch between them.

How it was fixed
- I created a helper to clean up the tags before creating them
- We need to make sure tags are:
    - stripped
    - unique
    - don't have a prepending `#`
- We use this helper both in FactCheck.rb and Tag.rb

References: 5120
PR: 2054

* CV2-5005: Sentry issue related to ES (#2057)

* CV2-5005: limit ES date to updated fields only

* CV2-5005: fix tests

* CV2-5005: test coverage

* CV2-5371: fix sentry error (#2062)

* Fix setuptools version pin for check-api builds (#2063)

* Pin version for this distribution.

* CV2-5391: use save instead of save! to avoid raising error (#2061)

* CV2-5391: use save instead of save! to avoid raising error

* CV2-5391: fix archived validation error

* Reset item status to default one when claim/fact-check is detached. (#2064)

Fixes CV2-4502.

* CV2-5418: fix sentry issue (#2066)

* Add ukrainian translation (#2065)

* Add ukrainian translation

* Add 'uk' to config.i18n.available_locales

* CV2-5420: include cached fields that require ES or PG updates (#2067)

* :create_project_media_tags should be able to ignore tag already added to item (#2068)

A small fix to how tags are created in the background to make sure :create_project_media_tags is able to ignore tag already added to item.

References: 5426, 5120
PR: 2068

* CV2-5392: export cluster description (#2070)

* Setting initial value for `last_request_date` for feed clusters. (#2072)

There was a change introduced in CV2-5331 that normalized how the `last_request_date` field for a shared feed cluster is calculated. But there is an issue: if the cluster doesn't have any request, no value is set. The fix needed here is to be sure that there is an initial value, which can be the same date as the last item that joined the cluster, when this item has no requests.

Fixes: CV2-5446.

* set the annotator to be the “Smooch Bot” (#2069)

Set the annotator to "Smooch Bot" so that the bot user is displayed in the content warning cover when a user is blocked.

Reference: CV2-5142

* CV2-5419: rescue ActiveRecord::RecordNotUnique for relationship save (#2071)

* CV2-5419: rescue ActiveRecord::RecordNotUnique for relationship save

* CV2-5419: delete suggested relation before create confirmed one

* CV2-5451: set confirmed before creation based on relationship type (#2073)

* Make sure that an item can't be related to itself. (#2075)

We already had an Active Record validation for that, but it was bypassed for straight updates or race conditions. This PR makes it more robust by adding a database constraint.

Fixes: CV2-5437.

* Report bad relationship structure only if relationship is not nil. (#2074)

Fixes: CV2-5454.

* Revert rejecting suggestion if relationship creation fails. (#2076)

When a confirmed relationship is created, if there is any suggestion between the same two items, the suggestion should be delete. We had logic in two different places for that. This PR keeps this logic in only one place (the relationship model), so I removed the logic from the `create_unless_exists` method, which now also takes the relationship type into account. This way we have the logic in only one place, and having this logic as a `before_create` that contains a transaction means that the `destroy` of the suggestion will be rolled back by PostgreSQL if the `create` fails.

Fixes: CV2-5436.

* Fixing Sentry error

* CV2-5434: skip process the TeamTask background job if there is a more recent one (#2078)

* Setting retry interval for GenericWorker. (#2080)

This change tries to avoid a single case reported by Sentry: A race condition situation where the related object was still not fully persisted in the database when the job executed. Setting a longer retry interval which should avoid this case.

Fixes: CV2-5459.

* Adding a log line for outgoing Smooch requests. (#2081)

Adding a log line for the outgoing Smooch requests. This can help debugging some issues.

Reference: CV2-5378.

* Do not send report if search results were already received. (#2082)

I noticed this regression introduced by CV2-5451. Now that `confirmed_by` and `confirmed_at` are set for all relationships, even the ones created as confirmed matches, we have a regression here. In order to know if a report should be sent for an accepted suggestion, we can't rely solely on the existence of a value for `confirmed_by` or `confirmed_at`. We know that a suggestion will happen after the relationship was created, so, if `confirmed_at` happens before or at the same time as `created_at`, we know that this is not a suggestion that was accepted, but a relationship that was already created as confirmed match.

Reference: CV2-5451.

* CV2-5348: refactor ES cached field calling and remove retry_on_conflict (#2083)

* CV2-5348: set retry_on_conflict to zero and pass id instead of the object

* CV2-5348: skip blank obj

* CV2-5348: apply PR comments

* CV2-5190: Create a Link and Claim from tipline message that contain both link and long text (#2084)

* CV2-5190: create a link and claim from tipline message that contaion link and long text

* CV2-5190: handle link and short text

* CV2-5190: fix tests

* CV2-5190: fix CC

* Request/5424 add tagalog translations (#2085)

* Add Tagalog hardcoded strings

* Bump rails-i18n

* Request/5424 add tagalog translations (#2086)

* Add Tagalog hardcoded strings

* Bump rails-i18n

* Bump rails-i18n again. Changed long date format instead of the default one

* update fixtures on broken tests

* add webmock

* review and resolve missing line errors

* resolve changes from Sawy

* gut source and id from any tests and responses

* more tweaking to resolve broken tests

---------

Co-authored-by: Caio <[email protected]>
Co-authored-by: Mohamed El-Sawy <[email protected]>
Co-authored-by: Martin Peck <[email protected]>
Co-authored-by: Manu Vasconcelos <[email protected]>
Co-authored-by: Alexandre Amoedo Amorim <[email protected]>
Co-authored-by: Daniele Valverde <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants