Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store content object collections in search index [FC-0062] #35469

Merged
merged 48 commits into from
Sep 13, 2024
Merged
Show file tree
Hide file tree
Changes from 37 commits
Commits
Show all changes
48 commits
Select commit Hold shift + click to select a range
ed30ce2
feat: Add Library Collections REST endpoints
yusuf-musleh Aug 14, 2024
3d64ea4
test: Add tests for Collections REST APIs
yusuf-musleh Aug 22, 2024
931f688
chore: Add missing __init__ files
yusuf-musleh Aug 23, 2024
b2b38cb
feat: Verify collection belongs to library
yusuf-musleh Aug 23, 2024
a16b398
feat: Add events emitting for Collections
yusuf-musleh Aug 26, 2024
be39d03
chore: fix pylint errors
yusuf-musleh Aug 26, 2024
22fa791
refactor: Use convert_exceptions + update tests
yusuf-musleh Aug 29, 2024
0bf0f78
refactor: Relocate convert_exceptions, remove utils
yusuf-musleh Aug 30, 2024
45cf886
refactor: Remove Collection handlers skeleton code
yusuf-musleh Aug 30, 2024
366a7e9
refactor: Move Collections views/tests
yusuf-musleh Aug 30, 2024
504aa4f
feat: Add REST endpoints to update Components in a Collections (temp)…
pomegranited Sep 3, 2024
a2da207
refactor: pull view functionality into api
pomegranited Sep 3, 2024
aa3c1e3
temp: use temporary openedx-learning branch
pomegranited Sep 4, 2024
6ae83ba
refactor: use collection.key as ID in search records
pomegranited Sep 4, 2024
abb1eb1
refactor: use Collection.key as identifier in content_libraries
pomegranited Sep 4, 2024
0eb52cf
test: adds tests for events raised by collections api
pomegranited Sep 4, 2024
6f39767
temp: use temporary openedx-events branch
pomegranited Sep 4, 2024
bcc5cbb
fix: pass a ContentKey to the search doc handlers
pomegranited Aug 28, 2024
0e32451
feat: add tags.collections to content object search index
pomegranited Aug 28, 2024
3c6617e
temp: use openedx-events temporary branch
pomegranited Sep 5, 2024
0c29552
feat: emit CONTENT_OBJECT_ASSOCIATIONS_CHANGED
pomegranited Sep 5, 2024
e20ee24
temp: use WIP openedx-events branch
pomegranited Sep 6, 2024
9619341
test: fix test failing after changes to openedx-learning
pomegranited Sep 6, 2024
1764e87
fix: emit both CONTENT_OBJECT_ASSOCIATIONS_CHANGED and CONTENT_OBJECT…
pomegranited Sep 6, 2024
ae7926f
docs: minor comment update
pomegranited Sep 6, 2024
007f80a
fix: catch IntegrityError in create_library_collection
pomegranited Sep 6, 2024
197733b
Merge remote-tracking branch 'origin/master' into yusuf-musleh/collec…
pomegranited Sep 6, 2024
436822e
docs: replaces CONTENT_OBJECT_TAGS_CHANGED with CONTENT_OBJECT_ASSOCI…
pomegranited Sep 10, 2024
d517d80
refactor: store collections outside of tags index
pomegranited Sep 10, 2024
d312ff3
chore: updates openedx-events==9.14.0
pomegranited Sep 10, 2024
9a94316
Merge branch 'yusuf-musleh/collections-crud-rest-api' into jill/colle…
pomegranited Sep 10, 2024
66b2aa8
fix: missing import
pomegranited Sep 10, 2024
715527c
Merge branch 'master' into yusuf-musleh/collections-crud-rest-api
pomegranited Sep 10, 2024
d5aeff8
chore: updates openedx-learning==0.11.4
pomegranited Sep 11, 2024
e6b469d
refactor: Update collections crud rest api (#683)
ChrisChV Sep 11, 2024
4bc7c22
Merge branch 'yusuf-musleh/collections-crud-rest-api' into jill/colle…
pomegranited Sep 11, 2024
bd44874
test: fixed failing test after refactor
pomegranited Sep 11, 2024
bfd548b
feat: index collections when created or updated
pomegranited Sep 11, 2024
0037530
feat: adds num_children to collection index
pomegranited Sep 11, 2024
8a380a7
Merge branch 'master' into jill/collection-components-search
pomegranited Sep 12, 2024
eb78583
Merge branch 'jill/collection-components-search' into jill/reindex-co…
pomegranited Sep 12, 2024
5bdcc9e
fix: store collection.key in the document block_id field
pomegranited Sep 12, 2024
a81ea9a
fix: store library block collections as a dict in search index
pomegranited Sep 12, 2024
360ec35
fix: reindex tags + collections if no changes specified
pomegranited Sep 12, 2024
ab25e2c
Merge branch 'jill/collection-components-search' into jill/reindex-co…
pomegranited Sep 12, 2024
2c6c8cf
fix: apply library collection changes synchronously
pomegranited Sep 12, 2024
79f83d7
Merge pull request #684 from open-craft/jill/reindex-collection
rpenido Sep 12, 2024
8d9f738
Merge branch 'master' into jill/collection-components-search
ChrisChV Sep 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions docs/hooks/events.rst
Original file line number Diff line number Diff line change
Expand Up @@ -233,17 +233,29 @@ Content Authoring Events
- 2023-07-20

* - `LIBRARY_BLOCK_CREATED <https://github.com/openedx/openedx-events/blob/c0eb4ba1a3d7d066d58e5c87920b8ccb0645f769/openedx_events/content_authoring/signals.py#L167>`_
- org.openedx.content_authoring.content_library.created.v1
- org.openedx.content_authoring.library_block.created.v1
- 2023-07-20

* - `LIBRARY_BLOCK_UPDATED <https://github.com/openedx/openedx-events/blob/c0eb4ba1a3d7d066d58e5c87920b8ccb0645f769/openedx_events/content_authoring/signals.py#L178>`_
- org.openedx.content_authoring.content_library.updated.v1
- org.openedx.content_authoring.library_block.updated.v1
- 2023-07-20

* - `LIBRARY_BLOCK_DELETED <https://github.com/openedx/openedx-events/blob/c0eb4ba1a3d7d066d58e5c87920b8ccb0645f769/openedx_events/content_authoring/signals.py#L189>`_
- org.openedx.content_authoring.content_library.deleted.v1
- org.openedx.content_authoring.library_block.deleted.v1
- 2023-07-20

* - `CONTENT_OBJECT_TAGS_CHANGED <https://github.com/openedx/openedx-events/blob/c0eb4ba1a3d7d066d58e5c87920b8ccb0645f769/openedx_events/content_authoring/signals.py#L207>`_
- org.openedx.content_authoring.content.object.tags.changed.v1
- 2024-03-31
* - `LIBRARY_COLLECTION_CREATED <https://github.com/openedx/openedx-events/blob/main/openedx_events/content_authoring/signals.py#L219>`_
- org.openedx.content_authoring.content_library.collection.created.v1
- 2024-08-23

* - `LIBRARY_COLLECTION_UPDATED <https://github.com/openedx/openedx-events/blob/main/openedx_events/content_authoring/signals.py#L230>`_
- org.openedx.content_authoring.content_library.collection.updated.v1
- 2024-08-23

* - `LIBRARY_COLLECTION_DELETED <https://github.com/openedx/openedx-events/blob/main/openedx_events/content_authoring/signals.py#L241>`_
- org.openedx.content_authoring.content_library.collection.deleted.v1
- 2024-08-23

* - `CONTENT_OBJECT_ASSOCIATIONS_CHANGED <https://github.com/openedx/openedx-events/blob/eb17e03f075b272ad8a29e8435d6a514f8884131/openedx_events/content_authoring/signals.py#L205-L214>`_
- org.openedx.content_authoring.content.object.associations.changed.v1
- 2024-09-06
89 changes: 49 additions & 40 deletions openedx/core/djangoapps/content/search/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
searchable_doc_for_course_block,
searchable_doc_for_collection,
searchable_doc_for_library_block,
searchable_doc_collections,
searchable_doc_tags,
)

Expand Down Expand Up @@ -296,16 +297,12 @@ def rebuild_index(status_cb: Callable[[str], None] | None = None) -> None:
status_cb("Counting courses...")
num_courses = CourseOverview.objects.count()

# Get the list of collections
status_cb("Counting collections...")
num_collections = authoring_api.get_collections().count()

# Some counters so we can track our progress as indexing progresses:
num_contexts = num_courses + num_libraries + num_collections
num_contexts = num_courses + num_libraries
num_contexts_done = 0 # How many courses/libraries we've indexed
num_blocks_done = 0 # How many individual components/XBlocks we've indexed

status_cb(f"Found {num_courses} courses, {num_libraries} libraries and {num_collections} collections.")
status_cb(f"Found {num_courses} courses, {num_libraries} libraries.")
with _using_temp_index(status_cb) as temp_index_name:
############## Configure the index ##############

Expand All @@ -326,6 +323,7 @@ def rebuild_index(status_cb: Callable[[str], None] | None = None) -> None:
Fields.tags + "." + Fields.tags_level1,
Fields.tags + "." + Fields.tags_level2,
Fields.tags + "." + Fields.tags_level3,
Fields.collections,
Fields.type,
Fields.access_id,
Fields.last_published,
Expand All @@ -339,6 +337,7 @@ def rebuild_index(status_cb: Callable[[str], None] | None = None) -> None:
Fields.content,
Fields.tags,
Fields.description,
Fields.collections,
# If we don't list the following sub-fields _explicitly_, they're only sometimes searchable - that is, they
# are searchable only if at least one document in the index has a value. If we didn't list them here and,
# say, there were no tags.level3 tags in the index, the client would get an error if trying to search for
Expand Down Expand Up @@ -379,6 +378,7 @@ def index_library(lib_key: str) -> list:
doc = {}
doc.update(searchable_doc_for_library_block(metadata))
doc.update(searchable_doc_tags(metadata.usage_key))
doc.update(searchable_doc_collections(metadata.usage_key))
docs.append(doc)
except Exception as err: # pylint: disable=broad-except
status_cb(f"Error indexing library component {component}: {err}")
Expand All @@ -390,10 +390,43 @@ def index_library(lib_key: str) -> list:
status_cb(f"Error indexing library {lib_key}: {err}")
return docs

############## Collections ##############
def index_collection_batch(batch, num_done) -> int:
docs = []
for collection in batch:
try:
doc = searchable_doc_for_collection(collection)
# Uncomment below line once collections are tagged.
# doc.update(searchable_doc_tags(collection.id))
docs.append(doc)
except Exception as err: # pylint: disable=broad-except
status_cb(f"Error indexing collection {collection}: {err}")
num_done += 1

if docs:
try:
# Add docs in batch of 100 at once (usually faster than adding one at a time):
_wait_for_meili_task(client.index(temp_index_name).add_documents(docs))
except (TypeError, KeyError, MeilisearchError) as err:
status_cb(f"Error indexing collection batch {p}: {err}")
return num_done

for lib_key in lib_keys:
status_cb(f"{num_contexts_done + 1}/{num_contexts}. Now indexing library {lib_key}")
status_cb(f"{num_contexts_done + 1}/{num_contexts}. Now indexing blocks in library {lib_key}")
lib_docs = index_library(lib_key)
num_blocks_done += len(lib_docs)

# To reduce memory usage on large instances, split up the Collections into pages of 100 collections:
library = lib_api.get_library(lib_key)
collections = authoring_api.get_collections(library.learning_package.id, enabled=True)
num_collections = collections.count()
num_collections_done = 0
status_cb(f"{num_collections_done + 1}/{num_collections}. Now indexing collections in library {lib_key}")
paginator = Paginator(collections, 100)
for p in paginator.page_range:
num_collections_done = index_collection_batch(paginator.page(p).object_list, num_collections_done)
status_cb(f"{num_collections_done}/{num_collections} collections indexed for library {lib_key}")

num_contexts_done += 1

############## Courses ##############
Expand Down Expand Up @@ -430,39 +463,6 @@ def add_with_children(block):
num_contexts_done += 1
num_blocks_done += len(course_docs)

############## Collections ##############
status_cb("Indexing collections...")

def index_collection_batch(batch, num_contexts_done) -> int:
docs = []
for collection in batch:
status_cb(
f"{num_contexts_done + 1}/{num_contexts}. "
f"Now indexing collection {collection.title} ({collection.id})"
)
try:
doc = searchable_doc_for_collection(collection)
# Uncomment below line once collections are tagged.
# doc.update(searchable_doc_tags(collection.id))
docs.append(doc)
except Exception as err: # pylint: disable=broad-except
status_cb(f"Error indexing collection {collection}: {err}")
finally:
num_contexts_done += 1

if docs:
try:
# Add docs in batch of 100 at once (usually faster than adding one at a time):
_wait_for_meili_task(client.index(temp_index_name).add_documents(docs))
except (TypeError, KeyError, MeilisearchError) as err:
status_cb(f"Error indexing collection batch {p}: {err}")
return num_contexts_done

# To reduce memory usage on large instances, split up the Collections into pages of 100 collections:
paginator = Paginator(authoring_api.get_collections(enabled=True), 100)
for p in paginator.page_range:
num_contexts_done = index_collection_batch(paginator.page(p).object_list, num_contexts_done)

status_cb(f"Done! {num_blocks_done} blocks indexed across {num_contexts_done} courses, collections and libraries.")


Expand Down Expand Up @@ -575,6 +575,15 @@ def upsert_block_tags_index_docs(usage_key: UsageKey):
_update_index_docs([doc])


def upsert_block_collections_index_docs(usage_key: UsageKey):
"""
Updates the collections data in documents for the given Course/Library block
"""
doc = {Fields.id: meili_id_from_opaque_key(usage_key)}
doc.update(searchable_doc_collections(usage_key))
_update_index_docs([doc])


def _get_user_orgs(request: Request) -> list[str]:
"""
Get the org.short_names for the organizations that the requesting user has OrgStaffRole or OrgInstructorRole.
Expand Down
48 changes: 47 additions & 1 deletion openedx/core/djangoapps/content/search/documents.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@
from hashlib import blake2b

from django.utils.text import slugify
from django.core.exceptions import ObjectDoesNotExist
from opaque_keys.edx.keys import LearningContextKey, UsageKey
from openedx_learning.api import authoring as authoring_api

from openedx.core.djangoapps.content.search.models import SearchAccess
from openedx.core.djangoapps.content_libraries import api as lib_api
Expand Down Expand Up @@ -52,6 +54,8 @@ class Fields:
tags_level1 = "level1"
tags_level2 = "level2"
tags_level3 = "level3"
# List of collection.key strings this object belongs to.
collections = "collections"
# The "content" field is a dictionary of arbitrary data, depending on the block_type.
# It comes from each XBlock's index_dictionary() method (if present) plus some processing.
# Text (html) blocks have an "html_content" key in here, capa has "capa_content" and "problem_types", and so on.
Expand Down Expand Up @@ -223,6 +227,35 @@ def _tags_for_content_object(object_id: UsageKey | LearningContextKey) -> dict:
return {Fields.tags: result}


def _collections_for_content_object(object_id: UsageKey | LearningContextKey) -> dict:
"""
Given an XBlock, course, library, etc., get the collections for its index doc.

e.g. for something in Collections "COL_A" and "COL_B", this would return:
{
"collections": ["COL_A", "COL_B"],
Copy link
Contributor

@rpenido rpenido Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pomegranited Sorry for the late request (I overlooked this before), but what about changing the structure here?

Suggested change
"collections": ["COL_A", "COL_B"],
"collections": [
{ "display_name": "Collection A", key: "COL_A" },
{ "display_name": "Collection B", key: "COL_B" },
],

That way, we can use the display_name as a searchable attribute and the slug/key to a dev action (like a redirect).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @rpenido , but I decided to store the collections data like we store the tags data, e.g.

   "collections": {
      "display_name": ["Collection One", "Another Collection"],
      "key": ["collection-one", "another-key"],
   },

cf a81ea9a

}

Returns an empty dict if the object is not in any collections.
"""
# Gather the collections associated with this object
result = {}
collections = []
try:
component = lib_api.get_component_from_usage_key(object_id)
collections = authoring_api.get_entity_collections(
component.learning_package_id,
component.key,
).values_list("key", flat=True)
except ObjectDoesNotExist:
log.warning(f"No component found for {object_id}")

if collections:
result[Fields.collections] = list(collections)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the component is not in any collections, we must return an empty list to ensure the index is updated with this information.

I created a small PR here: open-craft#685

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return result


def searchable_doc_for_library_block(xblock_metadata: lib_api.LibraryXBlockMetadata) -> dict:
"""
Generate a dictionary document suitable for ingestion into a search engine
Expand Down Expand Up @@ -265,6 +298,19 @@ def searchable_doc_tags(usage_key: UsageKey) -> dict:
return doc


def searchable_doc_collections(usage_key: UsageKey) -> dict:
"""
Generate a dictionary document suitable for ingestion into a search engine
like Meilisearch or Elasticsearch, with the collections data for the given content object.
"""
doc = {
Fields.id: meili_id_from_opaque_key(usage_key),
}
doc.update(_collections_for_content_object(usage_key))

return doc


def searchable_doc_for_course_block(block) -> dict:
"""
Generate a dictionary document suitable for ingestion into a search engine
Expand All @@ -288,7 +334,7 @@ def searchable_doc_for_collection(collection) -> dict:
found using faceted search.
"""
doc = {
Fields.id: collection.id,
Fields.id: collection.key,
Copy link
Contributor

@rpenido rpenido Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are better with the id here. I fixed it in the other PR and add comments here: 1738447

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#35321 is merged, so we can update this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the collection.key in order to call the REST APIs though -- I've re-added it in the internal PR: open-craft@5bdcc9e

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for catching this @pomegranited!

Fields.type: DocType.collection,
Fields.display_name: collection.title,
Fields.description: collection.description,
Expand Down
37 changes: 24 additions & 13 deletions openedx/core/djangoapps/content/search/handlers.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,14 @@

from django.db.models.signals import post_delete
from django.dispatch import receiver
from openedx_events.content_authoring.data import ContentLibraryData, ContentObjectData, LibraryBlockData, XBlockData
from openedx_events.content_authoring.data import (
ContentLibraryData,
ContentObjectChangedData,
LibraryBlockData,
XBlockData,
)
from opaque_keys import InvalidKeyError
from opaque_keys.edx.keys import UsageKey
from openedx_events.content_authoring.signals import (
CONTENT_LIBRARY_DELETED,
CONTENT_LIBRARY_UPDATED,
Expand All @@ -16,20 +23,19 @@
XBLOCK_CREATED,
XBLOCK_DELETED,
XBLOCK_UPDATED,
CONTENT_OBJECT_TAGS_CHANGED,
CONTENT_OBJECT_ASSOCIATIONS_CHANGED,
)
from openedx.core.djangoapps.content_tagging.utils import get_content_key_from_string

from openedx.core.djangoapps.content.course_overviews.models import CourseOverview
from openedx.core.djangoapps.content.search.models import SearchAccess

from .api import only_if_meilisearch_enabled, upsert_block_tags_index_docs
from .api import only_if_meilisearch_enabled, upsert_block_collections_index_docs, upsert_block_tags_index_docs
from .tasks import (
delete_library_block_index_doc,
delete_xblock_index_doc,
update_content_library_index_docs,
upsert_library_block_index_doc,
upsert_xblock_index_doc
upsert_xblock_index_doc,
)

log = logging.getLogger(__name__)
Expand Down Expand Up @@ -145,22 +151,27 @@ def content_library_updated_handler(**kwargs) -> None:
update_content_library_index_docs.apply(args=[str(content_library_data.library_key)])


@receiver(CONTENT_OBJECT_TAGS_CHANGED)
@receiver(CONTENT_OBJECT_ASSOCIATIONS_CHANGED)
@only_if_meilisearch_enabled
def content_object_tags_changed_handler(**kwargs) -> None:
def content_object_associations_changed_handler(**kwargs) -> None:
"""
Update the tags data in the index for the Content Object
Update the collections/tags data in the index for the Content Object
"""
content_object_tags = kwargs.get("content_object", None)
if not content_object_tags or not isinstance(content_object_tags, ContentObjectData):
content_object = kwargs.get("content_object", None)
if not content_object or not isinstance(content_object, ContentObjectChangedData):
log.error("Received null or incorrect data for event")
return

try:
# Check if valid if course or library block
get_content_key_from_string(content_object_tags.object_id)
except ValueError:
usage_key = UsageKey.from_string(str(content_object.object_id))
except InvalidKeyError:
log.error("Received invalid content object id")
return

upsert_block_tags_index_docs(content_object_tags.object_id)
# This event's changes may contain both "tags" and "collections", but this will happen rarely, if ever.
# So we allow a potential double "upsert" here.
if "tags" in content_object.changes:
upsert_block_tags_index_docs(usage_key)
elif "collections" in content_object.changes:
upsert_block_collections_index_docs(usage_key)
Copy link
Contributor

@rpenido rpenido Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If we have empty changes, we assume everything will change.
Also, we cannot use elif here, or we may skip the collecions update.

Suggested change
if "tags" in content_object.changes:
upsert_block_tags_index_docs(usage_key)
elif "collections" in content_object.changes:
upsert_block_collections_index_docs(usage_key)
if not content_object.changes or "tags" in content_object.changes:
upsert_block_tags_index_docs(usage_key)
if not content_object.changes or "collections" in content_object.changes:
upsert_block_collections_index_docs(usage_key)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 360ec35

Loading
Loading