Skip to content
This repository has been archived by the owner on Mar 1, 2024. It is now read-only.

feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. #930

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

CraftingLevi
Copy link

@CraftingLevi CraftingLevi commented Feb 6, 2024

Description

Added functionality to load more data from a Sharepoint Site.

  • Added functionality to download files from the root folder of a Sharepoint Site
  • Added functionality to download the HTML content of Sharepoint Pages for a Sharepoint Site
  • Added functionality to limit downloading of files to specific file types.

Fixes # (issue)
#936
#937
#938

Type of Change

Please delete options that are not relevant.

  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • Added new unit/integration tests
  • Added new notebook (that tests end-to-end)
  • I stared at the code and made sure it makes sense

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I ran make format; make lint to appease the lint gods

@CraftingLevi
Copy link
Author

The tests fail in _extract_page, the output type hints states ‘None | Dict[str…’ , because sometimes a SharePoint page has no ‘TextWebParts’, meaning there is no valid text to extract from the HTML. When this is the case, None is returned.

instead, this situation should raise a ValueError that is handled by _download_pages_and_extract_metadata by passing on that page.

@CraftingLevi
Copy link
Author

Tested locally after installing test_requirements.txt, got a missing dependency error (llmsherpa), which is likely an issue with test_requirements.txt.

After installing llmsherpa, tests run with 91 passed, 13 skipped, 1 warning in 17.43s

@CraftingLevi
Copy link
Author

I found out there is a way to do batch requests. I have an implementation running now that, for a sharepoint with +200 sites cuts down the download step from 1 minute to 7 seconds.

Please wait with running tests until this is committed.

@CraftingLevi CraftingLevi reopened this Feb 7, 2024
@CraftingLevi
Copy link
Author

Successfully implemented the use of batch requests to retrieve the content of pages, significantly speeding up retrieval of page content.

Ran all tests locally with result 91 passed, 13 skipped, 2 warnings in 8.73s, after format and lint.

@CraftingLevi CraftingLevi changed the title Add support for loading files from root drive in for Sharepoint Site, loading SharePoint pages from a site and limiting document loading to specific filetypes. feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. Feb 7, 2024
@CraftingLevi CraftingLevi marked this pull request as draft February 8, 2024 14:13
@CraftingLevi
Copy link
Author

Closes #938
Closes #937
Closes #936

@CraftingLevi CraftingLevi marked this pull request as ready for review February 8, 2024 14:20
Copy link
Collaborator

@anoopshrma anoopshrma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @CraftingLevi! Have added minor comments that needs to be addressed. Then we can move ahead and merge

llama_hub/microsoft_sharepoint/README.md Show resolved Hide resolved
llama_hub/microsoft_sharepoint/base.py Outdated Show resolved Hide resolved
llama_hub/microsoft_sharepoint/base.py Outdated Show resolved Hide resolved
llama_hub/microsoft_sharepoint/base.py Outdated Show resolved Hide resolved
@CraftingLevi
Copy link
Author

CraftingLevi commented Feb 13, 2024

@anoopshrma I've added all the requested changes and synced the fork.

In summary: I've adjusted the default argument of 'root' for 'sharepoint_folder_path' to "", in case someone ever wants to have a folder in the root folder called 'root', then this allows for that to happen. And it works nicely with the if/else comment you've made.

Ran make format, make lint and make test, seems all good.

@anoopshrma
Copy link
Collaborator

Hey @CraftingLevi ,
Highly appreciate you contribution here.
With recent llama_index update to v0.10.x, llama_hub is now deprecated. All the loaders have now moved into llamaindex repo.

Could you push your changes there directly!!
That would help a lot!
https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-microsoft-sharepoint

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants