feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. #930

CraftingLevi · 2024-02-06T15:13:12Z

Description

Added functionality to load more data from a Sharepoint Site.

Added functionality to download files from the root folder of a Sharepoint Site
Added functionality to download the HTML content of Sharepoint Pages for a Sharepoint Site
Added functionality to limit downloading of files to specific file types.

Fixes # (issue)
#936
#937
#938

Type of Change

Please delete options that are not relevant.

This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Added new unit/integration tests
Added new notebook (that tests end-to-end)
I stared at the code and made sure it makes sense

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I ran make format; make lint to appease the lint gods

…d all pages

…er file_types when loading documents.

CraftingLevi · 2024-02-07T06:28:32Z

The tests fail in _extract_page, the output type hints states ‘None | Dict[str…’ , because sometimes a SharePoint page has no ‘TextWebParts’, meaning there is no valid text to extract from the HTML. When this is the case, None is returned.

instead, this situation should raise a ValueError that is handled by _download_pages_and_extract_metadata by passing on that page.

CraftingLevi · 2024-02-07T08:13:27Z

Tested locally after installing test_requirements.txt, got a missing dependency error (llmsherpa), which is likely an issue with test_requirements.txt.

After installing llmsherpa, tests run with 91 passed, 13 skipped, 1 warning in 17.43s

CraftingLevi · 2024-02-07T13:43:12Z

I found out there is a way to do batch requests. I have an implementation running now that, for a sharepoint with +200 sites cuts down the download step from 1 minute to 7 seconds.

Please wait with running tests until this is committed.

CraftingLevi · 2024-02-07T15:24:32Z

Successfully implemented the use of batch requests to retrieve the content of pages, significantly speeding up retrieval of page content.

Ran all tests locally with result 91 passed, 13 skipped, 2 warnings in 8.73s, after format and lint.

CraftingLevi · 2024-02-08T14:19:49Z

Closes #938
Closes #937
Closes #936

anoopshrma

This looks great @CraftingLevi! Have added minor comments that needs to be addressed. Then we can move ahead and merge

llama_hub/microsoft_sharepoint/README.md

llama_hub/microsoft_sharepoint/base.py

…rs, and what the default parameters are.

CraftingLevi · 2024-02-13T11:43:55Z

@anoopshrma I've added all the requested changes and synced the fork.

In summary: I've adjusted the default argument of 'root' for 'sharepoint_folder_path' to "", in case someone ever wants to have a folder in the root folder called 'root', then this allows for that to happen. And it works nicely with the if/else comment you've made.

Ran make format, make lint and make test, seems all good.

anoopshrma · 2024-02-15T15:44:11Z

Hey @CraftingLevi ,
Highly appreciate you contribution here.
With recent llama_index update to v0.10.x, llama_hub is now deprecated. All the loaders have now moved into llamaindex repo.

Could you push your changes there directly!!
That would help a lot!
https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-microsoft-sharepoint

levi added 4 commits February 6, 2024 14:05

added functionality to load all documents, added functionality to loa…

da3a699

…d all pages

updated with new 'include' argument and 'file_types' argument to filt…

7211e27

…er file_types when loading documents.

Added documentation

ba27fab

running format and lint

ea1dfb6

fixed issue with None in output type hints

caaf94f

running format and lint

b70102e

CraftingLevi closed this Feb 7, 2024

levi and others added 3 commits February 7, 2024 15:49

implemented batch call for page content

1146d69

Improved type hints, documentation and comments

45d5d41

Merge branch 'run-llama:main' into main

6c5965d

CraftingLevi reopened this Feb 7, 2024

Merge branch 'main' of https://github.com/CraftingLevi/llama-hub

616f55a

CraftingLevi marked this pull request as draft February 8, 2024 14:13

CraftingLevi marked this pull request as ready for review February 8, 2024 14:20

fix typo

2901c8c

anoopshrma approved these changes Feb 12, 2024

View reviewed changes

llama_hub/microsoft_sharepoint/README.md Show resolved Hide resolved

llama_hub/microsoft_sharepoint/base.py Outdated Show resolved Hide resolved

llama_hub/microsoft_sharepoint/base.py Outdated Show resolved Hide resolved

llama_hub/microsoft_sharepoint/base.py Outdated Show resolved Hide resolved

levi and others added 5 commits February 13, 2024 11:33

Updated ReadMe

0e01fa2

Updated docstring _get_site_id_with_host_name

aca4c9e

Merge branch 'run-llama:main' into main

f03b2e9

changed default sharepoint_folder_path argument to "" instead of root

8647de0

Updated ReadMe with more explicit instructions on how to use paramete…

9f1875a

…rs, and what the default parameters are.

fix typo

ba66cee

CraftingLevi requested a review from anoopshrma February 15, 2024 08:45

anoopshrma approved these changes Feb 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. #930

feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. #930

CraftingLevi commented Feb 6, 2024 •

edited

Loading

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 8, 2024

anoopshrma left a comment

CraftingLevi commented Feb 13, 2024 •

edited

Loading

anoopshrma commented Feb 15, 2024

feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. #930

Are you sure you want to change the base?

feat: load SharePoint Pages content, feat: load docs from root folder in drive, feat: optionally only load specific file types. #930

Conversation

CraftingLevi commented Feb 6, 2024 • edited Loading

Description

Type of Change

How Has This Been Tested?

Suggested Checklist:

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 7, 2024

CraftingLevi commented Feb 8, 2024

anoopshrma left a comment

Choose a reason for hiding this comment

CraftingLevi commented Feb 13, 2024 • edited Loading

anoopshrma commented Feb 15, 2024

CraftingLevi commented Feb 6, 2024 •

edited

Loading

CraftingLevi commented Feb 13, 2024 •

edited

Loading