You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.
Describe the bug and how to reproduce it
I put some docx and pptx files in the source docs folder (I had it working fine with just state of the union) and now it doesn't want to ingest.
Expected behavior
Hoped it would ingest my docs
Environment (please complete the following information):
Additional context
Here's what I'm getting. Not sure what to change on the nltk front.
Here's a summary of what's happening:
- The script is attempting to execute the command python3 ingest.py.
There is an error related to the NLTK library: Error loading averaged_perceptron_tagger and Error loading punkt. The error message suggests that the required NLTK resource punkt is not found.
The script is trying to append data to an existing vectorstore located at db.
It is using an embedded DuckDB with persistence, meaning the data will be stored in a file or database named db.
The script is loading documents from a source directory called source_documents.
It is attempting to load new documents, but there seems to be an issue with the NLTK library again, as indicated by the error messages.
The loading process is interrupted with a traceback indicating an error occurred while executing the function load_single_document.
The traceback suggests that the error is related to the NLTK resource punkt not being found, and it advises using the NLTK Downloader to obtain the missing resource.
In summary, the script ingest.py is encountering errors related to the NLTK library and the missing punkt resource, which is preventing it from loading and processing documents.
Big error I'm getting:
LookupError:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
Appending to existing vectorstore at db
Using embedded DuckDB with persistence: data will be stored in: db
Loading documents from source_documents
Loading new documents: 1%|▏ | 1/98 [00:02<04:42, 2.92s/it][nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
Loading new documents: 7%|█▌ | 7/98 [00:05<01:06, 1.36it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 89, in load_single_document
return loader.load()
^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/langchain/document_loaders/word_document.py", line 102, in _get_elements
return partition_docx(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/docx.py", line 144, in partition_docx
para_element: Optional[Text] = _paragraph_to_element(paragraph)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/docx.py", line 185, in _paragraph_to_element
return _text_to_element(text)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/docx.py", line 201, in _text_to_element
elif is_possible_narrative_text(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 222, in sentence_count
sentences = sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
return _sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
return find(path_, path + [""]).open()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/kylepennell/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/share/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 166, in <module>
main()
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 150, in main
texts = process_documents([metadata['source'] for metadata in collection['metadatas']])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 118, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 107, in load_documents
for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/kylepennell/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/share/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
(myenv2) kylepennell@Kyles-MacBook-Pro privateGPT-main % python3 ingest.py
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
Appending to existing vectorstore at db
Using embedded DuckDB with persistence: data will be stored in: db
Loading documents from source_documents
Loading new documents: 0%| | 0/98 [00:00<?, ?it/s][nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
Loading new documents: 1%|▏ | 1/98 [00:04<06:30, 4.03s/it][nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
[nltk_data] Error loading averaged_perceptron_tagger: <urlopen error
[nltk_data] [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify
[nltk_data] failed: unable to get local issuer certificate
[nltk_data] (_ssl.c:1002)>
Loading new documents: 7%|█▌ | 7/98 [00:04<01:01, 1.49it/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 96, in load_single_document
return loader.load()
^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 71, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/langchain/document_loaders/word_document.py", line 102, in _get_elements
return partition_docx(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/docx.py", line 144, in partition_docx
para_element: Optional[Text] = _paragraph_to_element(paragraph)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/docx.py", line 185, in _paragraph_to_element
return _text_to_element(text)
^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/docx.py", line 201, in _text_to_element
elif is_possible_narrative_text(text):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 76, in is_possible_narrative_text
if exceeds_cap_ratio(text, threshold=cap_threshold):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 273, in exceeds_cap_ratio
if sentence_count(text, 3) > 1:
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/partition/text_type.py", line 222, in sentence_count
sentences = sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/unstructured/nlp/tokenize.py", line 38, in sent_tokenize
return _sent_tokenize(text)
^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/tokenize/__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/data.py", line 750, in load
opened_resource = _open(resource_url)
^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/data.py", line 876, in _open
return find(path_, path + [""]).open()
^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/myenv2/lib/python3.11/site-packages/nltk/data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/kylepennell/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/share/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 173, in <module>
main()
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 157, in main
texts = process_documents([metadata['source'] for metadata in collection['metadatas']])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 125, in process_documents
documents = load_documents(source_directory, ignored_files)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/kylepennell/Desktop/privateGPT-main/ingest.py", line 114, in load_documents
for i, docs in enumerate(pool.imap_unordered(load_single_document, filtered_files)):
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next
raise value
LookupError:
**********************************************************************
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
>>> import nltk
>>> nltk.download('punkt')
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/kylepennell/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/share/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
**********************************************************************
(myenv2) kylepennell@Kyles-MacBook-Pro privateGPT-main % python3
Python 3.11.4 (v3.11.4:d2340ef257, Jun 6 2023, 19:15:51) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import nltk
>>> nltk.download('punkt')
[nltk_data] Error loading punkt: <urlopen error [SSL:
[nltk_data] CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data] unable to get local issuer certificate (_ssl.c:1002)>
False
>>> exit()
(myenv2) kylepennell@Kyles-MacBook-Pro privateGPT-main %
The text was updated successfully, but these errors were encountered:
Note: if you'd like to ask a question or open a discussion, head over to the Discussions section and post it there.
Describe the bug and how to reproduce it
I put some docx and pptx files in the source docs folder (I had it working fine with just state of the union) and now it doesn't want to ingest.
Expected behavior
Hoped it would ingest my docs
Environment (please complete the following information):
Additional context
Here's what I'm getting. Not sure what to change on the nltk front.
Here's a summary of what's happening:
In summary, the script ingest.py is encountering errors related to the NLTK library and the missing punkt resource, which is preventing it from loading and processing documents.
Big error I'm getting:
LookupError:
Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:
For more information see: https://www.nltk.org/data.html
Attempted to load tokenizers/punkt/PY3/english.pickle
Searched in:
- '/Users/kylepennell/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/share/nltk_data'
- '/Users/kylepennell/Desktop/myenv2/lib/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- ''
Whole thing:
The text was updated successfully, but these errors were encountered: