You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had the idea to make a dataset out of IBM's Project Codenet codes (a little less than 14 million in total). I converted them all into text files (ending in .txt rather than .py, .c, etc) and after a couple of prior issues with the encoding that I solved by removing about 1 million files that had incorrect encodings (reducing it to about 12 million files), I tried to run it again. It then gave another, different error:
Traceback (most recent call last):
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/partition/json.py", line 45, in partition_json
dict = json.loads(file_text)
^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/json/__init__.py", line 346, in loads
return _default_decoder.decode(s)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/media/impromise/ExternalDrive/chatgpt-retrieval-main.txt.txt/chatgpt.py", line 36, in <module>
index = VectorstoreIndexCreator().from_loaders([loader])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/indexes/vectorstore.py", line 81, in from_loaders
docs.extend(loader.load())
^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/directory.py", line 156, in load
self.load_file(i, p, docs, pbar)
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/directory.py", line 105, in load_file
raise e
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/directory.py", line 99, in load_file
sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 86, in load
elements = self._get_elements()
^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/langchain/document_loaders/unstructured.py", line 172, in _get_elements
return partition(filename=self.file_path, **self.unstructured_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/partition/auto.py", line 230, in partition
elements = partition_json(filename=filename, file=file, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/documents/elements.py", line 138, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/file_utils/filetype.py", line 519, in wrapper
elements = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/media/impromise/ExternalDrive/miniconda3/lib/python3.11/site-packages/unstructured/partition/json.py", line 48, in partition_json
raise ValueError("Not a valid json")
ValueError: Not a valid json
This has had me stuck for a while. It had previously made such an error beforehand, and I decided that I'd move away certain folders that had the issue to be fixed later, but there were 4053 overall folders and I couldn't get move all problematic folders. I also looked through my dataset to make sure there were no shell files, JSON files, or CSV files that could cause such an issue, but there were none- it's just .txt, the cat.pdf file, and a Word doc. Without the dataset, I've been able to run the program with little difficulty, and even certain (if not most) folders were suitable as datasets to use.
Why is this error happening and how could I fix it? I am using Ubuntu 22.04, and Python 3.11.4. I've placed the program files onto an external hard drive, where the program has been able to run. If the error is unable to be fixed, are there any ways to circumvent the error?
The text was updated successfully, but these errors were encountered:
json.loads() take a json string and converts it into dictionary. By json string we mean json object but enclosed in quotes such as "{"name":"John", "age":30, "car":null}". Here you can't access any value using dictionary method since its wrapped in quotes so it acts like string. Thats why you convert it into dictionary by using json.loads(STRING_NAME)
The text you are loading does not contain data in json form.
I had the idea to make a dataset out of IBM's Project Codenet codes (a little less than 14 million in total). I converted them all into text files (ending in .txt rather than .py, .c, etc) and after a couple of prior issues with the encoding that I solved by removing about 1 million files that had incorrect encodings (reducing it to about 12 million files), I tried to run it again. It then gave another, different error:
This has had me stuck for a while. It had previously made such an error beforehand, and I decided that I'd move away certain folders that had the issue to be fixed later, but there were 4053 overall folders and I couldn't get move all problematic folders. I also looked through my dataset to make sure there were no shell files, JSON files, or CSV files that could cause such an issue, but there were none- it's just .txt, the cat.pdf file, and a Word doc. Without the dataset, I've been able to run the program with little difficulty, and even certain (if not most) folders were suitable as datasets to use.
Why is this error happening and how could I fix it? I am using Ubuntu 22.04, and Python 3.11.4. I've placed the program files onto an external hard drive, where the program has been able to run. If the error is unable to be fixed, are there any ways to circumvent the error?
The text was updated successfully, but these errors were encountered: