Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor local source references #140

Merged
merged 2 commits into from
Nov 5, 2024
Merged

Refactor local source references #140

merged 2 commits into from
Nov 5, 2024

Conversation

geoo89
Copy link
Collaborator

@geoo89 geoo89 commented Nov 2, 2024

Streamline how local files can be referenced as source sheets. Previously, only certain combinations of formats and lists/dicts worked.

To test, create a folder csv with a subfolder safeguarding and put some csv files inside. Then use the configuration below as config.json (it also references the xlsx files within this repository). Only invoke pull_data.

Note that files_dict key or files_list entries may not contain certain special characters such as /, as they are used as filenames. To reference a list of file paths, you'll need to either need to have all files in the same folder and specify a basepath, or use the sheet_names dict.

Question: Because the storage filenames (after running pull_data) are determined by the files_list entries, should we automatically append .xlsx and .json (for the respective subformats) so these file endings don't have to be specified in the list? This way we avoid storage filenames ending in .xlsx.json .json.json.

Note: I haven't tested this on any existing deployments, @edmoss345 could you please check that it doesn't break anything? I also haven't tested zip archives (with the files_archive config option). @edmoss345 I don't remember where you had an example use case, maybe you can test this or refer me to that zip file. Are there any deployments where this is in use currently?

{
    "meta": {
        "version": "1.0.0",
        "pipeline_version": "1.0.0"
    },
    "parents": {},
    "flows_outputbasename": "parenttext_all",
    "output_split_number": 1,
    "sheet_names" : {
        "csv_safeguarding" : "csv/safeguarding",
        "xlsx_safeguarding" : "excel_files/safeguarding crisis.xlsx"
    },
    "sources": {
        "safeguarding_csv_dict": {
            "parent_sources": [],
            "format": "sheets",
            "subformat": "csv",
            "files_dict": {
                "safeguarding": "csv/safeguarding"
            }
        },
        "safeguarding_csv_list": {
            "parent_sources": [],
            "format": "sheets",
            "subformat": "csv",
            "files_list": [
                "csv_safeguarding"
            ]
        },
        "safeguarding_csv_list_remap": {
            "parent_sources": [],
            "format": "sheets",
            "subformat": "csv",
            "basepath": "csv",
            "files_list": [
                "safeguarding"
            ]
        },
        "safeguarding_xlsx_dict": {
            "parent_sources": [],
            "format": "sheets",
            "subformat": "xlsx",
            "files_dict": {
                "safeguarding": "excel_files/safeguarding crisis.xlsx"
            }
        },
        "safeguarding_xlsx_list_remap": {
            "parent_sources": [],
            "format": "sheets",
            "subformat": "xlsx",
            "files_list": [
                "xlsx_safeguarding"
            ]
        },
        "safeguarding_xlsx_list": {
            "parent_sources": [],
            "basepath": "excel_files",
            "format": "sheets",
            "subformat": "xlsx",
            "files_list": [
                "safeguarding crisis.xlsx"
            ]
        }
    },
    "steps": []
}

@geoo89 geoo89 requested a review from edmoss345 November 2, 2024 13:35
@edmoss345
Copy link
Collaborator

@geoo89 I think the use case for the use of the archive was the initial south africa deployment, https://github.com/IDEMSInternational/parenttext-south-africa/blob/main/config.py. But I think this was before we transitioned to using your updated pipeline code. I am not sure we have a use case of using the archive functionality with the json config file but I will set something up to do a test within the crisis deployment

@edmoss345
Copy link
Collaborator

@geoo89 I have tested and alll seems to be working as expected

@geoo89 geoo89 merged commit 8905a39 into main Nov 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants