Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ocn 027 rw0 #292

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

Ocn 027 rw0 #292

wants to merge 23 commits into from

Conversation

clairehemmerly
Copy link
Collaborator

@clairehemmerly clairehemmerly commented Jun 7, 2022

Checklist for Reviewing a Pre-Processing Script

  • Does the python script contain the following 4 sections: Download data and save to your data directory, Process data, Upload processed data to Carto/Upload processed data to Google Earth Engine, Upload original data and processed data to Amazon S3 storage?
  • Does the script have the standardized variable names for: dataset_name, raw_data_file, processed_data_file?
  • Does the script create and use a 'data' directory?
  • Does the script use a python module to automatically download the data? If this is not possible, are there explicit instructions for how to download the data (step by step with instructions about every input parameter and button to click to find the exact data) and does it use shutil to move the data from 'Downloads' into the data directory?
  • Is the script automated as much as possible to minimize rewriting code the next time the dataset updates (ex 1: can you automatically pull out column names instead of typing them yourself? ex 2: if you are dropping columns with no data, did you use pandas to find the nodata columns instead of dropping the column by name?)
  • Are there comments on almost every line of code to explicitly state what is being done and why?
  • Are you uploading to the correct AWS location?
  • For GEE scripts, did you explicitly define the band manifest with a pyramiding policy so that we can easily change it later if we need to?
  • Does the README contain all relevant links?
  • Does the README state the original file type?
  • Does the README list all of the processing steps that were taken in the script?
  • For netcdfs, does the README state which variables were pulled from the netcdf?
  • Did you use the util functions whenever possible?
  • Have you checked the processed data file on your computer to make sure it matches the data you uploaded to Carto? (spaces or symbol names in column titles often get changed in the upload process-please change these before uploading so that the backed up processed data matches the data on Carto)
  • Does the folder name and python script file name match the dataset_name variable in the script? Are they all lowercase? Does the processing script end in '_processing.py'? (this part is often forgotten)

@clairehemmerly clairehemmerly requested a review from chrowe June 7, 2022 16:42
@clairehemmerly clairehemmerly self-assigned this Jun 7, 2022
Copy link
Member

@chrowe chrowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor change requests and a couple questions but overall looks good.
Only thing I didn't check was if the column names in Carto match those in the files sent to s3. Did you do that?


You can also download the original dataset [directly through Resource Watch](http://wri-public-data.s3.amazonaws.com/resourcewatch/raster/ocn_027_rw0_nitrogen_plumes.zip), or [from the source website](https://knb.ecoinformatics.org/view/urn%3Auuid%3Ac7bdc77e-6c7d-46b6-8bfc-a66491119d07).

###### Note: This dataset processing was done by Claire Hemmerly, and QC'd by [Chris Rowe](https://www.wri.org/profile/chris-rowe).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add any link here if you want.

ocn_027a_rw0_nitrogen_plumes/README.md Outdated Show resolved Hide resolved

You can view the processed Wastewater Plumes in Coastal Areas dataset [on Resource Watch](https://resourcewatch.org/data/explore/11804f04-d9c7-47b9-8d27-27ce6ed6c042).

You can also download the original dataset [directly through Resource Watch](http://wri-public-data.s3.amazonaws.com/resourcewatch/raster/ocn_027_rw0_nitrogen_plumes.zip), or [from the source website](https://knb.ecoinformatics.org/view/urn%3Auuid%3Ac7bdc77e-6c7d-46b6-8bfc-a66491119d07).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this if a raster file we don't link to our s3 version. You can just link to the source.
Side note, do you know why the zip file on our s3 is so large? My browser says it is 11GB but the source file looks to be 250MB.

Copy link
Collaborator Author

@clairehemmerly clairehemmerly Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compressed the zipfile and uploaded to the correct folder. I deleted the one that was 11G

ocn_027a_rw0_nitrogen_plumes/README.md Outdated Show resolved Hide resolved

# convert the data type of columns to integer
for col in gdf.columns[1:9]:
gdf[col] = gdf[col].fillna(0).astype('int')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure we want na values to be zero? Are there any existing zero values?

Copy link
Collaborator Author

@clairehemmerly clairehemmerly Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only NA values are in the columns that show % nitrogen input, which we're not using, and the same rows have 0 for the nitrogen input in g/yr, so I think updating the NAs to 0 should be fine. I had trouble creating the processed shapefile when the values were floats not integers, I got an error basically saying the numbers were too big.

ocn_027c_rw0_wastewater_watersheds/README.md Outdated Show resolved Hide resolved
ocn_027c_rw0_wastewater_watersheds/README.md Outdated Show resolved Hide resolved

# convert the data type of columns to integer
for col in gdf.columns[1:9]:
gdf[col] = gdf[col].fillna(0).astype('int')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, just want to make sure we are accurately translating the data here.

'''

# load in the polygon shapefile
shapes = os.path.join(raw_data_file_unzipped, 'effluent_N_pourpoints_all.shp')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use processed_data_file here instead of shapes for consistency with our other scripts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think the main thing is that the final thing you upload is called processed_data_file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ive updated this section, let me know if it looks ok

set_default_credentials(username=CARTO_USER, base_url="https://{user}.carto.com/".format(user=CARTO_USER),api_key=CARTO_KEY)

# upload data frame to Carto
to_carto(gdf, dataset_name + '_edit', if_exists='replace')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency can you use processed_data_file instead of gdf?

Upload original data and processed data to Amazon S3 storage
'''
# initialize AWS variables
aws_bucket = 'wri-public-data'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should actually be aws_bucket = 'wri-projects' for raster datasets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@clairehemmerly
Copy link
Collaborator Author

I added the processed data file steps and checked the col names, should match now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants