Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading some raster files using mos.read. Size issue? #550

Open
JimShady opened this issue Apr 5, 2024 · 9 comments
Open

Error reading some raster files using mos.read. Size issue? #550

JimShady opened this issue Apr 5, 2024 · 9 comments

Comments

@JimShady
Copy link

JimShady commented Apr 5, 2024

Hello.

Second file (87mb) works. First (7.9GB) does not.

I recall there was an issue with reading files larger than 2GB, but I thought that this had been resolved with Mosaic 0.4. So is it something else?

image

@JimShady
Copy link
Author

JimShady commented Apr 5, 2024

Actually, looking at the release notes, maybe the change did not make it into 0.4?

https://github.com/databrickslabs/mosaic/releases/tag/v_0.4.1

@sllynn
Copy link
Contributor

sllynn commented Apr 10, 2024

@milos-colic will have an authoritative answer here, but I think you'll need to use the 'retile_on_read' strategy for reading large rasters since there's no way around the 2GB limit on each row object in Spark.

raster_df = (
  spark.read
  .format("gdal")
  .option("raster.read.strategy", "retile_on_read") # sets the reader strategy
  .option("sizeInMB", "42") # sets the upper bound for size of raster in each row in the output dataframe
  .load("/path/to/file")
)

@JimShady
Copy link
Author

JimShady commented Apr 10, 2024

I didn't realise this was available in the options. I'll try it out and get back to you. I think it would be good to explicitly call this out in the documentation by the way? Thanks.

@sllynn
Copy link
Contributor

sllynn commented Apr 10, 2024

Agreed. Hope it helps you make progress.

@JimShady
Copy link
Author

Hi @sllynn . No luck unfortunately. I'm just trying to turn a raster into a H3 table. This is my code:

raster_df = (
  spark.read
  .format("gdal")
  .option("raster.read.strategy", "retile_on_read")
  .option("sizeInMB", "42")
  .load("dbfs:/ghsl/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.tif")
  .select(mos.rst_rastertogridavg('tile', F.lit(9)).alias("result"))
  .select(F.explode('result'))
  .select(F.explode('col').alias('my_array'))
  .select(F.col("my_array.cellID").alias("cellID"), F.col("my_array.measure").alias("measure"))
  .write.parquet("dbfs:/ghsl/h3/")
)

Error is below:

image

Could it be because my raster is in CRS 54009 rather than WGS84?

The file is available here if you/anyone wants to try to debug:

https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100/V1-0/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.zip

I will try to complete the process using a WGS84 version of the file in the meantime ...

@JimShady
Copy link
Author

Failed again on the WGS84 version of the file.

image

@JimShady
Copy link
Author

Just wanted to add that the 'retile on read' option does work. It was the next stage of my code (converting to H3) that is causing the crash.

I should add that the retile on read is very slow. I find myself wondering why it is physically re-writing our smaller files. Why not just leverage VRTs?

@mjohns-databricks
Copy link
Contributor

@JimShady we are giving attention to "raster_to_grid" in #556 which gets into retile, will come with 0.4.2 in about a week.

@mjohns-databricks
Copy link
Contributor

mjohns-databricks commented May 15, 2024

We got 0.4.2 out, but it didn't include the raster_to_grid and similar work involving tessellate performance. We had to streamline it due to a dependency issue that arose from latest geopandas, see docs. So, 0.4.3 coming soon with more "in-flight" work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants