Error reading some raster files using mos.read. Size issue? #550

JimShady · 2024-04-05T12:11:27Z

Hello.

Second file (87mb) works. First (7.9GB) does not.

I recall there was an issue with reading files larger than 2GB, but I thought that this had been resolved with Mosaic 0.4. So is it something else?

JimShady · 2024-04-05T12:15:19Z

Actually, looking at the release notes, maybe the change did not make it into 0.4?

https://github.com/databrickslabs/mosaic/releases/tag/v_0.4.1

sllynn · 2024-04-10T05:04:46Z

@milos-colic will have an authoritative answer here, but I think you'll need to use the 'retile_on_read' strategy for reading large rasters since there's no way around the 2GB limit on each row object in Spark.

raster_df = (
  spark.read
  .format("gdal")
  .option("raster.read.strategy", "retile_on_read") # sets the reader strategy
  .option("sizeInMB", "42") # sets the upper bound for size of raster in each row in the output dataframe
  .load("/path/to/file")
)

JimShady · 2024-04-10T06:38:58Z

I didn't realise this was available in the options. I'll try it out and get back to you. I think it would be good to explicitly call this out in the documentation by the way? Thanks.

sllynn · 2024-04-10T07:14:51Z

Agreed. Hope it helps you make progress.

JimShady · 2024-04-11T19:47:58Z

Hi @sllynn . No luck unfortunately. I'm just trying to turn a raster into a H3 table. This is my code:

raster_df = (
  spark.read
  .format("gdal")
  .option("raster.read.strategy", "retile_on_read")
  .option("sizeInMB", "42")
  .load("dbfs:/ghsl/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.tif")
  .select(mos.rst_rastertogridavg('tile', F.lit(9)).alias("result"))
  .select(F.explode('result'))
  .select(F.explode('col').alias('my_array'))
  .select(F.col("my_array.cellID").alias("cellID"), F.col("my_array.measure").alias("measure"))
  .write.parquet("dbfs:/ghsl/h3/")
)

Error is below:

Could it be because my raster is in CRS 54009 rather than WGS84?

The file is available here if you/anyone wants to try to debug:

https://jeodpp.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GLOBE_R2023A/GHS_POP_E2020_GLOBE_R2023A_54009_100/V1-0/GHS_POP_E2020_GLOBE_R2023A_54009_100_V1_0.zip

I will try to complete the process using a WGS84 version of the file in the meantime ...

JimShady · 2024-04-12T08:27:14Z

Failed again on the WGS84 version of the file.

JimShady · 2024-04-22T13:42:42Z

Just wanted to add that the 'retile on read' option does work. It was the next stage of my code (converting to H3) that is causing the crash.

I should add that the retile on read is very slow. I find myself wondering why it is physically re-writing our smaller files. Why not just leverage VRTs?

mjohns-databricks · 2024-05-04T15:23:49Z

@JimShady we are giving attention to "raster_to_grid" in #556 which gets into retile, will come with 0.4.2 in about a week.

mjohns-databricks · 2024-05-15T18:39:12Z

We got 0.4.2 out, but it didn't include the raster_to_grid and similar work involving tessellate performance. We had to streamline it due to a dependency issue that arose from latest geopandas, see docs. So, 0.4.3 coming soon with more "in-flight" work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading some raster files using mos.read. Size issue? #550

Error reading some raster files using mos.read. Size issue? #550

JimShady commented Apr 5, 2024 •

edited

Loading

JimShady commented Apr 5, 2024 •

edited

Loading

sllynn commented Apr 10, 2024

JimShady commented Apr 10, 2024 •

edited

Loading

sllynn commented Apr 10, 2024

JimShady commented Apr 11, 2024

JimShady commented Apr 12, 2024

JimShady commented Apr 22, 2024

mjohns-databricks commented May 4, 2024

mjohns-databricks commented May 15, 2024 •

edited

Loading

Error reading some raster files using mos.read. Size issue? #550

Error reading some raster files using mos.read. Size issue? #550

Comments

JimShady commented Apr 5, 2024 • edited Loading

JimShady commented Apr 5, 2024 • edited Loading

sllynn commented Apr 10, 2024

JimShady commented Apr 10, 2024 • edited Loading

sllynn commented Apr 10, 2024

JimShady commented Apr 11, 2024

JimShady commented Apr 12, 2024

JimShady commented Apr 22, 2024

mjohns-databricks commented May 4, 2024

mjohns-databricks commented May 15, 2024 • edited Loading

JimShady commented Apr 5, 2024 •

edited

Loading

JimShady commented Apr 5, 2024 •

edited

Loading

JimShady commented Apr 10, 2024 •

edited

Loading

mjohns-databricks commented May 15, 2024 •

edited

Loading