Write excel in single file with v2 #549

cristichircu · 2022-02-16T18:07:46Z

cristichircu
Feb 16, 2022

I see that, in v2, write excel writes multiple files (one per partition). I know that's consistent with the behavior for json, parquet, etc but is there any chance you'll provide an option or something to write to a single file?
I'm aware that you can do .coalesce(1) before the write to get a single file but you still have a random name for the actual file. If I want to save the file with a predefined name the only way I see is to do some extra steps like determine, rename and move the file, delete the generated folder, etc.
Any thoughts?
Thanks!

cristichircu · 2022-02-24T09:48:28Z

cristichircu
Feb 24, 2022
Author

@quanghgx @nightscape Would you be interested in having this functionality in spark-excel v2?
What I'm trying to do is basically preserve the single output file behavior we had in v1. What I'm thinking of is setting an option like "excelFileName" and, after the excel files are successfully written to "path", rename the partition file to the requested name and delete the success and .crc files (of course, only if a single excel file is written -> if the client does a coalesce(1) or something before writing to excel).

The client code would look like:
output.coalesce(1) .write() .format("excel") .option("excelFileName", "excelFile.xslx") .mode(SaveMode.Append) .save("//path//to//folder")

For the rename and cleanup I can extend org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter and overwrite commitJob and register the ouputCommiter in com.crealytics.spark.v2.excel.ExcelWriteBuilder (similar to what is happening in org.apache.spark.sql.execution.datasources.v2.parquet.ParquetWrite with org.apache.parquet.hadoop.ParquetOutputCommitter)

Some other improvements that would be nice but I'm not sure how/where to implement are (any suggestions?).:

do the coalesce somewhere in spark-excel (if the option is set)
avoid the separate option and have the client send the entire path like .save("//path//to//folder//excelFile.xslx"). Not sure where I could split the path up
create missing folders in the path before the exception is thrown (org.apache.spark.sql.execution.datasources.DataSource#checkAndGlobPathIfNecessary)

Not sure if this is something that would be useful to others. Any feedback is welcomed. Thanks!

6 replies

quanghgx Apr 19, 2022
Collaborator

Hi @cristichircu, I do think this feature should be useful for spark-excel users.

However, the way we coalesce or repartition, and create a the temporary folder and them moving file to the target path might introduce unseen issues:

User might not aware that, they are responsible for ordering along side the hidden-coalesce (repartition)
Invisible temporary folders need proper cleanup/protect that data and sometime storage planning

Another approach, instead of relying too much on spark-data source built in utilities, we might need to introduce some by ourselves. Spark-data source API v2 doesn't have any restriction of how to structure the output layout, even if we do not output to the file system at all. To be honest, I am also looking for a way to do that properly.

For a start, how about moving ahead with your use case and sharing how to do that in a wiki with a branch from spark-excel?

cristichircu Apr 19, 2022
Author

Thanks for your reply and for pointing out the possible shortcomings, @quanghgx!
Just so I'm sure we're on the same page: your suggestions would be for me to try doing this outside of spark-excel and we can discuss about moving (some of) it into spark-excel afterwards (depending on the result). Did I understand you correctly?
Thank you!

quanghgx Apr 19, 2022
Collaborator

I think having a working version will be great, outside or inside spark data source is up to you.
And sharing that with the community, maybe we will have some feedback and figure out the next steps from there.

cristichircu Apr 19, 2022
Author

Ok, I'll see what we can come up with outside of spark-excel and I'll add a branch with a test case or something too showcase it.

Thanks for the feedback!

Should I close the discussion or leave it open and continue it when I have something working? :)

quanghgx Apr 20, 2022
Collaborator

Thanks @cristichircu. Maybe we can leave it open ;). Frankly, I really want to learn from this thread.

For Excel, if writing must be supported, then writing to a single file seems more useful in its typical use-cases than to a bunch of files with random names. Excel might not be the best choice as an intermediate data storage format.

Let's see what we can come up with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Write excel in single file with v2 #549

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Write excel in single file with v2 #549

cristichircu Feb 16, 2022

Replies: 1 comment · 6 replies

cristichircu Feb 24, 2022 Author

quanghgx Apr 19, 2022 Collaborator

cristichircu Apr 19, 2022 Author

quanghgx Apr 19, 2022 Collaborator

cristichircu Apr 19, 2022 Author

quanghgx Apr 20, 2022 Collaborator

cristichircu
Feb 16, 2022

Replies: 1 comment 6 replies

cristichircu
Feb 24, 2022
Author

quanghgx Apr 19, 2022
Collaborator

cristichircu Apr 19, 2022
Author

quanghgx Apr 19, 2022
Collaborator

cristichircu Apr 19, 2022
Author

quanghgx Apr 20, 2022
Collaborator