-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(downsample): support export to multiple paths in Iceberg format #1720
Conversation
private def exportDataToRow(exportData: ExportRowData): Row = { | ||
val dataSeq = new mutable.ArrayBuffer[Any](3 + downsamplerSettings.exportPathSpecPairs.size) | ||
private def exportDataToRow(exportData: ExportRowData, exportTableConfig: ExportTableConfig): Row = { | ||
val dataSeq = new mutable.ArrayBuffer[Any](exportTableConfig.tableSchema.fields.length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quick question, is the size we specify here is for the array size ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the array size for dataSeq. This array size is dependent on the number of standard + dynamic columns in tableSchema.
val dynamicColNames = exportTableConfig.labelColumnMapping.map(pair => pair._2 + " string").mkString(", ") | ||
val partitionColNames = exportTableConfig.partitionByCols.mkString(", ") | ||
s""" | ||
|CREATE TABLE IF NOT EXISTS $catalog.$database.${exportTableConfig.tableName} ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: ( not blocking ) should we move this string constant up and only do string format inside this function ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not really a string constant. This string is dynamically created as per the input params of this function.
for ((value, expected) <- inputOutputPairs) { | ||
val map = Map("key" -> value) | ||
BatchExporter.makeLabelString(map) shouldEqual expected | ||
it("should give correct export schema") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lovely change . so much simpler
val tableSchema = { | ||
// NOTE: ArrayBuffers are sometimes used instead of Seq's because otherwise | ||
// ArrayIndexOutOfBoundsExceptions occur when Spark exports a batch. | ||
val fields = new mutable.ArrayBuffer[StructField](labelColumnMapping.length + 9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment on my we are adding +9
. Might be better if we add this as a constant and add a description there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment in the latest commit.
val tableName = group.as[String]("table") | ||
val tablePath = group.as[String]("table-path") | ||
val labelColumnMapping = group.as[Seq[String]]("label-column-mapping") | ||
.sliding(2, 2).map(seq => (seq.head, seq.last)).toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for. my understanding, what does sliding (2, 2 ) does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for creating pairs (2 size) from the config defined under key label-column-mapping
. Let me add an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comments in the latest commit to explain this sliding and to creation of labelColumnMapping Seq[(a,b), (c,d)]
headTask ++ tailTasks | ||
} | ||
// export/downsample RDDs in parallel | ||
exportTasks.par.foreach(_.apply()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a parallel config to control the degree of parallelism ? what is the default behavior ?
@@ -75,6 +76,40 @@ class Downsampler(settings: DownsamplerSettings) extends Serializable { | |||
lazy val exportLatency = | |||
Kamon.histogram("export-latency", MeasurementUnit.time.milliseconds).withoutTags() | |||
|
|||
/** | |||
* Exports an RDD for a specific export key. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kudos on the func doc 👍
…ilodb#1720) Adds support for export from the downsampler job to multiple destinations in the Iceberg format. Support for export of CSV-formatted data is removed. --------- Co-authored-by: nikitag55 <[email protected]>
Pull Request checklist
Adds support for export from the downsampler job to multiple destinations; data is now exported in the Iceberg format.
This PR will remove all support for export of CSV-formatted data.
From #1718:
Current behavior :
Spark Downsampler job currently exports data in CSV format.
New behavior :
Implements changes in Spark Downsampler Job to export data in Iceberg format and write to Iceberg Table in a specific schema.
Changes includes the following:
BatchExporter.scala
to be able to be to create RDD in new schema and convert RDD to DF to be able to write to Iceberg Table.filodb-defaults.conf
changes for export includes the following:DownsamplerSettings.scala
,BatchExporter.scala
,DownsamplerMain.scala
.Additional changes to remove old CSV based export: