feat(downsample): support export to multiple paths in Iceberg format #1720

alextheimer · 2024-02-23T02:22:02Z

Pull Request checklist

The commit(s) message(s) follows the contribution guidelines ?
Tests for the changes have been added (for bug fixes / features) ?
Docs have been added / updated (for bug fixes / features) ?

Adds support for export from the downsampler job to multiple destinations; data is now exported in the Iceberg format.
This PR will remove all support for export of CSV-formatted data.

From #1718:

Current behavior :
Spark Downsampler job currently exports data in CSV format.

New behavior :
Implements changes in Spark Downsampler Job to export data in Iceberg format and write to Iceberg Table in a specific schema.

Changes includes the following:

Updates in BatchExporter.scala to be able to be to create RDD in new schema and convert RDD to DF to be able to write to Iceberg Table.
filodb-defaults.conf changes for export includes the following:

Adds catalog, database conf Iceberg properties in filodb-defaults.conf.
Under groups, each group should also include, additional dynamic time series label based columns in addition to fixed columns in Iceberg Table.
Each group, should also include table name, table location.
Each group, can also specify, if partitioning needs to be done by some additional label based columns,

To implement above, additional changes are implemented in DownsamplerSettings.scala, BatchExporter.scala, DownsamplerMain.scala.
Implements a new ExportTableConfig Class to translate all data like table path, table names, export rules, per group from config.
ExportTableConfig is also used to generate ExportSchema dynamically.
Adds create table and create database Iceberg methods.
In Iceberg, data partitioning will be directly handled by the create table sql statement.
Rename destination-format key in filodb-defaults.conf with destination-path.

Additional changes to remove old CSV based export:

Remove old CSV based partition-by-template handling to generate data partition keys.
Remove CSV specific properties like options, path-spec from filodb-defaults.conf. As mentioned above, for Iceberg, path spec for partitioning is not required. It will be handled by Iceberg create table statement.
Remove old CSV based make-label-string methods/properties. In Iceberg table, labels will be stored in column labels of type map<string, string>, hence, string labels are not required.

… Iceberg Table (#1718)

sandeep6189 · 2024-02-23T17:55:55Z

spark-jobs/src/main/scala/filodb/downsampler/chunk/BatchExporter.scala

-  private def exportDataToRow(exportData: ExportRowData): Row = {
-    val dataSeq = new mutable.ArrayBuffer[Any](3 + downsamplerSettings.exportPathSpecPairs.size)
+  private def exportDataToRow(exportData: ExportRowData, exportTableConfig: ExportTableConfig): Row = {
+    val dataSeq = new mutable.ArrayBuffer[Any](exportTableConfig.tableSchema.fields.length)


quick question, is the size we specify here is for the array size ?

Yes, this is the array size for dataSeq. This array size is dependent on the number of standard + dynamic columns in tableSchema.

sandeep6189 · 2024-02-23T17:57:50Z

spark-jobs/src/main/scala/filodb/downsampler/chunk/BatchExporter.scala

+    val dynamicColNames = exportTableConfig.labelColumnMapping.map(pair => pair._2 + " string").mkString(", ")
+    val partitionColNames = exportTableConfig.partitionByCols.mkString(", ")
+    s"""
+       |CREATE TABLE IF NOT EXISTS $catalog.$database.${exportTableConfig.tableName} (


nitpick: ( not blocking ) should we move this string constant up and only do string format inside this function ?

This is not really a string constant. This string is dynamically created as per the input params of this function.

sandeep6189 · 2024-02-23T18:00:06Z

spark-jobs/src/test/scala/filodb/downsampler/DownsamplerMainSpec.scala

-    for ((value, expected) <- inputOutputPairs) {
-      val map = Map("key" -> value)
-      BatchExporter.makeLabelString(map) shouldEqual expected
+  it("should give correct export schema") {


lovely change . so much simpler

sandeep6189 · 2024-02-23T18:02:13Z

spark-jobs/src/main/scala/filodb/downsampler/chunk/DownsamplerSettings.scala

+      val tableSchema = {
+        // NOTE: ArrayBuffers are sometimes used instead of Seq's because otherwise
+        //   ArrayIndexOutOfBoundsExceptions occur when Spark exports a batch.
+        val fields = new mutable.ArrayBuffer[StructField](labelColumnMapping.length + 9)


nit: add a comment on my we are adding +9. Might be better if we add this as a constant and add a description there

Added comment in the latest commit.

sandeep6189 · 2024-02-23T18:02:45Z

spark-jobs/src/main/scala/filodb/downsampler/chunk/DownsamplerSettings.scala

+      val tableName = group.as[String]("table")
+      val tablePath = group.as[String]("table-path")
+      val labelColumnMapping = group.as[Seq[String]]("label-column-mapping")
+        .sliding(2, 2).map(seq => (seq.head, seq.last)).toSeq


for. my understanding, what does sliding (2, 2 ) does?

It's for creating pairs (2 size) from the config defined under key label-column-mapping. Let me add an example.

Added comments in the latest commit to explain this sliding and to creation of labelColumnMapping Seq[(a,b), (c,d)]

sandeep6189 · 2024-02-23T18:05:17Z

spark-jobs/src/main/scala/filodb/downsampler/chunk/DownsamplerMain.scala

+        headTask ++ tailTasks
+      }
+      // export/downsample RDDs in parallel
+      exportTasks.par.foreach(_.apply())


should we add a parallel config to control the degree of parallelism ? what is the default behavior ?

sandeep6189 · 2024-02-23T18:05:59Z

spark-jobs/src/main/scala/filodb/downsampler/chunk/DownsamplerMain.scala

@@ -75,6 +76,40 @@ class Downsampler(settings: DownsamplerSettings) extends Serializable {
  lazy val exportLatency =
    Kamon.histogram("export-latency", MeasurementUnit.time.milliseconds).withoutTags()

+  /**
+   * Exports an RDD for a specific export key.


kudos on the func doc 👍

…ilodb#1720) Adds support for export from the downsampler job to multiple destinations in the Iceberg format. Support for export of CSV-formatted data is removed. --------- Co-authored-by: nikitag55 <[email protected]>

alextheimer and others added 3 commits February 7, 2024 11:08

add support for multiple export destinations (#2)

3e34598

fix downsample bug

185f748

Export data from Spark Downsampler Job in Iceberg format and write to…

34188d4

… Iceberg Table (#1718)

sandeep6189 reviewed Feb 23, 2024

View reviewed changes

nikitag55 and others added 2 commits February 23, 2024 11:42

add code comments as per review comments

14a182e

reanames for clarity

262918b

nikitag55 previously approved these changes Feb 23, 2024

View reviewed changes

add configurable parallelism

54b84c9

alextheimer dismissed nikitag55’s stale review via 54b84c9 February 23, 2024 23:21

scalastyle

932de5b

sandeep6189 approved these changes Feb 23, 2024

View reviewed changes

nikitag55 approved these changes Feb 23, 2024

View reviewed changes

alextheimer merged commit 745eb47 into develop Feb 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(downsample): support export to multiple paths in Iceberg format #1720

feat(downsample): support export to multiple paths in Iceberg format #1720

alextheimer commented Feb 23, 2024 •

edited

Loading

sandeep6189 Feb 23, 2024

nikitag55 Feb 23, 2024

sandeep6189 Feb 23, 2024

nikitag55 Feb 23, 2024

sandeep6189 Feb 23, 2024

sandeep6189 Feb 23, 2024

nikitag55 Feb 23, 2024

sandeep6189 Feb 23, 2024

nikitag55 Feb 23, 2024

nikitag55 Feb 23, 2024 •

edited

Loading

sandeep6189 Feb 23, 2024

sandeep6189 Feb 23, 2024

feat(downsample): support export to multiple paths in Iceberg format #1720

feat(downsample): support export to multiple paths in Iceberg format #1720

Conversation

alextheimer commented Feb 23, 2024 • edited Loading

From #1718:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikitag55 Feb 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alextheimer commented Feb 23, 2024 •

edited

Loading

nikitag55 Feb 23, 2024 •

edited

Loading