Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: Create test data in-memory #901

Merged
merged 1 commit into from
Nov 1, 2024
Merged

test: Create test data in-memory #901

merged 1 commit into from
Nov 1, 2024

Conversation

nightscape
Copy link
Owner

@nightscape nightscape commented Nov 1, 2024

User description

instead of reading from files


PR Type

Tests


Description

  • Replaced reading test data from CSV files with in-memory DataFrame creation in DataFrameWriterApiComplianceSuite.
  • Updated ManyPartitionReadSuite to generate test data programmatically, removing dependency on external CSV files.
  • Defined explicit schema for in-memory DataFrame to ensure correct nullability and data types.
  • Simplified test setup and execution by eliminating file I/O operations.

Changes walkthrough 📝

Relevant files
Tests
DataFrameWriterApiComplianceSuite.scala
Use in-memory DataFrame for DataFrameWriter tests               

src/test/scala/com/crealytics/spark/excel/v2/DataFrameWriterApiComplianceSuite.scala

  • Replaced CSV file reading with in-memory DataFrame creation.
  • Updated test cases to use the new in-memory DataFrame.
  • Simplified DataFrame writing logic by removing file dependency.
  • +16/-17 
    ManyPartitionReadSuite.scala
    Generate in-memory data for partitioned read tests             

    src/test/scala/com/crealytics/spark/excel/v2/ManyPartitionReadSuite.scala

  • Replaced CSV file reading with programmatically generated data.
  • Defined schema explicitly for in-memory DataFrame.
  • Updated test logic to use in-memory data for partitioned writes.
  • +31/-8   

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    instead of reading from files
    Copy link

    github-actions bot commented Nov 1, 2024

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Data Integrity
    Ensure that the new in-memory DataFrame creation matches the expected schema and data types as the original CSV-based DataFrame.

    Data Generation Logic
    Verify the logic for generating in-memory data, especially the calculations and conditions used for generating rows per partition.

    Copy link

    github-actions bot commented Nov 1, 2024

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Enhancement
    Ensure the DataFrame schema is explicitly defined to match expected data types and column names

    Ensure that the DataFrame schema matches the expected types and column names when
    creating it from a sequence. Currently, the schema is inferred which might not align
    with expectations, especially if the DataFrame operations depend on specific column
    types.

    src/test/scala/com/crealytics/spark/excel/v2/DataFrameWriterApiComplianceSuite.scala [25-30]

     val data = Seq(
    -  ("foo", "bar", "1"),
    -  ("baz", "bang", "2")
    +  Row("foo", "bar", 1),
    +  Row("baz", "bang", 2)
     )
    -spark.createDataFrame(data).toDF("col1", "col2", "col3")
    +val schema = StructType(Array(
    +  StructField("col1", StringType, true),
    +  StructField("col2", StringType, true),
    +  StructField("col3", IntegerType, true)
    +))
    +spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
    Suggestion importance[1-10]: 7

    Why: Explicitly defining the schema ensures that the DataFrame has the correct data types, which can prevent runtime errors and improve code reliability. This is particularly important if subsequent operations depend on specific data types.

    7
    Maintainability
    Refactor data generation into a separate method to improve code organization and maintainability

    Refactor the data generation logic to a separate method to enhance code readability
    and maintainability.

    src/test/scala/com/crealytics/spark/excel/v2/ManyPartitionReadSuite.scala [49-74]

    -val data = (1 to 19).flatMap { col1 =>
    -  ...
    +def generateData(): Seq[Row] = {
    +  (1 to 19).flatMap { col1 =>
    +    ...
    +  }
     }
    +val data = generateData()
     val dfInput = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
    Suggestion importance[1-10]: 6

    Why: Refactoring the data generation logic into a separate method enhances code readability and maintainability by isolating functionality, making the code easier to understand and modify.

    6
    Best practice
    Implement error handling for DataFrame write operations to manage exceptions effectively

    Add error handling for DataFrame operations to manage potential runtime exceptions,
    especially when dealing with file I/O operations or DataFrame transformations.

    src/test/scala/com/crealytics/spark/excel/v2/DataFrameWriterApiComplianceSuite.scala [60-65]

    -df.write
    -  .format("excel")
    -  .option("path", targetDir)
    -  .option("header", value = true)
    -  .mode(writeMode)
    -  .save()
    +try {
    +  df.write
    +    .format("excel")
    +    .option("path", targetDir)
    +    .option("header", value = true)
    +    .mode(writeMode)
    +    .save()
    +} catch {
    +  case e: Exception => println("Error writing DataFrame to Excel: " + e.getMessage)
    +}
    Suggestion importance[1-10]: 5

    Why: Adding error handling can help manage potential runtime exceptions during file I/O operations, improving robustness. However, the suggestion uses a generic exception type, which is less informative.

    5
    Readability
    Improve variable names to enhance code readability and clarity

    Use a more descriptive variable naming convention for dfInput and dfFinal to reflect
    their roles or contents more clearly.

    src/test/scala/com/crealytics/spark/excel/v2/ManyPartitionReadSuite.scala [74-78]

    -val dfInput = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
    -val dfFinal = dfInput.union(dfInput)
    +val initialDataFrame = spark.createDataFrame(spark.sparkContext.parallelize(data), schema)
    +val combinedDataFrame = initialDataFrame.union(initialDataFrame)
    Suggestion importance[1-10]: 4

    Why: Using more descriptive variable names can improve code readability, making it easier for developers to understand the purpose of each variable. However, this change has a relatively minor impact on functionality.

    4

    @nightscape nightscape merged commit ce5d005 into main Nov 1, 2024
    40 checks passed
    @nightscape nightscape deleted the in-memory-test-data branch November 1, 2024 21:46
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    1 participant