forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-45502][BUILD] Upgrade Kafka to 3.6.0 #1
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…8s Docker images ### What changes were proposed in this pull request? This PR aims to add a symbolic link file, `spark-examples.jar`, in the example jar directory. ``` $ docker run -it --rm spark:latest ls -al /opt/spark/examples/jars | tail -n6 total 1620 drwxr-xr-x 1 root root 4096 Oct 11 04:37 . drwxr-xr-x 1 root root 4096 Sep 9 02:08 .. -rw-r--r-- 1 root root 78803 Sep 9 02:08 scopt_2.12-3.7.1.jar -rw-r--r-- 1 root root 1564255 Sep 9 02:08 spark-examples_2.12-3.5.0.jar lrwxrwxrwx 1 root root 29 Oct 11 04:37 spark-examples.jar -> spark-examples_2.12-3.5.0.jar ``` ### Why are the changes needed? Like PySpark example (`pi.py`), we can submit the examples without considering the version numbers which was painful before. ``` bin/spark-submit \ --master k8s://$K8S_MASTER \ --deploy-mode cluster \ ... --class org.apache.spark.examples.SparkPi \ local:///opt/spark/examples/jars/spark-examples.jar 10000 ``` The following is the driver pod log. ``` + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit ... --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples.jar 10000 Files local:///opt/spark/examples/jars/spark-examples.jar from /opt/spark/examples/jars/spark-examples.jar to /opt/spark/work-dir/./spark-examples.jar ``` ### Does this PR introduce _any_ user-facing change? No, this is an additional file. ### How was this patch tested? Manually build the docker image and do `ls`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43324 from dongjoon-hyun/SPARK-45497. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR introduces selectable animation for Spark SQL Plan Node On UI, which lights up the selected node and its linked nodes and edges. ### Why are the changes needed? Better UX for SQL plan visualization and debugging. Especially for large queries, users can now concentrate on the current node and its nearest neighbors to get a better understanding of node lineage. ### Does this PR introduce _any_ user-facing change? Yes, let's see the video. ### How was this patch tested? https://github.com/apache/spark/assets/8326978/f5ba884c-acce-46b8-8568-3ead55c91d4f ### Was this patch authored or co-authored using generative AI tooling? no Closes #43307 from yaooqinn/SPARK-45480. Authored-by: Kent Yao <[email protected]> Signed-off-by: Kent Yao <[email protected]>
…ivers in MasterPage ### What changes were proposed in this pull request? This PR aims to show the number of abnormaly completed drivers in MasterPage. ### Why are the changes needed? In the `Completed Drivers` table, there are various exit states. <img width="841" alt="Screenshot 2023-10-11 at 12 01 21 AM" src="https://github.com/apache/spark/assets/9700541/ff0b33f5-c546-42e7-870c-8323e2eefded"> We had better show the abnormally completed drivers in the top of the page. **BEFORE** ``` Drivers: 0 Running (0 Waiting), 7 Completed ``` **AFTER** ``` Drivers: 0 Running (0 Waiting), 7 Completed (1 Killed, 4 Failed, 0 Error) ``` <img width="676" alt="Screenshot 2023-10-11 at 12 00 03 AM" src="https://github.com/apache/spark/assets/9700541/94deab1f-b9f7-4e5b-8284-aaac4f7520df"> ### Does this PR introduce _any_ user-facing change? Yes, this is a new UI field. However, since this is UI, there will be no technical issues. ### How was this patch tested? Manual build Spark and check UI. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43328 from dongjoon-hyun/SPARK-45500. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…iterator` for `IterableOnce` ### What changes were proposed in this pull request? This pr replace `toIterator` with `iterator` for `IterableOnce` to clean up deprecated api usage. ### Why are the changes needed? Clean up deprecated Scala api usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Action ### Was this patch authored or co-authored using generative AI tooling? No Closes #43295 from LuciferYang/SPARK-45469. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Added `ArrayAssembler` feature transformer for `pyspark.ml.connect`. ### Why are the changes needed? Feature parity for `pyspark.ml.feature.VectorAssembler` ### Does this PR introduce _any_ user-facing change? Yes. ``` class ArrayAssembler( Transformer, HasInputCols, HasOutputCol, HasFeatureSizes, HasHandleInvalid, ParamsReadWrite, ): """ A feature transformer that merges multiple input columns into an array type column. Parameters ---------- You need to set param `inputCols` for specifying input column names, and set param `featureSizes` for specifying corresponding input column feature size, for scalar type input column, corresponding feature size must be set to 1, otherwise, set corresponding feature size to feature array length. Output column is "array<double"> type and contains array of assembled features. All elements in input feature columns must be convertible to double type. You can set 'handler_invalid' param to specify how to handle invalid input value (None or NaN), if it is set to 'error', error is thrown for invalid input value, if it is set to 'keep', it returns relevant number of NaN in the output. .. versionadded:: 4.0.0 Examples -------- >>> from pyspark.ml.connect.feature import ArrayAssembler >>> import numpy as np >>> >>> spark_df = spark.createDataFrame( ... [ ... ([2.0, 3.5, 1.5], 3.0, True, 1), ... ([-3.0, np.nan, -2.5], 4.0, False, 2), ... ], ... schema=["f1", "f2", "f3", "f4"], ... ) >>> assembler = ArrayAssembler( ... inputCols=["f1", "f2", "f3", "f4"], ... outputCol="out", ... featureSizes=[3, 1, 1, 1], ... handleInvalid="keep", ... ) >>> assembler.transform(spark_df).select("out").show(truncate=False) """ ``` ### How was this patch tested? Unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43199 from WeichenXu123/SPARK-45397. Authored-by: Weichen Xu <[email protected]> Signed-off-by: Weichen Xu <[email protected]>
…able` ### What changes were proposed in this pull request? Since #39062 , add `createTable` to `JdbcDialect`. But doesn't add comment for param. This PR complete it. ### Why are the changes needed? Add some comment for param. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only comment, unnecessary. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #42799 from Hisoka-X/SPARK-41516_jdbc_add_comment. Authored-by: Jia Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…nfigurable ### What changes were proposed in this pull request? This pr adds a new config `spark.sql.defaultCacheStorageLevel` , so that people can use `set spark.sql.defaultCacheStorageLevel=xxx` to change the default storage level of `dataset.cache`. ### Why are the changes needed? Most people use the default storage level, so this pr makes it easy to change the storage level without touching code. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add test ### Was this patch authored or co-authored using generative AI tooling? no Closes #43259 from ulysses-you/cache. Authored-by: ulysses-you <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…other-pure-statement` ### What changes were proposed in this pull request? This PR fixes two compilation warnings related to `other-pure-statement` ``` [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala:164:54: a pure expression does nothing in statement position [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=<part of the message>, cat=other-pure-statement, site=org.apache.spark.scheduler.OutputCommitCoordinatorSuite [error] 0 until rdd.partitions.size, resultHandler, () => ()) [error] /Users/yangjie01/SourceCode/git/spark-mine-sbt/streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala:142:71: a pure expression does nothing in statement position [error] Applicable -Wconf / nowarn filters for this fatal warning: msg=<part of the message>, cat=other-pure-statement, site=org.apache.spark.streaming.util.FileBasedWriteAheadLog.readAll.readFile [error] CompletionIterator[ByteBuffer, Iterator[ByteBuffer]](reader, () => reader.close()) ``` and removes the corresponding suppression rules from the compilation options ``` "-Wconf:cat=other-pure-statement&site=org.apache.spark.streaming.util.FileBasedWriteAheadLog.readAll.readFile:wv", "-Wconf:cat=other-pure-statement&site=org.apache.spark.scheduler.OutputCommitCoordinatorSuite:wv", ``` On the other hand, the code corresponding to the following two suppression rules no longer exists, so the corresponding suppression rules were also cleaned up in this pr. ``` "-Wconf:cat=other-match-analysis&site=org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction:wv", "-Wconf:cat=other-pure-statement&site=org.apache.spark.sql.streaming.sources.StreamingDataSourceV2Suite.testPositiveCase.\\$anonfun:wv", ``` ### Why are the changes needed? Code clean up. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43312 from LuciferYang/other-pure-statement. Authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
…oxyInstance().getClass` ### What changes were proposed in this pull request? This pr replace `Proxy.getProxyClass()` with `Proxy.newProxyInstance().getClass` to clean up deprecated api usage ref to https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/reflect/Proxy.java#L376-L391 ``` * deprecated Proxy classes generated in a named module are encapsulated * and not accessible to code outside its module. * {link Constructor#newInstance(Object...) Constructor.newInstance} * will throw {code IllegalAccessException} when it is called on * an inaccessible proxy class. * Use {link #newProxyInstance(ClassLoader, Class[], InvocationHandler)} * to create a proxy instance instead. * * see <a href="#membership">Package and Module Membership of Proxy Class</a> * revised 9 */ Deprecated CallerSensitive public static Class<?> getProxyClass(ClassLoader loader, Class<?>... interfaces) throws IllegalArgumentException ``` For the `InvocationHandler`, since the `invoke` method doesn't need to be actually called in the current scenario, but the `InvocationHandler` can't be null, a new `DummyInvocationHandler` has been added as follows: ``` private[spark] object DummyInvocationHandler extends InvocationHandler { override def invoke(proxy: Any, method: Method, args: Array[AnyRef]): AnyRef = { throw new UnsupportedOperationException("Not implemented") } } ``` ### Why are the changes needed? Clean up deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43291 from LuciferYang/SPARK-45467. Lead-authored-by: YangJie <[email protected]> Co-authored-by: yangjie01 <[email protected]> Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request? The PR adds Codegen Support for get_json_object. ### Why are the changes needed? Improve codegen coverage and performance. Github benchmark data(https://github.com/panbingkun/spark/actions/runs/4497396473/jobs/7912952710): <img width="879" alt="image" src="https://user-images.githubusercontent.com/15246973/227117793-bab38c42-dcc1-46de-a689-25a87b8f3561.png"> Local benchmark data: <img width="895" alt="image" src="https://user-images.githubusercontent.com/15246973/227098745-9b360e60-fe84-4419-8b7d-073a0530816a.png"> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new UT. Pass GA. Closes #40506 from panbingkun/json_code_gen. Authored-by: panbingkun <[email protected]> Signed-off-by: Max Gekk <[email protected]>
…ence#refersTo(null)` ### What changes were proposed in this pull request? This pr just replace `Reference#isEnqueued` with `Reference#refersTo` in `CompletionIteratorSuite`, the solution refer to https://github.com/openjdk/jdk/blob/dfacda488bfbe2e11e8d607a6d08527710286982/src/java.base/share/classes/java/lang/ref/Reference.java#L436-L454 ``` * deprecated * This method was originally specified to test if a reference object has * been cleared and enqueued but was never implemented to do this test. * This method could be misused due to the inherent race condition * or without an associated {code ReferenceQueue}. * An application relying on this method to release critical resources * could cause serious performance issue. * An application should use {link ReferenceQueue} to reliably determine * what reference objects that have been enqueued or * {link #refersTo(Object) refersTo(null)} to determine if this reference * object has been cleared. * * return {code true} if and only if this reference object is * in its associated queue (if any). */ Deprecated(since="16") public boolean isEnqueued() { return (this.queue == ReferenceQueue.ENQUEUED); } ``` ### Why are the changes needed? Clean up deprecated api usage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #43325 from LuciferYang/SPARK-45499. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Correct the function groups in connect.functions ### Why are the changes needed? to be consistent with 17da438 ### Does this PR introduce _any_ user-facing change? yes, will changes the scaladoc (when it is available) ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43309 from zhengruifeng/connect_function_scaladoc. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…ot match specified timestampFormat ### What changes were proposed in this pull request? This PR fix CSV/JSON schema inference when timestamps do not match specified timestampFormat will report error. ```scala //eg val csv = spark.read.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss") .option("inferSchema", true).csv(Seq("2884-06-24T02:45:51.138").toDS()) csv.show() //error Caused by: java.time.format.DateTimeParseException: Text '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19 ``` This bug only happend when partition had one row. The data type should be `StringType` not `TimestampType` because the value not match `timestampFormat`. Use csv as eg, in `CSVInferSchema::tryParseTimestampNTZ`, first, use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to inferring return `TimestampType`, if same partition had another row, it will use `tryParseTimestamp` to parse row with user defined `timestampFormat`, then found it can't be convert to timestamp with `timestampFormat`. Finally return `StringType`. But when only one row, we use `timestampNTZFormatter.parseWithoutTimeZoneOptional` to parse normally timestamp not right. We should only parse it when `spark.sql.timestampType` is `TIMESTAMP_NTZ`. If `spark.sql.timestampType` is `TIMESTAMP_LTZ`, we should directly parse it use `tryParseTimestamp`. To avoid return `TimestampType` when timestamps do not match specified timestampFormat. ### Why are the changes needed? Fix schema inference bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add new test. ### Was this patch authored or co-authored using generative AI tooling? No Closes #43243 from Hisoka-X/SPARK-45433-inference-mismatch-timestamp-one-row. Authored-by: Jia Fan <[email protected]> Signed-off-by: Max Gekk <[email protected]>
### What changes were proposed in this pull request? This PR adds an optional `ExecuteHolder` to `SparkConnectPlanner`. This allows plugins to access the `ExecuteHolder` to for example create a `QueryPlanningTracker` without the need to create a new `CommandPlugin` interface like proposed in #42984. ### Why are the changes needed? There is currently no way to track queries executed by a CommandPlugin. For this, the plugin needs access to the `ExecuteHolder`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Adjusted existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43311 from dillitz/plugin-new-approach. Authored-by: Robert Dillitz <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
…ender to ExecuteResponseObserver ### What changes were proposed in this pull request? Small improvements can be made to the way new ExecuteGrpcResponseSender is attached to observer. * Since now we have addGrpcResponseSender in ExecuteHolder, it should be ExecuteHolder responsibility to interrupt the old sender and that there is only one at a time, and to ExecuteResponseObserver's responsibility * executeObserver is used as a lock for synchronization. An explicit lock object could be better. Fix a small bug, when ExecuteGrpcResponseSender will not be waken up by interrupt if it was sleeping on the grpcCallObserverReadySignal. This would result in the sender potentially sleeping until the deadline (2 minutes) and only then removed, which would potentially delay timing the execution out by these 2 minutes. It should **not** cause any hang or wait on the client side, because if ExecuteGrpcResponseSender is interrupted, it means that the client has already came back with a new reattach, and the old sender is being kicked out. ### Why are the changes needed? Minor cleanup of previous work. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests in ReattachableExecuteSuite. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43181 from juliuszsompolski/SPARK-44855. Authored-by: Juliusz Sompolski <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>
…testore ### What changes were proposed in this pull request? Our spark environment features a number of parallel structured streaming jobs, many of which have use state store. Most use state store for dropDuplicates and work with a tiny amount of information, but a few have a substantially large state store requiring use of RocksDB. In such a configuration, spark allocates a minimum of `spark.sql.shuffle.partitions * queryCount` partitions, each of which pre-allocate about 74mb (observed on EMR/Hadoop) disk storage for RocksDB. This allocation is due to pre-allocation of log files space using [fallocate](https://github.com/facebook/rocksdb/blob/main/include/rocksdb/options.h#L871-L880), requiring users to either unnaturally reduce shuffle partitions, split running spark instances, or allocating a large amount of wasted storage. This PR provides users with the option to simply disable fallocate so RocksDB uses far less space for the smaller state stores, reducing complexity and disk storage at the expense of performance. ### Why are the changes needed? As previously mentioned, these changes allow a spark context to support many parallel structured streaming jobs when using RocksDB state stores without the need to allocate a glut of excess storage. ### Does this PR introduce _any_ user-facing change? Users disable the fallocate rocksdb performance optimization by configuring `spark.sql.streaming.stateStore.rocksdb.allowFAllocate=false` ### How was this patch tested? 1) A few test cases were added 2) The state store size was validated by running this script with and without fallocate disabled ``` from pyspark.sql.types import StructType, StructField, StringType, TimestampType import datetime if disable_fallocate: spark.conf.set("spark.sql.streaming.stateStore.rocksdb.allowFAllocate", "false") spark.conf.set( "spark.sql.streaming.stateStore.providerClass", "org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider", ) schema = StructType( [ StructField("one", TimestampType(), False), StructField("two", StringType(), True), ] ) now = datetime.datetime.now() data = [(now, y) for y in range(300)] init_df = spark.createDataFrame(data, schema) path = "/tmp/stream_try/test" init_df.write.format("parquet").mode("append").save(path) stream_df = spark.readStream.schema(schema).format("parquet").load(path) stream_df = stream_df.dropDuplicates(["one"]) def foreach_batch_function(batch_df, epoch_id): batch_df.write.format("parquet").mode("append").option("path", path + "_out").save() stream_df.writeStream.foreachBatch(foreach_batch_function).option( "checkpointLocation", path + "_checkpoint" ).start() ``` With these results (local run, docker container with small FS) ``` allowFAllocate=True (current default) --------------------- root0ef384f699e0:/tmp# du -sh spark-d43a2964-c92a-4d94-9fdd-f3557a651fd9 808M spark-d43a2964-c92a-4d94-9fdd-f3557a651fd9 | |-->4.1M StateStoreId(opId=0,partId=0,name=default)-d59b907c-8004-47f9-a8a1-dec131f73505 |--> <snip> |-->4.1M StateStoreId(opId=0,partId=199,name=default)-b49a93fe-1007-4e92-8f8f-5767aef41e5c allowFAllocate=False (new feature) ---------------------- root0ef384f699e0:/tmp# du -sh spark-00cb768d-2659-453c-8670-4aaf70148041 7.9M spark-00cb768d-2659-453c-8670-4aaf70148041 | |-->40K StateStoreId(opId=0,partId=0,name=default)-45b38d9c-737b-49b1-bb82- |--> <snip> |-->40K StateStoreId(opId=0,partId=199,name=default)-28a6cc02-2693-4360-b47a-1f1ab0d54a61 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes #43202 from schenksj/feature/rocksdb_allow_fallocate. Authored-by: Scott Schenkein <[email protected]> Signed-off-by: Jungtaek Lim <[email protected]>
### What changes were proposed in this pull request? This PR refines the docstring of DataFrameReader.parquet by adding more examples. ### Why are the changes needed? To improve PySpark documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #43301 from allisonwang-db/spark-45221-refine-parquet. Lead-authored-by: Hyukjin Kwon <[email protected]> Co-authored-by: allisonwang-db <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? sort before show ### Why are the changes needed? the orders of rows are non-deterministic after groupby the tests fail in some env ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #43331 from zhengruifeng/py_collect_groupby. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…thods to consume previous 'analyze' result ### What changes were proposed in this pull request? This PR adds a Python UDTF API for the `eval` and `terminate` methods to consume the previous `analyze` result. This also works for subclasses of the `AnalyzeResult` class, allowing the UDTF to return custom state from `analyze` to be consumed later. For example, we can now define a UDTF that perform complex initialization in the `analyze` method and then returns the result of that in the `terminate` method: ``` def MyUDTF(self): dataclass class AnalyzeResultWithBuffer(AnalyzeResult): buffer: str udtf class TestUDTF: def __init__(self, analyze_result): self._total = 0 self._buffer = do_complex_initialization(analyze_result.buffer) staticmethod def analyze(argument, _): return AnalyzeResultWithBuffer( schema=StructType() .add("total", IntegerType()) .add("buffer", StringType()), with_single_partition=True, buffer=argument.value, ) def eval(self, argument, row: Row): self._total += 1 def terminate(self): yield self._total, self._buffer self.spark.udtf.register("my_ddtf", MyUDTF) ``` Then the results might look like: ``` sql( """ WITH t AS ( SELECT id FROM range(1, 21) ) SELECT total, buffer FROM test_udtf("abc", TABLE(t)) """ ).collect() > 20, "complex_initialization_result" ``` ### Why are the changes needed? In this way, the UDTF can perform potentially expensive initialization logic in the `analyze` method just once and result the result of such initialization rather than repeating the initialization in `eval`. ### Does this PR introduce _any_ user-facing change? Yes, see above. ### How was this patch tested? This PR adds new unit test coverage. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43204 from dtenedor/prepare-string. Authored-by: Daniel Tenedorio <[email protected]> Signed-off-by: Takuya UESHIN <[email protected]>
…1 after get_json_object supports codgen ### What changes were proposed in this pull request? The pr aims to followup #40506, update JsonBenchmark-jdk21-results.txt for it. ### Why are the changes needed? Update JsonBenchmark-jdk21-results.txt. https://github.com/panbingkun/spark/actions/runs/6489918873 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only update the results of the benchmark, ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43346 from panbingkun/get_json_object_followup. Authored-by: panbingkun <[email protected]> Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request? This PR refines the docstring of `DataFrame.show` by adding more examples. ### Why are the changes needed? To improve PySpark documentations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? doctest ### Was this patch authored or co-authored using generative AI tooling? No Closes #43252 from allisonwang-db/spark-45442-refine-show. Authored-by: allisonwang-db <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
github-actions
bot
added
BUILD
DOCS
CORE
SQL
DSTREAM
ML
STRUCTURED STREAMING
PYTHON
KUBERNETES
labels
Oct 12, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Upgrade Apache Kafka from 3.4.1 to 3.6.0
Why are the changes needed?
A bunch of improvements.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
GitHub CI.
Was this patch authored or co-authored using generative AI tooling?
No