Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Promote 0.9.28 #1889

Merged
merged 45 commits into from
Nov 13, 2024
Merged
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
42c92c3
misc(core): Adding unit tests for histograms for StitchRvsExec (#1831)
sandeep6189 Aug 8, 2024
059bfb9
fix bugs caused by stitching empty and non-empty data. (#1832)
yu-shipit Aug 12, 2024
aed6961
hotfix(0.9.27.1): revert "Now metadata queries support _type_ filter …
alextheimer Aug 21, 2024
74107b0
fix(query): fix parse issue with escaped dot character (#1841)
sherali42 Aug 26, 2024
5ae7cbe
feat(core): Do not index any existing _type_ field since it is a rese…
vishramachandran Sep 5, 2024
46236ea
hotfix(0.9.27.1): revert "Now metadata queries support _type_ filter …
alextheimer Sep 11, 2024
8108083
hotfix(0.9.27.1): revert "Now metadata queries support _type_ filter …
alextheimer Aug 21, 2024
7cc9cca
Unrevert "Now metadata queries support _type_ filter (#1819)"
alextheimer Sep 11, 2024
a7312de
hotfix(0.9.27.1): un/revert "Now metadata queries support _type_ filt…
alextheimer Sep 12, 2024
4320860
fix(query): negative rate/increase due to NaN chunk (#1846)
sherali42 Sep 13, 2024
c2946bd
fix(query): Generalizaing the column filter check to span multiple te…
sandeep6189 Sep 16, 2024
26ab573
feat(core): Add support for Tantivy based time series index (#1852)
rfairfax Sep 20, 2024
101f566
feat(query): Add support for LogicalPlan updates, to use higher level…
sandeep6189 Sep 20, 2024
29dcec6
fix(core): Support Rust 1.78 (#1854)
rfairfax Sep 24, 2024
f9cbbcf
adding metrics for failover (#1856)
kvpetrov Sep 26, 2024
6fe42cc
fix(core): Fix Rust publishing of x86_64 binaries (#1857)
rfairfax Sep 26, 2024
e310445
fix(build): Support glibc qualifiers for Rust targets (#1858)
rfairfax Sep 27, 2024
aab8633
metrics for failover (#1859)
kvpetrov Sep 27, 2024
45f8d22
adding metrics for failover (#1856)
kvpetrov Sep 26, 2024
a685755
metrics for failover (#1859)
kvpetrov Sep 27, 2024
95ee5d2
Merge pull request #1860 from kvpetrov/shard_failover_metric
kvpetrov Sep 27, 2024
33e4656
misc(sparkjobs): force push metrics publish from index job (#1862)
sherali42 Oct 2, 2024
ebee7ae
fix(core): Fix tantivy column cache not releasing memory from deleted…
rfairfax Oct 7, 2024
22d0b08
fix(query) Fixed mismatched schema regarding fixedVectorLen. (#1855)
yu-shipit Oct 7, 2024
ccc70dd
fix(query) Fixed mismatched schema regarding fixedVectorLen. (#1855) …
yu-shipit Oct 7, 2024
6d0e997
misc(query): increment counter when query plan updated with next leve…
sandeep6189 Oct 8, 2024
80a37ac
fix(core): Improve performance for Tantivy indexValues call (#1867)
rfairfax Oct 11, 2024
69df0c2
feat(query): Support multiple aggregation rules for HierarchicalQuery…
sandeep6189 Oct 15, 2024
81cde60
feat(core): Now metadata queries support _type_ filter (#1819)
vishramachandran Jul 29, 2024
f7182b9
feat(core): Do not index any existing _type_ field since it is a rese…
vishramachandran Sep 5, 2024
4450bcd
Cherry-pick: support for _type_ filter in metadata queries
sherali42 Oct 15, 2024
84f7ade
fix(core): Don't index part keys with invalid schema (#1870)
rfairfax Oct 16, 2024
fd59ebb
fix(query) the schema provided by _type_ does not match colIDs in the…
yu-shipit Oct 21, 2024
6657340
fix(query): removing max/min aggregations from hierarchical query exp…
sandeep6189 Oct 24, 2024
880d6e9
Merge branch 'develop' into integ-merge
amolnayak311 Nov 1, 2024
7ed9466
Merge pull request #1876 from amolnayak311/integ-merge
amolnayak311 Nov 1, 2024
376e7c6
fix(coordinator): update LogicalPlanParser to correctly handle scalar…
Tanner-Meng Nov 1, 2024
7f008e7
perf(query) Memoize the part of the logical plan tree traversal for r…
amolnayak311 Nov 1, 2024
722caa4
Merge branch 'develop' into integ-merge-take2
amolnayak311 Nov 1, 2024
b5b3c0a
Merge pull request #1877 from amolnayak311/integ-merge-take2
amolnayak311 Nov 4, 2024
74c238f
Version bumnp to 0.9.28 (#1878)
amolnayak311 Nov 4, 2024
a606dbc
perf(query) Eliminate the allocation of memory for RepeatValueVector …
amolnayak311 Nov 8, 2024
9181cd9
Merge pull request #1888 from amolnayak311/integration
amolnayak311 Nov 13, 2024
5003655
Merge branch 'integration' into promote-0.9.28
amolnayak311 Nov 13, 2024
2c5f291
Version bump to 0.9.28.0
amolnayak311 Nov 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
adding metrics for failover (#1856)
Co-authored-by: Kier Petrov <kpetrov@apple.com>
kvpetrov and Kier Petrov committed Sep 27, 2024
commit 45f8d227ebba5d192733f1e256f202ba9a795970
Original file line number Diff line number Diff line change
@@ -7,6 +7,7 @@ import scala.jdk.CollectionConverters._

import com.typesafe.scalalogging.StrictLogging
import io.grpc.ManagedChannel
import kamon.Kamon

import filodb.coordinator.GrpcPlanDispatcher
import filodb.coordinator.ShardMapper
@@ -17,6 +18,9 @@ import filodb.grpc.GrpcCommonUtils
import filodb.query.{LabelNames, LabelValues, LogicalPlan, SeriesKeysByFilters}
import filodb.query.exec._

object HighAvailabilityPlanner {
final val FailoverCounterName = "single-cluster-plans-materialized"
}
/**
* HighAvailabilityPlanner responsible for using underlying local planner and FailureProvider
* to come up with a plan that orchestrates query execution between multiple
@@ -48,6 +52,24 @@ class HighAvailabilityPlanner(dsRef: DatasetRef,
import QueryFailureRoutingStrategy._
import LogicalPlan._

// legacy failover counter captures failovers when we send a PromQL to the buddy
// cluster
val legacyFailoverCounter = Kamon.counter(HighAvailabilityPlanner.FailoverCounterName)
.withTag("cluster", clusterName)
.withTag("type", "legacy")

// full failover counter captures failovers when we materialize a plan locally and
// send an entire plan to the buddy cluster for execution
val fullFailoverCounter = Kamon.counter(HighAvailabilityPlanner.FailoverCounterName)
.withTag("cluster", clusterName)
.withTag("type", "full")

// partial failover counter captures failovers when we materialize a plan locally and
// send some parts of it for execution to the buddy cluster
val partialFailoverCounter = Kamon.counter(HighAvailabilityPlanner.FailoverCounterName)
.withTag("cluster", clusterName)
.withTag("type", "partial")

// HTTP endpoint is still mandatory as metadata queries still use it.
val remoteHttpEndpoint: String = queryConfig.remoteHttpEndpoint
.getOrElse(throw new IllegalArgumentException("remoteHttpEndpoint config needed"))
@@ -110,6 +132,7 @@ class HighAvailabilityPlanner(dsRef: DatasetRef,
qContext)
}
case route: RemoteRoute =>
legacyFailoverCounter.increment()
val timeRange = route.timeRange.get
val queryParams = qContext.origQueryParams.asInstanceOf[PromQlQueryParams]
// rootLogicalPlan can be different from queryParams.promQl
@@ -341,6 +364,7 @@ class HighAvailabilityPlanner(dsRef: DatasetRef,
localActiveShardMapper: ActiveShardMapper,
remoteActiveShardMapper: ActiveShardMapper
): ExecPlan = {
partialFailoverCounter.increment()
// it makes sense to do local planning if we have at least 50% of shards running
// as the query might overload the few shards we have while doing second level aggregation
// Generally, it would be better to ship the entire query to the cluster that has more shards up
@@ -360,6 +384,7 @@ class HighAvailabilityPlanner(dsRef: DatasetRef,
buddyGrpcEndpoint = Some(remoteGrpcEndpoint.get)
)
val context = qContext.copy(plannerParams = haPlannerParams)
logger.info(context.getQueryLogLine("Using shard level failover"))
val plan = localPlanner.materialize(logicalPlan, context);
plan
}
@@ -419,6 +444,7 @@ class HighAvailabilityPlanner(dsRef: DatasetRef,
qContext: QueryContext,
localActiveShardMapper: ActiveShardMapper, remoteActiveShardMapper: ActiveShardMapper
): GenericRemoteExec = {
fullFailoverCounter.increment()
val timeout: Long = queryConfig.remoteHttpTimeoutMs.getOrElse(60000)
val plannerParams = qContext.plannerParams.copy(
failoverMode = filodb.core.query.ShardLevelFailoverMode,
Original file line number Diff line number Diff line change
@@ -65,6 +65,12 @@ class SingleClusterPlanner(val dataset: Dataset,
private val shardColumns = dsOptions.shardKeyColumns.sorted
private val dsRef = dataset.ref

// failed failover counter captures failovers which are not possible because at least one shard
// is down both on the primary and DR clusters, the query will get executed only when the
// partial results are acceptable otherwise an exception is thrown
val shardUnavailableFailoverCounter = Kamon.counter(HighAvailabilityPlanner.FailoverCounterName)
.withTag("cluster", clusterName)
.withTag("type", "shardUnavailable")

val numPlansMaterialized = Kamon.counter("single-cluster-plans-materialized")
.withTag("cluster", clusterName)
@@ -192,10 +198,12 @@ class SingleClusterPlanner(val dataset: Dataset,
if (!shardInfo.active) {
if (queryContext.plannerParams.allowPartialResults)
logger.debug(s"Shard: $shard is not available however query is proceeding as partial results is enabled")
else
else {
shardUnavailableFailoverCounter.increment()
throw new filodb.core.query.ServiceUnavailableException(
s"Remote Buddy Shard: $shard is not available"
)
}
}
val dispatcher = RemoteActorPlanDispatcher(shardInfo.address, clusterName)
dispatcher