Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refresh with cascade: true - refreshing unrelated materialized views #275

Open
yonibaciu opened this issue Feb 15, 2019 · 9 comments
Open

Comments

@yonibaciu
Copy link

yonibaciu commented Feb 15, 2019

Hi,

There is a flaw in the Postgres Adapter DependencyParser class to find out which mat views are dependencies of the mat view that is being refreshed. It results in unrelated mat views being refreshed.
The reason is tsort(dependency_hash) puts all mat views in the db linearly in an array including ones that have no relation to the mat view being refreshed.

Take this example:
A depends on B which depends on C
E depends on D which depends on C

No matter what you do, A will either be before E in the array or after it:
example #1 [C,B,A,D,E]
example #2 [C,D,E,B,A]

If I refresh (cascade: true) mat view E it will also unnecessarily refresh B then A in example #1.
If I refresh (cascade: true) mat view A it will also unnecessarily refresh E then D in example #2.

TL;DR - Forcing all mat views in the db into a single linear dependency array will cause unnecessary (and costly) refreshes of unrelated mat views.

EDIT: This problem is not only about mat views having a common 'ancestry'. It also applies to situations with no common 'ancestry':

A depends on B which depends on C
F depends on E which depends on D

tsort(dependency_hash) might put all mat views in a single array that will look like the following:

[C,B,A,D,E,F]

Refreshing F with cascade:true will result in all mat views being refreshed.

@francois-belle
Copy link

Any update on this issue?

@derekprior
Copy link
Contributor

Any update on this issue?

Well tested patches are welcome

@jgigault
Copy link

François says: Apparently, Scenic's algorithm for retrieving dependent materialized views cannot be trusted: #275

rsanheim added a commit to simpledotorg/simple-server that referenced this issue Jun 30, 2021
rsanheim added a commit to simpledotorg/simple-server that referenced this issue Jun 30, 2021
* Start adding new mat views

* Add BPs per month

* Refactor and add specs for v2

* freeze

* linting

* Turn cascade off, it refreshes too many tables

See scenic-views/scenic#275

* Use RefreshMaterializedViews to refresh the new matviews

* Visits depend up on States, so refresh states first

* States depend up on Visits, so refresh Visits first

* Enforce a known good time for all these specs

* Enforce end of June for UTC, ET, and IST

* Freeze time to avoid intermittent failures
@derekprior
Copy link
Contributor

@yonibaciu

Take this example:
A depends on B which depends on C
E depends on D which depends on C

If I refresh (cascade: true) mat view E it will also unnecessarily refresh B then A
If I refresh (cascade: true) mat view A it will also unnecessarily refresh E then D

There's an argument this is correct, is there not? If we're refreshing e, then we have to refresh it's direct decendent d and it's direct descendent c. It sounds like you're expecting it to stop there. However, we know that b depends on c. If We know c has been refreshed then it stands to reason that b may change. If we refresh b then we have to refresh a for the same reason.

I can see an argument that we should only refresh down the tree so to speak, but I think that would be more surprising to users who know they did something to change e, so request for it to be refreshed and are surprised a and b are now stale.

@derekprior
Copy link
Contributor

This problem is not only about mat views having a common 'ancestry'. It also applies to situations with no common 'ancestry':

A depends on B which depends on C
F depends on E which depends on D

Refreshing F with cascade:true will result in all mat views being refreshed

This is definitely a bug.

@yonibaciu
Copy link
Author

@yonibaciu

Take this example:
A depends on B which depends on C
E depends on D which depends on C

If I refresh (cascade: true) mat view E it will also unnecessarily refresh B then A
If I refresh (cascade: true) mat view A it will also unnecessarily refresh E then D

There's an argument this is correct, is there not? If we're refreshing e, then we have to refresh it's direct decendent d and it's direct descendent c. It sounds like you're expecting it to stop there. However, we know that b depends on c. If We know c has been refreshed then it stands to reason that b may change. If we refresh b then we have to refresh a for the same reason.

I can see an argument that we should only refresh down the tree so to speak, but I think that would be more surprising to users who know they did something to change e, so request for it to be refreshed and are surprised a and b are now stale.

I respectfully disagree. "cascade" usually means uni-directional, meaning down the tree (or up the tree if you prefer to look at it that way). I would consider another flag for the behavior you are describing.

Perhaps there should be three: "upstream" (the current cascade), "downstream" and "both" where "both" is the behavior you are describing.

@yonibaciu
Copy link
Author

@derekprior

here is code to refresh without affecting unnecessary mat views. It also includes refresh_all and refresh_list that refresh multiple mat views, maintaining dependency order and making sure a mat view is only refreshed once.

  class MatViewRefresher
    DEPENDENCY_SQL = <<-SQL
      SELECT materialized_view, string_agg(depends_on,',') AS depends_on
      FROM (
        SELECT DISTINCT class_for_rewrite.relname AS materialized_view, class_for_depend.relname AS depends_on
        FROM pg_rewrite AS rewrite
        JOIN pg_class AS class_for_rewrite ON rewrite.ev_class = class_for_rewrite.oid
        JOIN pg_namespace AS rewrite_namespace ON rewrite_namespace.oid = class_for_rewrite.relnamespace
        JOIN pg_depend AS depend ON rewrite.oid = depend.objid
        JOIN pg_class AS class_for_depend ON depend.refobjid = class_for_depend.oid
        JOIN pg_namespace AS depend_namespace ON depend_namespace.oid = class_for_depend.relnamespace
        WHERE class_for_depend.relkind = 'm'
          AND rewrite_namespace.nspname = 'public'
          AND class_for_rewrite.relkind = 'm'
          AND class_for_depend.relname != class_for_rewrite.relname
      ) a
      GROUP BY materialized_view
      ORDER BY materialized_view;
    SQL

    def refresh(mat_view, concurrently: false, cascade: false, logger: Logger.new(STDOUT))
      @dependencies = ActiveRecord::Base.connection.select_all(DEPENDENCY_SQL)
      @already_refreshed = []

      if cascade
        refresh_recursive(mat_view, concurrently: concurrently, logger: logger)
      else
        refresh_simple(mat_view, concurrently: concurrently, logger: logger)
      end
    end

    def refresh_all(concurrently: false, logger: Logger.new(STDOUT))
      @dependencies = ActiveRecord::Base.connection.select_all(DEPENDENCY_SQL)
      @already_refreshed = []

      list_mat_views = ActiveRecord::Base.connection.select_values("SELECT oid::regclass::text FROM pg_class WHERE relkind = 'm'")
      list_mat_views.each do |mat_view|
        refresh_recursive(mat_view, concurrently: concurrently, logger: logger)
      end
    end

    def refresh_list(list_mat_views, concurrently: false, logger: Logger.new(STDOUT))
      @dependencies = ActiveRecord::Base.connection.select_all(DEPENDENCY_SQL)
      @already_refreshed = []

      list_mat_views.each do |mat_view|
        refresh_recursive(mat_view, concurrently: concurrently, logger: logger)
      end
    end

    private

    def refresh_recursive(mat_view, concurrently: false, logger:)
      depends_on_row = @dependencies.detect { |row| row['materialized_view'] == mat_view }
      depends_on_mat_views = []
      if depends_on_row
        depends_on_mat_views = depends_on_row['depends_on'].split(',')
        depends_on_mat_views.each do |depends_on_mat_view|
          refresh_recursive(depends_on_mat_view, concurrently: concurrently, logger: logger)
        end
      end
      unless @already_refreshed.include?(mat_view)
        refresh_simple(mat_view, concurrently: concurrently, logger: logger)
        @already_refreshed << mat_view
      end
    end

    def refresh_simple(mat_view, concurrently: false, logger:)
      logger.info("Refreshing the #{mat_view} view #{concurrently ? 'concurrently' : 'not concurrently'}") unless Rails.env.test?
      ActiveRecord::Base.connection.execute("REFRESH MATERIALIZED VIEW #{concurrently ? 'CONCURRENTLY' : ''} #{mat_view}")
    end
  end

@23tux
Copy link

23tux commented Dec 15, 2024

I stumbled upon the same issue today: A small view that has another small dependent should be refreshed. I did a refresh(cascade: true) and it refreshed 4 other, completely unrelated views. Unfortunately, those 4 are HUGE 😢

Last comment is about 1.5 years ago. Is there any chance this get's merged?

@derekprior
Copy link
Contributor

Last comment is about 1.5 years ago. Is there any chance this get's merged?

It needs a PR and tests, but yes. I also need to think through the impact of the functionality change and whether this warrants a major version bump.

I'm hoping to do some holiday-time OSS maintenance. If you want to see this definitely land in that time period and are willing to open a PR with tests (that fail without the change and pass with it) then that would greatly increase the chances of this getting attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants