Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient Computation of Statistics for a Non-linear N-dimensional Function with Dependent Variables #508

Open
4 tasks
CFoye-Creare opened this issue Apr 3, 2023 · 3 comments · May be fixed by #509
Open
4 tasks
Assignees

Comments

@CFoye-Creare
Copy link
Contributor

CFoye-Creare commented Apr 3, 2023

Objective
Develop an efficient method to compute the statistics s(c) for a given non-linear n-dimensional function f(x, y) with dependent variables x and y, such that F(f(s(x), s(y))) = s(c) while minimizing computational expense.

Background
The non-linear n-dimensional function f(x, y) has dependent variables x and y, and is computationally expensive when directly calculating s(f(x, y)) to obtain s(c). Instead, we want to find an efficient approach to compute F(f(s(x), s(y))) to derive s(c) with reduced computational cost.

Input:

  1. A non-linear n-dimensional function f(x, y)
  2. Statistics on x and y, denoted s(x) and s(y)

Output:
Statistics on the function output, denoted s(c), such that F(s(x), s(y)) = s(c). For example, given two PDFs P(x) and P(y), we want to approximate F such that F(P(x), P(y)) = PDF(c)

Constraints:
The proposed method should significantly reduce computational cost compared to directly calculating s(f(x, y)) to obtain s(c).

Evaluation Metrics:
The efficiency of the proposed method will be evaluated based on the following criteria:

  1. Accuracy: The computed s(c) should be accurate and comparable to the result obtained from s(f(x, y)).
  2. Computational cost: The proposed method should demonstrate a significant reduction in computational cost compared to calculating s(f(x, y)) directly.
  3. Scalability: The method should be able to handle large-scale problems with high-dimensional functions and large datasets for x and y.
  4. Robustness: The method should be robust to variations in the function and input data.

Deliverables

  • A mathematical representation or model that captures the relationship between the dependent variables x and y, and the function f(x, y), which allows us to approximate f(s(x), s(y)) without directly computing s(f(x, y)).
  • An algorithm or method that efficiently computes F(s(x), s(y)) to obtain s(c) based on the derived mathematical representation or model. This algorithm should be designed to minimize computational cost while maintaining accuracy, scalability, and robustness.
  • A method for validating and quantifying the accuracy of the derived s(c) compared to the result obtained from s(f(x, y)). This can be done using techniques like cross-validation, error analysis, or comparisons with benchmark datasets.
  • A thorough analysis of the computational cost, scalability, and robustness of the proposed method compared to directly computing s(f(x, y)). This will help demonstrate the practical benefits and efficiency of the derived approach.
@CFoye-Creare CFoye-Creare self-assigned this Apr 3, 2023
@CFoye-Creare
Copy link
Contributor Author

Some possible approaches include:

  1. Surrogate Modeling
    Using surrogate models, like Gaussian Process Regression or Radial Basis Function networks, to approximate the function f(x, y) and then compute F(f(s(x), s(y))). These models can help in reducing the computational cost by providing a fast approximation of the function. Relevant Resource
  2. Sparse Grid Techniques
    Using sparse grid techniques to approximate the function f(x, y) and then compute F(f(s(x), s(y))). Sparse grids can help in reducing the computational cost by effectively handling high-dimensional functions with fewer grid points. Relevant Resource
  3. Machine Learning
    Leveraging machine learning techniques, such as neural networks or support vector machines, to learn the underlying relationship between the input and output of the function f(x, y) and then compute F(f(s(x), s(y))). These techniques can help in reducing the computational cost by learning an approximation of the function. Relevant Resource

CFoye-Creare added a commit that referenced this issue Apr 3, 2023
@CFoye-Creare
Copy link
Contributor Author

Also notable is the application of dimensionality reduction techniques before building surrogate models or approximations.

  1. Principal Component Analysis (PCA):
    PCA is a linear dimensionality reduction technique that transforms the input data into a new coordinate system by finding orthogonal axes (principal components) that capture the most variance in the data. The principal components are linear combinations of the original features, and the transformed data can be represented with fewer dimensions by retaining only the components that account for the most significant variance.
    Relevant Paper:
    Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065), 20150202.
    (https://royalsocietypublishing.org/doi/10.1098/rsta.2015.0202)
  2. t-Distributed Stochastic Neighbor Embedding (t-SNE):
    t-SNE is a non-linear dimensionality reduction technique that maps high-dimensional data to a lower-dimensional space while preserving local structures. It measures pairwise similarities between data points in the high-dimensional space and the lower-dimensional space, and minimizes the divergence between these similarity distributions using a gradient descent approach. t-SNE is particularly useful for visualizing complex data structures.
    Relevant paper:
    van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579-2605.
    (https://jmlr.org/papers/v9/vandermaaten08a.html)
  3. Uniform Manifold Approximation and Projection (UMAP):
    UMAP is a non-linear dimensionality reduction technique based on manifold learning and topology. It approximates the high-dimensional manifold structure by constructing a graph representation of the data and then optimizes an embedding in the lower-dimensional space to preserve both local and global structures. UMAP is computationally efficient and scalable, making it suitable for large-scale data analysis and visualization.
    Relevant paper:
    McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv preprint arXiv:1802.03426.
    (https://arxiv.org/abs/1802.03426)

CFoye-Creare added a commit that referenced this issue Apr 3, 2023
CFoye-Creare added a commit that referenced this issue Apr 3, 2023
CFoye-Creare added a commit that referenced this issue Apr 4, 2023
@CFoye-Creare
Copy link
Contributor Author

Since this issue, we have evolved our approach. We are now using the Law of the Unconscious Statistician and Change-of-Variable to solve this problem.
See the following:

Our approach became:

  1. Calculating a pdf of size n_bins for each coarse resolution grid square
  2. Sampling the centers of each bin and evaluating fine-scale soil-moisture
  3. Multiplying the evaluated soil moisture by the pdf to weight it correctly.

This commit shows this using toy data: ab6618a .
This commit shows using real data: 27140d2

Here are some figures:
MSE vs Bin Size for calculating mean soil moisture.
image

And calculating variance:
image

@CFoye-Creare CFoye-Creare linked a pull request May 22, 2023 that will close this issue
@CFoye-Creare CFoye-Creare linked a pull request May 22, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant