Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Polish analysis of a multiclosure test #1950

Closed
wants to merge 15 commits into from
Closed

Conversation

andreab1997
Copy link
Contributor

Here we collect new functions and template that allow the analysis of multiclosure tests.
@comane @giovannidecrescenzo @mariaubiali

@scarlehoff
Copy link
Member

If you can do this as a branch off the new data reader that would be immensely helpful :__

and it should also work for everything other than ttbar and jet/dijets ...

@andreab1997
Copy link
Contributor Author

If you can do this as a branch off the new data reader that would be immensely helpful :__

and it should also work for everything other than ttbar and jet/dijets ...

I would do it but we actually need jets and ttbar

Comment on lines +222 to +223
import matplotlib.pyplot as plt
import matplotlib.colors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use validphys.subplots instead


import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from sklearn import preprocessing

Seems to be unused

Comment on lines +70 to +79
# fits_dataset_predictions = [
# ThPredictionsResult.from_convolution(pdf, dataset) for pdf in fits_pdf
# ]

# dimensions here are (Nfits, Ndat, Nrep)
# reps = np.asarray([th.error_members for th in fits_dataset_predictions])

# reshape so as to get PCs from all the samples
# reps = reps.reshape(reps.shape[1],-1)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# fits_dataset_predictions = [
# ThPredictionsResult.from_convolution(pdf, dataset) for pdf in fits_pdf
# ]
# dimensions here are (Nfits, Ndat, Nrep)
# reps = np.asarray([th.error_members for th in fits_dataset_predictions])
# reshape so as to get PCs from all the samples
# reps = reps.reshape(reps.shape[1],-1)

Comment on lines +83 to +92
# rescale feature matrix
reps_scaled = reps # preprocessing.scale(reps)

# choose number of principal components (PCs) based on explained variance ratio
n_comp = 1
for _ in range(reps.shape[0]):
pca = PCA(n_comp).fit(reps_scaled.T)
if np.sum(pca.explained_variance_ratio_) >= explained_variance_ratio:
break
n_comp += 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# rescale feature matrix
reps_scaled = reps # preprocessing.scale(reps)
# choose number of principal components (PCs) based on explained variance ratio
n_comp = 1
for _ in range(reps.shape[0]):
pca = PCA(n_comp).fit(reps_scaled.T)
if np.sum(pca.explained_variance_ratio_) >= explained_variance_ratio:
break
n_comp += 1
# choose number of principal components (PCs) based on explained variance ratio
n_comp = 1
for _ in range(reps.shape[0]):
pca = PCA(n_comp).fit(reps.T)
if np.sum(pca.explained_variance_ratio_) >= explained_variance_ratio:
break
n_comp += 1

I think that we can remove this. Or otherwise set preprocessing as an option. The idea was to rescale the feature matrix so as to have it unit free. I am not sure however, whether we really want to have it as we compute the PCA for each dataset separately

Comment on lines +113 to +114
if n_comp <=1:
return None, None, n_comp
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably a bit ugly, but it's the way I had found to deal either with datasets that have one datapoint only or with those that for the given explained variance ratio shrink to 1.
I would be happy to discuss on this

@comane
Copy link
Member

comane commented Mar 4, 2024

Hi @andreab1997, is it fine for you if I commit to this branch?
This way I can add some of the modifications I am suggesting plus other stuff (such as docstrings and further plotting runcards)

@andreab1997
Copy link
Contributor Author

Hi @andreab1997, is it fine for you if I commit to this branch? This way I can add some of the modifications I am suggesting plus other stuff (such as docstrings and further plotting runcards)

Yes fell free. Maybe first we want to rebase this on the new commondata reader, as @scarlehoff was suggesting

@comane comane force-pushed the multict_analysis branch from 20ee788 to a4eabd8 Compare March 4, 2024 17:24
@comane
Copy link
Member

comane commented Mar 4, 2024

@andreab1997, @scarlehoff I am trying to rebase on master but I get a series of conflict that I don't understand too much.
In particular, I get a conflict with multiclosure_inconsistent_output.py which should not even be in master

@comane
Copy link
Member

comane commented Mar 5, 2024

If you can do this as a branch off the new data reader that would be immensely helpful :__
and it should also work for everything other than ttbar and jet/dijets ...

I would do it but we actually need jets and ttbar

@scarlehoff Is this still an issue?

@andreab1997
I did rebase this branch on master by git cherry picking every relevant commit from here (rebasing was too messy I think).
The new branch name is 240305_multict_analysis

If the one above is not an issue please feel free to close this PR and open one for this new branch.

@scarlehoff
Copy link
Member

@scarlehoff Is this still an issue?

No, @andreab1997 fixed it. And on the process of fixing it we found out that some of the values were wrong... (I think for dijet, the value of Q had an extra square or an extra square root, I don't remember right now). It should be now fixed in master.

@comane
Copy link
Member

comane commented Mar 5, 2024

@scarlehoff Is this still an issue?

No, @andreab1997 fixed it. And on the process of fixing it we found out that some of the values were wrong... (I think for dijet, the value of Q had an extra square or an extra square root, I don't remember right now). It should be now fixed in master.

Ok, great, thanks for that!
Then I think it's better if we close this PR and use the new branch instead as that one is rebased on master.

@comane comane closed this Mar 5, 2024
@comane comane deleted the multict_analysis branch March 5, 2024 16:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants