Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NEW: peds-simulation action #58

Merged
merged 55 commits into from
Aug 14, 2024
Merged

Conversation

cherman2
Copy link
Contributor

@cherman2 cherman2 commented May 11, 2023

This PR adds peds-bootstrap functionality.

bootstrapping peds shuffles the donors and test to see if the distribution of the real donors is higher than the distribution of the fake donors

TODO:

  • Implement an output for peds-bootstrap
  • Implement a visualization for peds-bootstrap
  • add unit-tests
  • add usage example

@cherman2 cherman2 marked this pull request as draft May 11, 2023 23:22
@cherman2 cherman2 marked this pull request as ready for review May 14, 2024 18:40
@cherman2 cherman2 added this to the Manuscript Prep milestone May 23, 2024
Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cherman2, this is looking good. I have a bunch of comments, but I don't think any should take too long. Maybe when I'm back from my trip and you've had a chance to work on this, we can sit together for a final review and get it all merged? I bet we can do that in a couple of hours when the time comes.

q2_fmt/plugin_setup.py Outdated Show resolved Hide resolved
q2_fmt/plugin_setup.py Outdated Show resolved Hide resolved
q2_fmt/_peds.py Outdated Show resolved Hide resolved
q2_fmt/_peds.py Outdated
drop_incomplete_timepoint=drop_incomplete_timepoint)
real_temp = peds["measure"]
else:
shifted_list = recipient[reference_column].sample(frac=1).to_list()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend calling this "shuffled" list, as "shifted" would imply something different. The operation we're carrying out here is like shuffling a deck of cards, not (for example) shifting every item's index in a list by adding one to it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And would it work to bypass creating this new list all together? For example, this SO post shows how to do this operation in place. It looks like you don't use the metadata after this, so you could keep shuffling it over and over.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: not creating the list. Those examples dont quite do what I am doing because those shuffled lists keep the ID and value tied together so they are more just shuffling the rows in the dataframe.

The list allows me to keep the sample ids(index) the same and shuffle the refernece list. I will look in to a simpler way to do this.

q2_fmt/_peds.py Outdated Show resolved Hide resolved
q2_fmt/_peds.py Outdated Show resolved Hide resolved
q2_fmt/_peds.py Outdated Show resolved Hide resolved
q2_fmt/tests/test_engraftment.py Outdated Show resolved Hide resolved
Fs1 = feature_peds_df.set_index("id").at['Feature 1 __',
'subject']
Fs2 = feature_peds_df.set_index("id").at['Feature 2',
'subject']
self.assertEqual("1", Fs1)
self.assertEqual("2", Fs2)


class TestBoot(TestBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there relevant boundary conditions that should be tested, such as a single donor, a single recipient, PEDS=1, PEDS=0, ...)?

bootstrap_replicates=999)
real_median = stats["A:measure"].values
fake_median = stats["B:measure"].values
self.assertGreater(real_median, fake_median)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these tests pass all the time, or do you see intermittent failures? Ideally failures won't ever occur, but they should be really rare if there are intermittent failures.

Copy link
Contributor Author

@cherman2 cherman2 Jul 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not experienced intermittent failures. Did you? I ran the tests 20 times and all of them passed.

@cherman2
Copy link
Contributor Author

cherman2 commented Jul 8, 2024

I am still working on this PR, I just wanted to jot down some notes in the interim

  • I am going to change the name of this to peds-simulation
  • I am going to replace bootstrap-replicates with replicates
  • I am going to add a method description that explains this is a monte carlo and that it can serve as a kinda replacement for baseline

'subject_column': T_subject,
'filter_missing_references': Bool,
'drop_incomplete_subjects': Bool,
'drop_incomplete_timepoint': List[Str],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The List[Str] type doesn't align with the description above - reminder to sort that out.

np.testing.assert_array_equal(recip_mask, exp_r_mask)

def test_simulate_uniform_distro(self):
mismatch_peds = [0, 0, 0, 0, 0, 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding some diversity to this test. For example, make mismatch_peds = [1, 2, 3], and then confirm that you see each of those values at least once in the results (intermittent failure is possible, so it's worth mentioning that in a comment, noting that it should be extremely rare).

@cherman2 cherman2 assigned cherman2 and unassigned gregcaporaso Jul 31, 2024
@cherman2
Copy link
Contributor Author

cherman2 commented Aug 1, 2024

Back to you, @gregcaporaso

Thanks for all the feedback!

@cherman2 cherman2 assigned gregcaporaso and unassigned cherman2 Aug 1, 2024
@cherman2
Copy link
Contributor Author

cherman2 commented Aug 8, 2024

fixing merge conflicts from #88

q2_fmt/_peds.py Outdated
# Tranforming to series for easy series math in _per_subject_stats()
peds_iters = pd.Series(peds_iters)
return peds_iters


def _per_subject_stats(mismatched_peds, actual_temp,
iterations, mismatched_pairs_n):
def peds_sim_stats(value, peds_iters, num_iterations):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing in num_iterations, compute that internally here as len(peds_iters) to avoid a value being provided for num_iterations that doesn't match the number of values in ped_iters.

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one little comment that should be addressed before merge, but after that this should be good to go.

@cherman2, just as a reminder, you should write up notes from our discussion about how dependence between samples and comparisons of baseline samples to (their actual) donor samples can both make the test overly conservative. This will be good to document along with instructions on how the user could address this if they want to (but it requires large recipient numbers, which is why we're not doing it by default).

@cherman2
Copy link
Contributor Author

cherman2 commented Aug 9, 2024

There many be slight dependence issues here in this methods. @ebolyen, I would love your feedback on this.
From my understanding dependence issues will make the test more conservative which is less concerning( I think).
This data is longitudinal and I think there are two issues with this:

  1. I am not sure if its an issue to have a sample from the same subject in the mismatched-peds values. I think this is similar to the fact that a real sample-donor pair can be compared to same sample but fake donor in mismatched peds. ex: sample1 -donor1 vs sample1-donor2. This may be an issue if the user has maintaince donors that differ between subject samples like: subject1sample1 has donor1, subject1sample2 has donor2. subject1sample2 could be randomized to have donor1 in mismatched peds and could have a relatively high overlap because donor1 did donate to subject1 just not in the sampling.

  2. Baseline timepoints are currently included as real peds values but are expected to have low peds values because no transfer has occured. This would be a negative control and might indentify if the selected donor has a similar microbiome composition as there recipient (some studies are interested in that) but could make my global p-value less significant thant it should be. However, most studies should have enough post-fmt recipients to balance this out.

We have thought of a couple solutions:

  1. This could be a per-timepoint method but that really lowers our comparison N
  2. we can document these possible issues that would cause conservative test stats and let users decided how to filter. (like fitering out their baseline or filtering to one tp)

@cherman2
Copy link
Contributor Author

cherman2 commented Aug 14, 2024

Evan and I discussed offline and decided that this is a limitation of the method and we will document accordingly. But like discussed above these limitations make the method more conservative and therefore these issues are less concerning

Copy link
Member

@gregcaporaso gregcaporaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just two test edits and this is good to go.

q2_fmt/plugin_setup.py Outdated Show resolved Hide resolved
q2_fmt/plugin_setup.py Outdated Show resolved Hide resolved
cherman2 and others added 2 commits August 14, 2024 14:12
Co-authored-by: Greg Caporaso <[email protected]>
Co-authored-by: Greg Caporaso <[email protected]>
@gregcaporaso gregcaporaso merged commit 0335d7b into qiime2:dev Aug 14, 2024
4 checks passed
@lizgehret lizgehret assigned cherman2 and unassigned gregcaporaso Aug 15, 2024
@cherman2 cherman2 removed their assignment Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Completed
Development

Successfully merging this pull request may close these issues.

3 participants