Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protocol for extending the variables pane #560

Merged
merged 35 commits into from
Oct 23, 2024
Merged

Conversation

dfalbel
Copy link
Contributor

@dfalbel dfalbel commented Oct 2, 2024

This PR is a proposal of a mechanism that allows package authors to implement custom behavior for objects in the variables pane.

A description of how the protocol works is available here: https://github.com/posit-dev/ark/blob/daf9348affc0f1b190928a2fb0cad5bfadaf263c/doc/variables-pane-extending.md

See also #561 for the custom inspect behavior.

Motivation

The main motivation is to allow package authors to improve the display of objects for which the internal R data structure does not show useful information for users. For example, reticulate R obejcts are environments containing a flag and an external pointer (x <- reticulate::py_eval("1", convert = FALSE)):

image

Or torch tensors (x = torch::torch_randn(10, 10)) :

image

Or xml2 documents (x <- xml2::read_xml("<foo> <bar> text <baz/> </bar> </foo>")):

image

How it's implemented

Package authors can implement a set of S3 like methods, each customizing a specific part of the UI.
Methods are discovered by ark when packages are loaded and lazily stored in an environment. Package authors don't need to export these methods in their packages. See here for the description of methods.

When responding to events from the variables pane, we try to apply the custom methods and if the method works correctly we use it's value, otherwise we try to continue using the default implementation.

Methods are expllicitly implemented for Ark, so it's expected that package authors will take extra care to make they work correctly with Positron.

Comparison to RStudio

RStudio uses the utils::str S3 in a few situations to obtain the values description, see usages of ValueFromStr in eg: https://github.com/rstudio/rstudio/blob/2175496934a54e21979461a1c67eb13ad5048cf2/src/cpp/session/modules/SessionEnvironment.R#L122

There are many guardrails though, as str methods are commonly implemented and package authors might not be aware of such bugs when used from RStudio.

This comment was marked as resolved.

@t-kalinowski

This comment was marked as resolved.

Copy link
Contributor

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems like a good idea!

Before reviewing the implementation, I'd just like to discuss a potential simplification of the extension mechanism. What if we had a single generic ps_variable_proxy() that would be in charge of converting an object to a base type representation (data frame, vector, list, environment) that Ark would be able to display in the usual way. This is consistent with how genericity works in the tidyverse via vctrs proxies.

The display type could be set to the first element of the class vector, but if that's too inconvenient we could have an additional method for it as proposed here.

Laziness for deeply/infinitely nested objects:

  • Could be supported via environments. The only downside is that environments have the restriction of uniquely named elements, but that seems reasonable for object internals since that's consistent with struct-like types?

  • Or we could allow list-proxies containing classed objects that themselves have a proxy method.

What do you think?

@dfalbel
Copy link
Contributor Author

dfalbel commented Oct 3, 2024

I like the idea of a single ps_variable_proxy.

I think I'd still want to customize the display value though, eg for a torch_tensor, I think the display value should be something like Float [1:10, 1:10], I think the closest proxy would be a length 1 character vector with that text, which would still be displayed with quotes around it:"Float [1:10, 1:10]".

I'm fine with display type = class(x)[1], we could remove display_type.

So the proxy would replace has_children(), get_children() and get_child_at().

My main concern with a single proxy implementation is performance. The way the variables pane is currenty implemented, we don't keep any state after the representations are sent to the front-end and we only get back a vector of access keys from the front-end when we want to expand a node.

Considering we implement the second option where variable_proxy() returns a list-proxy containing classed objects implementing variable_proxy. So suppose that A is an object that implements the variable_proxy() and it's children are also objects that also implement variable_proxy() (For instance, A could be a reticulate object representing a Python dictionary and B a dictionary contained in A).

When we click on B, so we get it's children, we'll receive a request from the front-end containing [A, B].
To compute the children of B, we'll need to:

  1. ps_variable_proxy(A)
  2. find B there and then
  3. ps.variable_proxy(B)
Root
└── A
    ├── B
    │   ├── B1
    │   └── B2
    └── C

If A is large, building it's variable_proxy() can be time consuming and we'll need to rebuild it for every. neste child one expands.

The main reason for having two separate get_children() and get_child() is that there might be an efficient way of finding B in A without building the entire list of possible children.

IMO this will be a rare situation though, except for reticulate, most use cases I can think would only want to show a nice display value. So maybe it's fine to let package authors guarantee that variable_proxy is fast enough, even if they need to implement some kind of caching, etc. Or maybe, we want to keep some state in the variables pane so subsequent requests don't need to recompute the children.

Let me know what you think!

@lionel-
Copy link
Contributor

lionel- commented Oct 3, 2024

hmm if performance is an issue I think we should be able to store the proxies in our current_bindings? These would only be updated if the objects change across top-level calls.

@t-kalinowski
Copy link
Member

I think we can add support for a ps_variable_proxy, which could be in addition to the methods proposed in the PR, as an alternative API. However, I would not want to require that the proxy approach is the only supported API.

The 'base-atomic-proxy' approach creates difficulties for objects that cannot be easily materialized as a base atomic. This is particularly problematic for objects that would be much larger in memory when materialized as a base atomic type, and for objects that have no suitable base-atomic equivalent.

For example, large sparse arrays or arrays with smaller dtypes pose challenges. It's not uncommon to have a single-digit GB-sized NumPy array with a 'float32' dtype. Materializing that as a 'base-atomic-proxy' would mean allocating a double-digit GB double array every time the variables pane is updated. Similarly, if a user has a 10 GB array of int8s, we'd be materializing a 40 GB R integer array with each variables pane update.

(We encounter these same limitations of the vctrs-proxy approach in other places, like rstudio/reticulate#1481)

Many objects also lack a suitable base atomic equivalent. This opens up tricky questions. For instance:

  1. Python AST nodes: import ast; n = ast.Name('x'). Would this need to be presented as a string or an R symbol?

  2. Python futures: fut = concurrent.futures.ThreadPoolExecutor().submit(fn). Would reticulate have to construct an R {coro} or {future} proxy?

  3. Lazy-loaded arrays like https://www.bioconductor.org/packages/release/bioc/html/DelayedArray.html: How would the correct "true" dimensions of such an array be communicated without materializing a full proxy?

@lionel-
Copy link
Contributor

lionel- commented Oct 3, 2024

In my mind the variable pane should never contain GBs of data and any materialised data should be truncated - to support large data we'd need a similar approach to the data viewer and then the proxy becomes a slice proxy.

I see your points regarding flexibility of data types in the context of interop. I think this aligns with Daniel's request to keep a display value extension point?

@t-kalinowski
Copy link
Member

I think this aligns with Daniel's request to keep a display value extension point?

I'm probably misunderstanding, but I don't think it does. A code example might help.

With the proxy approach, what method(s) would reticulate implement to display a NumPy array?

@dfalbel
Copy link
Contributor Author

dfalbel commented Oct 3, 2024

I think for a numpy array, reticulate would only implement display_value and a proxy method that returns anything that's scalar atomic, so the arrow \/ to list children doesn't appear for it in the variables pane (equivalent to current has_children() = FALSE).

Storing proxies in current_bindings is probably possible, but the implementation will be quite tricky, because an object implementing variable_proxy() could be any level nested into an object that is pointed by one of the current_bindings.
I can take a look at what that could look like.

@lionel-
Copy link
Contributor

lionel- commented Oct 3, 2024

@dfalbel Feel free to take a look but I think you've both convinced me that the proxy approach would not be a simplification.

@dfalbel
Copy link
Contributor Author

dfalbel commented Oct 3, 2024

Ok, I didn't get to it, but happy to experiment if you think it's worth it. I set up an example package here: https://github.com/dfalbel/testVariablesPaneExtension/tree/main/R with an example that works for R6 classes and some examples of what I'd like to make it work for torch.

Copy link
Contributor

@lionel- lionel- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really nice PR!
Approved but I think @DavisVaughan should also take a look.

I believe this is the very first API exposed from Ark to R packages (if we exclude rstudioapi support) so we should discuss what that should look like. I like the setup you've created to register methods but I wonder if:

.ps.register_ark_method("ark_variable_display_value", ...)

should have the opposite namespacing:

.ark.register_method("positron_variable_display_value", ...)
  • The variable pane is currently a positron-only feature so should be namespaced in positron_
  • I'm imagining the register_method() function will be common to both ark and positron features, so should be namespaced as such.

We probably should meet and discuss how to expose this API.

crates/ark/src/interface.rs Outdated Show resolved Hide resolved
use crate::modules::ARK_ENVS;

#[derive(Debug, PartialEq, EnumString, EnumIter, IntoStaticStr, Display, Eq, Hash, Clone)]
pub enum ArkGenerics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be in its own module one level above?

Or renamed to ArkVariableGenerics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a preference? It feels like this could be used elsewhere, so we could move to its own module above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved one level above in ff887ff . Happy to rever if you prefer making it specific for the variables pane.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this change, it seems like that infrastructure could be useful in the future.

Comment on lines 115 to 126
fn parse_method(name: &String) -> Option<(Self, String)> {
for method in ArkGenerics::iter() {
let method_str: &str = method.clone().into();
if name.starts_with::<&str>(method_str) {
if let Some((_, class)) = name.split_once(".") {
return Some((method, class.to_string()));
}
}
}
None
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Neat!

Comment on lines 104 to 108
Rf_lang3(
r_symbol!(":::"),
r_symbol!(package),
r_symbol!(format!("{generic}.{class}")),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can replace this with RCall which works similarly to RFunction (the latter is implemented in terms of the former).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why not pass a function to register_method()? To avoid inlining objects in R calls? If so needs a comment, the implicit evaluation is surprising.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the motivation for storing a call pkgname:::methodname instead of the closure object was to ensure that:

  1. Everything worked seamlessly with devtools::load_all() and similar functions (no risk of calling an outdated method from a stale cache);
  2. An escape hatch remained available for monkey-patching a method in the package namespace (e.g., to enable workarounds for misbehaving methods).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated to use RCall in aadef2b

}
}

pub fn register_method(generic: Self, class: &str, method: RObject) -> anyhow::Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to take a consuming self (as opposed to &self)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning was that a static method would make more clear that we're actually modfying some global state and not just an object state. But loooks weird indeed - fixed in 16d4ab8

crates/ark/src/variables/methods.rs Outdated Show resolved Hide resolved
T: TryFrom<RObject>,
<T as TryFrom<harp::RObject>>::Error: std::fmt::Debug,
{
if !r_is_object(x) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is rather generic infrastructure, we might want to allow registering methods for base types in the future (only us would be allowed to)?

crates/ark/src/variables/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/variables/variable.rs Outdated Show resolved Hide resolved
let kind: Option<String> = ArkGenerics::VariableKind.try_dispatch(value, vec![])?;
match kind {
None => Ok(None),
// Enum reflection is not beautiful, we want to parse a VariableKind from it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this note.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more like a rant :P. It's hard to go from the string representation to an enum value in rust. Here we had to create a json and then read it with serde.

@dfalbel
Copy link
Contributor Author

dfalbel commented Oct 9, 2024

Thanks @lionel-

I agree that we should have the opposite namespacing and just changed here: a21dc37

I'm happy to chat further on how to expose this API. Currently we don't necessarily need to expose .ark_register_method() as we already automatically find methods in packages, but it's nice for interactive debugging and for testing purposes.

Copy link
Contributor

@DavisVaughan DavisVaughan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review - will do more in a bit

crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/modules/positron/methods.R Outdated Show resolved Hide resolved
crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/methods.rs Outdated Show resolved Hide resolved
crates/ark/src/variables/variable.rs Outdated Show resolved Hide resolved
crates/ark/src/variables/variable.rs Outdated Show resolved Hide resolved
crates/ark/src/variables/variable.rs Outdated Show resolved Hide resolved
crates/ark/src/variables/variable.rs Outdated Show resolved Hide resolved
crates/ark/src/variables/variable.rs Show resolved Hide resolved
@dfalbel dfalbel force-pushed the feature/extend-variables branch from 96e9917 to e6e2e18 Compare October 17, 2024 19:26
@dfalbel
Copy link
Contributor Author

dfalbel commented Oct 18, 2024

As discussed, I have updated the PR with:

  • Removing auto-registration in favor of only explicit method registration.
  • Added an allowed list of packages, so we can test the API for some time before making it really public.

I have torch PR implementing how packages should explicitly register methods: mlverse/torch@faa11e4

@dfalbel dfalbel merged commit 4bb6ca9 into main Oct 23, 2024
6 checks passed
@dfalbel dfalbel deleted the feature/extend-variables branch October 23, 2024 12:20
@github-actions github-actions bot locked and limited conversation to collaborators Oct 23, 2024
@t-kalinowski
Copy link
Member

t-kalinowski commented Nov 11, 2024

@lionel- @DavisVaughan Can I ask why

  • automatic discovery of methods in packages was removed,
  • and only packages on the explicit allow list are allowed to register methods

I’d like to implement methods not only in reticulate but also in keras and tensorflow. The recent changes before merge feel like unnecessary added complexity, inconvenience, and communication overhead, for reasons I don’t fully understand.

@lionel-
Copy link
Contributor

lionel- commented Nov 14, 2024

(1) is a choice for explicitness. This API seems sufficiently specific that magic and extra convenience did not feel warranted.
(2) is about not committing to any public interface in the short term until the mlverse had time to experiment with the API.

@t-kalinowski
Copy link
Member

That makes sense, thanks.

@wesm
Copy link
Contributor

wesm commented Dec 3, 2024

I was reading through this since it was mentioned in posit-dev/positron#5573.

A side comment: I wanted to mention that in https://github.com/posit-dev/ark/blob/main/doc/variables-pane-extending.md, Positron is not mentioned at all. Ark contains a lot of code that powers features in Positron but not easy to use elsewhere. Since Ark is intended to be usable outside of Positron, it may make sense to more clearly separate out the Positron-specific code and documentation so that people don't get the wrong idea that Ark is only relevant to Positron.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants