Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pre-proposal: standardize object representations for ai and a protocol to retrieve them #128

Open
mlucool opened this issue Dec 6, 2024 · 13 comments
Assignees

Comments

@mlucool
Copy link
Contributor

mlucool commented Dec 6, 2024

Summary

To deeply integrate AI into Jupyter, we should standardize both a method on objects to represent themselves and a messaging protocol for retrieving these representations. We propose using _ai_repr_(**kwargs) -> str | dict for objects to return representations. Additionally, we suggest creating a registry in kernels (e.g. IPython) for users to set representations for objects that do not define this method, along with a new message type for retrieving these representations.

Motivation

Users should be able to include representations of instances of objects from their kernel as they interact with AI. This capability is what sets a productive Jupyter experience apart from other IDE-based approaches. For example, you should be able to use Jupyter AI and ask Given @myvar, how do I add another row or What's the best way to do X with @myvar?

While using something like _repr_* may have been sufficient, it can slow down display requests and does not allow passing information to hint about the shape of representations. For example, imagine a Chart. In a multimodal model, we may want to use a rendered image, but in a text-only model, we may want to pass only a description. Other model parameters or user preferences may also matter, such as the size of the context window or how verbose they want the representation to be.

Because of this, we suggest defining a new standard called _ai_repr_(**kwargs) -> str | dict. This method should return either a string or a MIME bundle. Additionally, since many libraries will not have this defined initially, there should be a registry where users can create a set of defaults and/or overrides, allowing them to use this feature without waiting for libraries to define it themselves.

Finally, the UI (e.g., jupyter-ai) needs a way to retrieve these representations for a given object. This is best done by introducing a new message type that can include the object and the kwargs. We expect this process to be slow at times (e.g., generating an image for a chart), so the control channel should not be used. Instead, a normal comms message can be used today, and as support for subshells improves, we can use that to avoid blocking while kernels are busy.

Example

Continuing with the chart object example, we may want to add something like below. Typically this fictional chart it returns structured data for its JS display to render, but now we want an image for the context, which we expect to be slow to compute (e.g. a headless browser may need to be launched to do this):

class JSBasedChart:
    ...

    def _ai_repr_(self, **kwargs):
        return {
            "text/plain": f"A chart titled {self.title} with series named {self.series_names}",
            "image/png": self.get_image()
        }

Other MIME types can also be used to enable the caller to represent the object in an optimized way for the model they are using (e.g., XML). For example, we could imagine Pandas DataFrame's defining this method:

class DataFrame:
    ...
    def _ai_repr_(self, **kwargs):
        info_buf = io.StringIO()
        self.info(buf=info_buf, memory_usage=False, show_counts=False)

        return {
            "text/plain": self.to_string(),
            "application/foo": {
                "type": "pandas.DataFrame",
                "value": f"Some random rows from the dataframe:\n{self.sample(min(5, len(self)))}",
                "structure": info_buf.getvalue()
            }
        }

Now the caller can use this MIME type to render the object in the context window using xml if it chooses (see here):

<variable>
    <name>{name}</name>
    <type>{type}</type>
    <value>{value}</value>
    <structure>{structure}</structure>
</variable>

This approach intentionally mirrors how repr works in the Jupyter ecosystem, but it focused on non-displayed reprs. In a similar fashion, we don't want to over-specify return types because we want to encourage innovation in this area.

Given the desire to query for this from the front end, we also propose a new message type similar to inspect_request, but allowing kwargs to be passed in by the caller. We intentionally do not want to define what these kwargs are at this early stage, preferring to let extension providers innovate and reach a consensus on what is useful. In the example above, we may pass multimodal=False and update the code in JSBasedChart to not render an image or we may pass context_window=1_000_000 and let the DataFrame repr include statistics per column or maybe even put small tables into the context window as is.

CC @Carreau @krassowski @SylvainCorlay

@SylvainCorlay
Copy link
Member

Thank you for the detailed pre-proposal @mlucool.

I love the idea of Jupyter-AI and other LLM-powered tools to be able to utilize the current state of the kernel in a standardized manner rather than merely using the content of the documents at hand.

This approach resonates with the preference many users had for IPython's autocompletion, which offered more dynamic and context-aware suggestions compared to tools that relied purely on static analysis. By utilizing runtime information, we can achieve more accurate and relevant results.

@Carreau
Copy link
Member

Carreau commented Dec 9, 2024

From the investigation I did for @mlucool; I think it is possible (with some tweaks), to reuse the IPython display formatter; that not only would allow to define _ai_repr_, but register as well hooks for objects we do not control the source; which I think is important in order to fill the gap while projects start to adopt this.

  • _ai_repr_(**kwargs) -> str | dict, I think should always -> dict; it's easier to add |str later than rmove.
  • We likely want somehting that is usable outside of Jupyter (or with minimal dependencies); I'm wonderign if IPython is already too much.
  • We need to think about how to both have global and per-call configurations:
In [1]: from IPython.core.formatters import BaseFormatter, FormatterABC
   ...:
   ...:
   ...: class LLMFormatter(BaseFormatter):
   ...:     format_type = "x-vendor/llm"
   ...:     print_method = "_ai_repr_"
   ...:     _return_type = dict
   ...:
   ...:
   ...: llm_formatter = LLMFormatter()
   ...:
   ...:
   ...: class Foo:
   ...:     def _ai_repr_(self, *args, **kwargs):
   ...:         return {"text/plain": "this is foo"}
   ...:
   ...:
   ...: llm_formatter(Foo())
Out[1]: {'text/plain': 'this is foo'}

We can also register formatter for external objects:

In [2]: llm_formatter.for_type(int, lambda x:{'text/plain':'this is the integer %s' % x})

In [3]: llm_formatter(1)
Out[3]: {'text/plain': 'this is the integer 1'}

I think as a first step I can :

  • add a default get_ipython().llm_formatter default instance setup as above;
  • modify it's __call__ to pass all extras *args and **kwargs to _ai_repr_.

I'm suggesting we create a package call ai_repr with:

from ai_repr import formatter # or ai_formatter 

That basically exposes the same functionalities when outside of IPython and have a way of registering formatter via entrypoints so you can for example pip install pandas-ai-repr and it working out of the box.

@mlucool
Copy link
Contributor Author

mlucool commented Dec 10, 2024

I'm open to whatever will spread this the fastest across project owners.

Curious if you have thoughts @dlqqq or @Zsailer given your ownership of GAI projects in this space.

@dlqqq
Copy link

dlqqq commented Dec 11, 2024

@mlucool Thank you for opening this JEP! I'm really excited to see others driving thought leadership on how we can improve Jupyter AI & other AI extensions further. I need to head out now for a personal matter, so I will give this a thorough review this tomorrow morning.

@krassowski
Copy link
Member

@3coins
Copy link

3coins commented Dec 11, 2024

Excited to see this being discussed. A past colleague of mine recently asked about these capabilities in Jupyter AI and what would it take to integrate.

Want to PoC how close can I get Jupyter AI to work as Panda AI

@mlucool
Copy link
Contributor Author

mlucool commented Dec 11, 2024

We'll make a PR that includes this in Jupyter AI as part a larger PR that demos some things we think would be good to discuss with the community in the near future.

@dlqqq
Copy link

dlqqq commented Dec 11, 2024

@mlucool Thank you for opening this again! I've reviewed this and have left some recommendations & initial thoughts below.

Explore only returning MIME bundles

It may be better to only return Dict[str, Any], even if the object only has a string representation. This adds a requirement that implementations must always provide a MIME type. To see why this may be useful, consider the two classes:

class RawXmlTree():
    """Stores an arbitrary XML tree."""
    ...

class RawHtmlTree():
    """Stores an arbitrary HTML tree."""
    ...

Both of these classes have string representations, but the strings may be so similar in structure that the language model can't infer the format of the string. By requiring the implementation to always return a Dict[str, Any], we can ensure that the language model always receives context about the format of a string representation via its stated MIME type.

Using union return types (str | Dict[str, Any]) also adds a minor amount of developer overhead, since every consumer will have to check the type of the returned value before doing anything with the returned representation.

Explore preferring a functional interface

One issue that may impede adoption of this proposal is that we don't have control over packages outside of Project Jupyter. If a project doesn't wish to implement this JEP, then we wouldn't have a way of computing an AI representation for its classes, since they will lack the _ai_repr_() method.

We could subclass each of those classes or implement meta-programming to add the methods immediately on import. However, these approaches add difficulty to the implementation, which may also impede adoption.

The most basic implementation of this proposal can be described by a single, top-level function that provides the API:

def compute_ai_repr(obj: Any) -> Dict[str, Any]:
    ...

By defining the top-level API as a single function that takes one object instead of multiple methods on every object, we can provide AI representations for objects from packages outside of Project Jupyter, without requiring an upstream change.

Objects may still define an _ai_repr_() for convenience. The definition of the top-level API can be easily adapted to suit this:

def compute_ai_repr(obj: Any) -> Dict[str, Any]:
    if hasattr(obj, '_ai_repr_'):
        return obj._ai_repr_()
    
    ...

A functional top-level API provides benefits for both implementers & consumers. I believe we should consider this when discussing implementation ideas.

Explore including implementation guidance

The JEP should also include guidance on implementing AI representations. This will drive adoption by improving consistency across implementations & by making it easier for future contributors to write new implementations. As we experiment further, we should think deeply about what guidance we should provide to implementers.

For example, here are some rough guidelines that I think are worth considering as we experiment further:

  • Implementations must provide at least one string representation under text/plain.
  • AI representations of primitive objects should return their values literally. If @foo references a string, float, or bool, then we should include the value of foo literally.
  • AI representations of fixed-length structs (dataclasses, Pydantic models, etc.) may return the entire contents serialized as a JSON dictionary.
  • AI representations of dynamic-length objects (arbitrary lists, dicts, sets) should be size-dependent.
    • If the object is "small" in size, the AI representation may include the entire contents. We will need to experiment more to find the right size metric to use to distinguish this case.
    • Otherwise, the AI representation should produce a natural language summary of its structure. e.g. "A pandas dataframe with X rows and 3 columns, whose column names are (..., ..., ...) respectively."
  • AI representations should be deterministic, i.e. they should always be identical for the same object with the same internal state. If we want to include a "preview" of rows in a dataframe, then we should do this deterministically, e.g. by always showing the first 5 rows.

@dlqqq
Copy link

dlqqq commented Dec 11, 2024

@mlucool

We'll make a PR that includes this in Jupyter AI as part a larger PR that demos some things we think would be good to discuss with the community in the near future.

I'm really excited to see a proof of concept for this, and would be happy to review your PR! Jupyter AI has support for context commands, which take the syntax @<context-provider>:<argument>. We have an @file context command which allow you to include a file's text content with your prompt, via @file:<file-name>. Perhaps your team could implement something like @var:<var-name> to get AI representations of local variables in the notebook?

This source file may be helpful: https://github.com/jupyterlab/jupyter-ai/blob/main/packages/jupyter-ai/jupyter_ai/context_providers/file.py

Note that I will be out of office from tomorrow to Mon Dec 16, so I will only be able to review your PR after that. 👋

@mlucool
Copy link
Contributor Author

mlucool commented Dec 13, 2024

Thanks for the reply!

Looks like there is consensus for Dict[str, Any] so we should go with that.

A functional top-level API provides benefits for both implementers & consumers. I believe we should consider this when discussing implementation ideas.

I think this somewhat close to what @Carreau proposed with a package and a registry that handles more of this. While I agree that the single function is enough, I'm not sure it adds anything practically here, but maybe I'm missing something

The JEP should also include guidance on implementing AI representations

I am hesitant here to be too prescriptive. I think its pretty unknown at this point and would prefer not to have any "must" to describe the output, At even guidance like "must be deterministic" feels extreme. What if you asked an LLM to turn your large repr into something small automatically? Should that an antipattern if it turns out that is very effective?

I'm really excited to see a proof of concept for this, and would be happy to review your PR!

As an FYI, the proof of concept is not just limited to this feature and will be meant as a dicussion point on AI in jupyter in general. The PR is likely to big to be merged, so we wanted to get feedback on key parts (one of which is this concept).

@Carreau Carreau self-assigned this Dec 17, 2024
@echarles
Copy link
Member

ai_repr(**kwargs) -> str | dict

What about adding bytes as potential output? str | dict | bytes
That could be useful for heavy datastructures being serialized as bytes.

@mlucool
Copy link
Contributor Author

mlucool commented Dec 18, 2024

What about adding bytes as potential output? str | dict | bytes

As noted by others above, I agree that Dict[str, Any] is better to reduce load on downstream applications needing to figure out how to process things. This doesn't preclude anyone from using bytes, but doesn't make anything but a key value pair as something to handle

@mlucool
Copy link
Contributor Author

mlucool commented Dec 18, 2024

@govinda18 made a PR to demo this an many other features to Jupyter AI in: jupyterlab/jupyter-ai#1157. The screencast should give readers a pretty good sense of its power - it loads data from a CVS and knows column names all without a user needing to be explicit. Really, I encourage people to try the PR themselves and get a sense of how it feels.

As noted, that is a large PR, so we can further discuss other features there (note: it uses the previous internal idea of just calling it __llm__ over the proposed _llm_repr_).

As briefly discussed with @krassowski and @Carreau, we'll make a JEP focusing on the protocol change which is clearly within the Jupyter's purview. The other mechanics we can be a bit more agnostic to and are not clearly something we need a JEP for

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants