Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with function input names not matching the column name #1206

Open
1 of 2 tasks
xzdandy opened this issue Sep 24, 2023 · 0 comments · May be fixed by #1227
Open
1 of 2 tasks

Issues with function input names not matching the column name #1206

xzdandy opened this issue Sep 24, 2023 · 0 comments · May be fixed by #1227

Comments

@xzdandy
Copy link
Collaborator

xzdandy commented Sep 24, 2023

Search before asking

  • I have searched the EvaDB issues and found no similar bug report.

Bug

The function is defined as the following:

import pandas as pd
from evadb.catalog.catalog_type import ColumnType
from evadb.functions.abstract.abstract_function import AbstractFunction
from evadb.functions.decorators.decorators import forward, setup
from evadb.functions.decorators.io_descriptors.data_types import PandasDataframe

class Chunk(AbstractFunction):
    """
    Arguments:
        None

    Input Signatures:
        input_dataframe (DataFrame) : A DataFrame containing a column of strings.

    Output Signatures:
        output_dataframe (DataFrame) : A DataFrame containing chunks of strings.

    Example Usage:
        You can use this function to concatenate strings in a DataFrame and split them into chunks.
    """

    @property
    def name(self) -> str:
        return "Chunk"

    @setup(cacheable=False)
    def setup(self) -> None:
        # Any setup or initialization can be done here if needed
        pass

    @forward(
        input_signatures=[
            PandasDataframe(
                columns=["input_string"],
                column_types=[ColumnType.TEXT],
                column_shapes=[(None,)],
            )
        ],
        output_signatures=[
            PandasDataframe(
                columns=["chunks"],
                column_types=[ColumnType.TEXT],
                column_shapes=[(None,)],
            )
        ],
    )
    def forward(self, input_dataframe):
        # Ensure input is provided
        if input_dataframe.empty:
            raise ValueError("Input DataFrame must not be empty.")

        # Define the maximum number of tokens per chunk
        max_tokens_per_chunk = 16000  # Adjust this value as needed

        # Initialize lists for the output DataFrame
        output_strings = []

        # Iterate over rows of the input DataFrame
        for _, row in input_dataframe.iterrows():
            input_string = row["input_string"]

            # Split the input string into chunks of maximum tokens
            chunks = [input_string[i:i + max_tokens_per_chunk] for i in range(0, len(input_string), max_tokens_per_chunk)]

            output_strings.extend(chunks)

        # Create a DataFrame with the output strings
        output_dataframe = pd.DataFrame({"chunks": output_strings})

        return output_dataframe

The row["input_string"] does not work when the input dataframe from the SlackCSV table has column named text. We get the following error message:

KeyError: 'input_string'

Environment

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

3 participants