Skip to content

Commit

Permalink
Add 'convert' command for GenAIScript processing (#953)
Browse files Browse the repository at this point in the history
* feat: add 'convert' command for GenAIScript processing ♻️

* feat: ➿ Add support for concurrent file conversions

* ci: 🧪 add model version and error logging

* feat: ✨ add rewrite and cancel-word options to CLI

* docs: ✏️ add convert command documentation

* fix: 🐛 refine file processing logic and instructions

* refactor: update text assignment logic 💡

* refactor: ♻️ Remove unnecessary file fetching logic

* feat: ✨ add --cancel-word option to CLI convert command

* fix: handle cancelled files in conversion process 🛑
  • Loading branch information
pelikhan authored Dec 17, 2024
1 parent e0aafe1 commit ecd5473
Show file tree
Hide file tree
Showing 16 changed files with 417 additions and 122 deletions.
9 changes: 6 additions & 3 deletions .github/workflows/ollama.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,10 @@ jobs:
- name: start ollama
run: yarn ollama:start
- name: run summarize-ollama-phi3
run: yarn test:summarize --model ollama:phi3.5 --out ./temp/summarize-ollama-phi3
run: yarn test:summarize --model ollama:phi3.5:latest --out ./temp/summarize-ollama-phi3
env:
OLLAMA_HOST: "http://localhost:11434"

OLLAMA_HOST: "http://localhost:11434"
- name: run convert-ollama-phi3
run: yarn cli convert summarize --model ollama:phi3.5:latest "packages/sample/src/rag/*.md" --cache-name sum
env:
OLLAMA_HOST: "http://localhost:11434"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,4 @@ packages/core/*.temp.*
packages/sample/test.txt
packages/sample/poems/*.txt
packages/sample/src/rag/markdown.md.txt
*.genai.md
39 changes: 38 additions & 1 deletion docs/src/content/docs/reference/cli/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ Options:
-mtc, --max-tool-calls <number> maximum tool calls for the run
-se, --seed <number> seed for the run
-em, --embeddings-model <string> embeddings model for the run
--cache enable LLM result cache
-c, --cache enable LLM result cache
-cn, --cache-name <name> custom cache file name
-cs, --csv-separator <string> csv separator (default: "\t")
-ff, --fence-format <string> fence format (choices: "xml", "markdown", "none")
Expand Down Expand Up @@ -129,6 +129,43 @@ Options:
-h, --help display help for command
```

## `convert`

```
Usage: genaiscript convert [options] <script> [files...]
Converts file through a GenAIScript. Each file is processed separately through
the GenAIScript and the LLM output is saved to a <filename>.genai.md (or custom
suffix).
Options:
-s, --suffix <string> suffix for converted files (default:
".genai.md")
-rw, --rewrite rewrite input file with output
-cw, --cancel-word <string> cancel word which allows the LLM to notify
to ignore output
-ef, --excluded-files <string...> excluded files
-egi, --exclude-git-ignore exclude files that are ignored through the
.gitignore file in the workspace root
-m, --model <string> 'large' model alias (default)
-sm, --small-model <string> 'small' alias model
-vm, --vision-model <string> 'vision' alias model
-ma, --model-alias <nameid...> model alias as name=modelid
-ft, --fallback-tools Enable prompt-based tools instead of
builtin LLM tool calling builtin tool
calls
-o, --out <string> output folder. Extra markdown fields for
output and trace will also be generated
--vars <namevalue...> variables, as name=value, stored in
env.vars. Use environment variables
GENAISCRIPT_VAR_name=value to pass
variable through the environment
-c, --cache enable LLM result cache
-cn, --cache-name <name> custom cache file name
-cc, --concurrency <number> number of concurrent conversions
-h, --help display help for command
```

## `scripts`

```
Expand Down
83 changes: 83 additions & 0 deletions docs/src/content/docs/reference/cli/convert.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
---
title: Convert
description: Learn how to apply a script to many files and extract the output.
sidebar:
order: 2
keywords: CLI tool execution, genai script running, stdout streaming, file globing, environment configuration
---

Converts a set of files, separately, using a script.

```bash
npx genaiscript convert <script> "<files...>"
```

where `<script>` is the id or file path of the tool to run, and `<files...>` is the name of the spec file to run it on.
Unlike `run` which works on all files at once, `convert` processes each file individually.

## Files

`convert` takes one or more [glob](<https://en.wikipedia.org/wiki/Glob_(programming)>) patterns to match files in the workspace.

```bash sh
npx genaiscript run <script> "**/*.md" "**/*.ts"
```

### --excluded-files &lt;files...&gt;

Excludes the specified files from the file set.

```sh "--excluded-files <excluded-files...>"
npx genaiscript convert <script> <files> --excluded-files <excluded-files...>
```

### --exclude-git-ignore

Exclude files ignored by the `.gitignore` file at the workspace root.

```sh "--exclude-git-ignore"
npx genaiscript convert <script> <files> --exclude-git-ignore
```

## Output

The output of each file is saved to a new or existing file. You can control the logic to decide which part of the output to save where to save it.
By default, a conversion result of a file `<filename>` is saved to a file `<filename>.genai.md`.

### --suffix &lt;suffix&gt;

The `--suffix` option allows you to specify a suffix to append to the output file name.

```sh "--suffix .genai.txt"
npx genaiscript convert <script> <files> --suffix .genai.txt
```

GenAIScript will "unfence" output in the markdown that match the suffix (after `.genai`) automatically. So if the LLM generates

````markdown
```txt
:)
```
````

The converted content in `<filename>.genai.txt` will be `:)`.

### --rewrite

This flag override `suffix` and tells GenAIScript to rewrite the original file with the converted content.

```sh "--rewrite"
npx genaiscript convert <script> <files> --rewrite
```

### --cancel-word &lt;word&gt;

Specify the "ignore output, nothing to see here" keyword using the `-cw` flag.

```sh '--cancel-word "<NO>"'
npx genaiscript convert <script> <files> --cancel-word "<NO>"
```

## Read more

The full list of options is available in the [CLI reference](/genaiscript/reference/cli/commands#convert).
8 changes: 1 addition & 7 deletions docs/src/content/docs/reference/cli/run.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ Runs a script on files and streams the LLM output to stdout or a folder from the
npx genaiscript run <script> "<files...>"
```

where `<script>` is the id or file path of the tool to run, and `[spec]` is the name of the spec file to run it on.
where `<script>` is the id or file path of the tool to run, and `<files...>` is the name of the spec file to run it on.

Files can also include [glob](<https://en.wikipedia.org/wiki/Glob_(programming)>) pattern.

Expand All @@ -26,12 +26,6 @@ If multiple files are specified, all files are included in `env.files`.
npx genaiscript run <script> "src/*.bicep" "src/*.ts"
```

## Credentials

The LLM connection configuration is read from environment variables or from a `.env` file in the workspace root directory.

See [configuration](/genaiscript/getting-started/configuration).

## Files

`run` takes one or more [glob](<https://en.wikipedia.org/wiki/Glob_(programming)>) patterns to match files in the workspace.
Expand Down
69 changes: 27 additions & 42 deletions docs/src/content/docs/samples/sc.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -17,75 +17,60 @@ Starting at the top of the script, we see that it's a GenAI script, which is evi
```ts
script({
title: "Spell checker",
system: ["system", "system.files", "system.diff"],
temperature: 0.1,
system: [
"system.output_plaintext",
"system.assistant",
"system.files",
"system.changelog",
"system.safety_jailbreak",
"system.safety_harmful_content",
],
temperature: 0.2,
cache: "sc",
})
```

This block sets the title of the script to "Spell checker" and specifies that it uses several system prompts, such as file operations and diff generation. The `temperature` is set to `0.1`, indicating that the script will generate output with low creativity, thus favoring precision.

### Fetching Files for Checking

Next, we check for files to process, first from the environment and then from Git if none are provided.

```ts
let files = env.files
if (files.length === 0) {
const gitStatus = await host.exec("git diff --name-only --cached")
files = await Promise.all(
gitStatus.stdout
.split(/\r?\n/g)
.filter((filename) => /\.(md|mdx)$/.test(filename))
.map(async (filename) => await workspace.readText(filename))
)
}
```

In this block, we're assigning files from the `env` variable, which would contain any files passed to the script. If no files are provided, we execute a Git command to get a list of all cached (staged) modified files and filter them to include only `.md` and `.mdx` files. We then read the content of these files for processing.

### Defining the File Types to Work on

Following this, there's a `def` call:

```ts
def("FILES", files, { endsWith: [".md", ".mdx"] })
def("FILES", files)
```

This line defines `FILES` to be the array of files we gathered. The options object `{ endsWith: [".md", ".mdx"] }` tells GenAI that we're only interested in files ending with `.md` or `.mdx`.
This line defines `FILES` to be the array of files we gathered.

The `$`-prefixed backtick notation is used to write the prompt template:

```ts
$`Fix the spelling and grammar of the content of FILES. Use diff format for small changes.
- do NOT fix the frontmatter
- do NOT fix code regions
- do NOT fix \`code\` and \`\`\`code\`\`\`
$`Fix the spelling and grammar of the content of ${files}. Return the full file with corrections
If you find a spelling or grammar mistake, fix it.
If you do not find any mistakes, respond <NO> and nothing else.
- only fix major errors
- use a technical documentation tone
- minimize changes; do NOT change the meaning of the content
- if the grammar is good enough, do NOT change it
- do NOT modify the frontmatter. THIS IS IMPORTANT.
- do NOT modify code regions. THIS IS IMPORTANT.
- do NOT fix \`code\` and \`\`\`code\`\`\` sections
- in .mdx files, do NOT fix inline typescript code
`
```

This prompt instructs GenAI to fix spelling and grammar in the content of the defined `FILES`, outputting small changes in diff format. It also specifies constraints, such as not fixing the frontmatter, code regions, inline code in markdown, and inline TypeScript code in MDX files.

Finally, there is a `defFileOutput` call:

```ts
defFileOutput(files, "fixed markdown or mdx files")
```

This call declares the intent that the script will generate "fixed markdown or mdx files" based on the input files.

## How to Run the Script with GenAIScript CLI

Running this spell checker script is straightforward with the GenAIScript CLI. First, ensure you have the CLI installed by following the instructions in the [GenAIScript documentation](https://microsoft.github.io/genaiscript/getting-started/installation).

Once you have the CLI installed, navigate to your local copy of the script in your terminal or command line interface. Run the following command to execute the spell checker:

```shell
genaiscript run sc
genaiscript convert sc "**/*.md" --rewrite
```

Remember, you do not need to specify the `.genai.mts` extension when using the `run` command.
Remember, you do not need to specify the `.genai.mts` extension when using the `convert` command.

And there you have it—a detailed walkthrough of a GenAI spell checker script for markdown files. Happy coding and perfecting your documents!

Expand All @@ -97,10 +82,10 @@ And there you have it—a detailed walkthrough of a GenAI spell checker script f

The following measures are taken to ensure the safety of the generated content.

- This script includes system prompts to prevent prompt injection and harmful content generation.
- This script includes system prompts to prevent prompt injection and harmful content generation.
- [system.safety_jailbreak](/genaiscript/reference/scripts/system#systemsafety_jailbreak)
- [system.safety_harmful_content](/genaiscript/reference/scripts/system#systemsafety_harmful_content)
- The generated description is saved to a file at a specific path, which allows for a manual review before committing the changes.
- The generated description is saved to a file at a specific path, which allows for a manual review before committing the changes.

Additional measures to further enhance safety would be to run [a model with a safety filter](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/content-filter?tabs=warning%2Cuser-prompt%2Cpython-new)
or validate the message with a [content safety service](/genaiscript/reference/scripts/content-safety).
Expand Down
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
"gcm": "node packages/cli/built/genaiscript.cjs run gcm --model github:gpt-4o",
"prd": "node packages/cli/built/genaiscript.cjs run prd -prd --model github:gpt-4o",
"genai": "node packages/cli/built/genaiscript.cjs run",
"genai:convert": "node packages/cli/built/genaiscript.cjs convert",
"genai:debug": "yarn compile-debug && node packages/cli/built/genaiscript.cjs run",
"upgrade:deps": "zx scripts/upgrade-deps.mjs",
"cli": "node packages/cli/built/genaiscript.cjs",
Expand Down
5 changes: 1 addition & 4 deletions packages/cli/src/api.ts
Original file line number Diff line number Diff line change
Expand Up @@ -34,10 +34,7 @@ export async function run(
*/
signal?: AbortSignal
}
): Promise<{
exitCode: number
result?: GenerationResult
}> {
): Promise<GenerationResult> {
if (!scriptId) throw new Error("scriptId is required")
if (typeof files === "string") files = [files]

Expand Down
51 changes: 49 additions & 2 deletions packages/cli/src/cli.ts
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ import {
OPENAI_MAX_RETRY_DELAY,
OPENAI_RETRY_DEFAULT_DEFAULT,
OPENAI_MAX_RETRY_COUNT,
GENAI_MD_EXT,
} from "../../core/src/constants" // Core constants
import {
errorMessage,
Expand All @@ -47,6 +48,7 @@ import {
import { CORE_VERSION, GITHUB_REPO } from "../../core/src/version" // Core version and repository info
import { logVerbose } from "../../core/src/util" // Utility logging
import { semverSatisfies } from "../../core/src/semver" // Semantic version checking
import { convertFiles } from "./convert"

/**
* Main function to initialize and run the CLI.
Expand Down Expand Up @@ -177,7 +179,7 @@ export async function cli() {
"-em, --embeddings-model <string>",
"embeddings model for the run"
)
.option("--cache", "enable LLM result cache")
.option("-c, --cache", "enable LLM result cache")
.option("-cn, --cache-name <name>", "custom cache file name")
.option("-cs, --csv-separator <string>", "csv separator", "\t")
.addOption(
Expand Down Expand Up @@ -245,6 +247,51 @@ export async function cli() {
.description("Launch test viewer")
.action(scriptTestsView) // Action to view the tests

program
.command("convert")
.description(
"Converts file through a GenAIScript. Each file is processed separately through the GenAIScript and the LLM output is saved to a <filename>.genai.md (or custom suffix)."
)
.arguments("<script> [files...]")
.option(
"-s, --suffix <string>",
"suffix for converted files",
GENAI_MD_EXT
)
.option("-rw, --rewrite", "rewrite input file with output")
.option(
"-cw, --cancel-word <string>",
"cancel word which allows the LLM to notify to ignore output"
)
.option("-ef, --excluded-files <string...>", "excluded files")
.option(
"-egi, --exclude-git-ignore",
"exclude files that are ignored through the .gitignore file in the workspace root"
)
.option("-m, --model <string>", "'large' model alias (default)")
.option("-sm, --small-model <string>", "'small' alias model")
.option("-vm, --vision-model <string>", "'vision' alias model")
.option("-ma, --model-alias <nameid...>", "model alias as name=modelid")
.option(
"-ft, --fallback-tools",
"Enable prompt-based tools instead of builtin LLM tool calling builtin tool calls"
)
.option(
"-o, --out <string>",
"output folder. Extra markdown fields for output and trace will also be generated"
)
.option(
"--vars <namevalue...>",
"variables, as name=value, stored in env.vars. Use environment variables GENAISCRIPT_VAR_name=value to pass variable through the environment"
)
.option("-c, --cache", "enable LLM result cache")
.option("-cn, --cache-name <name>", "custom cache file name")
.option(
"-cc, --concurrency <number>",
"number of concurrent conversions"
)
.action(convertFiles)

// Define 'scripts' command group for script management tasks
const scripts = program
.command("scripts")
Expand Down Expand Up @@ -281,7 +328,7 @@ export async function cli() {

// Define 'cache' command for cache management
const cache = program.command("cache").description("Cache management")
const clear = cache
cache
.command("clear")
.description("Clear cache")
.argument("[name]", "Name of the cache, tests")
Expand Down
Loading

0 comments on commit ecd5473

Please sign in to comment.