From 364827abd70a4d5c53f4f721917278917b03226d Mon Sep 17 00:00:00 2001
From: marcus <marcus.s.dunn@gmail.com>
Date: Tue, 2 Jan 2024 12:18:38 -0800
Subject: [PATCH] removed old and internal readme with one suitable for this
 set of packages

---
 README.md | 79 ++++++++++---------------------------------------------
 1 file changed, 14 insertions(+), 65 deletions(-)
diff --git a/README.md b/README.md
index 3653fb6b..25022fe6 100644
--- a/README.md
+++ b/README.md
@@ -1,72 +1,21 @@
-# llama-cpp-rs
-An reimplementation of the parts of microsoft's [guidance](https://github.com/guidance-ai/guidance) that don't slow things down. Based on [llama.cpp](https://github.com/ggerganov/llama.cpp) with bindings in rust.
+# llama-cpp-rs-2
 
-## Features
+A wrapper around the [llama-cpp](https://github.com/ggerganov/llama.cpp/) library for rust.
 
-✅ Guarenteed LLM output formatting (see [formatting](#formatting))
+# Goals
 
-✅ Dynamic prompt templates
+- Safe
+- Up to date (llama-cpp-rs is out of date)
+- Abort free (llama.cpp will abort if you violate its invariants. This library will attempt to prevent that by ether
+  ensuring the invariants are upheld statically or by checking them ourselves and returning an error)
+- Performant (no meaningful overhead over using llama-cpp-sys-2)
+- Well documented
 
-✅ Model Quantization
+# Non-goals
 
-✅ Fast (see [performace](#performace))
+- Idiomatic rust (I will prioritize a more direct translation of the C++ API over a more idiomatic rust API due to
+  maintenance burden)
 
-## Prompt storage.
+# Contributing
 
-You can store context on the filesystem if it will be reused, or keep the GRPC connection open to keep it in memory.
-
-## Formatting
-
-For a very simple example, assume you pass an LLM a transcript - you just sent the user a verification code, but you don't know if they've recived it yet, or if they are even able to access the 2fa device. You ask the user for the code - they respond and you prompt the LLM.
-
-````
-<transcript>
-What is the users verification code?
-```yaml
-verification code: '
-````
-
-A tranditional solution (and the only solution offered by openai) is to give a stop condition of `'` you hope the llm to fills in a string and stops when it is done. You get *no control* on how it will respond. Without spending extra compute on a longer prompt you cannot specify that the code is 6 digits or what to output if it does not exist. And even with the longer prompt there is no guarentee it will be followed.
-
-We do things differently by adding the ability to force an LLMs output to follow a regex and allowing bidirectional streaming.
-
-- Given the regex `(true)|(false)` you can force a LLM to only respond with true or false.
-- Given `([0-9]+)|(null)` you can extract a verification code that a user has given.
-
-Combining the two leads to something like
-
-````{ prompt: "<rest>verification code: '" }````
-
-````{ generate: "(([0-9]+)|(null))'" }````
-
-Which will always output the users verification code or `null`.
-
-When combined with bidrirectional streaming we can do neat things, for example if the LLM yeilds a null `verification code`. We can send a second message asking for a `reason` (with the regex `(not arrived)|(unknown)|(device inaccessable)`).
-
-### Comparisons
-
-Guidance uses complex tempating sytnax. Dynamism is achvived though function calling and conditional statments in a handlebars like DSL. The function calling is a security nightmare (especially in a language as dynamic as python) and condional templating does not scale.
-
-[lmql](https://lmql.ai/) uses a similar approach in that control flow stays in the "host" language, but it is a superset of python supported via decorators. Preformance is difficult to control and near impossible to use in a concurrent setting such as a web server.
-
-We instead stick the LLM on a GPU (or many if resources are required) and call to it using GRPC.
-
-Dynamism is achived in the client code (where it belongs) by streaming messages back and forth between the client and `llama-cpp-rpc` with minimal overhead.
-
-## Performace
-
-Numbers are run on a 3090 running a finetuned 7b Minseral model (unquantized). With quantization we can run state of the art 70b models on consumer hardware.
-
-||Remote Hosting|FS context storage|concurrency|raw tps|guided tps|
-|----|----|----|----|----|----|
-|Llama-cpp-rpc|✅|✅|✅|65|56||
-|Guidance|❌|❌|❌|30|5||
-|LMQL|❌|❌|❌|30|10||
-
-## Dependencies
-
-### Ubuntu
-
-```bash
-sudo apt install -y curl libssl-dev libclang-dev pkg-config cmake git protobuf-compiler
-```
+Contributions are welcome. Please open an issue before starting work on a PR.
\ No newline at end of file