diff --git a/docs/docs/quick-start/getting-started-01.md b/docs/docs/quick-start/getting-started-01.md index 66eaedb2f..e535bd5a9 100644 --- a/docs/docs/quick-start/getting-started-01.md +++ b/docs/docs/quick-start/getting-started-01.md @@ -46,7 +46,7 @@ dspy.inspect_history(n=1) ``` **Output:** -See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22) +See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22). DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it. @@ -151,7 +151,7 @@ len(trainset), len(valset), len(devset), len(testset) What kind of metric can suit our question-answering task? There are many choices, but since the answer are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response? -That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](/docs/building-blocks/modules) using whatever LM we're working with. +That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/77c2e1cceba427c7f91edb2ed5653276fb0c6de7/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with. ```python @@ -192,7 +192,7 @@ dspy.inspect_history(n=1) ``` **Output:** -See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8) +See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8). For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`. diff --git a/docs/docs/quick-start/getting-started-02.md b/docs/docs/quick-start/getting-started-02.md index 87da009f7..91784f9ca 100644 --- a/docs/docs/quick-start/getting-started-02.md +++ b/docs/docs/quick-start/getting-started-02.md @@ -93,7 +93,11 @@ class RAG(dspy.Module): def forward(self, question): context = search(question, k=self.num_docs) return self.respond(context=context, question=question) - +``` + +Let's use the RAG module. + +``` rag = RAG() rag(question="what are high memory and low memory on linux?") ``` @@ -111,7 +115,7 @@ dspy.inspect_history() ``` **Output:** -See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c) +See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c). In the previous guide with a CoT module, we got nearly 40% in terms of semantic F1 on our `devset`. Would this `RAG` module score better? @@ -151,7 +155,7 @@ optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset, ``` **Output:** -See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb) +See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb). The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.