Skip to content

Commit

Permalink
Merge pull request stanfordnlp#1637 from stanfordnlp/docs_oct2024
Browse files Browse the repository at this point in the history
Docs
  • Loading branch information
okhat authored Oct 16, 2024
2 parents 77c2e1c + 8d05895 commit a1eae3f
Show file tree
Hide file tree
Showing 2 changed files with 10 additions and 6 deletions.
6 changes: 3 additions & 3 deletions docs/docs/quick-start/getting-started-01.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ dspy.inspect_history(n=1)
```

**Output:**
See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22)
See this [gist](https://gist.github.com/okhat/aff3c9788ccddf726fdfeb78e40e5d22).


DSPy has various built-in modules, e.g. `dspy.ChainOfThought`, `dspy.ProgramOfThought`, and `dspy.ReAct`. These are interchangeable with basic `dspy.Predict`: they take your signature, which is specific to your task, and they apply general-purpose prompting techniques and inference-time strategies to it.
Expand Down Expand Up @@ -151,7 +151,7 @@ len(trainset), len(valset), len(devset), len(testset)

What kind of metric can suit our question-answering task? There are many choices, but since the answer are long, we may ask: How well does the system response _cover_ all key facts in the gold response? And the other way around, how well is the system response _not saying things_ that aren't in the gold response?

That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](/docs/building-blocks/modules) using whatever LM we're working with.
That metric is essentially a **semantic F1**, so let's load a `SemanticF1` metric from DSPy. This metric is actually implemented as a [very simple DSPy module](https://github.com/stanfordnlp/dspy/blob/77c2e1cceba427c7f91edb2ed5653276fb0c6de7/dspy/evaluate/auto_evaluation.py#L21) using whatever LM we're working with.


```python
Expand Down Expand Up @@ -192,7 +192,7 @@ dspy.inspect_history(n=1)
```

**Output:**
See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8)
See this [gist](https://gist.github.com/okhat/57bf86472d1e14812c0ae33fba5353f8).

For evaluation, you could use the metric above in a simple loop and just average the score. But for nice parallelism and utilities, we can rely on `dspy.Evaluate`.

Expand Down
10 changes: 7 additions & 3 deletions docs/docs/quick-start/getting-started-02.md
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,11 @@ class RAG(dspy.Module):
def forward(self, question):
context = search(question, k=self.num_docs)
return self.respond(context=context, question=question)

```

Let's use the RAG module.

```
rag = RAG()
rag(question="what are high memory and low memory on linux?")
```
Expand All @@ -111,7 +115,7 @@ dspy.inspect_history()
```

**Output:**
See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c)
See this [gist](https://gist.github.com/okhat/d807032e138862bb54616dcd2f4d481c).


In the previous guide with a CoT module, we got nearly 40% in terms of semantic F1 on our `devset`. Would this `RAG` module score better?
Expand Down Expand Up @@ -151,7 +155,7 @@ optimized_rag = tp.compile(RAG(), trainset=trainset, valset=valset,
```

**Output:**
See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb)
See this [gist](https://gist.github.com/okhat/d6606e480a94c88180441617342699eb).


The prompt optimization process here is pretty systematic, you can learn about it for example in this paper. Importantly, it's not a magic button. It's very possible that it can overfit your training set for instance and not generalize well to a held-out set, making it essential that we iteratively validate our programs.
Expand Down

0 comments on commit a1eae3f

Please sign in to comment.