Skip to content

Commit 5a4e27e

Browse files
Enhance evaluation section in README.md
1 parent b9776bf commit 5a4e27e

File tree

1 file changed

+20
-18
lines changed

1 file changed

+20
-18
lines changed

README.md

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
[![Documentation Status](https://app.readthedocs.org/projects/sdialog/badge/?version=latest)](https://sdialog.readthedocs.io)
44
[![CI](https://img.shields.io/github/actions/workflow/status/idiap/sdialog/ci.yml?label=CI)](https://github.com/idiap/sdialog/actions/workflows/ci.yml)
55
[![codecov](https://codecov.io/gh/idiap/sdialog/graph/badge.svg?token=2210USI8I0)](https://app.codecov.io/gh/idiap/sdialog?displayType=list)
6+
[![Demo](https://img.shields.io/badge/Demo%20video-YouTube-red?logo=youtube)](https://www.youtube.com/watch?v=oG_jJuU255I)
67
[![PyPI version](https://badge.fury.io/py/sdialog.svg)](https://badge.fury.io/py/sdialog)
78
[![Downloads](https://static.pepy.tech/badge/sdialog)](https://pepy.tech/project/sdialog)
8-
[![Demo video](https://img.shields.io/badge/Demo%20video-YouTube-red?logo=youtube)](https://www.youtube.com/watch?v=oG_jJuU255I)
99
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/idiap/sdialog/)
1010

1111
Quick links: [Website](https://sdialog.github.io/)[GitHub](https://github.com/idiap/sdialog)[Docs](https://sdialog.readthedocs.io)[API](https://sdialog.readthedocs.io/en/latest/api/sdialog.html)[ArXiv paper](https://arxiv.org/abs/2512.09142)[Demo (video)](demo.md)[Tutorials](https://github.com/idiap/sdialog/tree/main/tutorials)[Datasets (HF)](https://huggingface.co/datasets/sdialog)[Issues](https://github.com/idiap/sdialog/issues)
@@ -67,7 +67,7 @@ support_persona = SupportAgent(name="Ava", politeness="high", communication_styl
6767
customer_persona = Customer(name="Riley", issue="double charge", desired_outcome="refund")
6868

6969
# (Optional) Let's define two mock tools (just plain Python functions) for our support agent
70-
def account_verification(user_id):
70+
def verify_account(user_id):
7171
"""Verify user account by user id."""
7272
return {"user_id": user_id, "verified": True}
7373
def refund(amount):
@@ -84,7 +84,7 @@ react_refund = SimpleReflexOrchestrator(
8484
support_agent = Agent(
8585
persona=support_persona,
8686
think=True, # Let's also enable thinking mode
87-
tools=[account_verification, refund],
87+
tools=[verify_account, refund],
8888
name="Support"
8989
)
9090
simulated_customer = Agent(
@@ -192,27 +192,29 @@ See [Dialog section](https://sdialog.readthedocs.io/en/latest/sdialog/index.html
192192
<summary>Score dialogs with built‑in metrics and LLM judges, and compare datasets with aggregators and plots.</summary>
193193

194194
Dialogs can be evaluated using the different components available inside the `sdialog.evaluation` module.
195-
Use [built‑in metrics](https://sdialog.readthedocs.io/en/latest/api/sdialog.html#module-sdialog.evaluation) (readability, flow, linguistic features, LLM judges) or easily create new ones, then aggregate and compare datasets (sets of dialogs) via `DatasetComparator`.
195+
Use [built‑in metrics](https://sdialog.readthedocs.io/en/latest/api/sdialog.html#module-sdialog.evaluation)—conversational features, readability, embedding-based, LLM-as-judge, flow-based, functional correctness (30+ metrics across six categories)—or easily create new ones, then aggregate and compare datasets (sets of dialogs) via `Comparator`.
196196

197197
```python
198-
from sdialog.evaluation import LLMJudgeRealDialog, LinguisticFeatureScore
199-
from sdialog.evaluation import FrequencyEvaluator, MeanEvaluator
200-
from sdialog.evaluation import DatasetComparator
201-
202-
reference = [...] # list[Dialog]
203-
candidate = [...] # list[Dialog]
198+
from sdialog import Dialog
199+
from sdialog.evaluation import LLMJudgeYesNo, ToolSequenceValidator
200+
from sdialog.evaluation import FrequencyEvaluator, Comparator
204201

205-
judge = LLMJudgeRealDialog()
206-
flesch = LinguisticFeatureScore(feature="flesch-reading-ease")
202+
# Two quick checks: did the agent ask for verification, and did it call tools in order?
203+
judge_verify = LLMJudgeYesNo(
204+
"Did the support agent try to verify the customer?",
205+
reason=True,
206+
)
207+
tool_seq = ToolSequenceValidator(["verify_account", "refund"])
207208

208-
comparator = DatasetComparator([
209-
FrequencyEvaluator(judge, name="Realistic dialog rate"),
210-
MeanEvaluator(flesch, name="Mean Flesch Reading Ease"),
209+
comparator = Comparator([
210+
FrequencyEvaluator(judge_verify, name="Asked for verification"),
211+
FrequencyEvaluator(tool_seq, name="Correct tool order"),
211212
])
212213

213-
results = comparator({"reference": reference, "candidate": candidate})
214-
215-
# Plot results for each evaluator
214+
results = comparator({
215+
"model-A": Dialog.from_folder("output/model-A"),
216+
"model-B": Dialog.from_folder("output/model-B"),
217+
})
216218
comparator.plot()
217219
```
218220
</details>

0 commit comments

Comments
 (0)