Evaluating Gen AI Applications: Using Ragas and DeepEval
This is a beginner-friendly blog post for anyone exploring tools to evaluate RAG pipelines, summarization systems, or general LLM-driven applications. Throughout this post, you’ll find practical explanations and examples using Ragas and DeepEval, focusing on how they handle metrics like faithfulness, contextual precision, and answer relevancy. If you’re new to these concepts or just starting to assess your LLM outputs, this guide will give you a clear foundation. For a deeper dive, including installation, advanced configurations, and evolving features, be sure to explore each project’s official website and GitHub documentation.
Why evaluate LLM outputs at all?
We’ve all been using ChatGPT, Gemini, or other LLMs to help with various tasks professionally and personally. When you integrate an LLM into your own applications, like a chatbot, summarizer, or multi-step agent, you can’t just rely on gut feel that it works as expected.
You need to systematically measure things like:
- Faithfulness: is it hallucinating, or grounded in the evidence you provided?
- Contextual precision: does it only pull what’s needed from your docs?
- Answer relevancy: is it even addressing the user’s question?
That’s where specialized LLM evaluation frameworks come in.
For this post, we’ll be considering two tools to assess exactly these qualities: Ragas and DeepEval.
Ragas vs DeepEval
Ragas
Ragas is a Python library designed mainly for evaluating RAG (retrieval-augmented generation) systems. It offers metrics like:
faithfulness
- Measures if the generated answer is faithful to the provided contextcontext_precision
- Evaluates how much of the retrieved context is relevant to the questioncontext_recall
- Measures how much of the relevant context was actually retrievedanswer_relevancy
- Assesses if the answer is relevant to the user’s question
from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, context_recall, answer_relevancy
from datasets import Dataset
# Prepare dataset (this is the complex part!)
dataset = Dataset.from_dict({
"question": ["Why do leaves change color?"],
"answer": ["Leaves change color due to chlorophyll breakdown."],
"contexts": [["Leaves turn color in autumn due to changes in daylight."]],
"ground_truth": ["Leaves change color because chlorophyll fades."]
})
# Evaluate
results = evaluate(
dataset=dataset,
metrics=[faithfulness, context_precision, context_recall, answer_relevancy]
)
It uses LLM-as-a-judge under the hood, sending structured prompts to models like GPT-4, and supports multiple evaluation models including local models.
DeepEval
DeepEval is a broader LLM quality framework. Think of it as pytest for LLMs, allowing you to:
- Create
LLMTestCase
objects withinput
,actual_output
,expected_output
, andretrieval_context
. - Use
.measure()
to get scores orassert_test()
to enforce quality thresholds in CI/CD.
It supports metrics like:
faithfulness
context_precision
context_recall
answer_relevancy
GEval
from deepeval import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, ContextualPrecisionMetric, AnswerRelevancyMetric, ContextualRecallMetric
# Create test case
test_case = LLMTestCase(
input="Your question",
actual_output="LLM response",
expected_output="Expected response",
retrieval_context=["context1", "context2"]
)
# Measure individual metrics
faithfulness = FaithfulnessMetric().measure(test_case)
context_precision = ContextualPrecisionMetric().measure(test_case)
answer_relevancy = AnswerRelevancyMetric().measure(test_case)
contextual_recall = ContextualRecallMetric().measure(test_case)
Note: DeepEval also supports RAGAS metrics through their RagasMetric
class, which provides four RAGAS metrics (answer relevancy, faithfulness, contextual precision, and contextual recall).
Key Example Questions & Responses
Both tools were tested on the same four example scenarios, illustrating typical, incomplete, and incorrect uses of context. Note: The evaluation scores below are only meaningful when considered with the specific context and expected outputs provided.
Example | Question | Response | Context | Reference |
---|---|---|---|---|
1 | Why do leaves change color in the fall? | Leaves change color in autumn because shorter days and cooler temperatures cause chlorophyll to break down, revealing other pigments. | • “Leaves turn color in autumn due to changes in daylight and temperature.” • “Photosynthesis slows and chlorophyll breaks down, revealing red and yellow pigments.” • “Some tree species have different pigment compounds that show up prominently in fall.” |
Leaves change color in autumn because chlorophyll breaks down due to shorter days and cooler weather. |
2 | Why do we have seasons on Earth? | Seasons occur because the Earth’s axis is tilted, which changes how sunlight hits different parts of the planet throughout the year. | • “The tilt of Earth’s axis relative to its orbit causes different parts of Earth to receive varying amounts of sunlight throughout the year.” • “This axial tilt is approximately 23.5 degrees.” • “Changes in distance from the sun during the orbit do not cause seasons.” |
Earth has seasons because its axis is tilted relative to its orbital plane, causing varying sunlight. |
3 | Why do we have seasons on Earth? (limited context) | We have seasons because Earth moves around the Sun. | • “Earth revolves around the Sun in an elliptical orbit.” • “There are four main seasons: spring, summer, autumn, and winter.” |
Earth has seasons because its axis is tilted relative to its orbit, changing sunlight. |
4 | Why do leaves change color in the fall? (unrelated context) | Leaves change color because fish swim upstream in autumn. | • “Fish migrate upstream to spawn in freshwater rivers.” • “This migration ensures the survival of the next generation.” |
Leaves change color due to breakdown of chlorophyll in cooler temperatures. |
Metric Results Overview
Below are the aggregated scores each tool produced for the four examples. Higher numbers are better (max = 1.000).
Ragas Scores
Example | Faithfulness | Context Precision | Context Recall | Answer Relevancy |
---|---|---|---|---|
1 | 1.000 | 1.000 | 1.000 | 0.987 |
2 | 1.000 | 1.000 | 1.000 | 0.970 |
3 | 1.000 | 0.000 | 0.000 | 0.974 |
4 | 0.000 | 0.000 | 0.000 | 0.000 |
DeepEval Scores
Example | Faithfulness | Context Precision | Answer Relevancy | Contextual Recall |
---|---|---|---|---|
1 | 1.000 | 1.000 | 1.000 | 1.000 |
2 | 1.000 | 1.000 | 1.000 | 1.000 |
3 | 1.000 | 0.000 | 1.000 | 0.000 |
4 | 0.000 | 0.000 | 0.000 | 0.000 |
Overall Assessment
The results show that both Ragas and DeepEval provide consistent and reliable evaluation of RAG systems. Both tools excel at identifying:
- High-quality responses with relevant context
- Responses that ignore or misuse provided context
- Completely incorrect or hallucinated responses
Biases to watch out for
Bias name | What it means |
---|---|
First case bias | The LLM judge may favor the first listed option. |
Self-evaluation bias (LLM evaluator bias) | Tends to prefer outputs matching its own typical style. |
Alignment bias (style calibration bias) | Rewards verbose, hedged, balanced language matching its training. |
Conclusion
Evaluating LLM outputs systematically is critical for building reliable AI applications. Based on the analysis above, here are practical recommendations for choosing between Ragas and DeepEval:
Choose Ragas if:
- Research-focused: You need detailed RAG-specific metrics with academic rigor
- Batch evaluation: You have large datasets and want to evaluate them all at once
- Custom embeddings: You want to use specific embedding models for answer relevancy
- Academic publishing: You need reproducible results with standardized metrics
Choose DeepEval if:
- Production workflows: You need CI/CD integration with
assert_test()
and pytest - Iterative development: You want to test individual cases and get immediate feedback
- Cost optimization: You need to run evaluations multiple times with caching
- Debugging: You want detailed reasoning for why metrics failed
When to Use Both:
Consider using both tools in complementary ways: Ragas for initial dataset evaluation and DeepEval for ongoing monitoring and CI/CD workflows.
By understanding these trade-offs and the specific context of your evaluation needs, you can make an informed decision that balances technical rigor with practical implementation requirements.