Evaluating Gen AI Applications: Using Ragas and DeepEval

This is a beginner-friendly blog post for anyone exploring tools to evaluate RAG pipelines, summarization systems, or general LLM-driven applications. Throughout this post, you’ll find practical explanations and examples using Ragas and DeepEval, focusing on how they handle metrics like faithfulness, contextual precision, and answer relevancy. If you’re new to these concepts or just starting to assess your LLM outputs, this guide will give you a clear foundation. For a deeper dive, including installation, advanced configurations, and evolving features, be sure to explore each project’s official website and GitHub documentation.

Why evaluate LLM outputs at all?

We’ve all been using ChatGPT, Gemini, or other LLMs to help with various tasks professionally and personally. When you integrate an LLM into your own applications, like a chatbot, summarizer, or multi-step agent, you can’t just rely on gut feel that it works as expected.

You need to systematically measure things like:

Faithfulness: is it hallucinating, or grounded in the evidence you provided?
Contextual precision: does it only pull what’s needed from your docs?
Answer relevancy: is it even addressing the user’s question?

That’s where specialized LLM evaluation frameworks come in.

For this post, we’ll be considering two tools to assess exactly these qualities: Ragas and DeepEval.

Ragas vs DeepEval

Ragas

Ragas is a Python library designed mainly for evaluating RAG (retrieval-augmented generation) systems. It offers metrics like:

faithfulness - Measures if the generated answer is faithful to the provided context
context_precision - Evaluates how much of the retrieved context is relevant to the question
context_recall - Measures how much of the relevant context was actually retrieved
answer_relevancy - Assesses if the answer is relevant to the user’s question

from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, context_recall, answer_relevancy
from datasets import Dataset

# Prepare dataset (this is the complex part!)
dataset = Dataset.from_dict({
    "question": ["Why do leaves change color?"],
    "answer": ["Leaves change color due to chlorophyll breakdown."],
    "contexts": [["Leaves turn color in autumn due to changes in daylight."]],
    "ground_truth": ["Leaves change color because chlorophyll fades."]
})

# Evaluate 
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, context_precision, context_recall, answer_relevancy]
)

It uses LLM-as-a-judge under the hood, sending structured prompts to models like GPT-4, and supports multiple evaluation models including local models.

DeepEval

DeepEval is a broader LLM quality framework. Think of it as pytest for LLMs, allowing you to:

Create LLMTestCase objects with input, actual_output, expected_output, and retrieval_context.
Use .measure() to get scores or assert_test() to enforce quality thresholds in CI/CD.

It supports metrics like:

faithfulness
context_precision
context_recall
answer_relevancy
GEval

from deepeval import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, ContextualPrecisionMetric, AnswerRelevancyMetric, ContextualRecallMetric

# Create test case
test_case = LLMTestCase(
    input="Your question",
    actual_output="LLM response",
    expected_output="Expected response",
    retrieval_context=["context1", "context2"]
)

# Measure individual metrics
faithfulness = FaithfulnessMetric().measure(test_case)
context_precision = ContextualPrecisionMetric().measure(test_case)
answer_relevancy = AnswerRelevancyMetric().measure(test_case)
contextual_recall = ContextualRecallMetric().measure(test_case)

Note: DeepEval also supports RAGAS metrics through their RagasMetric class, which provides four RAGAS metrics (answer relevancy, faithfulness, contextual precision, and contextual recall).

Key Example Questions & Responses

Both tools were tested on the same four example scenarios, illustrating typical, incomplete, and incorrect uses of context. Note: The evaluation scores below are only meaningful when considered with the specific context and expected outputs provided.

Example	Question	Response	Context	Reference
1	Why do leaves change color in the fall?	Leaves change color in autumn because shorter days and cooler temperatures cause chlorophyll to break down, revealing other pigments.	• “Leaves turn color in autumn due to changes in daylight and temperature.” • “Photosynthesis slows and chlorophyll breaks down, revealing red and yellow pigments.” • “Some tree species have different pigment compounds that show up prominently in fall.”	Leaves change color in autumn because chlorophyll breaks down due to shorter days and cooler weather.
2	Why do we have seasons on Earth?	Seasons occur because the Earth’s axis is tilted, which changes how sunlight hits different parts of the planet throughout the year.	• “The tilt of Earth’s axis relative to its orbit causes different parts of Earth to receive varying amounts of sunlight throughout the year.” • “This axial tilt is approximately 23.5 degrees.” • “Changes in distance from the sun during the orbit do not cause seasons.”	Earth has seasons because its axis is tilted relative to its orbital plane, causing varying sunlight.
3	Why do we have seasons on Earth? (limited context)	We have seasons because Earth moves around the Sun.	• “Earth revolves around the Sun in an elliptical orbit.” • “There are four main seasons: spring, summer, autumn, and winter.”	Earth has seasons because its axis is tilted relative to its orbit, changing sunlight.
4	Why do leaves change color in the fall? (unrelated context)	Leaves change color because fish swim upstream in autumn.	• “Fish migrate upstream to spawn in freshwater rivers.” • “This migration ensures the survival of the next generation.”	Leaves change color due to breakdown of chlorophyll in cooler temperatures.

Metric Results Overview

Below are the aggregated scores each tool produced for the four examples. Higher numbers are better (max = 1.000).

Ragas Scores

Example	Faithfulness	Context Precision	Context Recall	Answer Relevancy
1	1.000	1.000	1.000	0.987
2	1.000	1.000	1.000	0.970
3	1.000	0.000	0.000	0.974
4	0.000	0.000	0.000	0.000

DeepEval Scores

Example	Faithfulness	Context Precision	Answer Relevancy	Contextual Recall
1	1.000	1.000	1.000	1.000
2	1.000	1.000	1.000	1.000
3	1.000	0.000	1.000	0.000
4	0.000	0.000	0.000	0.000

Overall Assessment

The results show that both Ragas and DeepEval provide consistent and reliable evaluation of RAG systems. Both tools excel at identifying:

High-quality responses with relevant context
Responses that ignore or misuse provided context
Completely incorrect or hallucinated responses

Biases to watch out for

Bias name	What it means
First case bias	The LLM judge may favor the first listed option.
Self-evaluation bias (LLM evaluator bias)	Tends to prefer outputs matching its own typical style.
Alignment bias (style calibration bias)	Rewards verbose, hedged, balanced language matching its training.

Conclusion

Evaluating LLM outputs systematically is critical for building reliable AI applications. Based on the analysis above, here are practical recommendations for choosing between Ragas and DeepEval:

Choose Ragas if:

Research-focused: You need detailed RAG-specific metrics with academic rigor
Batch evaluation: You have large datasets and want to evaluate them all at once
Custom embeddings: You want to use specific embedding models for answer relevancy
Academic publishing: You need reproducible results with standardized metrics

Choose DeepEval if:

Production workflows: You need CI/CD integration with assert_test() and pytest
Iterative development: You want to test individual cases and get immediate feedback
Cost optimization: You need to run evaluations multiple times with caching
Debugging: You want detailed reasoning for why metrics failed

When to Use Both:

Consider using both tools in complementary ways: Ragas for initial dataset evaluation and DeepEval for ongoing monitoring and CI/CD workflows.

By understanding these trade-offs and the specific context of your evaluation needs, you can make an informed decision that balances technical rigor with practical implementation requirements.

References & Resources

Demo notebooks:
- DeepEval examples
- Ragas examples
Ragas docs
DeepEval docs
AI Engineering book by Chip Huyen