This is a beginner-friendly blog post for anyone exploring tools to evaluate RAG pipelines, summarization systems, or general LLM-driven applications. Throughout this post, you’ll find practical explanations and examples using Ragas and DeepEval, focusing on how they handle metrics like faithfulness, contextual precision, and answer relevancy. If you’re new to these concepts or just starting to assess your LLM outputs, this guide will give you a clear foundation. For a deeper dive, including installation, advanced configurations, and evolving features, be sure to explore each project’s official website and GitHub documentation.


Why evaluate LLM outputs at all?

We’ve all been using ChatGPT, Gemini, or other LLMs to help with various tasks professionally and personally. When you integrate an LLM into your own applications, like a chatbot, summarizer, or multi-step agent, you can’t just rely on gut feel that it works as expected.

You need to systematically measure things like:

  • Faithfulness: is it hallucinating, or grounded in the evidence you provided?
  • Contextual precision: does it only pull what’s needed from your docs?
  • Answer relevancy: is it even addressing the user’s question?

That’s where specialized LLM evaluation frameworks come in.

For this post, we’ll be considering two tools to assess exactly these qualities: Ragas and DeepEval.


Ragas vs DeepEval

Ragas

Ragas is a Python library designed mainly for evaluating RAG (retrieval-augmented generation) systems. It offers metrics like:

  • faithfulness - Measures if the generated answer is faithful to the provided context
  • context_precision - Evaluates how much of the retrieved context is relevant to the question
  • context_recall - Measures how much of the relevant context was actually retrieved
  • answer_relevancy - Assesses if the answer is relevant to the user’s question
from ragas import evaluate
from ragas.metrics import faithfulness, context_precision, context_recall, answer_relevancy
from datasets import Dataset

# Prepare dataset (this is the complex part!)
dataset = Dataset.from_dict({
    "question": ["Why do leaves change color?"],
    "answer": ["Leaves change color due to chlorophyll breakdown."],
    "contexts": [["Leaves turn color in autumn due to changes in daylight."]],
    "ground_truth": ["Leaves change color because chlorophyll fades."]
})

# Evaluate 
results = evaluate(
    dataset=dataset,
    metrics=[faithfulness, context_precision, context_recall, answer_relevancy]
)

It uses LLM-as-a-judge under the hood, sending structured prompts to models like GPT-4, and supports multiple evaluation models including local models.

DeepEval

DeepEval is a broader LLM quality framework. Think of it as pytest for LLMs, allowing you to:

  • Create LLMTestCase objects with input, actual_output, expected_output, and retrieval_context.
  • Use .measure() to get scores or assert_test() to enforce quality thresholds in CI/CD.

It supports metrics like:

  • faithfulness
  • context_precision
  • context_recall
  • answer_relevancy
  • GEval
from deepeval import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, ContextualPrecisionMetric, AnswerRelevancyMetric, ContextualRecallMetric

# Create test case
test_case = LLMTestCase(
    input="Your question",
    actual_output="LLM response",
    expected_output="Expected response",
    retrieval_context=["context1", "context2"]
)

# Measure individual metrics
faithfulness = FaithfulnessMetric().measure(test_case)
context_precision = ContextualPrecisionMetric().measure(test_case)
answer_relevancy = AnswerRelevancyMetric().measure(test_case)
contextual_recall = ContextualRecallMetric().measure(test_case)

Note: DeepEval also supports RAGAS metrics through their RagasMetric class, which provides four RAGAS metrics (answer relevancy, faithfulness, contextual precision, and contextual recall).


Key Example Questions & Responses

Both tools were tested on the same four example scenarios, illustrating typical, incomplete, and incorrect uses of context. Note: The evaluation scores below are only meaningful when considered with the specific context and expected outputs provided.

Example Question Response Context Reference
1 Why do leaves change color in the fall? Leaves change color in autumn because shorter days and cooler temperatures cause chlorophyll to break down, revealing other pigments. • “Leaves turn color in autumn due to changes in daylight and temperature.”
• “Photosynthesis slows and chlorophyll breaks down, revealing red and yellow pigments.”
• “Some tree species have different pigment compounds that show up prominently in fall.”
Leaves change color in autumn because chlorophyll breaks down due to shorter days and cooler weather.
2 Why do we have seasons on Earth? Seasons occur because the Earth’s axis is tilted, which changes how sunlight hits different parts of the planet throughout the year. • “The tilt of Earth’s axis relative to its orbit causes different parts of Earth to receive varying amounts of sunlight throughout the year.”
• “This axial tilt is approximately 23.5 degrees.”
• “Changes in distance from the sun during the orbit do not cause seasons.”
Earth has seasons because its axis is tilted relative to its orbital plane, causing varying sunlight.
3 Why do we have seasons on Earth? (limited context) We have seasons because Earth moves around the Sun. • “Earth revolves around the Sun in an elliptical orbit.”
• “There are four main seasons: spring, summer, autumn, and winter.”
Earth has seasons because its axis is tilted relative to its orbit, changing sunlight.
4 Why do leaves change color in the fall? (unrelated context) Leaves change color because fish swim upstream in autumn. • “Fish migrate upstream to spawn in freshwater rivers.”
• “This migration ensures the survival of the next generation.”
Leaves change color due to breakdown of chlorophyll in cooler temperatures.

Metric Results Overview

Below are the aggregated scores each tool produced for the four examples. Higher numbers are better (max = 1.000).

Ragas Scores

Example Faithfulness Context Precision Context Recall Answer Relevancy
1 1.000 1.000 1.000 0.987
2 1.000 1.000 1.000 0.970
3 1.000 0.000 0.000 0.974
4 0.000 0.000 0.000 0.000

DeepEval Scores

Example Faithfulness Context Precision Answer Relevancy Contextual Recall
1 1.000 1.000 1.000 1.000
2 1.000 1.000 1.000 1.000
3 1.000 0.000 1.000 0.000
4 0.000 0.000 0.000 0.000

Overall Assessment

The results show that both Ragas and DeepEval provide consistent and reliable evaluation of RAG systems. Both tools excel at identifying:

  • High-quality responses with relevant context
  • Responses that ignore or misuse provided context
  • Completely incorrect or hallucinated responses

Biases to watch out for

Bias name What it means
First case bias The LLM judge may favor the first listed option.
Self-evaluation bias (LLM evaluator bias) Tends to prefer outputs matching its own typical style.
Alignment bias (style calibration bias) Rewards verbose, hedged, balanced language matching its training.

Conclusion

Evaluating LLM outputs systematically is critical for building reliable AI applications. Based on the analysis above, here are practical recommendations for choosing between Ragas and DeepEval:

Choose Ragas if:

  • Research-focused: You need detailed RAG-specific metrics with academic rigor
  • Batch evaluation: You have large datasets and want to evaluate them all at once
  • Custom embeddings: You want to use specific embedding models for answer relevancy
  • Academic publishing: You need reproducible results with standardized metrics

Choose DeepEval if:

  • Production workflows: You need CI/CD integration with assert_test() and pytest
  • Iterative development: You want to test individual cases and get immediate feedback
  • Cost optimization: You need to run evaluations multiple times with caching
  • Debugging: You want detailed reasoning for why metrics failed

When to Use Both:

Consider using both tools in complementary ways: Ragas for initial dataset evaluation and DeepEval for ongoing monitoring and CI/CD workflows.

By understanding these trade-offs and the specific context of your evaluation needs, you can make an informed decision that balances technical rigor with practical implementation requirements.

References & Resources