How to Build a Self-Testing AI System with LlamaIndex and OpenAI Using Retrieval, Instrumentation, and Automated Quality Testing

ocopd 2026年1月18日

0 8 3 minutes read

How to Build a Self-Testing AI System with LlamaIndex and OpenAI Using Retrieval, Instrumentation, and Automated Quality Testing

In this tutorial, we build an advanced AI workflow using LlamaIndex and the OpenAI model. We focus on building a reliable retrieval-augmented generation (RAG) agent that can consult the evidence, use the tools deliberately, and evaluate their results qualitatively. By programming the system around retrieving, aggregating responses, and self-evaluation, we show how agent patterns go beyond simple chatbots and move toward more reliable, controllable AI systems suitable for research and analytics applications.

!pip -q install -U llama-index llama-index-llms-openai llama-index-embeddings-openai nest_asyncio


import os
import asyncio
import nest_asyncio
nest_asyncio.apply()


from getpass import getpass


if not os.environ.get("OPENAI_API_KEY"):
   os.environ["OPENAI_API_KEY"] = getpass("Enter OPENAI_API_KEY: ")

We set up the environment and install all the dependencies necessary to run the agent AI workflow. We securely load the OpenAI API key at runtime, ensuring that information is never hard-coded. We also modified the notebook to handle asynchronous operations more smoothly.

from llama_index.core import Document, VectorStoreIndex, Settings
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding


Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0.2)
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")


texts = [
   "Reliable RAG systems separate retrieval, synthesis, and verification. Common failures include hallucination and shallow retrieval.",
   "RAG evaluation focuses on faithfulness, answer relevancy, and retrieval quality.",
   "Tool-using agents require constrained tools, validation, and self-review loops.",
   "A robust workflow follows retrieve, answer, evaluate, and revise steps."
]


docs = [Document(text=t) for t in texts]
index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine(similarity_top_k=4)

We prepare the OpenAI language model and embedding model and build a unified knowledge base for our agent. We convert raw text into indexed documents so the agent can retrieve relevant evidence during consultation.

from llama_index.core.evaluation import FaithfulnessEvaluator, RelevancyEvaluator


faith_eval = FaithfulnessEvaluator(llm=Settings.llm)
rel_eval = RelevancyEvaluator(llm=Settings.llm)


def retrieve_evidence(q: str) -> str:
   r = query_engine.query(q)
   out = []
   for i, n in enumerate(r.source_nodes or []):
       out.append(f"[{i+1}] {n.node.get_content()[:300]}")
   return "n".join(out)


def score_answer(q: str, a: str) -> str:
   r = query_engine.query(q)
   ctx = [n.node.get_content() for n in r.source_nodes or []]
   f = faith_eval.evaluate(query=q, response=a, contexts=ctx)
   r = rel_eval.evaluate(query=q, response=a, contexts=ctx)
   return f"Faithfulness: {f.score}nRelevancy: {r.score}"

We describe the main tools used by the agent: evidence retrieval and feedback evaluation. We use automated scoring for reliability and relevance so the agent can judge the quality of their responses.

from llama_index.core.agent.workflow import ReActAgent
from llama_index.core.workflow import Context


agent = ReActAgent(
   tools=[retrieve_evidence, score_answer],
   llm=Settings.llm,
   system_prompt="""
Always retrieve evidence first.
Produce a structured answer.
Evaluate the answer and revise once if scores are low.
""",
   verbose=True
)


ctx = Context(agent)

We build an agent based on ReAct and describe its system behavior, specifying how it receives evidence, generates responses, and evaluates results. We also implement an implementation context that maintains the agent’s state across interactions. Step combines tools and logic into a single agent workflow.

async def run_brief(topic: str):
   q = f"Design a reliable RAG + tool-using agent workflow and how to evaluate it. Topic: {topic}"
   handler = agent.run(q, ctx=ctx)
   async for ev in handler.stream_events():
       print(getattr(ev, "delta", ""), end="")
   res = await handler
   return str(res)


topic = "RAG agent reliability and evaluation"
loop = asyncio.get_event_loop()
result = loop.run_until_complete(run_brief(topic))


print("nnFINAL OUTPUTn")
print(result)

We create a complete agent loop by passing the subject through the system and broadcasting the agent’s thinking and output. We allow the agent to complete its cycle of retrieval, production, and evaluation in parallel.

In conclusion, we have shown how an agent can find supporting evidence, generate a structured response, and evaluate its reliability and relevance before finalizing a response. We kept the design modular and transparent, making it easy to extend the workflow with additional tools, inspectors, or domain-specific information sources. This approach shows how we can use agent AI with the LlamaIndex and OpenAI models to create highly skilled and reliable systems that are self-aware in their thinking and responses.

Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the power of Artificial Intelligence for the benefit of society. His latest endeavor is the launch of Artificial Intelligence Media Platform, Marktechpost, which stands out for its extensive coverage of machine learning and deep learning stories that sound technically sound and easily understood by a wide audience. The platform boasts of more than 2 million monthly views, which shows its popularity among viewers.

ocopd 2026年1月18日

0 8 3 minutes read

How to Build a Self-Testing AI System with LlamaIndex and OpenAI Using Retrieval, Instrumentation, and Automated Quality Testing

ocopd

Leave a Reply Cancel reply

EV Supply Chain Play: Why Solid State Power (SLDP) Can Be a Hidden Gem in Battery Tech

5 Ways to Plan Your Financial Journey to Buying a Home – RISMedia’s Housecall

A $350K Wire Almost Went Into a Scam How a Fraudster Used the US Embassy to Steal $350K

Holy Ship! HII Stock Jumps on Trump Navy Ship Plans

Multilingual Sentiment Analysis – Importance, Methodology, and Challenges

Humans and AI at Work: Who’s Really in Control?

ocopd

Bitcoin Acceptance in West Virginia Sets a New Benchmark for the State

Metricon lookbook warns of $50k upgrade pitfalls

Related Articles

Google AI Releases Gemini 3.1 Pro with 1 million token core and 77.1 percent ARC-AGI-2 Reasons for AI Agents

Revealing the biases, feelings, personalities, and abstract concepts hidden in large language models | MIT News

Zyphra Releases ZUNA: A Basic 380M-Parameter BCI Model for EEG Data, Improves Thought-to-Text Development

A smart parking system can prevent frustration and emissions MIT News