Beyond Dashboards: Building a Domain-Specific Agentic AI Assistant

The Power of Tool Orchestration

“Which Texas microtransit services reached new performance milestones in the recent quarter, and who are the key partners involved?”

Via is a global transit-tech company with development, operations, partnerships and analytics teams spread across countries and continents. System design, service performance and partnership data are created by different teams and managed across diverse, dedicated systems. Answering the question above requires filtering by service type, geographic lookup, quarterly trend analysis, comparative ranking, and partner accounts mapping. No single database query or API call can answer it, but an AI agent with the right tools can in under a minute!

Internal business teams manage interaction with existing and potential partners. These interactions often raise questions that require combining data from multiple sources in ways that the sources weren’t designed for. Traditionally, this means navigating several analytics tools, exporting CSVs, and manually cross-referencing - a slow and error-prone process.

The agentic approach is fundamentally different: the agent interprets the user’s intent, selects the right tools, and synthesizes results into a single response. The value is not in any one tool, but in the agent’s ability to orchestrate tools intelligently with routing guardrails and multi-step execution. This architecture is built to scale, so new tools and skills can be added without redesigning the whole system.

To make this difference concrete, let’s compare how a typical question is answered using traditional analytics versus an agentic approach.
chart1 (1)-1

Tools + Reasoning = Intelligence

We built our internal agentic AI assistant using LangChain for agent orchestration, AWS Bedrock for the LLM, and many specialized tools for accessing Via transportation data and relevant external sources. This is more than a single LLM prompt: the agent can route queries, call tools, execute multi-step analysis, and adapt across conversation context, with user feedback captured for iterative improvement.

Our tool registry includes specialized capabilities:

Category	Purpose	Example Query
Service Design	Describe transportation service attributes and configuration	“Which microtransit services operate in Texas?”
Performance Metrics	Retrieve operational metrics and trends	"Show me ridership trends for 2025/Q4"
Geospatial	Location-based filtering and proximity search	“Which services operate within 20 miles of Austin, Tx?”
External Data	Integrate benchmarks & third-party datasets	“How do these services compare to NTD averages?”

The anti-pattern in tool design is creating one giant get_everything() tool with 50 parameters. Instead, we built a set of focused tools that an agent can chain. This approach offers key advantages:

The agent can reason about which specific tools to use
New combinations emerge without requiring new tools
Prevents context pollution: focused tools return only relevant data, keeping the agent’s working context clean
Reduces inference cost: smaller, targeted responses mean fewer tokens and faster processing
Each tool is easier to maintain and test independently

The agent’s job is to understand intent, plan execution, chain tools intelligently, synthesize results, and explain its reasoning. These capabilities are what distinguish an agent from a thin LLM wrapper.

Real example 1: Multi-Step Analysis

Let's see how the agent handles a complex request: "Which microtransit services in Texas saw the most significant performance gains in the recent quarter?"

chart2 (1) (1)

The agent interprets “performance gains” as a comparative question, and understands that it first needs to find the relevant services (Texas + Microtransit). It correctly calls a dedicated tool to retrieve them. Next, it selects an appropriate comparison window (Quarter), and ranks results based on KPI deltas by calling a dedicated tool for KPIs retrieval. The tool defines available KPIs, and the agent decides which to use and how to interpret change.

Real example 2: Discovery

Let’s ask an open-ended exploratory question - "What is the common feature for top utilization TaaS services?"

*TaaS stands for transportation-as-a-service: services where Via provides turnkey solutions including management of drivers, vehicles, and other operational aspects.

The agent:

Retrieves TaaS services
And their Q4 2025 utilization metrics
Identifies the top four performers
Analyzes their attributes to extract commonalities

We discover that university campus shuttles achieve higher utilization than typical microtransit services because they serve concentrated demand in small geographic areas with predictable travel patterns. This insight required analysis across multiple data sources!

The discovery can be improved with conversation mode, which maintains context across follow-up questions. This mirrors how a human analyst works and removes the friction of repeatedly reconfiguring dashboards or re-running queries.

Development Lessons: Making Agents Reliable

Building an agent that works in production requires addressing several key challenges.

Challenge 1: Agent Loops

Problem: The agent repeatedly invokes the same tool with identical parameters, unnecessarily consuming tokens, increasing latency, and frustrating users.

Solution: We implemented a layered approach to prevent loops:

Limited tool results: Each tool returns only essential, structured data (not full datasets), keeping context clean and consistent.
High-Reasoning Models: We leverage models such as Claude Sonnet to enable the agent to 'self-reflect' on its execution plan. Unlike smaller models prone to repetitive logic, Sonnet more reliably identifies when a task is complete or requires a different tool.

Recursion limits: As a safety net, we produce structured empty results and enforce a maximum of reasoning steps to force termination if other safeguards fail.

Challenge 2: User Experience & Performance

Problem: Queries to analytics databases can take 1-2 minutes to complete. Users need fast, consistent response times and visibility into what the agent is doing to maintain user confidence.

Solution: We implemented multiple optimizations to reduce latency and keep users engaged:

Performance optimization: We reduced latency through agent instance caching, data pre-warming, query result caching, and efficient conversation state management via LangGraph’s checkpointer.
Progress transparency: The UI streams real-time updates as the agent reasons and executes tools, giving users visibility into what’s happening. Long-running queries run asynchronously with status updating to maintain responsiveness.

These optimizations enable the system to provide sub-two minute response times, while keeping users informed throughout the process.

Challenge 3: Empty Tool Results

Problem: The agent expects tool results as structured input for reasoning. When a tool encounters missing data or unavailable resources (e.g. database connection failure), errors must be handled gracefully to ensure valid input for the agent.

Solution: we built tools to return structured responses with clear status indicators, even when no matching data was found. This prevents tool call loops, maintaining system functionality, and clearly communicating temporary unavailability to users.

Challenge 4: Hallucinations and Irrelevant Topics

Problem: LLMs can generate plausible-sounding but incorrect results, especially when asked about data they don't have access to or when extrapolating beyond tools results. As a result, out-of-scope prompts can lead to irrelevant answers and confuse users.

Solution: We implemented a two-layer defense against hallucinations:

Prevention through Prompt Engineering:

Citation enforcement: Instructions to reference which tools provided the information
Scope limitations: Clear boundaries on what data is available
Data grounding requirement: "Only use data from tool results, never make up numbers, facts or service names"

User guidance:

Adding example questions in the UI illustrates available capabilities and helps users become familiar with the tool.

Challenge 5: Context Limit

Context windows in modern LLMs are ever-increasing, but they are still a practical constraint in production systems. Tool-calling adds input overhead: the model reads tool’s definitions, then context keeps growing with chat history, tool outputs and system instructions. As workflows become more complex, this can increase latency, raise cost, and degrade output quality when too much intermediate data is retained, leading to 'lost in the middle' phenomena.

Given our current tool set and routing guardrails, the existing agent -> tools architecture is close to the effective frontier for reliability, controllability, and implementation complexity. Looking ahead, we plan to evaluate a multi-agent design (for example, a Supervisor with specialized sub-agents) for workflows that are longer, more decomposable, or benefit from parallel execution.

Testing Agentic Systems with “LLM as a Judge”

Unlike traditional software where a specific input always produces the exact same output, LLM-based agents are non-deterministic. The same question might be answered using different wording, different tool sequences, or slightly varied levels of detail each time:

Wording: "12 services" vs. "There are twelve services."
Structure: Bullet lists vs. numbered lists vs. paragraphs.
Order: Services listed alphabetically vs. by launch date.
Detail level: A minimal answer vs. a comprehensive explanation.

Semantically, these responses are equivalent. However, a naive string comparison (checking for an exact match) would fail these tests, even though the agent answered correctly.

Solution: Curated Regression Baselines + LLM Judge

To ensure our agent remains reliable as we add tools and update models, we supplemented our traditional test suite with a semantic evaluation layer. We utilize a separate, high-reasoning LLM to act as a "Judge" that compares the agent's output against a "Gold Standard" ground truth.

1. Establishing a Baseline

We curated a set of representative questions with verified answers that we have manually confirmed as correct and well-formatted. This serves as a constant baseline to evaluate newer versions of the agent.

2. The Evaluation Dimensions

The judge evaluates the output across two distinct dimensions:

Semantic Similarity: Do both responses provide equivalent facts and conclusions? The judge ignores stylistic choices like wording, bullet points, or list order.
Numerical Accuracy: The judge extracts numerical data and validates that values match within a specific tolerance (e.g., 5%), ensuring data integrity even if the prose changes.

3. Scoring Guidelines

The judge returns a structured score based on the following rubric:

Score	Status	Description
0.8 – 1.0	PASS	Identical core information; all key facts present; numbers within tolerance.
0.6 – 0.8	MARGINAL	Similar information but missing context or containing minor numerical drift.
Below 0.6	FAIL	Incorrect information, significant numerical errors, or hallucinations.

4. Testing the Agentic Process

Beyond the final text, our regression tests validate the "internal" health of the agent:

Tool Orchestration: Did the agent call the expected tools in the correct sequence?
Efficiency: Did it avoid unnecessary loops or redundant tool calls?
Latency: Did execution time remain within acceptable limits?
State Management: Did it maintain context correctly across multi-turn conversations?

Key Takeaways

We built this agentic AI assistant to address a core business challenge: internal teams need to quickly and easily find information about our services, but the information is fragmented across many specialized, distributed systems. By orchestrating specialized tools through an AI agent, we deliver comprehensive answers in one place.

Agentic systems don’t replace dashboards - they reduce the friction around them. What once required hours of manual navigation, data exports, and synthesizing reports now can be achieved in a conversation. As a result, operations managers can more easily investigate performance, identify issues, and generate insights without relying on the data team.

Key Design Principles:

Focused, composable and structured tools
Strong reasoning LLM
User experience that is focused on ease of use and functionality
Guardrails for prompts, execution flow and responses
LLM-as-judge testing using semantic similarity

Domain-specific AI assistants are within reach for most engineering teams. You don't need massive infrastructure or custom models. Start small: identify recurring business questions, build 3-5 focused tools, use a strong reasoning LLM, test with real users, and iterate.

Our roadmap focuses on expanding capabilities while maintaining reliability:

Expanding tool coverage allowing reasoning over additional data sources
Proactive insights where the agent suggests analyses based on detected trends
Multi-modal outputs including charts, maps, and visualizations

We employ an agent-to-tools structure today because it delivers the best reliability-to-complexity tradeoff, while planning a move to multi-agent only when context growth and workflow complexity make specialization and parallelism clearly beneficial.

Looking ahead, we see a path toward a unified AI ecosystem. This might include bridging our structured data agent with the existing RAG-based Slack-bot, allowing users to query both live operational data and static company documentation in a single conversation.

This blog post describes Via's internal AI assistant for transportation data analysis. The patterns and lessons are applicable to other domain-specific AI assistant projects.