“Which Texas microtransit services reached new performance milestones in the recent quarter, and who are the key partners involved?”
Via is a global transit-tech company with development, operations, partnerships and analytics teams spread across countries and continents. System design, service performance and partnership data are created by different teams and managed across diverse, dedicated systems. Answering the question above requires filtering by service type, geographic lookup, quarterly trend analysis, comparative ranking, and partner accounts mapping. No single database query or API call can answer it, but an AI agent with the right tools can in under a minute!
Internal business teams manage interaction with existing and potential partners. These interactions often raise questions that require combining data from multiple sources in ways that the sources weren’t designed for. Traditionally, this means navigating several analytics tools, exporting CSVs, and manually cross-referencing - a slow and error-prone process.
The agentic approach is fundamentally different: the agent interprets the user’s intent, selects the right tools, and synthesizes results into a single response. The value is not in any one tool, but in the agent’s ability to orchestrate tools intelligently with routing guardrails and multi-step execution. This architecture is built to scale, so new tools and skills can be added without redesigning the whole system.
To make this difference concrete, let’s compare how a typical question is answered using traditional analytics versus an agentic approach.
We built our internal agentic AI assistant using LangChain for agent orchestration, AWS Bedrock for the LLM, and many specialized tools for accessing Via transportation data and relevant external sources. This is more than a single LLM prompt: the agent can route queries, call tools, execute multi-step analysis, and adapt across conversation context, with user feedback captured for iterative improvement.
Our tool registry includes specialized capabilities:
|
Category |
Purpose |
Example Query |
|---|---|---|
|
Service Design |
Describe transportation service attributes and configuration |
“Which microtransit services operate in Texas?” |
|
Performance Metrics |
Retrieve operational metrics and trends |
"Show me ridership trends for 2025/Q4" |
|
Geospatial |
Location-based filtering and proximity search |
“Which services operate within 20 miles of Austin, Tx?” |
|
External Data |
Integrate benchmarks & third-party datasets |
“How do these services compare to NTD averages?” |
The anti-pattern in tool design is creating one giant get_everything() tool with 50 parameters. Instead, we built a set of focused tools that an agent can chain. This approach offers key advantages:
The agent’s job is to understand intent, plan execution, chain tools intelligently, synthesize results, and explain its reasoning. These capabilities are what distinguish an agent from a thin LLM wrapper.
Let's see how the agent handles a complex request: "Which microtransit services in Texas saw the most significant performance gains in the recent quarter?"
The agent interprets “performance gains” as a comparative question, and understands that it first needs to find the relevant services (Texas + Microtransit). It correctly calls a dedicated tool to retrieve them. Next, it selects an appropriate comparison window (Quarter), and ranks results based on KPI deltas by calling a dedicated tool for KPIs retrieval. The tool defines available KPIs, and the agent decides which to use and how to interpret change.
Let’s ask an open-ended exploratory question - "What is the common feature for top utilization TaaS services?"
*TaaS stands for transportation-as-a-service: services where Via provides turnkey solutions including management of drivers, vehicles, and other operational aspects.
The agent:
We discover that university campus shuttles achieve higher utilization than typical microtransit services because they serve concentrated demand in small geographic areas with predictable travel patterns. This insight required analysis across multiple data sources!
The discovery can be improved with conversation mode, which maintains context across follow-up questions. This mirrors how a human analyst works and removes the friction of repeatedly reconfiguring dashboards or re-running queries.
Building an agent that works in production requires addressing several key challenges.
Problem: The agent repeatedly invokes the same tool with identical parameters, unnecessarily consuming tokens, increasing latency, and frustrating users.
Solution: We implemented a layered approach to prevent loops:
Recursion limits: As a safety net, we produce structured empty results and enforce a maximum of reasoning steps to force termination if other safeguards fail.
Problem: Queries to analytics databases can take 1-2 minutes to complete. Users need fast, consistent response times and visibility into what the agent is doing to maintain user confidence.
Solution: We implemented multiple optimizations to reduce latency and keep users engaged:
These optimizations enable the system to provide sub-two minute response times, while keeping users informed throughout the process.
Problem: The agent expects tool results as structured input for reasoning. When a tool encounters missing data or unavailable resources (e.g. database connection failure), errors must be handled gracefully to ensure valid input for the agent.
Solution: we built tools to return structured responses with clear status indicators, even when no matching data was found. This prevents tool call loops, maintaining system functionality, and clearly communicating temporary unavailability to users.
Problem: LLMs can generate plausible-sounding but incorrect results, especially when asked about data they don't have access to or when extrapolating beyond tools results. As a result, out-of-scope prompts can lead to irrelevant answers and confuse users.
Solution: We implemented a two-layer defense against hallucinations:
Prevention through Prompt Engineering:
User guidance:
Context windows in modern LLMs are ever-increasing, but they are still a practical constraint in production systems. Tool-calling adds input overhead: the model reads tool’s definitions, then context keeps growing with chat history, tool outputs and system instructions. As workflows become more complex, this can increase latency, raise cost, and degrade output quality when too much intermediate data is retained, leading to 'lost in the middle' phenomena.
Given our current tool set and routing guardrails, the existing agent -> tools architecture is close to the effective frontier for reliability, controllability, and implementation complexity. Looking ahead, we plan to evaluate a multi-agent design (for example, a Supervisor with specialized sub-agents) for workflows that are longer, more decomposable, or benefit from parallel execution.
Unlike traditional software where a specific input always produces the exact same output, LLM-based agents are non-deterministic. The same question might be answered using different wording, different tool sequences, or slightly varied levels of detail each time:
Semantically, these responses are equivalent. However, a naive string comparison (checking for an exact match) would fail these tests, even though the agent answered correctly.
To ensure our agent remains reliable as we add tools and update models, we supplemented our traditional test suite with a semantic evaluation layer. We utilize a separate, high-reasoning LLM to act as a "Judge" that compares the agent's output against a "Gold Standard" ground truth.
We curated a set of representative questions with verified answers that we have manually confirmed as correct and well-formatted. This serves as a constant baseline to evaluate newer versions of the agent.
The judge evaluates the output across two distinct dimensions:
The judge returns a structured score based on the following rubric:
|
Score |
Status |
Description |
|
0.8 – 1.0 |
PASS |
Identical core information; all key facts present; numbers within tolerance. |
|
0.6 – 0.8 |
MARGINAL |
Similar information but missing context or containing minor numerical drift. |
|
Below 0.6 |
FAIL |
Incorrect information, significant numerical errors, or hallucinations. |
Beyond the final text, our regression tests validate the "internal" health of the agent:
We built this agentic AI assistant to address a core business challenge: internal teams need to quickly and easily find information about our services, but the information is fragmented across many specialized, distributed systems. By orchestrating specialized tools through an AI agent, we deliver comprehensive answers in one place.
Agentic systems don’t replace dashboards - they reduce the friction around them. What once required hours of manual navigation, data exports, and synthesizing reports now can be achieved in a conversation. As a result, operations managers can more easily investigate performance, identify issues, and generate insights without relying on the data team.
Key Design Principles:
Domain-specific AI assistants are within reach for most engineering teams. You don't need massive infrastructure or custom models. Start small: identify recurring business questions, build 3-5 focused tools, use a strong reasoning LLM, test with real users, and iterate.
Our roadmap focuses on expanding capabilities while maintaining reliability:
We employ an agent-to-tools structure today because it delivers the best reliability-to-complexity tradeoff, while planning a move to multi-agent only when context growth and workflow complexity make specialization and parallelism clearly beneficial.
Looking ahead, we see a path toward a unified AI ecosystem. This might include bridging our structured data agent with the existing RAG-based Slack-bot, allowing users to query both live operational data and static company documentation in a single conversation.
This blog post describes Via's internal AI assistant for transportation data analysis. The patterns and lessons are applicable to other domain-specific AI assistant projects.