The Power of Tool Orchestration
“Which Texas microtransit services reached new performance milestones in the recent quarter, and who are the key partners involved?”
Via is a global transit-tech company with development, operations, partnerships and analytics teams spread across countries and continents. System design, service performance and partnership data are created by different teams and managed across diverse, dedicated systems. Answering the question above requires filtering by service type, geographic lookup, quarterly trend analysis, comparative ranking, and partner accounts mapping. No single database query or API call can answer it, but an AI agent with the right tools can in under a minute!
Internal business teams manage interaction with existing and potential partners. These interactions often raise questions that require combining data from multiple sources in ways that the sources weren’t designed for. Traditionally, this means navigating several analytics tools, exporting CSVs, and manually cross-referencing - a slow and error-prone process.
The agentic approach is fundamentally different: the agent interprets the user’s intent, selects the right tools, and synthesizes results into a single response. The value is not in any one tool, but in the agent’s ability to orchestrate tools intelligently with routing guardrails and multi-step execution. This architecture is built to scale, so new tools and skills can be added without redesigning the whole system.
To make this difference concrete, let’s compare how a typical question is answered using traditional analytics versus an agentic approach.-1.png?width=627&height=709&name=chart1%20(1)-1.png)
Tools + Reasoning = Intelligence
We built our internal agentic AI assistant using LangChain for agent orchestration, AWS Bedrock for the LLM, and many specialized tools for accessing Via transportation data and relevant external sources. This is more than a single LLM prompt: the agent can route queries, call tools, execute multi-step analysis, and adapt across conversation context, with user feedback captured for iterative improvement.
Our tool registry includes specialized capabilities:
|
Category |
Purpose |
Example Query |
|---|---|---|
|
Service Design |
Describe transportation service attributes and configuration |
“Which microtransit services operate in Texas?” |
|
Performance Metrics |
Retrieve operational metrics and trends |
"Show me ridership trends for 2025/Q4" |
|
Geospatial |
Location-based filtering and proximity search |
“Which services operate within 20 miles of Austin, Tx?” |
|
External Data |
Integrate benchmarks & third-party datasets |
“How do these services compare to NTD averages?” |
The anti-pattern in tool design is creating one giant get_everything() tool with 50 parameters. Instead, we built a set of focused tools that an agent can chain. This approach offers key advantages:
- The agent can reason about which specific tools to use
- New combinations emerge without requiring new tools
- Prevents context pollution: focused tools return only relevant data, keeping the agent’s working context clean
- Reduces inference cost: smaller, targeted responses mean fewer tokens and faster processing
- Each tool is easier to maintain and test independently
The agent’s job is to understand intent, plan execution, chain tools intelligently, synthesize results, and explain its reasoning. These capabilities are what distinguish an agent from a thin LLM wrapper.
Real example 1: Multi-Step Analysis
Let's see how the agent handles a complex request: "Which microtransit services in Texas saw the most significant performance gains in the recent quarter?"
%20(1).png?width=756&height=520&name=chart2%20(1)%20(1).png)
The agent interprets “performance gains” as a comparative question, and understands that it first needs to find the relevant services (Texas + Microtransit). It correctly calls a dedicated tool to retrieve them. Next, it selects an appropriate comparison window (Quarter), and ranks results based on KPI deltas by calling a dedicated tool for KPIs retrieval. The tool defines available KPIs, and the agent decides which to use and how to interpret change.
Real example 2: Discovery
Let’s ask an open-ended exploratory question - "What is the common feature for top utilization TaaS services?"
*TaaS stands for transportation-as-a-service: services where Via provides turnkey solutions including management of drivers, vehicles, and other operational aspects.
The agent:
- Retrieves TaaS services
- And their Q4 2025 utilization metrics
- Identifies the top four performers
- Analyzes their attributes to extract commonalities
We discover that university campus shuttles achieve higher utilization than typical microtransit services because they serve concentrated demand in small geographic areas with predictable travel patterns. This insight required analysis across multiple data sources!
The discovery can be improved with conversation mode, which maintains context across follow-up questions. This mirrors how a human analyst works and removes the friction of repeatedly reconfiguring dashboards or re-running queries.
Development Lessons: Making Agents Reliable
Building an agent that works in production requires addressing several key challenges.
Challenge 1: Agent Loops
Problem: The agent repeatedly invokes the same tool with identical parameters, unnecessarily consuming tokens, increasing latency, and frustrating users.
Solution: We implemented a layered approach to prevent loops:
- Limited tool results: Each tool returns only essential, structured data (not full datasets), keeping context clean and consistent.
- High-Reasoning Models: We leverage models such as Claude Sonnet to enable the agent to 'self-reflect' on its execution plan. Unlike smaller models prone to repetitive logic, Sonnet more reliably identifies when a task is complete or requires a different tool.
Recursion limits: As a safety net, we produce structured empty results and enforce a maximum of reasoning steps to force termination if other safeguards fail.
Challenge 2: User Experience & Performance
Problem: Queries to analytics databases can take 1-2 minutes to complete. Users need fast, consistent response times and visibility into what the agent is doing to maintain user confidence.
Solution: We implemented multiple optimizations to reduce latency and keep users engaged:
- Performance optimization: We reduced latency through agent instance caching, data pre-warming, query result caching, and efficient conversation state management via LangGraph’s checkpointer.
- Progress transparency: The UI streams real-time updates as the agent reasons and executes tools, giving users visibility into what’s happening. Long-running queries run asynchronously with status updating to maintain responsiveness.
These optimizations enable the system to provide sub-two minute response times, while keeping users informed throughout the process.
Challenge 3: Empty Tool Results
Problem: The agent expects tool results as structured input for reasoning. When a tool encounters missing data or unavailable resources (e.g. database connection failure), errors must be handled gracefully to ensure valid input for the agent.
Solution: we built tools to return structured responses with clear status indicators, even when no matching data was found. This prevents tool call loops, maintaining system functionality, and clearly communicating temporary unavailability to users.
Challenge 4: Hallucinations and Irrelevant Topics
Problem: LLMs can generate plausible-sounding but incorrect results, especially when asked about data they don't have access to or when extrapolating beyond tools results. As a result, out-of-scope prompts can lead to irrelevant answers and confuse users.
Solution: We implemented a two-layer defense against hallucinations:
Prevention through Prompt Engineering:
- Citation enforcement: Instructions to reference which tools provided the information
- Scope limitations: Clear boundaries on what data is available
- Data grounding requirement: "Only use data from tool results, never make up numbers, facts or service names"
User guidance:
- Adding example questions in the UI illustrates available capabilities and helps users become familiar with the tool.
Challenge 5: Context Limit
Context windows in modern LLMs are ever-increasing, but they are still a practical constraint in production systems. Tool-calling adds input overhead: the model reads tool’s definitions, then context keeps growing with chat history, tool outputs and system instructions. As workflows become more complex, this can increase latency, raise cost, and degrade output quality when too much intermediate data is retained, leading to 'lost in the middle' phenomena.
Given our current tool set and routing guardrails, the existing agent -> tools architecture is close to the effective frontier for reliability, controllability, and implementation complexity. Looking ahead, we plan to evaluate a multi-agent design (for example, a Supervisor with specialized sub-agents) for workflows that are longer, more decomposable, or benefit from parallel execution.
Testing Agentic Systems with “LLM as a Judge”
Unlike traditional software where a specific input always produces the exact same output, LLM-based agents are non-deterministic. The same question might be answered using different wording, different tool sequences, or slightly varied levels of detail each time:
- Wording: "12 services" vs. "There are twelve services."
- Structure: Bullet lists vs. numbered lists vs. paragraphs.
- Order: Services listed alphabetically vs. by launch date.
- Detail level: A minimal answer vs. a comprehensive explanation.
Semantically, these responses are equivalent. However, a naive string comparison (checking for an exact match) would fail these tests, even though the agent answered correctly.
Solution: Curated Regression Baselines + LLM Judge
To ensure our agent remains reliable as we add tools and update models, we supplemented our traditional test suite with a semantic evaluation layer. We utilize a separate, high-reasoning LLM to act as a "Judge" that compares the agent's output against a "Gold Standard" ground truth.
1. Establishing a Baseline
We curated a set of representative questions with verified answers that we have manually confirmed as correct and well-formatted. This serves as a constant baseline to evaluate newer versions of the agent.
2. The Evaluation Dimensions
The judge evaluates the output across two distinct dimensions:
- Semantic Similarity: Do both responses provide equivalent facts and conclusions? The judge ignores stylistic choices like wording, bullet points, or list order.
- Numerical Accuracy: The judge extracts numerical data and validates that values match within a specific tolerance (e.g., 5%), ensuring data integrity even if the prose changes.
3. Scoring Guidelines
The judge returns a structured score based on the following rubric:
|
Score |
Status |
Description |
|
0.8 – 1.0 |
PASS |
Identical core information; all key facts present; numbers within tolerance. |
|
0.6 – 0.8 |
MARGINAL |
Similar information but missing context or containing minor numerical drift. |
|
Below 0.6 |
FAIL |
Incorrect information, significant numerical errors, or hallucinations. |
4. Testing the Agentic Process
Beyond the final text, our regression tests validate the "internal" health of the agent:
- Tool Orchestration: Did the agent call the expected tools in the correct sequence?
- Efficiency: Did it avoid unnecessary loops or redundant tool calls?
- Latency: Did execution time remain within acceptable limits?
- State Management: Did it maintain context correctly across multi-turn conversations?
Key Takeaways
We built this agentic AI assistant to address a core business challenge: internal teams need to quickly and easily find information about our services, but the information is fragmented across many specialized, distributed systems. By orchestrating specialized tools through an AI agent, we deliver comprehensive answers in one place.
Agentic systems don’t replace dashboards - they reduce the friction around them. What once required hours of manual navigation, data exports, and synthesizing reports now can be achieved in a conversation. As a result, operations managers can more easily investigate performance, identify issues, and generate insights without relying on the data team.
Key Design Principles:
- Focused, composable and structured tools
- Strong reasoning LLM
- User experience that is focused on ease of use and functionality
- Guardrails for prompts, execution flow and responses
- LLM-as-judge testing using semantic similarity
Domain-specific AI assistants are within reach for most engineering teams. You don't need massive infrastructure or custom models. Start small: identify recurring business questions, build 3-5 focused tools, use a strong reasoning LLM, test with real users, and iterate.
Our roadmap focuses on expanding capabilities while maintaining reliability:
- Expanding tool coverage allowing reasoning over additional data sources
- Proactive insights where the agent suggests analyses based on detected trends
- Multi-modal outputs including charts, maps, and visualizations
We employ an agent-to-tools structure today because it delivers the best reliability-to-complexity tradeoff, while planning a move to multi-agent only when context growth and workflow complexity make specialization and parallelism clearly beneficial.
Looking ahead, we see a path toward a unified AI ecosystem. This might include bridging our structured data agent with the existing RAG-based Slack-bot, allowing users to query both live operational data and static company documentation in a single conversation.
This blog post describes Via's internal AI assistant for transportation data analysis. The patterns and lessons are applicable to other domain-specific AI assistant projects.
City Performance Director