Are LLMs Suitable for Finance and Risk Tasks?
Professionals in financial risk management, equity research, and related fields often approach Large Language Models (LLMs) with skepticism — especially when it comes to quantitative work. Their concern is simple: LLMs are black boxes, and they don’t guarantee deterministic results, which is crucial for finance applications.
Well, they’re quite right.
What’s more, by design, LLMs are not designed to calculate. They are trained on text and code, not numbers. However, this does not mean that GenAI cannot revolutionize the way finance & risk teams are performing their quantitative analyses. In today’s article, I would like to share the outcomes of a PoC that we have recently been developing to show how LLMs can drastically reduce the time required to complete typical quantitative tasks performed by financial analysis departments.
The idea
Before we dive in, an important note: what follows is an architecture built from multiple LLMs and non-LLM components working together to solve a multi-step task. This setup is not something you can reproduce using a “simple” public chatbot.
The idea was to produce a GenAI-augmented workflow that is able to respond to an undefined set of problems that are typical tasks faced by a risk or equity analyst. Let’s see the workflow step by step:
The user asks the system to perform a task (e.g., “calculate the daily standard deviation of Tesla over the last 12 months”),
An AI agent interprets the request and generates a properly formatted data query to Yahoo Finance,
The data is retrieved—but not sent back to the LLM,
A second AI agent analyzes the user’s request and generates Python code that answers the user’s query,
The workflow assembles everything together and returns the final result.
The key thing in that setup is that we are not sending any quantitative data to the LLM—only column headers and the first row of data. In addition, we do not ask the LLM to perform any calculations directly - instead, we ask the LLM to produce a Python script that does the calculations. Recall that, after all, LLMs are poor at calculations, but great at coding.
This approach allows for both high token efficiency (more on that later), as well as provides an opportunity to work with AI without the need to share confidential data externally. Just imagine using this solution to work not with publicly available data from Yahoo Finance, but with licensed sources like Eikon, Platts, or you own proprietary internal datasets.
If this sounds like optimistic wishful thinking, let’s look at some real examples.
The Equity Analysis Intern
We start by asking our agent the simple question mentioned earlier:
“Calculate the daily standard deviation of Tesla stock over the last 12 months.”
Within the first step, the agent decides what to search for. He correctly identifies the correct ticker of Tesla (TSLA), but fails to understand what “last 12 months” means by assuming the current date is December 1st 2024 (this is a recurring LLM issue I wrote about in another article).
This can be easily fixed by forcing the agent to retrieve the current date before making any assumptions about dates. We will skip that for now—the focus of today’s analysis is elsewhere.
Then, the agent retrieves the data, generates intermediate Python scripts, runs them, and returns the final output:
You can double-check the results manually using traditional tools. If you’re too lazy, trust me — they’re correct. As said earlier, the LLM doesn’t see the actual data, dramatically reducing hallucination risk (aside from the date confusion) and keeping token usage low. For context: the raw Yahoo Finance dataset would have weighed around 18k tokens — enough to confuse a large language model, significantly increasing hallucination risk.
This architecture scales almost infinitely: longer datasets barely affect token cost or runtime.
If this still feels trivial, let’s go one level up.
The Senior Risk Analyst
Now let’s move to the FX market. We’ll run a simple stress-test exercise:
“Identify the largest drops in EUR/PLN in recent 3 years.”
This time we also ask for a graph in addition to the numerical/written description:
Once again, we see a date misinterpretation:
After some reasoning, the agent produces a written explanation:
And a graph showing exactly what was requested:
Allow me to leave this without any further comments.
Time and cost considerations
Below is the summary of token usage, cost and duration of the work done by our two virtual analysts:
To put this into perspective:
If an analyst earns $5,000 per month, the same monthly budget would let you run these analyses 78,000 to 128,000 times per month.
Yes—our two analysts should probably start learning LLM engineering.

