Try Finage Data Feeds Now!

Algorithmic Trading with Reinforcement Learning

8 min read • June 13, 2021

Introduction

We are following Quantitative Finance offerings for a long time because we believe it doesn't solve the root of people's problems; nevertheless, we feel there is value in utilizing Quantitative Finance as a challenge to design better utility that does precisely that.

This is because working on an algorithmic trading challenge can help us better harness the creativity of machines when devising trading strategies. AlphaGo and AlphaGoZero are truly creative, not only because they can design new moves, but also because they can be modeled as Generative Models, allowing us to understand strategizing as a creative endeavor. I'd like to explain the benefits of Deep Reinforcement Learning for algorithmic trading from the perspective of someone who isn't in finance.

A Quantitative Trading Theorem

Assuming that people do not withdraw money from the financial market until in unusual circumstances such as a market crash, what they do is simply move money from one asset to another in the hopes of increasing its value. As a result, all trading approaches based on this idea are ultimately about constructing a correlation matrix for all assets, because money redrawn from one item would be reflected in the values of another. If we have a portfolio with n assets and only one characteristic for each asset, we will end up with a massive n by n matrix, which is why we may wish to compute the correlation matrix using an expressive function like Neural Network.

“At the end of the day, all trading approaches boil down to computing a correlation matrix for all assets.”

Why Deep Reinforcement Learning Is Beneficial

Algorithmic trading requires a high level of speed. Quantum Finance introduces a new computing paradigm, but it also necessitates the development of logics for making trade decisions. In an ideal world, we'd like a framework that alleviates the bottleneck as much as feasible in decision-making. Self-supervised learning has a promising future in AI since it combines the superior performance of supervised methods with the “no-label-no-problem” of unsupervised methods. Self-supervised learning methods all have one thing in common: they create the learning objectives that will be used to teach them. Their learning procedure is as follows: the agent (1) makes a prediction, (2) observes the actual event, (3) examines the forecast for errors, and lastly (4) learns from the error. Yes, I just defined what Reinforcement Learning is all about.

Let's look at why Reinforcement Learning works for algorithmic trading now that we know what it is.

1. Exploit and Explore in a Timely Manner

Rather of focusing on constructing models for most domain-specific problems with well-defined objective functions, we should focus on improving the environment's representation to train the model. Even given an acceptable representation, people find it difficult to design a trading strategy since there are simply too many alternatives. Human traders have a tendency to rely on only a few signals as their major tools throughout time. Fortunately, unlike other frameworks, reinforcement learning deals with both prediction and control (for example, market prices) (e.g., portfolio allocation). We can train trading techniques offline using offline reinforcement learning and venture into unexplored territory. The execution latency required for high-frequency trading is drastically reduced as a result of this.

Offline Reinforcement Learning, in contrast to Online Reinforcement Learning, trains the agent with a fixed dataset and no further incoming data. This improves learning efficiency while while ensuring policy "completeness." Because the existing policy in online learning is based on a stochastic future, it is imperfect. Though today's entire approach may not be applicable in the future, it is more robust if it is complete.
The agent is taught on a preset dataset of encounters "offline" rather than on the move in offline-reinforcement learning (online). Offline Reinforcement Learning faces an existential challenge in most cases: We wouldn't know the targets of the agent's actions if it acted too off the beaten path given a dataset with fixed targets. However, in algorithmic trading, we can always compute the various targets with the prices, which saves us a lot of time and effort in terms of control.

“...even with an appropriate representation, deriving a trading strategy from it is a challenging task for humans since there are simply too many options.”

2. Invest in the future rather than the past.

Despite the fact that historical data is all we have, it is simply not a good predictor of the future for most non-stationary problems, especially in financial markets where events are constantly changing. Model-based reinforcement learning has a distinct philosophy: instead of attempting to forecast the future based on the past, the agent may envision all of the future possibilities and plan ahead. If you don't want to do your homework, for example, you can either (1) gamble that your teacher will not bring it up, or (2) prepare a response for each of the two possibilities, as your teacher may or may not bring it up.

You will be in control of the situation if you choose the latter choice, regardless of what happens next. When there are many options, planning for the future (rather than forecasting it) can help to alleviate the stress of ambiguity. We are now dealing with the problem of capacity rather than precision, which is much easier to command, because we are focusing on preparing for the biggest possibilities. In fact, I feel that RL's usage of the Markov Decision Process empathizes with this unique approach to continuous predictive control, in which we only use the current state as the latent space to change the reaction to the future while discarding the past.

Markovian Reinforcement Learning is a GAN-like generative model in which a generator (our policy) acts as a Bayes model to predict a posterior (action) based on sample states collected from a prior dataset (s). The discriminator is fully trained to recognize d(s) in a single agent context, therefore the model plays a reward-maximization game. In a multi-agent context, other players influence the discriminator (market), causing d(s) to become non-stationary and the agent to become an Empirical Bayes in a MinMax game that converges at Nash equilibrium.

A plug-in framework like Model-based Reinforcement Learning should be explored as a solution for an Empirical Bayes estimator because the financial market is inherently non-stationary. Regardless, we must remember that the state-action value prediction is occasionally incorrect, necessitating a decoupling of prediction and decision.

“Despite the fact that historical data is all we have, it is simply insufficient to predict the future for most non-stationary problems.”

3. Prediction and decision are separated.
Separating prediction and choice is a long-standing concept in inferential statistics. “Making a right prediction at a specific time does not guarantee that the decision based on that prediction will be accurate,” the logic goes. There are a variety of causes for this; for instance, imagine the following case, in which two objectives follow two different normal distributions. Though both the left and right have the same expectation, the right's standard deviation is much higher, implying more uncertainty, in which case we should take different actions (e.g., seeking more diversification on RHS).

Because it has lower variances, the linear model on the LHS is more robust than the one on the RHS. To lessen the uncertainty on the RHS, we may require additional information at each step.

Because stochastic predictors (e.g., stock predictors) frequently require computing the expectation of a target distribution, the argument for separating prediction and judgment should be strong in Algorithmic Trading. Currently, Reinforcement Learning uses Actor-critic approaches, in which a model consists of a Critic (which predicts the state-action value) and an Actor (which makes decisions based on, but not limited by, the Critic's forecast). There's no need to be concerned about their alignments because both functions are trained simultaneously and end-to-end. The value function (Critic) should, however, be regularized to account for the non-stationary nature of the state distribution.

“Making an accurate prediction at a specific time does not ensure that the choice based on that prediction will be accurate.”

Finally, I'd like to make a few observations. The ideal game for someone who enjoys playing games.

Trading is nothing more than a game of the "Greater Fools" for the common person. Your objective is to identify people who are more ignorant than you in order to sell them something of lower worth. If everyone knows an asset will fail, no one will be ready to trade it; hence, the financial market is one of the few human innovations that generates wealth from chaos, and if winning in any form is about minimizing environmental uncertainty, there will be no inherent win-win scenario in trading.

To put it another way, trading financial assets is not the same as trading commodities; it is an adversarial game in which your only goal is to defeat a stochastic opponent (the market, or another trading bot) who is playing by the same rules (ideally) but with incomplete knowledge. Even better, this stochastic adversary is made up of several players with whom you can collaborate: One of the advantages of working together to benefit all is that you may ride the trend, but only for a limited time.

That's everything there is to it. Reinforcement Learning seems to be particularly good at games with simple rules but a lot of probability. The game, like Go, has rules that can be learned in under 60 minutes, but it has more broad configurations than the number of atoms in the Universe. Reinforcement Learning can beat the stock market if it can beat the game of Go.

quantitive learning methods Algorithmic Trading with Reinforcement Learning Guides for Quants How to learn algorithmic trade Also Trade with APIs Real-Time Market Data APIs for Quants

Claim Your Free API Key Today

Access stock, forex and crypto market data with a free API key—no credit card required.

Start free trial

Stay Informed, Stay Ahead

Finage Blog: Data-Driven Insights & Ideas

Discover company news, announcements, updates, guides and more

7 min read • July 6, 2025

Using Finage WebSocket API for Real-Time Crypto Arbitrage

Crypto

7 min read • July 5, 2025

How Finage Helps Financial Startups Meet MiFID II, FCA, and SEC Compliance

Finage News

7 min read • July 4, 2025

Crypto Market API Showdown: CEX vs DEX Data Accuracy

Web3