[Reopen] Trade Action #9

stevexxs · 2019-12-24T03:32:13Z

I have the same question as issue #6,
according to your answer,
The actions are for timestep t+1 of the closed price. You observe a value at time t and then decide what action you should take next.

for example, if the period is 2 hours between each timestep,
you observe and make decision of next action, but the action will be take 2 hours later,
is my understanding right ?

If so, I think correct behavior should be like, after making the decision for next action at timestep t, the action should be taken immediately even there is a splippage, no need to wait for another period until timestep t+1, like 2 hour later.

If my understanding is wrong, please let me know, Thanks !

The text was updated successfully, but these errors were encountered:

Kostis-S-Z · 2019-12-27T10:37:56Z

Hello stevexxs,

Okay I see what you mean and maybe I should have expanded more on this in my answer in #6 since it can be interpreted in a wrong way. What I meant with this:

You observe a value at time t and then decide what action you should take next.

is that, per classic MDP formulation, the order is state -> action -> reward -> next state. So for a state at t, the action the agent takes is again for the position at t.

I will backtrack a bit and just describe what is going on step-by-step.

Specifically, we can see at the code (file trail_env.py function step) the following order:

The timestep at t is represented with the variable
self.position
So the rate value at t is
c_val = self.data[self.position] # this is p_t
The action at t
self.action = SELL / BUY / NEUTRAL
The result of the action at t which is applied immediately as you say and thus generates the agent's value at timestep t+1.
self.value +/-= self.value * self.turn # s(t+1) = s(t) +/- [s(t) * a]

So, the state consists of
state = [self.position, c_val, self.action, self.value]
where self.value is the updated position of the agent at t+1 and is paired with the position of time t.

Now we move the agent to the next timestep t+1.
self.position += 1
And then calculate the reward in order for the position, the rate value and the agent value to match. So the agent at position t performs the action at t which results in a new agent value for position t+1 which the reward is based on.

Let me know if there is something here that is not clear.

Note that I am no longer working on this project but want to answer questions or fix any bugs reported. While re-reading the paper and going through the code again, this indeed seems a bit strange: the reward at t is based on the agent's value at t+1. Is this the issue you were referring to?

stevexxs · 2019-12-27T13:15:32Z

Hi @Kostis-S-Z ,

Thanks so much for your detailed explaination.

this indeed seems a bit strange: the reward at t is based on the agent's value at t+1. Is this the issue you were referring to?
yes, also this makes me reconsidering of issue #3

if action is indeed taken at timestep t,
a small bug here could be,
in step(), get_reward() is called after self.position+=1,
which means reward is calculated at timestamp t+1

   step():
            # Move the agent to the next timestep
            self.position += 1

            # Calculate the reward of the agent
            self.reward = self.get_reward()
            self.epoch_reward += self.reward

but the problem is get_reward() called trade(),

  def get_reward(self):
  ...
        c_val = self.data[self.position]  # p(t)
        self.trade(c_val)
  ...

in trade(), trade action is stored by position t+1, not t.
self.trades.append([self.position, c_val])
which causes plot_actions() plot trade action with 1 timestep lag.

Kostis-S-Z · 2019-12-27T14:10:54Z

Hmm okay.

this indeed seems a bit strange: the reward at t is based on the agent's value at t+1. Is this the issue you were referring to?

Now it makes sense why this seems strange to me, it is because initially it wasn't like that. It was indeed "reward at t is based on agent's value at t" but I changed it due to issue #3 which makes me think that my initial code was correct.

As for this:

in trade(), trade action is stored by position t+1, not t.
self.trades.append([self.position, c_val])
which causes plot_actions() plot trade action with 1 timestep lag.

This is not true because plot_actions does not use trades to plot but short and long actions which are stored independently and correctly.

stevexxs · 2019-12-27T14:25:15Z

Hi,

This is not true because plot_actions does not use trades to plot but short and long actions which are stored independently and correctly.
plot_actions use trades to plot if my understanding is correct,
you can see from below code, a diamond marker.

def plot_actions():
...
    if trades is not None:
        # Plot the time steps where a trade was made as well (long<->short)
        x_axis_tr = [x[0] for x in trades]
        y_axis_tr = [y[1] for y in trades]
        p.scatter(x_axis_tr, y_axis_tr, marker="diamond",
                  line_color="#FFA500", fill_color="#FFFF00", fill_alpha=0.5, size=16)

Kostis-S-Z · 2019-12-27T14:33:34Z

I was about to clarify my previous comment.

plot_actions does not use trades to plot the actions of the agent but the variables short_actions and long_actions which are stored independently and correctly.

It's used optionally to mark when a trade is made. In the end (as you can see, there are no diamond markers in the plots of the paper) this was never used because it creates a very clattered plot. Also this variable does not affect the results in the sense that the PnL is not based on this variable.

But indeed it should be something that needs to be fixed, just that it is not of high importance :)

stevexxs · 2019-12-27T14:40:56Z

Yes, you are right,
I found this, when checking if the trade behavior is correct or not.
the diamond marker is quite helpful.

mg64ve · 2019-12-27T14:51:32Z

Sorry but I don't agree with you @Kostis-S-Z and @stevexxs
Consider a real trading situation when you are trading an asset on daily basis.
Every morning you have to take an action on the market. You have the following data close(t), close(t-1), close(t-2), and you are considering to be at time t.
What is your reward? The reward is what you an get at end of the day, so it is close(t+1)-close(t)
So you should increment your position before calculating the reward.
Do you agree?

mg64ve · 2019-12-29T17:43:17Z

Hi @Kostis-S-Z and @stevexxs . I would appreciate if you could send me a feedback to my comment. Thanks.

Kostis-S-Z · 2019-12-30T15:34:49Z

Okay after careful consideration and review of the code and everyone's comment on this issue, the verdict is the following: Everything is as it should be.

@mg64ve Yes, indeed the reward at the end of the day should be based on the final position of the agent (t+1) in regard to the market value at the time (t+1) and at the start of the day (t). So the code and reward function are correct.
The "self.trades" variable is also storing values correctly. As we said:
1. Observe rate value at t
2. Decide to do an action as soon as possible which is at timestep t+1. Imagine the duration is not 4 hours between rates but seconds. The moment you observe a value, its already in the past and you want the agent to be able to predict which action is based for the next timestep. So, the trade is executed at timestep t+1 even though the action was decided based on the observation of the rate value of t.

Let me know if you have any more thoughts on this otherwise I will close the issue in a few days.

mg64ve · 2019-12-31T08:09:42Z

Thanks @Kostis-S-Z for your clarification, it makes sense.
I still have two doubts/questions.

In my experiments I have noticed discrepancy between rewards and PNL. While the model was getting positive reward, the PNL was in fact negative or decreasing. From your paper I can see that the more the s(t) is close to the price, the less is the difference between PNL and reward, with an error e(t) = abs(s(t)-p(t)). Now, could you give us some clarification on why we are getting this discrepancy between rewards and PNL?
In your paper you are not considering Walk and Forward analysis. You took 12000 values in the past and trained the model on that data, then you use 2000 samples for testing. 2000 samples at 4H timeframe is almost 2 years without retraining. Do you think that with Walk and Forward retraining you would get better results?

I would really appreciate a feedback from you on these points.
Thanks.

Kostis-S-Z · 2019-12-31T10:38:10Z

Closing this issue and transferring the discussion in #11 .

Kostis-S-Z closed this as completed Dec 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Reopen] Trade Action #9

[Reopen] Trade Action #9

stevexxs commented Dec 24, 2019 •

edited

Loading

Kostis-S-Z commented Dec 27, 2019

stevexxs commented Dec 27, 2019

Kostis-S-Z commented Dec 27, 2019

stevexxs commented Dec 27, 2019 •

edited

Loading

Kostis-S-Z commented Dec 27, 2019

stevexxs commented Dec 27, 2019

mg64ve commented Dec 27, 2019 •

edited

Loading

mg64ve commented Dec 29, 2019

Kostis-S-Z commented Dec 30, 2019

mg64ve commented Dec 31, 2019

Kostis-S-Z commented Dec 31, 2019

[Reopen] Trade Action #9

[Reopen] Trade Action #9

Comments

stevexxs commented Dec 24, 2019 • edited Loading

Kostis-S-Z commented Dec 27, 2019

stevexxs commented Dec 27, 2019

Kostis-S-Z commented Dec 27, 2019

stevexxs commented Dec 27, 2019 • edited Loading

Kostis-S-Z commented Dec 27, 2019

stevexxs commented Dec 27, 2019

mg64ve commented Dec 27, 2019 • edited Loading

mg64ve commented Dec 29, 2019

Kostis-S-Z commented Dec 30, 2019

mg64ve commented Dec 31, 2019

Kostis-S-Z commented Dec 31, 2019

stevexxs commented Dec 24, 2019 •

edited

Loading

stevexxs commented Dec 27, 2019 •

edited

Loading

mg64ve commented Dec 27, 2019 •

edited

Loading