Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Reopen] Trade Action #9

Closed
stevexxs opened this issue Dec 24, 2019 · 11 comments
Closed

[Reopen] Trade Action #9

stevexxs opened this issue Dec 24, 2019 · 11 comments

Comments

@stevexxs
Copy link

stevexxs commented Dec 24, 2019

I have the same question as issue #6,
according to your answer,
The actions are for timestep t+1 of the closed price. You observe a value at time t and then decide what action you should take next.

for example, if the period is 2 hours between each timestep,
you observe and make decision of next action, but the action will be take 2 hours later,
is my understanding right ?

If so, I think correct behavior should be like, after making the decision for next action at timestep t, the action should be taken immediately even there is a splippage, no need to wait for another period until timestep t+1, like 2 hour later.

If my understanding is wrong, please let me know, Thanks !

@Kostis-S-Z
Copy link
Owner

Hello stevexxs,

Okay I see what you mean and maybe I should have expanded more on this in my answer in #6 since it can be interpreted in a wrong way. What I meant with this:

You observe a value at time t and then decide what action you should take next.

is that, per classic MDP formulation, the order is state -> action -> reward -> next state. So for a state at t, the action the agent takes is again for the position at t.

I will backtrack a bit and just describe what is going on step-by-step.

Specifically, we can see at the code (file trail_env.py function step) the following order:

  1. The timestep at t is represented with the variable
    self.position
  2. So the rate value at t is
    c_val = self.data[self.position] # this is p_t
  3. The action at t
    self.action = SELL / BUY / NEUTRAL
  4. The result of the action at t which is applied immediately as you say and thus generates the agent's value at timestep t+1.
    self.value +/-= self.value * self.turn # s(t+1) = s(t) +/- [s(t) * a]

So, the state consists of
state = [self.position, c_val, self.action, self.value]
where self.value is the updated position of the agent at t+1 and is paired with the position of time t.

  1. Now we move the agent to the next timestep t+1.
    self.position += 1
  2. And then calculate the reward in order for the position, the rate value and the agent value to match. So the agent at position t performs the action at t which results in a new agent value for position t+1 which the reward is based on.

Let me know if there is something here that is not clear.

Note that I am no longer working on this project but want to answer questions or fix any bugs reported. While re-reading the paper and going through the code again, this indeed seems a bit strange: the reward at t is based on the agent's value at t+1. Is this the issue you were referring to?

@stevexxs
Copy link
Author

Hi @Kostis-S-Z ,

Thanks so much for your detailed explaination.

this indeed seems a bit strange: the reward at t is based on the agent's value at t+1. Is this the issue you were referring to?
yes, also this makes me reconsidering of issue #3

if action is indeed taken at timestep t,
a small bug here could be,
in step(), get_reward() is called after self.position+=1,
which means reward is calculated at timestamp t+1

   step():
            # Move the agent to the next timestep
            self.position += 1

            # Calculate the reward of the agent
            self.reward = self.get_reward()
            self.epoch_reward += self.reward

but the problem is get_reward() called trade(),

  def get_reward(self):
  ...
        c_val = self.data[self.position]  # p(t)
        self.trade(c_val)
  ...

in trade(), trade action is stored by position t+1, not t.
self.trades.append([self.position, c_val])
which causes plot_actions() plot trade action with 1 timestep lag.

@Kostis-S-Z
Copy link
Owner

Hmm okay.

this indeed seems a bit strange: the reward at t is based on the agent's value at t+1. Is this the issue you were referring to?

Now it makes sense why this seems strange to me, it is because initially it wasn't like that. It was indeed "reward at t is based on agent's value at t" but I changed it due to issue #3 which makes me think that my initial code was correct.

As for this:

in trade(), trade action is stored by position t+1, not t.
self.trades.append([self.position, c_val])
which causes plot_actions() plot trade action with 1 timestep lag.

This is not true because plot_actions does not use trades to plot but short and long actions which are stored independently and correctly.

@stevexxs
Copy link
Author

stevexxs commented Dec 27, 2019

Hi,

This is not true because plot_actions does not use trades to plot but short and long actions which are stored independently and correctly.
plot_actions use trades to plot if my understanding is correct,
you can see from below code, a diamond marker.

def plot_actions():
...
    if trades is not None:
        # Plot the time steps where a trade was made as well (long<->short)
        x_axis_tr = [x[0] for x in trades]
        y_axis_tr = [y[1] for y in trades]
        p.scatter(x_axis_tr, y_axis_tr, marker="diamond",
                  line_color="#FFA500", fill_color="#FFFF00", fill_alpha=0.5, size=16)

@Kostis-S-Z
Copy link
Owner

I was about to clarify my previous comment.

plot_actions does not use trades to plot the actions of the agent but the variables short_actions and long_actions which are stored independently and correctly.

It's used optionally to mark when a trade is made. In the end (as you can see, there are no diamond markers in the plots of the paper) this was never used because it creates a very clattered plot. Also this variable does not affect the results in the sense that the PnL is not based on this variable.

But indeed it should be something that needs to be fixed, just that it is not of high importance :)

@stevexxs
Copy link
Author

Yes, you are right,
I found this, when checking if the trade behavior is correct or not.
the diamond marker is quite helpful.

@mg64ve
Copy link

mg64ve commented Dec 27, 2019

Sorry but I don't agree with you @Kostis-S-Z and @stevexxs
Consider a real trading situation when you are trading an asset on daily basis.
Every morning you have to take an action on the market. You have the following data close(t), close(t-1), close(t-2), and you are considering to be at time t.
What is your reward? The reward is what you an get at end of the day, so it is close(t+1)-close(t)
So you should increment your position before calculating the reward.
Do you agree?

@mg64ve
Copy link

mg64ve commented Dec 29, 2019

Hi @Kostis-S-Z and @stevexxs . I would appreciate if you could send me a feedback to my comment. Thanks.

@Kostis-S-Z
Copy link
Owner

Okay after careful consideration and review of the code and everyone's comment on this issue, the verdict is the following: Everything is as it should be.

  1. @mg64ve Yes, indeed the reward at the end of the day should be based on the final position of the agent (t+1) in regard to the market value at the time (t+1) and at the start of the day (t). So the code and reward function are correct.

  2. The "self.trades" variable is also storing values correctly. As we said:

    1. Observe rate value at t
    2. Decide to do an action as soon as possible which is at timestep t+1. Imagine the duration is not 4 hours between rates but seconds. The moment you observe a value, its already in the past and you want the agent to be able to predict which action is based for the next timestep. So, the trade is executed at timestep t+1 even though the action was decided based on the observation of the rate value of t.

Let me know if you have any more thoughts on this otherwise I will close the issue in a few days.

@mg64ve
Copy link

mg64ve commented Dec 31, 2019

Thanks @Kostis-S-Z for your clarification, it makes sense.
I still have two doubts/questions.

  1. In my experiments I have noticed discrepancy between rewards and PNL. While the model was getting positive reward, the PNL was in fact negative or decreasing. From your paper I can see that the more the s(t) is close to the price, the less is the difference between PNL and reward, with an error e(t) = abs(s(t)-p(t)). Now, could you give us some clarification on why we are getting this discrepancy between rewards and PNL?
  2. In your paper you are not considering Walk and Forward analysis. You took 12000 values in the past and trained the model on that data, then you use 2000 samples for testing. 2000 samples at 4H timeframe is almost 2 years without retraining. Do you think that with Walk and Forward retraining you would get better results?

I would really appreciate a feedback from you on these points.
Thanks.

@Kostis-S-Z
Copy link
Owner

Closing this issue and transferring the discussion in #11 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants