Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CalcRandom #33

Open
IlCingalese opened this issue Jul 25, 2017 · 14 comments
Open

CalcRandom #33

IlCingalese opened this issue Jul 25, 2017 · 14 comments

Comments

@IlCingalese
Copy link

Hi,
is possible in calcRandom function accept EventsNames parameter like other algorithm function?
i think it's a bug

Claudio

@pferrel
Copy link
Collaborator

pferrel commented Jul 25, 2017

calcRandom, creates a random ranking of all items that is used if there is no reason to recommend any other way, such as using other events. It is therefore independent of events.

It is also seldom used. It is for situations where a large number of items do not have any events associated with them and gives very poor results (random recommendations?) but will expose more items to the user and then get events. I would not use it unless you have a good reason.

@pferrel pferrel closed this as completed Jul 25, 2017
@IlCingalese
Copy link
Author

IlCingalese commented Jul 25, 2017 via email

@pferrel
Copy link
Collaborator

pferrel commented Jul 25, 2017

There is no need for events if the ranking is random and no need to increase event storage.

@IlCingalese
Copy link
Author

IlCingalese commented Jul 25, 2017 via email

@pferrel
Copy link
Collaborator

pferrel commented Jul 25, 2017

I think you misunderstand to score items randomly requires no data to be sent to the EventStore. During training, each item in the model is given a random number to rank them.

If you want to set a "custom" ranking this requires you to send $set events with your ranking.

Using "random" is extremely efficient, not sure why you would say it loads useless data.

@IlCingalese
Copy link
Author

IlCingalese commented Jul 25, 2017 via email

@pferrel
Copy link
Collaborator

pferrel commented Jul 25, 2017

It must load all events to calculate the model even without random ranking.

Sorry I still don't understand what you are saying is wrong. All data must be loaded, random ranking or not. This is a big-data application and so works on very large datasets. There are ways to trim the data but this has nothing to do with random ranking.

The random ranking should be part of the normal train operation, not a separate task. If you are doing 2 trains, there is no need. Updating the model is integral to changing the random ranking.

Are you trying to change the random ranking more often than you update the model?

@IlCingalese
Copy link
Author

IlCingalese commented Jul 26, 2017 via email

@pferrel
Copy link
Collaborator

pferrel commented Jul 28, 2017

@IlCingalese no you are wrong about this. Queries are not stored.

FYI I am a committer and PMC member on PIO and have worked on it for several years now. I also wrote the UR so I do know a bit about all this :-)

@IlCingalese
Copy link
Author

IlCingalese commented Jul 28, 2017 via email

@dszeto
Copy link
Contributor

dszeto commented Jul 28, 2017

@IlCingalese It looks like your engine(s) have feedback turned on. Unless you are conducting online evaluation, having feedback turned on have no value and will only eat up event storage. Please turn it off by dropping --feedback from your pio deploy commands. Event feedback is not turned on by default.

@pferrel
Copy link
Collaborator

pferrel commented Jul 28, 2017

@IlCingalese there is an undocumented parameter to pio deploy called --feedback that will turn on storage of queries. It is off by default and not meant for casual use. https://github.com/apache/incubator-predictionio/blob/develop/core/src/main/scala/org/apache/predictionio/workflow/CreateServer.scala#L132

Could you or someone else have accidentally used this param in your pio deploy ... command?

As to calcRandom, it iterates through all the items in the Model to be written to Elasticsearch and assigns a random number to each. This happens during pio train. So no events need to be know by calcRandom.

@pferrel pferrel reopened this Jul 28, 2017
@pferrel
Copy link
Collaborator

pferrel commented Jul 28, 2017

BTW if you have turned on --feedback for the UR it does nothing, the UR does not support it's use.

Furthermore the queries will never be deleted from the database. To cleanup the DB:

  1. do a pio export
  2. write a program to drop the queries and keep only the events you want
  3. pio app data-delete to drop all data
  4. pio-import... to import the cleaned up data

This will preserve your appName and access key. But it's not very safe if you are not completely sure the cleaned data is formatted correctly. To do it safely create a new app and access key for the cleaned data and test the import and predictions before switching to it. Then once everything is switched over, drop the old appName with pio add delete...

@IlCingalese
Copy link
Author

IlCingalese commented Jul 29, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants