Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate reports on what has been uploaded to the interface #107

Open
jraddaoui opened this issue Sep 3, 2018 · 8 comments
Open

Generate reports on what has been uploaded to the interface #107

jraddaoui opened this issue Sep 3, 2018 · 8 comments

Comments

@jraddaoui
Copy link
Collaborator

As the Administrator, I want to generate reports on what has been uploaded to the interface (e.g., the most common file formats in the Collection or a visualization of the last modified dates by archive; this is where Kibana has potential)

@jraddaoui
Copy link
Collaborator Author

jraddaoui commented Sep 3, 2018

As mentioned, Kibana could be used, it may require to add some extra data to the current Elasticsearch indexes or modify existing ones (see #54), depending on what should be reported. Also, a custom dashboard could be developed using D3 charts, like we have in Binder. This estimate my vary depending on that.

@jraddaoui
Copy link
Collaborator Author

Hi @stefanabreitwieser, @bunekcca,

We'd need more information about this feature to be able to provide an estimate:

  • Do you want to use Kibana or to add a reports page to the application?
  • What reports/visualizations do you want?
  • Do you want charts, tables, CSV downloads or something else?
  • Do you want to be able to change the parameters to generate those reports or just have default parameters?

@stefanabreitwieser
Copy link
Collaborator

Hi @jraddaoui !

Kibana had originally been Tim's suggestion. I think it looks really interesting as a tool, but as a non-developer I don't necessarily understand how it would integrate with SCOPE. Would it be a separate stand-alone tool? Are there any pros/cons to not using Kibana? Do you have any strong preferences or opinions about either option?

We have two big categories of reporting that we're interested in: reference statistics and collection statistics. Reference statistics should show how the material is being used by researchers. Some sample questions:

  • What SIPs are being downloaded the most?
  • What collections are the most downloaded SIPs from?
  • How many downloads occurred over a given period of time?
  • How many logins occurred over a given period of time? (Note: We should NOT keep track of who logs in or what they download, only that the login/download has occurred.)
  • How many search queries were conducted over a given period of time?

Collection statistics should reflect what the entire collection as a whole actually is (i.e. a broad look at everything that's been uploaded). Some questions:

  • How many collections are there?
  • How many total DIPs have been uploaded?
  • How many GBs/TBs of DIPs have been uploaded?
  • Visualization: Chart of which file formats make up the entire collection.
  • Logs of when things were uploaded, to show how much was processed over time.

Charts and tables that can be exported as CSVs would be ideal.

Being able to change the parameters would be ideal and it looks possible using Kibana, but this is negotiable.

This is a long list of things! I have to say that for this feature, I'm less sure of what's possible and reasonable to do given our budget and timeline, meaning that if you'd like to set up a call with just the two of us to discuss the best way of moving forward with this, I'm happy to do so.

@jraddaoui
Copy link
Collaborator Author

Hi @stefanabreitwieser, that's great!

Thanks for the quick response. I'm out on vacations for a few days but I'll follow-up as soon as I get back on the 17th. Sorry for the inconvenience.

@jraddaoui
Copy link
Collaborator Author

jraddaoui commented Jan 18, 2019

Hi again @stefanabreitwieser,

As you mention, Kibana would be a different application, with its own authentication method, etc. I just sent an email with the credentials for a test instance and a Kibana server to check what this tool can provide. If you're planing to have this reports and visualizations only for admins I'd definitely recommend to use Kibana. However, if you intend to give access to this section to all the SCOPE users, it would be better to integrate it into the application.

On one hand, Kibana won't require major development, just improving the current indexes to allow the statistics you want, but it will require some knowledge of Elasticsearch to create the charts/reports and to format the data. It also has a lot of features that you probably won't use and some of them (like a proper authentication system) require to purchase an Elastic license.

On the other hand, developing a reports section with all the requirements you mention will take quite some time but it will give you more control over the content without having to know about Elasticsearch aggregations.


About the two categories of reports:

  • Reference statistics: I'd use Google Analytics for this statistics, some of them are already included in Generate statistical reports on page visits and DIP downloads #106. It will require to bump a little that estimate to include the login and query events, but adding this kind of data to the Elasticsearch indexes will be a lot harder than tracking the events and dimensions in Google Analytics.

  • Collection statistics: These are the ones we should take care of in this issue. Some of them may be harder to achieve than others, but I gave it a try to a "Total size by file format" visualization in Kibana and, after some back and forward, I could achieve good results:

https://kibana.ccarch.artefactual.com/app/kibana#/dashboard/23b53430-1b12-11e9-9314-9f7362acbb75

By clicking in the three dots on the top-right corner of the graph you will see the chart in a table format with the option to export it as a CSV file.


With all that being said and considering the phase 2 budget and other issues, I personally think we should go with Kibana for now. If we have the time, we can try to find a way to proxy or iframe the Kibana reports in the SCOPE application.

Best regards.

@stefanabreitwieser
Copy link
Collaborator

Thanks so much Radda! Reports will be for admins only, so Kibana should be no problem in that respect. I'll take a look at the link once we fix the timeout issue. (I sent an email with more detail.)

Before we go any further with this ticket, would you mind giving us a time estimate? We did flag Kibana reporting for this phase, but it's a lower priority compared to other things. Let's make sure we have room enough in the budget before doing additional work here. Thank you!

@jraddaoui
Copy link
Collaborator Author

Hi @stefanabreitwieser,

The timeout issue should be fixed for you now, I've changed the URL in my previous update accordingly.

For an estimate, if we're only going to use Kibana, it depends how much do you want to do in there and how much guide will be needed from us. As an external tool, we won't be able to develop any requirements you may have for it, but we could guide you on how to create the charts and reports you need. It will also require to add/format some data into the Elasticsearch indexes and to create some documentation to setup and connect both instances. So far, I have spent around 8 hours setting-up, configuring and securing the Kibana instance and creating the first chart, but the final estimate will vary depending on what reports are needed and if we're going to use only Kibana and only for the collection statistics.

@jraddaoui
Copy link
Collaborator Author

Added a "Digital files per year" chart and fixed the existing counts in the dashboard:

https://kibana.ccarch.artefactual.com/app/kibana#/dashboard/23b53430-1b12-11e9-9314-9f7362acbb75

@sallain sallain added ready and removed backlog labels Mar 27, 2019
@sallain sallain added this to the phase2 milestone Mar 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants