Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a visual schematic diagram for data wrangling workflow in {datawizard} #87

Open
IndrajeetPatil opened this issue Mar 6, 2022 · 29 comments
Assignees
Labels
Docs 📚 Improvements or additions to documentation

Comments

@IndrajeetPatil
Copy link
Member

IMHO, the current README is quite dull and long-winded, and doesn't provide much insight into how this package can be useful for the users.

What we need is for it to feature a visual schematics like the following ones in our other popular high-level packages:

image

image

Of course, paging our in-house visualization wizard @DominiqueMakowski! 🪄

Needless to say, this is low-priority and if you think so necessary, we can definitely wait for the package to even mature further.

@IndrajeetPatil IndrajeetPatil added the Docs 📚 Improvements or additions to documentation label Mar 6, 2022
@DominiqueMakowski
Copy link
Member

What would it contain? I can start a draft with powerpoint

DominiqueMakowski added a commit that referenced this issue Mar 7, 2022
DominiqueMakowski added a commit that referenced this issue Mar 7, 2022
@IndrajeetPatil
Copy link
Member Author

IndrajeetPatil commented Jul 3, 2022

@DominiqueMakowski How about something like this? (cc @bwiernik, @strengejacke, @mattansb, @etiennebacher)

unnamed

Of course, there is a lot of room for improvement here. Specifically,

  • I am not sure how to visually depict messy versus clean/tidy data. The only thing I could come up with was a hamper full of dirty clothes versus clean, folded clothes. Maybe others have better ideas.
  • This includes no functions about "Data Properties". Is it important to include them?
  • The list of functions I've included in the two columns is incomplete. Not sure how comprehensive we want to be here.

@DominiqueMakowski
Copy link
Member

I can give it a go next week (do ping me then if you remember :)

The list of functions I've included in the two columns is incomplete.

It's okay not to be comprehensive otherwise we will be obsolete as soon as we add a new function, better perhaps to create like a wordcloud or something like that

@IndrajeetPatil
Copy link
Member Author

Yeah, I agree. That's why I had put the ... in those columns. I don't think we need to be comprehensive, but we should definitely include the most important ones (filter, select, join, etc.).

@bwiernik
Copy link
Contributor

bwiernik commented Jul 3, 2022

I think a separate viz of data cleaning versus data summary functions would be good

@IndrajeetPatil
Copy link
Member Author

@DominiqueMakowski It will be nice to have something like this in the JOSS paper.

@DominiqueMakowski
Copy link
Member

Will do within the next couple of days

@DominiqueMakowski
Copy link
Member

Would be nice to generate a wordcloud of the functions tho

@DominiqueMakowski
Copy link
Member

Wordlist for wordclouds (https://www.wordclouds.com/):

  • Preparation:

data_filter()
data_select()
data_to_long()
data_to_wide()
data_rotate()
data_rename()
data_relocate()
data_join()

  • Transformation:

standardize()
normalize()
center()
degroup()
winsorize()
data_cut()
data_recode()
data_shift()

@IndrajeetPatil
Copy link
Member Author

I want to wait for #57 and #197 to be resolved before we can include the following functions in the wordcloud:

data_cut()
data_recode()
data_shift()

We should avoid including any functions names in a publication that we are not sure will survive for long.

@DominiqueMakowski
Copy link
Member

you're right, I'll come up with a diagram prototype nonetheless and then we can fine-tune the wordcloud

@DominiqueMakowski
Copy link
Member

We can focus on the dirty clothes metaphor but it lacks some text at the bottom? (feel free to directly edit the powerpoint on the diagram branch!)

image

@IndrajeetPatil
Copy link
Member Author

Thanks, Dom! This looks like a great start.

I think one way this can be improved is by making it visually less busy and more minimal. Additionally, we need to mention only a few (key and most useful) functions and just have ... (which will cover all the other existing or future functions).

I don't like the star shape in the "Transformations" section.

Maybe this can be an ironing table with a shirt on it?
As in, imperfections in prepared data are ironed out using statistical transformations before the data is ready to be fed into a statistical model.

Instead of "No dependencies", I'd write "Lightweight", since we do import{insight}.

@IndrajeetPatil
Copy link
Member Author

I also want to hear what @etiennebacher, @strengejacke, @bwiernik, @mattansb think about the current status of the illustration and how it can be further improved.

@bwiernik
Copy link
Contributor

I agree with Indra's comments and don't have much more to add there. I like the ironing metaphor (maybe the function names in a cloud of steam?). And agree that making the function names less busy/stand out more would be good

@mattansb
Copy link
Member

Looks good. I would maybe change the color of the bg color of the washing machine to a lighter blue? And for transform use the non data_* variant names.

@etiennebacher
Copy link
Member

Looks good to me too, but it's a bit hard to read most function names in steps 2 and 3. Maybe you can remove the very small ones to increase the size of the others?

@IndrajeetPatil
Copy link
Member Author

Thank you all for great suggestions!

WDYT, @DominiqueMakowski? Will this be possible? Don't know how complicated it will be to design.

@strengejacke

This comment was marked as outdated.

@bwiernik

This comment was marked as outdated.

@IndrajeetPatil
Copy link
Member Author

@DominiqueMakowski Let us know if these suggestions make sense.

@IndrajeetPatil
Copy link
Member Author

bump

@strengejacke
Copy link
Member

hello-mcfly

@IndrajeetPatil
Copy link
Member Author

bump

@DominiqueMakowski
Copy link
Member

DominiqueMakowski commented Sep 21, 2022

Is that the correct list?

Preparation:
data_filter()
data_select()
data_to_long()
data_to_wide()
data_rotate()
data_rename()
data_relocate()
data_join()

Transformation:
standardize()
normalize()
center()
degroup()
winsorize()
categorize()
change_code()
slide()

@DominiqueMakowski
Copy link
Member

thanks for the bumps 🙊

@IndrajeetPatil
Copy link
Member Author

IndrajeetPatil commented Sep 21, 2022

These need to change to their new names:

  • data_cut() -> categorize()
  • data_recode() -> recode_values() change_code()
  • data_shift() -> slide()

Btw, feel free to not include all of them. Whatever looks better with the chosen graphic design.

@bwiernik
Copy link
Contributor

recode_values() not change_code()

@IndrajeetPatil
Copy link
Member Author

bump

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs 📚 Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

6 participants