Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Revision for Hwan-branch #1

Open
4 tasks done
edwardhuh opened this issue Apr 6, 2022 · 0 comments
Open
4 tasks done

Code Revision for Hwan-branch #1

edwardhuh opened this issue Apr 6, 2022 · 0 comments
Assignees

Comments

@edwardhuh
Copy link
Contributor

edwardhuh commented Apr 6, 2022

Hi Hwan. Great job. Was very impressed with your code. I do have some concerns about the outcome, and would like you to debug your work using the following process. As always, tell me if this helps.

  • Read through this Towards Data Science medium article. I think it gives you a good background into the method of TF-IDF.
  • Verify that the lemma-ization does what you expect. (why is infect and infection lemma-ized to different words?)
    I would suggest you apply the tokenize & lemma functions return expected results from some common variants you see in the data.
  • Apply spellcheck with spacy prior to lemma-ization
  • Re-implement TF-IDF iteratively testing on smaller corpus (i.e try to do your process with just 1 sentence, then 10 sentences, then 40 sentences, etc. The TF-IDF formula can be calculated by hand by just counting the words. Can you replicate the small number outcomes)

After this, we are going to find a way to create a 'word cloud'. Please review the wordcloud python package by next week so that you can be ready to produce a word cloud of your tf-idf outcomes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants