Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interesting project #2

Open
correlator opened this issue Oct 27, 2016 · 1 comment
Open

Interesting project #2

correlator opened this issue Oct 27, 2016 · 1 comment

Comments

@correlator
Copy link

Hi Joseph,
Vyki Englert sent me over here to check this project out and it seems pretty awesome/ambitious. I had a bit of trouble understanding the state of affairs and where you are and where you're going.

If I understand correctly all jobs are currently posted online in various formats across the web. You aim to scrape all of that data and put it into a system that can be searched / analyzed.

I would be happy to help a bit with this but could use a little clarification on the tasks. If I had to guess I would say

  1. Find all sources of state jobs online.
  2. Figure out how to access that data and classify as structured, semi-structured, or unstructured.
  3. Build a schema that is the target of how all jobs should be represented.
  4. Implement schema in Neo4j or some other NoSQL db.
  5. Build ETLs for all the different job data sources by classification.
  6. Migrate data over.
  7. Build API for data for search and exposing the data to people who want to analyze the data.

Is this similar to what you have in mind? Where in the process are we currently? I would be interested in helping with 3, 4, 5, and 7. Happy to work, rubber ducky or help in any small way I can.

@josephlei
Copy link
Owner

Hi, thank you for peeking in, @vykster is an awesome ally and colleague and I look forward to working with you on this as well.

To clarify, we are attempting to take job classification specifications, which are posted online in similar/same formats all in one place as html/aspx documents (http://calhr.ca.gov/state-hr-professionals/Pages/job-descriptions.aspx). There is one page for each class specification.

The end goal is to use xpath/css selectors or other method to pull the relevant values inside these documents and store them in a data structure that can then be further analyzed, linked and API-ified

Examples of what we want to pull from one class such as Associate Governmental Program Analyst (AGPA) class code 5393 (http://calhr.ca.gov/state-hr-professionals/pages/5393.aspx) might be:

  • Schematic code (KEY) with a (VALUE) of JY35
  • Definition (KEY) with a (VALUE) of the string:
    • Under direction, incumbents perform the more responsible, varied, and complex technical analytical staff services assignments such as program evaluation and planning; policy analysis and formulation; systems development; budgeting, planning, management, and personnel analysis; and continually provide consultative services to management or others. This is the full journey level analyst class. Incumbents are typically subject-matter generalists who have demonstrated possession of intellectual abilities, the management tools, and the personal qualifications to succeed in a variety of general staff services settings.

The tricky part is when we get to sections like "Minimum Qualifications" because there are multiple ways to meet MQs for a class, in this example it is:

  • EDUCATION AND
    • EXPERIENCE PATH 1 OR
    • EXPERIENCE PATH 2

If not in a DB, I imagine this lends itself to a hierarchical structure like json/xml but I trust others know better than I, what would be appropriate.

The great news is, I've already done items 1 and 2 (taking master list of class codes, retrieving and caching all the data) and items 3, 4, 5, 7 are exactly correct what we need assistance with!

The cached files are located here, they have html extensions because that's what I specified when I used urllib to fetch, but they are originally at .aspx endpoints. Doesn't really affect the content at all, but just a FYI I noticed after the fact.

I'll be out of the country for the next month in Asia but will check in periodically. A huge thank you for your interest, look forward to what we can do together and demonstrate how powerful and effective civic citizens can be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants