scrape_linkedin
is a python package to scrape all details from public LinkedIn
profiles, turning the data into structured json. You can scrape Companies
and user profiles with this package.
Warning: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests
Run pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git
git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git
Run python setup.py install
Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions.
Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session.
- Navigate to www.linkedin.com and log in
- Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
- Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
- Click the Cookies dropdown on the left-hand menu, and select the
www.linkedin.com
option - Find and copy the li_at value
There are two ways to set your li_at cookie:
- Set the LI_AT environment variable
$ export LI_AT=YOUR_LI_AT_VALUE
- On Windows:
C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE
- Pass the cookie as a parameter to the Scraper object.
>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:
A cookie value passed directly to the Scraper will override your environment variable if both are set.
See /examples
scrape_linkedin comes with a command line argument module scrapeli
created
using click.
Note: CLI only works with Personal Profiles as of now.
Options:
- --url : Full Url of the profile you want to scrape
- --user: www.linkedin.com/in/USER
- --driver: choose Browser type to use (Chrome/Firefox), default: Chrome
- -a --attribute : return only a specific attribute (default: return all attributes)
- -i --input_file : Raw path to html file of the profile you want to scrape
- -o --output_file: Raw path to output file for structured json profile (just prints results by default)
- -h --help : Show this screen.
Examples:
- Get Austin O'Boyle's profile info:
$ scrapeli --user=austinoboyle
- Get only the skills of Austin O'Boyle:
$ scrapeli --user=austinoboyle -a skills
- Parse stored html profile and save json output:
$ scrapeli -i /path/file.html -o output.json
Use ProfileScraper
component to scrape profiles.
from scrape_linkedin import ProfileScraper
with ProfileScraper() as scraper:
profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())
Profile
- the class that has properties to access all information pulled from
a profile. Also has a to_dict() method that returns all of the data as a dict
with open('profile.html', 'r') as profile_file:
profile = Profile(profile_file.read())
print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}
Structure of the fields scraped
- personal_info
- name
- company
- school
- headline
- followers
- summary
- websites
- phone
- connected
- image
- skills
- experiences
- volunteering
- jobs
- education
- interests
- accomplishments
- publications
- cerfifications
- patents
- courses
- projects
- honors
- test scores
- languages
- organizations
Use CompanyScraper
component to scrape companies.
from scrape_linkedin import CompanyScraper
with CompanyScraper() as scraper:
company = scraper.scrape(company='facebook')
print(company.to_dict())
Company
- the class that has properties to access all information pulled from
a company profile. There will be three properties: overview, jobs, and life.
Overview is the only one currently implemented.
with open('overview.html', 'r') as overview,
open('jobs.html', 'r') as jobs,
open('life.html', 'r') as life:
company = Company(overview, jobs, life)
print (company.overview)
# {...}
Structure of the fields scraped
- overview
- name
- company_size
- specialties
- headquarters
- founded
- website
- description
- industry
- num_employees
- type
- image
- jobs NOT YET IMPLEMENTED
- life NOT YET IMPLEMENTED
Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow.
- cookie
{str}
: li_at cookie value (overrides env variable)- default:
None
- default:
- driver
{selenium.webdriver}
: driver type to use- default:
selenium.webdriver.Chrome
- default:
- driver_options
{dict}
: kwargs to pass to driver constructor- default:
{}
- default:
- scroll_pause
{float}
: time(s) to pause during scroll increments- default:
0.1
- default:
- scroll_increment
{int}
num pixels to scroll down each time- default:
300
- default:
- timeout
{float}
: default time to wait for async content to load- default:
10
- default:
New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles.
from scrape_linkedin import scrape_in_parallel, CompanyScraper
companies = ['facebook', 'google', 'amazon', 'microsoft', ...]
#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
scraper_type=CompanyScraper,
items=companies,
output_file="companies.json",
num_instances=4
)
Parameters:
- scraper_type
{scrape_linkedin.Scraper}
: Scraper to use - items
{list}
: List of items to be scraped - output_file
{str}
: path to output file - num_instances
{int}
: number of parallel instances of selenium to run - temp_dir
{str}
: name of temporary directory to use to store data from intermediate steps- default: 'tmp_data'
- driver {selenium.webdriver}: driver to use for scraping
- default: selenium.webdriver.Chrome
- driver_options
{dict}
: dict of keyword arguments to pass to the driver function.- default: scrape_linkedin.utils.HEADLESS_OPTIONS
- **kwargs
{any}
: extra keyword arguments to pass to thescraper_type
constructor for each job
Report bugs and feature requests here.