Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter Promoted Company ads #1

Open
jfcolomer opened this issue Sep 15, 2023 · 1 comment
Open

Filter Promoted Company ads #1

jfcolomer opened this issue Sep 15, 2023 · 1 comment

Comments

@jfcolomer
Copy link

Hi there,
Thanks for creating this script, it's fabulous!
I was wondering what'd be the best way to target not every single post but specifically PROMOTE ADS, this is, the ones listed here:
https://www.linkedin.com/company/{company-name}/posts/?feedView=ads
For some reason when I update the link variable on the scrape function to be something like link = f'{link}/posts/?feedView=ads' it will only pick up the very first promoted ad but for some reason it won't be able to collect the remaining ones (i.e. 50 ads, it will return only 1 result) and from this result it won't be able to collect likes/links (i.e. an ad with a carousel and items with links).
For ALL other posts, it does indeed work as a charm.
Thanks

@jfcolomer
Copy link
Author

jfcolomer commented Oct 25, 2023

Hi,

Any help to understand how the post individual items are created before they are passed to the postInfo = getPostInformation(str(post)) would be really appreciated:

`def scrape(driver, link, profileType):
if (profileType == "Company"):
link = f'{link}/posts/?feedView=ads'
else:
link = f'{link}/recent-activity/all/'

driver.get(link)

time.sleep(3)

posts = {}

old_position = 0
new_position = None
counter = 0
while new_position != old_position:
    # Get old scroll position
    old_position = driver.execute_script(
            ("return (window.pageYOffset !== undefined) ?"
             " window.pageYOffset : (document.documentElement ||"
             " document.body.parentNode || document.body);"))
    time.sleep(1)           #experimentar tirar eleste limte de tempo, para ver se a execução do programa é mais rápida, como o programa está a fazer processamento pode ser que não seja nbecessáio o tempo de sleep como era preciso no insta. No insta apenas estava a fazer scrool sem nenhum processamento pelo meio
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    soup = str(soup)

    results = soup.split('occludable-update')
    
    # results = {}
    for result in results:

        try:
            counter += 1

            postlink = result.split('data-urn="')[counter].split('"')[0]
            postlink = f'https://www.linkedin.com/feed/update/{postlink}'
        except:
            postlink = ''
        
        if('linkedin' in postlink):
            posts[postlink] = result
    new_position = scroll(driver, old_position)

print(f'\n\nFound {len(posts)} posts.')
postsFiltered = []

for postlink, post in posts.items():
    postInfo = getPostInformation(str(post))
    postInfo.append(postlink)
    postsFiltered.append(postInfo)`

After refactoring the link variable, link = f'{link}/posts/?feedView=ads' I can get the script to export all the final company promoted posts exported to the csv with this format:
https://www.linkedin.com/feed/update/urn:li:activity:00000000000000001
https://www.linkedin.com/feed/update/urn:li:activity:00000000000000002
https://www.linkedin.com/feed/update/urn:li:activity:00000000000000003

and so on ...

But the description, hashtags etc.. will only return the values for the first of the posts, in this case https://www.linkedin.com/feed/update/urn:li:activity:00000000000000001 so it'd be really appreciated if you could explain how the post variable that is referenced here

for postlink, post in posts.items():
is generated.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant