Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ad Removal Layer (Ad Blocking) #30

Open
nb333 opened this issue Dec 31, 2013 · 15 comments
Open

Ad Removal Layer (Ad Blocking) #30

nb333 opened this issue Dec 31, 2013 · 15 comments
Labels

Comments

@nb333
Copy link
Member

nb333 commented Dec 31, 2013

Our Ad Blocking will be similar to AdBlock's service, but ours will be server-side. Thus, we will strip out the ad so it's never even sent to the user. :D

@zlatanvasovic
Copy link

Hum hum. In JavaScript or Python? If with Python we'll need to use a DOM addon (I pretty don't know what) and do a search within <body>, with predefined ad classes.

@nb333
Copy link
Member Author

nb333 commented Jan 1, 2014

@zdroid Correct. With JavaScript being client-side, we would have to remove the ad once we've already fetched or prevent the call to get the ad. Instead, if we chose Python (server-side), that allows us to remove the ad code before it ever gets to the client.

@zlatanvasovic
Copy link

Yeah, but Python one would be super-super complex.

@arunenigma
Copy link
Member

@nb333 @zdroid Sorry, I have been really busy lately with school and work. @zdroid if you can sent me the detailed requirements, I can try to help with this issue.

@zlatanvasovic
Copy link

@arunenigma Don't worry, it's New Year! Relax... :)

Details: OpenFaux server should get ads and remove them, then OpenFaux client renders page and opens it without ads.

@Sp3ctr3
Copy link
Contributor

Sp3ctr3 commented Jan 1, 2014

Hmm..privoxy has something similar, we should be able to emulate that in python.

@zlatanvasovic
Copy link

Ok.

2014/1/1 Yashin Mehaboobe [email protected]

Hmm..privoxy has something similar, we should be able to emulate that in
python.


Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-31424568
.

Zlatan Vasović - ZDroid

@admwx7
Copy link

admwx7 commented Jan 6, 2014

A big portion of ad services we'll be removing are run through major services (such as google ad sense) and will have a streamlined implementation we can be looking for. If we wanted to it'd be as easy as running the code through a regex and stripping out anything that matches. Personally I'd prefer to use a DOM handler so we know the objects are preserved as expected then we can just run attributes of the elements the DOM generates through a regex.

@boxtown
Copy link
Contributor

boxtown commented Jan 8, 2014

Just to let you guys know, regex cannot be used to parse HTML. HTML is not a regular language. You need to parse the HTML first and then possibly use regex (although probably not required after parsing HTML). Shouldn't be a problem if done server side though because Python comes with a HTMLParser class in its standard library.

@zlatanvasovic
Copy link

Lawl lawl lawl. I said load HTML and then search it. :D

Problem is that Python doesn't love HTML too much.

2014/1/8 Michael Ma [email protected]

Just to let you guys know, regex cannot be used to parse HTML. HTML is not
a regular language. You need to parse the HTML first and then possibly use
regex (although probably not required after parsing HTML). Shouldn't be a
problem if done server side though because Python comes with a HTMLParser
class in its standard library.


Reply to this email directly or view it on GitHubhttps://github.com//issues/30#issuecomment-31797889
.

Zlatan Vasović - ZDroid

@admwx7
Copy link

admwx7 commented Jan 8, 2014

@boxtown HTML is a regular language, to be exact it's a markup language, yes it's syntax is different from a programming language but it's still a standardized language.

@zdroid if you're saying we render the HTML then search it, we don't want to do that either.

HTMLParser should do the trick, if it's anything like the built in parser for JS then we can just search for all of element type x with class y and remove it/them. Will just require a bit of research on our part to find the common elements between ads generated by the different ad services. It may help to check out the source code for adblock (https://hg.adblockplus.org/adblockplus/) since they do this already, although their service is client-side. It's possible adblock uses a different method we haven't thought of that might work better, same goes for any other service of this type.

The only thing I'm worried about with removing HTML elements is that it may destroy the flow of the page, in which case maybe there's a way we can just unlink all of the files that are required for the ad, so if it generates the ad through some JS, remove the JS include, if there's an image associated with it, remove the image and so on so it never grabs the resources but the element is still there and (depending on how the ad service implements) still filling the space, just with empty space now. Ideas?

@Sp3ctr3
Copy link
Contributor

Sp3ctr3 commented Jan 8, 2014

Wouldn't building a blacklist of ad elements help? Check if any of them exist in the browser contents and then remove it altogether?

@admwx7
Copy link

admwx7 commented Jan 8, 2014

We don't want to accidentally break someone's layout by just blindly removing the elements, but a blacklist will be needed. Instead of blacklisting a

<div class="ad">...</div>

element instead we can focus on the part that will actually impact the user's experience, such as removing the

<script src="getYoAdHere.spam/..." />

that will actually be making requests out so when the element renders it'll keep it's styling that was added and the div element so it shouldn't break the flow of the page but since it's never grabbing the script to fetch the image it'll never actually render anything more then some black space. This will also help with those pesky sites that have JS built int to overlay a ad that you have to click close on before you can see the content.

@Sp3ctr3
Copy link
Contributor

Sp3ctr3 commented Jan 8, 2014

Alright. Once we figure out what to remove, the actual removal should be fairly trivial. We just modify the buffer in the proxy accordingly. Parse the HTML content using Beautiful soup or lxml (faster?) and then remove the element.

@admwx7
Copy link

admwx7 commented Jan 8, 2014

Agreed, we'll just have to find the common culprits and create a blacklist for it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants