Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible workflow (and WIP) for identifying owners of domains, and privacy policies #121

Open
billfitzgerald opened this issue Feb 17, 2022 · 4 comments

Comments

@billfitzgerald
Copy link
Contributor

tl;dr: I might be able to jumpstart identifying owners for domains that are currently unaffiliated, and I have a decent method of identifying likely candidates for privacy policies in an automated way.

This is a WIP; sharing details here to see if you are interested in this work, and to make sure it aligns and would be useful for the project.

Details:

Parse all records in the tracker-radar/domains/US directory. Identify all domains that do not have an owner.

For every domain that does not have an owner, get up to 5 subdomains. If a record doesn't have a subdomain, use the base domain and suffix.

Send a request for headers to every site in the list at http and https (only headers - no need to be rude and hit the full site). Record response codes. This requires calls to 33,798 locations (16,899, contacted via both http and https). This step is complete, I'm happy to share this dataset if you're interested.

This gives us a range of useful information, including:

  1. sites that support http and https
  2. Response codes at each subdomain (2xx, 3xx, 4xx, 5xx)

By cross referencing what sites support http/https and response codes at each location, we can infer a range of things (that's a longer and separate conversation). For this specific use case, we can use the protocol and response code results to flesh out ownership of these domains.

I'm starting with all domains that have a subdomain that responded with a 2xx response code and supported https on at least 1 subdomain (4676 unique domains). I'm loading each domain via Selenium, passing the results to BeautifulSoup, and then focusing on the contents of all "```a``` tags, specifically the text displayed in the tag and the linked url. If either the text or the url contains "privacy", "terms", "legal", etc, I store the url.

Using this process, I'm generating an additional csv file that includes:

  • starting_url,
  • current_url (if the site redirects, the eventual location)
  • page title (often contains the company name)
  • relevant urls of policies
  • relevant text associated with urls

This csv file can jumpstart the process of adding additional entity records for domains that are currently not affiliated with any entity, and identifying the privacy policies. The method of identifying privacy policies can also work for domains that are currently mapped to an owner, but that do not have a privacy policy listed.

Once I finish the 2xx/https domains, I'll probably process the 3xx/https domains - these are domains that might have been acquired, or are possibly up to no good, so I'll need to exercise additional caution about the device I use to gather information from them. Or, I might jump to the 4xx/https domains, as those could potentially be more legit sites (ie, they use https, and they do not allow easy access to randos on the internet)

CAVEAT: This issue might be premature. The steps I'm outlining here are a WIP. Initial testing looks good, but it's not done until it's done, and because a lot of these sites are, at best, dodgy, they often behave in ways that are, well, curious, which makes data collection more interesting than I'd prefer.

@billfitzgerald
Copy link
Contributor Author

billfitzgerald commented Feb 17, 2022

Okay - the yield here looks pretty good. I ran just over 1K domains earlier today, and that in turn generated about 450 domains with a privacy policy and a distinct page title that are currently not tracked.

Some of the "positives" are actually parked domains - I don't know how many, but from eyeballing page titles there are some.

But yeah - this first test pass looks decent.

@billfitzgerald
Copy link
Contributor Author

Okay - after some testing and work, this is definitely a viable approach (for me, anyways) for identifying owners of currently unclaimed domains.

More to come.

@billfitzgerald
Copy link
Contributor Author

Okay - this is coming together.

This screencast shows some of the details: https://vimeo.com/680551075

But the short version: for domains that don't have owners, we can store ownership info in a csv.

That csv is processed, and:

  • checked for possible duplicates in privacy_policies.json and entity_map.json
  • for entries with no dupes, outputs json to be added to privacy_policies.json and entity_map.json
  • for entries with dupes, the script outputs a list of where the duplicates occurred.

The script is here: https://gist.github.com/billfitzgerald/d8e5a1af729865f4b00b21eb9eeb980a

There are some additional details, but this is the short-ish version.

@billfitzgerald
Copy link
Contributor Author

This issue can be closed - the process works, and leads to updates as documented in this pull request: #124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant