Dedup the Perl_Companies files #1

vmbrasseur · 2013-05-28T20:52:04Z

There are a lot of dupes in these things. This needs to be cleaned up.

thaljef · 2013-05-28T21:43:03Z

I think the trick is to dedup the data whilst still being able to re-run the script to extract additional data from the mail or add future mail.

One possibility it to maintain a separate list of regexes that sanitize the data. For example Yahoo, Inc => /yahoo/i would mean to treat any company that matches m/yahoo/i as "Yahoo, Inc."

That still list has to be maintained by hand, which kinda sucks. But it is probably better than manually editing each mail.

vmbrasseur · 2013-05-28T22:08:41Z

Agreed, that would be a much better way to do it. But I didn't have the cycles to start that process (again: perfect was becoming enemy of the good; plus: other things to do ATM). Hopefully once the project is announced someone will be able to start the process with a pull request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dedup the Perl_Companies files #1

Dedup the Perl_Companies files #1

vmbrasseur commented May 28, 2013

thaljef commented May 28, 2013

vmbrasseur commented May 28, 2013

Dedup the Perl_Companies files #1

Dedup the Perl_Companies files #1

Comments

vmbrasseur commented May 28, 2013

thaljef commented May 28, 2013

vmbrasseur commented May 28, 2013