Extract URLs and email addresses from text using R. Actually, all kinds of URIs are supported, not just URLs. The set of accepted URI schemes can easily be adjusted.
Leading and trailing punctuation is examined. If it seems that
punctuation is used as delimiters around a URI or that a URI is the
last part of sentence, some trailing punctuation may be
removed. Comma-separated URI lists are split but the heuristics used
for this may fail, as the comma is a valid character in some parts of
a URI. Any technically valid URI is protected from being cut if it is
surrounded by angle brackets (<http://www.example.org/>
) or double
quotes ("http://www.example.org/"
). Whitespace is allowed (and
removed) within angle brackets, as long as the URI scheme and the
following :
are not interrupted by whitespace.
Some (approximate) validation against the URI
specification is
performed, for example in the host part of the URI. The program also
catches illegal ASCII characters and use of the %
character for
purposes other than percent-encoding; anything after that, including
the illegal character itself, is not considered a part of the URL. The
program is generally not aware of possible additional rules applying
to URIs following a particular URI scheme. As an exception, the
program knows about the structure of
mailto URIs.
With devtools already installed, run the following command in the R console:
devtools::install_github("mvkorpel/pickURL")
After installing the package, see the help page of function pick_urls
.