Algorithm: Recognizing URLs within plain text, and displaying them as clickable links in HTML, in Wicket
I have just, out of necessity for a customer project, written code which takes user-entered plain text, and creates out of that HTML with URLs marked up as clickable links.
Although marking up links in user-entered text is standard functionality, Stack Overflow would have you believe that it's not something that should not be attempted, as it cannot be done perfectly. This is technically correct, however, users are accustomed to software which does a best-effort attempt, and customers are accustomed to take delivery of software meeting users expectations.
The software I have written is available as open-source, either as a Java class with the method encodeLinksToHtml which takes some plain text and returns safe HTML with clickable links, or as a component in the Wicket web framework called MultilineLabelWithClickableLinks.
Finding links within text is not as easy at it seems
Users may enter with/without protocol (http://). Domains may or may not have www at the start. There may or may not be a trailing slash. There may or may not be information after the URL. Having a whitelist of acceptable domain endings such as ".com" is a bad idea as the list is large and subject to change over time. Punctuation after links should not be included (for example "see foo.com.", with a trailing dot which is not part of the URL)
The software matches foo://foo.foo/foo
, where:
- Protocol is optional
- Domain must contain at least one dot
- Last part is optional and can contain anything apart from space and trailing punctuation (= part of the sentence in which the link is embedded)
Quotes are not allowed because we don't want <a href="foo">
to have foo containing quotes (XSS).
Making links clickable is not as easy as it seems
Facts:
- Conversion from plain text to HTML requires that entities such as
&
get replaced by&
. - Links such as
foo.com/a&b
need to get replaced by<a href='foo.com/a&b'>foo.com/a&b</a>
. (&
in URL needs to stay&
in the href, but needs to become&
in the visible text part)
Therefore,
- One cannot firstly replace entities and then markup links, as the links should contain unescaped
&
as opposed to&
. - One cannot firstly encode links and then secondly replace entities as the angle brackets in the link's
<a href..
would get replaced by<a href...
which the browser would not understand.
Therefore, the replacement of HTML entities, and the replacement of links, must be done in a single (complicated) pass, rather than two (simple) passes.