Regex For Links In Html Text

Question

I hope this question is not a RTFM one. I am trying to write a Python script that extracts links from a standard HTML webpage (the

Solution 1:

Regexes with HTML get messy. Just use a DOM parser like Beautiful Soup.

Solution 2:

As others have suggested, if real-time-like performance isn't necessary, BeautifulSoup is a good solution:

import urllib2
from BeautifulSoup importBeautifulSouphtml= urllib2.urlopen("http://www.google.com").read()
soup = BeautifulSoup(html)
all_links = soup.findAll("a")

As for the second question, yes, HTML links ought to be well-defined, but the HTML you actually encounter is very unlikely to be standard. The beauty of BeautifulSoup is that it uses browser-like heuristics to try to parse the non-standard, malformed HTML that you are likely to actually come across.

If you are certain to be working on standard XHTML, you can use (much) faster XML parsers like expat.

Regex, for the reasons above (the parser must maintain state, and regex can't do that) will never be a general solution.

Solution 3:

No there isn't.

You can consider using Beautiful Soup. You can call it the standard for parsing html files.

Solution 4:

Shoudln't a link be a well-defined regex?

No, [X]HTML is not in the general case parseable with regex. Consider examples like:

<linktitle='hello">world'href="x">link</link><!-- <link href="x">not a link</link> -->
<![CDATA[ ><link href="x">not a link</link> ]]>
<script>document.write('<linkhref="x">not a link</link>')</script>

and that's just a few random valid examples; if you have to cope with real-world tag-soup HTML there are a million malformed possibilities.

If you know and can rely on the exact output format of the target page you can get away with regex. Otherwise it is completely the wrong choice for scraping web pages.

Solution 5:

Shoudln't a link be a well-defined regex? This is a rather theoretical question,

I second PEZ's answer:

I don't think HTML lends itself to "well defined" regular expressions since it's not a regular language.

As far as I know, any HTML tag may contain any number of nested tags. For example:

<ahref="http://stackoverflow.com">stackoverflow</a><ahref="http://stackoverflow.com"><i>stackoverflow</i></a><ahref="http://stackoverflow.com"><b><i>stackoverflow</i></b></a>
...

Thus, in principle, to match a tag properly you must be able at least to match strings of the form:

BE
BBEE
BBBEEE
...
BBBBBBBBBBEEEEEEEEEE
...

where B means the beginning of a tag and E means the end. That is, you must be able to match strings formed by any number of B's followed by the same number of E's. To do that, your matcher must be able to "count", and regular expressions (i.e. finite state automata) simply cannot do that (in order to count, an automaton needs at least a stack). Referring to PEZ's answer, HTML is a context-free grammar, not a regular language.

Html5 Tech