Skip to content Skip to sidebar Skip to footer

Repairing Invalid Html With Nokogiri (removing Invalid Tags)

I'm trying to tidy some retrieved HTML using the tidy-ext gem. However, it fails when the HTML is quite broken, so I'm trying to repair the HTML using Nokogiri first: repaired_html

Solution 1:

You can parse HTML using Nokogiri's XML parser, which is strict by default but that only helps a little, because it will still do fixups so the HTML/XML is marginally correct. By adjusting the flags you can pass to the parser you can make Nokogiri even more rigid so it will refuse to return an invalid document. Nokogiri is not a sanitizer or a white-list for desired tags. Check out Loofah and Sanitize for that functionality.

If your HTML content is in a variable called html, and you do:

doc = Nokogiri::XML.parse(html)

then check doc.errors afterwards to see if you had errors. Nokogiri will attempt to fix them, but anything that generated an error will be flagged there.

For instance:

Nokogiri::XML('<fb:like></fb:like>').errors
=> [#<Nokogiri::XML::SyntaxError: Namespace prefix fb onlikeisnot defined>]

Nokogiri will attempt to fix up the HTML:

Nokogiri::XML('<fb:like></fb:like>').to_xml
=> "<?xml version=\"1.0\"?>\n<like/>\n"

but it only corrects it to the point of removing the unknown namespace on the tag.

If you want to strip those nodes:

doc = Nokogiri::XML('<fb:like></fb:like>')
doc.search('like').each{ |n| n.remove }
doc.to_xml => "<?xml version=\"1.0\"?>\n"

Post a Comment for "Repairing Invalid Html With Nokogiri (removing Invalid Tags)"