1 message in org.python.python-bugs-list[ python-Bugs-736428 ] allow HTMLPars...
FromSent OnAttachments
SourceForge.netMar 16, 2004 4:53 am 
Actions with this message:
Paste this link in email or IM:
Paste this link in email or IM:
Atom feed for this thread
Paste this URL into your reader:
Subject:[ python-Bugs-736428 ] allow HTMLParser error recoveryActions...
From:SourceForge.net (nore@sourceforge.net)
Date:Mar 16, 2004 4:53:03 am
List:org.python.python-bugs-list

Bugs item #736428, was opened at 2003-05-12 12:37 Message generated for change (Comment added) made by kingswood You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=736428&group_id=5470

Category: Python Library Group: Feature Request Status: Open Resolution: None Priority: 5 Submitted By: Steven Rosenthal (smroid) Assigned to: Nobody/Anonymous (nobody) Summary: allow HTMLParser error recovery

Initial Comment: I'm using 2.3a2.

HTMLParser correctly raises a "malformed start tag" error on:

<meta NAME=DESCRIPTION Content=Lands&#039; End quality... outerwear and more.>

because my application is imprecise by nature (web scraping), I want to be able to continue after such errors.

I can override the error() method to not raise an exception. To make this work, I also needed to alter HTMLParser.py, near line 316, to read as:

self.updatepos(i, j) self.error("malformed start tag") return j # ADDED THIS LINE raise AssertionError("we should not get here!")

My enhancement request is for every place where self.error() is called, to ensure that the "override error() to not raise an exception" continuation strategy works as well as can be hoped.

Thanks,

Steve

----------------------------------------------------------------------

Comment By: Frank Vorstenbosch (kingswood) Date: 2004-03-16 09:53

Message: Logged In: YES user_id=555155

Fixed by my patch against 2.3.3. The patch adds recovery to ensure progress and tries to not miss any data in the input. The error() method is now commented as being overridable, just def error(): pass to ignore any parsing errors.

----------------------------------------------------------------------