51 messages in com.xensource.lists.xen-develRE: [Xen-devel] segfault in VM
FromSent OnAttachments
Derek Glidden18 Jul 2004 22:22 
James Harper18 Jul 2004 22:50 
Keir Fraser19 Jul 2004 00:26 
Chris Andrews19 Jul 2004 01:28 
Keir Fraser19 Jul 2004 01:56 
Chris Andrews19 Jul 2004 02:01 
Wm19 Jul 2004 05:48 
Keir Fraser19 Jul 2004 06:22 
Derek Glidden19 Jul 2004 11:51 
Derek Glidden19 Jul 2004 11:56 
Derek Glidden19 Jul 2004 11:58 
Derek Glidden19 Jul 2004 12:05 
Chris Andrews19 Jul 2004 12:34 
Derek Glidden19 Jul 2004 16:06 
James Harper19 Jul 2004 17:00 
James Harper19 Jul 2004 17:04 
Derek Glidden19 Jul 2004 18:01 
James Harper19 Jul 2004 18:04 
Keir Fraser19 Jul 2004 23:56 
Keir Fraser20 Jul 2004 00:59 
James Harper20 Jul 2004 03:41 
Keir Fraser20 Jul 2004 03:52 
Christian Limpach20 Jul 2004 06:38 
Derek Glidden20 Jul 2004 08:51 
Chris Andrews20 Jul 2004 11:09 
James Harper20 Jul 2004 18:14 
Christian Limpach21 Jul 2004 03:12 
Keir Fraser21 Jul 2004 06:30 
James Harper21 Jul 2004 06:47 
Keir Fraser21 Jul 2004 07:17 
Derek Glidden21 Jul 2004 16:39 
Derek Glidden21 Jul 2004 18:47 
Keir Fraser21 Jul 2004 18:54 
James Harper21 Jul 2004 18:56 
Keir Fraser21 Jul 2004 19:03 
Derek Glidden21 Jul 2004 19:39 
James Harper21 Jul 2004 19:47 
Keir Fraser21 Jul 2004 19:56 
James Harper21 Jul 2004 20:49 
James Harper21 Jul 2004 21:35 
Derek Glidden21 Jul 2004 22:28 
Keir Fraser22 Jul 2004 04:22 
Keir Fraser22 Jul 2004 04:54 
James Harper22 Jul 2004 05:53 
Keir Fraser22 Jul 2004 06:08 
Derek Glidden22 Jul 2004 08:31 
Derek Glidden22 Jul 2004 08:38 
Keir Fraser22 Jul 2004 10:47 
James Harper22 Jul 2004 18:03 
Keir Fraser22 Jul 2004 18:11 
James Harper22 Jul 2004 21:49 
Subject:RE: [Xen-devel] segfault in VM
From:James Harper (Jam@bendigoit.com.au)
Date:07/22/2004 09:49:14 PM
List:com.xensource.lists.xen-devel

That's comforting. I was starting to think of looking for gcc bugs and the like.

Even so, it might be useful to collect the gcc versions of anyone who either has
seen the bug or has tried to reproduce it and can't. Mine reports itself as "gcc
(GCC) 3.3.4 (Debian 1:3.3.4-2)" with "gcc --version"

James

From: Keir Fraser Sent: Fri 23/07/2004 11:11 AM To: James Harper Cc: Keir Fraser; Derek Glidden; xen-@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM

Yeah, it turns out I can reproduce this bug trivially by md5summing a file just slightly bigger than dom0's memory allocation, while floodpinging dom1.

I'm trying out a few things right now, so hopefully I'll be able to report progress on this evil bug r.s.n. :-)

-- Keir

I just made a change so that the skbuf is always copied in netif_be_start_xmit
but it still crashes, which means most likely that bit is fine or at least isn't
the only code containing bugs.

As another test I also put the 'goto done;' after the 'if ( skb_shared(skb) ||
skb_cloned(skb) || ...' block, (still block the receive but do it later) and
there were no crashes, so i'm comfortable that we've exhausted
netif_be_start_xmit as a source for bugs.

So I guess that leaves net_rx_action. I'm unsure on one thing though, the pages
that get passed from dom0 to domU, how/where/do they get recycled back to dom0?
Is it possible that domU could still write to a page that dom0 thought it had
free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser Sent: Fri 23/07/2004 3:48 AM To: Derek Glidden Cc: xen-@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM

It's useful to have the extra data points -- it adds to our confidence that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver's data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes.

-- Keir

On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:

Anyway - currently sounds like teh bug resides in the most complex half of the most complex driver. Who'd've thought it? ;-)

At this point this data is surely redundant but...

When I went to sleep last night I let my box run dom0 and four VMs doing md5sum checks on a couple of large files, hammering the heck out of the block i/o drivers and CPU but with all the ifaces/vifs on the machine down. When I woke up, all compares had been correct for the six hours or so it ran. I re-upped the ifaces and started to ping dom0 and the VMs and within a minute of the pings starting dom0 started to report incorrect md5sums.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn't have to stop there." -- Dana Gould

------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

_______________________________________________ Xen-devel mailing list Xen-@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel

------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=4721&alloc_id=10040&op=click

_______________________________________________ Xen-devel mailing list Xen-@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel

-=- MIME -=- --_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

I just made a change so that the skbuf is always copied in netif_be_start_x= mit but it still crashes, which means most likely that bit is fine or at le= ast isn't the only code containing bugs.

As another test I also put the 'goto done;' after the 'if ( skb_shared(skb)= || skb_cloned(skb) || ...' block, (still block the receive but do it later= ) and there were no crashes, so i'm comfortable that we've exhausted netif_= be_start_xmit as a source for bugs.

So I guess that leaves net_rx_action. I'm unsure on one thing though, the p= ages that get passed from dom0 to domU, how/where/do they get recycled back= to dom0? Is it possible that domU could still write to a page that dom0 th= ought it had free to use for something else? If so, where would that be?

Keir: have you been able to reproduce these errors at all?

James

From: Keir Fraser Sent: Fri 23/07/2004 3:48 AM To: Derek Glidden Cc: xen-@lists.sourceforge.net Subject: Re: [Xen-devel] segfault in VM

It's useful to have the extra data points -- it adds to our confidence that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver's data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes.

-- Keir

=20 On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote:

Anyway - currently sounds like teh bug resides in the most complex half of the most complex driver. Who'd've thought it? ;-)

=20 At this point this data is surely redundant but... =20 When I went to sleep last night I let my box run dom0 and four VMs=20 doing md5sum checks on a couple of large files, hammering the heck out=20 of the block i/o drivers and CPU but with all the ifaces/vifs on the=20 machine down. When I woke up, all compares had been correct for the=20 six hours or so it ran. I re-upped the ifaces and started to ping dom0=20 and the VMs and within a minute of the pings starting dom0 started to=20 report incorrect md5sums. =20 -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-= =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- "We all enter this world in the | Support Electronic Freedom same way: naked; screaming; soaked | http://www.eff.org/ in blood. But if you live your | http://www.anti-dmca.org/ life right, that kind of thing |--------------------------- doesn't have to stop there." -- Dana Gould =20 =20 =20

------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=3D4721&alloc_id=3D10040&op=3Dclick

_______________________________________________ Xen-devel mailing list Xen-@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_ Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable

<HTML><HEAD></HEAD> <BODY> <DIV id=3DidOWAReplyText58627 dir=3Dltr> <DIV dir=3Dltr><FONT face=3DArial color=3D#000000 size=3D2>I just made a ch= ange so that the skbuf is always copied in netif_be_start_xmit but it still= crashes, which means most likely that bit is fine or at least isn't the on= ly code containing bugs.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>As another test I also put the '= goto done;' after the 'if ( skb_shared(skb) || skb_cloned(skb) || ...' bloc= k, (still block the receive but do it later) and there were no crashes, so = i'm comfortable that we've exhausted netif_be_start_xmit as a source for bu= gs.</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>So I guess that leaves net_rx_ac= tion. I'm unsure on one thing though, the pages that get passed from dom0 t= o domU, how/where/do they get recycled back to dom0? Is it possible that do= mU could still write to a page that dom0 thought it had free to use for som= ething else? If so, where would that be?</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>Keir: have you been able to repr= oduce these errors at all?</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2>James</FONT></DIV> <DIV dir=3Dltr><FONT face=3DArial size=3D2></FONT>&nbsp;</DIV></DIV> <DIV dir=3Dltr><BR> <HR tabIndex=3D-1> <FONT face=3DTahoma size=3D2><B>From:</B> Keir Fraser<BR><B>Sent:</B> Fri 2= 3/07/2004 3:48 AM<BR><B>To:</B> Derek Glidden<BR><B>Cc:</B> xen-devel@lists= .sourceforge.net<BR><B>Subject:</B> Re: [Xen-devel] segfault in VM<BR></FON= T><BR></DIV> <DIV><PRE style=3D"WORD-WRAP: break-word">It's useful to have the extra dat= a points -- it adds to our confidence that it's the network driver that is somehow at fault here.

Quite how to proceed in narrowing down the problem is unclear. One approach is to perturb the backend driver's data path (e.g., always copying packets into a known-safe page-sized buffer, as a check that our current copy-avoidancxe checks are not at fault; and replacing the current high-performance but convoluted code for batching hypercalls with something slower but easier to grok). The latter is useful because if the bug goes away then we have a smaller chunk of code to look at; if the bug remains then we end up with a less complex data path that is easier to instrument and bughunt.

If anyone is interested in pursuing this bug independently, the functions most under suspicion are netif_be_start_xmit and net_rx_action, both in linux/arch/xen/drivers/netif/backend/main.c. These two form the data path for packets getting sent to guest OSes.

-- Keir

&gt;=20 &gt; On Jul 22, 2004, at 7:22 AM, Keir Fraser wrote: &gt; &gt; &gt; &gt; Anyway - currently sounds like teh bug resides in the most comple= x &gt; &gt; half of the most complex driver. Who'd've thought it? ;-) &gt;=20 &gt; At this point this data is surely redundant but... &gt;=20 &gt; When I went to sleep last night I let my box run dom0 and four VMs=20 &gt; doing md5sum checks on a couple of large files, hammering the heck out= =20 &gt; of the block i/o drivers and CPU but with all the ifaces/vifs on the=20 &gt; machine down. When I woke up, all compares had been correct for the=20 &gt; six hours or so it ran. I re-upped the ifaces and started to ping dom= 0=20 &gt; and the VMs and within a minute of the pings starting dom0 started to= =20 &gt; report incorrect md5sums. &gt;=20 &gt; -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-= =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D- &gt; "We all enter this world in the | Support Electronic Freedom &gt; same way: naked; screaming; soaked | http://www.eff.org/ &gt; in blood. But if you live your | http://www.anti-dmca.org/ &gt; life right, that kind of thing |--------------------------- &gt; doesn't have to stop there." -- Dana Gould &gt;=20 &gt;=20 &gt;=20 &gt; ------------------------------------------------------- &gt; This SF.Net email is sponsored by BEA Weblogic Workshop &gt; FREE Java Enterprise J2EE developer tools! &gt; Get your free copy of BEA WebLogic Workshop 8.1 today. &gt; http://ads.osdn.com/?ad_id=3D4721&amp;alloc_id=3D10040&amp;op=3Dclick &gt; _______________________________________________ &gt; Xen-devel mailing list &gt; Xen-@lists.sourceforge.net &gt; https://lists.sourceforge.net/lists/listinfo/xen-devel

--_9E21B0BB-4B74-4723-AD6C-A6A06B6BFD7A_--