atom feed17 messages in org.freebsd.freebsd-currentRe: [regression] unable to boot: no G...
FromSent OnAttachments
David NaylorApr 11, 2011 9:39 pm 
Alexander MotinApr 11, 2011 11:17 pm 
David NaylorApr 12, 2011 12:31 pm.diff
Alexander MotinApr 12, 2011 1:12 pm 
YongHyeon PYUNApr 12, 2011 2:03 pm 
Alexander MotinApr 12, 2011 2:08 pm 
Garrett CooperApr 12, 2011 2:39 pm 
David NaylorApr 12, 2011 9:51 pm 
David NaylorApr 13, 2011 10:06 am 
John BaldwinApr 15, 2011 9:27 am 
David NaylorApr 15, 2011 2:29 pm.txt
David NaylorMay 9, 2011 11:24 am 
John BaldwinMay 9, 2011 11:48 am 
John BaldwinMar 28, 2012 11:37 am 
David NaylorApr 5, 2012 1:40 am 
John BaldwinApr 5, 2012 7:05 am 
David NaylorApr 6, 2012 3:35 am 
Subject:Re: [regression] unable to boot: no GEOM devices found.
From:David Naylor (nayl@gmail.com)
Date:Apr 12, 2011 9:51:18 pm
List:org.freebsd.freebsd-current

On Tuesday 12 April 2011 23:39:30 Garrett Cooper wrote:

On Tue, Apr 12, 2011 at 2:08 PM, Alexander Motin <ma@freebsd.org> wrote:

YongHyeon PYUN wrote:

On Tue, Apr 12, 2011 at 11:12:55PM +0300, Alexander Motin wrote:

David Naylor wrote:

On Tuesday 12 April 2011 08:17:51 Alexander Motin wrote:

David Naylor wrote:

I am running -current and since a few days ago (at least 2011/04/11) I am unable to boot.

The boot process stops when it looks to find a bootable device. The prompt (when pressing '?') does not display any device and yielding one second (or more) to the kernel (by pressing '.') does not improve the situation.

A known working date is 2011/02/20.

I am running amd64 on a nVidia MCP51 chipset.

MCP51... again...

I am willing to help any way I can.

You could start from capturing and showing verbose dmesg. Full or at least in parts related to disks.

I captured the dmesg output for both the old (working) kernel and the new (bad) kernel. See attached for the difference between the two. If you need the full dmesg please let me know.

One thing I found is that the old kernel would not boot if I simply rebooted from the bad kernel. I had to do a hard power off before the old kernel would work again. Is some device state surviving between reboots?

+ata2: reiniting channel .. +ata2: SATA connect time=0ms status=00000113 +ata2: reset tp1 mask=01 ostat0=58 ostat1=00 +ata2: stat0=0x50 err=0x01 lsb=0x00 msb=0x00 +ata2: reset tp2 stat0=50 stat1=00 devices=0x1 +ata2: reinit done .. +unknown: FAILURE - ATA_IDENTIFY timed out LBA=0

As soon as all devices detected but not responding to commands, I would suppose that there is something wrong with ATA interrupts. There is a long chain of interrupt problems in this chipset. I have already tried to debug one case where ATA wasn't generating interrupts at all. Unfortunately, without success -- requests were executing, but not generating interrupts, it wasn't looked like ATA driver problem.

What's about possible candidate to revision triggering your problem, I would look on this message: +pcib0: Enabling MSI window for HyperTransport slave at pci0:0:9:0

At least it is recent (SVN revs 219737,219740 on 2011-03-18 by jhb) and it is interrupt related.

Does the driver disable MSI for MCP51?

ata(4) doesn't uses MSI by default and I doubt this controller supports them any way. But if I am not mixing something, there were very strange situations with MSI on that chipset, when enabling them one one device caused interrupt problems on another.

I think jhb's patch fixed one MSI issue of all MCP chipset.

I am not telling it is wrong. It could just trigger something.

Could the OP try disabling MSI[X] to see whether or not the issue still occurs then? -Garrett

I added: hw.pci.enable_msi=0 hw.pci.enable_msix=0 to loader.conf but the problem persisted.

@mav: I will revert r219737 and r219740 and try again but this will be in +10 hours...

Thanks