| From | Sent On | Attachments |
|---|---|---|
| Kern Sibbald | Jun 1, 2003 10:54 am | |
| Dan Langille | Jun 1, 2003 11:32 am | |
| Justin T. Gibbs | Jun 1, 2003 1:08 pm | |
| Kern Sibbald | Jun 1, 2003 2:44 pm | |
| Justin T. Gibbs | Jun 1, 2003 3:39 pm | |
| Matthew Jacob | Jun 1, 2003 5:00 pm | |
| Matthew Jacob | Jun 1, 2003 5:13 pm | |
| Dan Langille | Jun 1, 2003 6:58 pm | |
| Matthew Jacob | Jun 1, 2003 7:03 pm | |
| Kern Sibbald | Jun 2, 2003 1:28 am | |
| Kern Sibbald | Jun 2, 2003 1:29 am | |
| Kern Sibbald | Jun 2, 2003 1:57 am | |
| Kern Sibbald | Jun 2, 2003 3:45 am | |
| Dan Langille | Jun 2, 2003 4:28 am | |
| Matthew Jacob | Jun 2, 2003 8:05 am | |
| Justin T. Gibbs | Jun 2, 2003 8:10 am | |
| Dan Langille | Jun 2, 2003 8:14 am | |
| Matthew Jacob | Jun 2, 2003 8:21 am | |
| Kern Sibbald | Jun 2, 2003 8:27 am | |
| Dan Langille | Jun 2, 2003 9:46 am | |
| Dan Langille | Jun 2, 2003 11:05 am | |
| Matthew Jacob | Jun 2, 2003 11:11 am | |
| Justin T. Gibbs | Jun 2, 2003 11:49 am | |
| Dan Langille | Jun 2, 2003 12:06 pm | |
| Justin T. Gibbs | Jun 2, 2003 12:10 pm | |
| Matthew Jacob | Jun 2, 2003 1:14 pm | |
| Dan Langille | Jun 2, 2003 2:16 pm | |
| Matthew Jacob | Jun 2, 2003 2:24 pm | |
| Kern Sibbald | Jun 2, 2003 2:46 pm | |
| Matthew Jacob | Jun 2, 2003 2:55 pm | |
| Kern Sibbald | Jun 2, 2003 3:31 pm | |
| Carl Reisinger | Jun 2, 2003 3:44 pm | |
| Matthew Jacob | Jun 2, 2003 3:44 pm | |
| Dan Langille | Jun 2, 2003 6:37 pm | |
| Kern Sibbald | Jun 3, 2003 12:28 am | |
| Kern Sibbald | Jun 3, 2003 6:07 am | |
| Carl Reisinger | Jun 3, 2003 6:19 am | |
| Kern Sibbald | Jun 3, 2003 6:37 am | |
| Carl Reisinger | Jun 3, 2003 7:01 am | |
| Matthew Jacob | Jun 3, 2003 7:34 am | |
| Justin T. Gibbs | Jun 3, 2003 7:51 am | |
| Kern Sibbald | Jun 3, 2003 8:05 am | |
| Kern Sibbald | Jun 3, 2003 8:11 am | |
| Matthew Jacob | Jun 3, 2003 9:03 am | |
| Dan Langille | Jun 3, 2003 9:10 am | |
| Justin T. Gibbs | Jun 3, 2003 9:24 am | |
| Kern Sibbald | Jun 3, 2003 9:40 am | |
| Justin T. Gibbs | Jun 3, 2003 10:03 am | |
| Kern Sibbald | Jun 3, 2003 10:19 am | |
| Kern Sibbald | Jun 3, 2003 10:34 am | |
| Matthew Jacob | Jun 3, 2003 11:00 am | |
| Matthew Jacob | Jun 3, 2003 11:16 am | |
| Matthew Jacob | Jun 3, 2003 11:39 am | |
| Justin T. Gibbs | Jun 3, 2003 12:12 pm | |
| Dan Langille | Jun 3, 2003 12:43 pm | |
| Matthew Jacob | Jun 3, 2003 12:46 pm | |
| Kern Sibbald | Jun 3, 2003 1:05 pm | |
| PostMaster General | Jun 3, 2003 2:21 pm | |
| Kern Sibbald | Jun 4, 2003 12:20 am | |
| Matthew Jacob | Jun 4, 2003 7:51 am | |
| Kern Sibbald | Jun 4, 2003 9:51 am | |
| Kern Sibbald | Jun 6, 2003 7:38 am | |
| Dan Langille | Jun 6, 2003 8:59 am | |
| Matthew Jacob | Jun 6, 2003 11:50 am | |
| Dan Langille | Jun 20, 2003 6:17 pm | |
| Dan Langille | Jul 1, 2003 5:07 pm | |
| Matthew Jacob | Jul 1, 2003 11:11 pm | |
| Michael L. Squires | Aug 25, 2003 4:16 am | |
| Dan Langille | Aug 25, 2003 9:13 am | |
| Michael L. Squires | Aug 27, 2003 5:27 am |
| Subject: | SCSI tape data loss | |
|---|---|---|
| From: | Kern Sibbald (ke...@sibbald.com) | |
| Date: | Jun 3, 2003 8:05:00 am | |
| List: | org.freebsd.freebsd-scsi | |
What is clear from the output is that the write() is returning a -1 status. errno could possibly be 0, in which case I set it to ENOSPC, if it is not 0 then it is ENOSPC judging by the error message that is printed "Write error on device ...".
You may want to see more, but here is the basic code that does the write: if ((uint32_t)(stat=write(dev->fd, block->buf, (size_t)wlen)) != wlen) { /* We should check for errno == ENOSPC, BUT many * devices simply report EIO when it is full. * with a little more thought we may be able to check * capacity and distinguish real errors and EOT * conditions. In any case, we probably want to * simulate an End of Medium. */ clrerror_dev(dev, -1);
if (dev->dev_errno == 0) { dev->dev_errno = ENOSPC; /* out of space */ }
Dmsg4(10, "=== Write error. size=%u rtn=%d errno=%d: ERR=%s\n", wlen, stat, dev->dev_errno, strerror(dev->dev_errno));
if (stat == -1) { Jmsg(jcr, M_ERROR, 0, _("Write error on device %s. ERR=%s.\n"), dev->dev_name, strerror(dev->dev_errno)); } else { Jmsg3(jcr, M_INFO, 0, _("End of medium on device %s. Write of %u bytes got %d dev->dev_name, wlen, stat); } block->write_failed = true; dev->EndBlock = dev->block_num; dev->EndFile = dev->file; weof_dev(dev, 1); /* end the tape */ weof_dev(dev, 1); /* write second eof */ dev->state |= (ST_EOF | ST_EOT | ST_WEOT);
======= clererror() does: void clrerror_dev(DEVICE *dev, int func) { char *msg = NULL;
dev->dev_errno = errno; /* save errno */ if (errno == EIO) { dev->VolCatInfo.VolCatErrors++; }
if (!(dev->state & ST_TAPE)) { return; } if (errno == ENOTTY || errno == ENOSYS) { /* Function not implemented */ switch (func) { case -1: Emsg0(M_ABORT, 0, "Got ENOTTY on read/write!\n"); break; case MTWEOF: msg = "WTWEOF"; dev->capabilities &= ~CAP_EOF; /* turn off feature */ break; #ifdef MTEOM case MTEOM: msg = "WTEOM"; dev->capabilities &= ~CAP_EOM; /* turn off feature */ break; #endif case MTFSF: msg = "MTFSF"; dev->capabilities &= ~CAP_FSF; /* turn off feature */ break; case MTBSF: msg = "MTBSF"; dev->capabilities &= ~CAP_BSF; /* turn off feature */ break; case MTFSR: msg = "MTFSR"; dev->capabilities &= ~CAP_FSR; /* turn off feature */ break; case MTBSR: msg = "MTBSR"; dev->capabilities &= ~CAP_BSR; /* turn off feature */ break; default: msg = "Unknown"; break; } if (msg != NULL) { dev->dev_errno = ENOSYS; Mmsg1(&dev->errmsg, _("This device does not support %s.\n"), msg); Emsg0(M_ERROR, 0, dev->errmsg); } } /* Found on Linux */ #ifdef MTIOCLRERR { struct mtop mt_com; mt_com.mt_op = MTIOCLRERR; mt_com.mt_count = 1; /* Clear any error condition on the tape */ ioctl(dev->fd, MTIOCTOP, (char *)&mt_com); Dmsg0(200, "Did MTIOCLRERR\n"); } #endif
/* Typically on FreeBSD */ #ifdef MTIOCERRSTAT { /* Read and clear SCSI error status */ union mterrstat mt_errstat; Pmsg2(000, "Doing MTIOCERRSTAT errno=%d ERR=%s\n", dev->dev_errno, strerror(dev->dev_errno)); ioctl(dev->fd, MTIOCERRSTAT, (char *)&mt_errstat); } #endif }
====
On Tue, 2003-06-03 at 16:34, Matthew Jacob wrote:
The fact that you're getting ENOSPC means that you're getting to PEOT- past LEOT. I guess I need to see the Bacula source to see why LEOT is being missed. If you can build a kernel with CAMDEBUG and run
camcontrol debug -I b:t:l
(bus:target:lun for the tape) and rerun the test, you'll get boatloads of output, but an audit trail of what sastart and saerror are doing around the PEOT timeframe.
There's other stuff here that I need to collect my thoughts on to mail about. This will happen later today.
On Tue, 3 Jun 2003, Kern Sibbald wrote:
Hello,
Dan has now re-run our test of writing to two tapes. In this test, he told Bacula not to attempt to re-read the last block written, so Bacula wrote until -1 with errno=ENOSPC was returned, wrote two EOF marks then put up the next volume.
The results were the same (more or less) 12 blocks of data were lost, which corresponds to the smaller size of the restored file that was split across two tapes.
These 12 blocks were also at the end of the tape.
During the restore, Bacula reported the following:
03-Jun-2003 05:01 undef-sd: RestoreFiles.2003-06-03_04.36.59 Error: Invalid block number. Expected 6060, got 6072
and in Bacula's database, Bacula indicates that blocks 0 to 6072 were written to the first tape. In fact, only blocks 0 to 6071 were written to the first tape -- I see that Bacula has included the failed block in its count, which is wrong, but this doesn't change the results at all though.
Bottom line:
Even when we eliminate the code that backs up and re-reads the last block, we still see the last 12 or 13 blocks being lost. They were written by the program but are not physically on the tape.
Next step:
Dan is now running a test where Bacula will stop writing on the first tape before the EOM is reached.
Best regards,
Kern





