| From | Sent On | Attachments |
|---|---|---|
| Ron Goldman | Mar 2, 2009 5:06 pm | |
| Eric Arseneau | Mar 2, 2009 6:09 pm | |
| Pete St. Pierre | Mar 2, 2009 6:35 pm | |
| Eric Arseneau | Mar 2, 2009 8:00 pm | |
| John Daniels | Mar 2, 2009 11:43 pm | |
| Pete St. Pierre | Mar 3, 2009 8:41 am | |
| Ron Goldman | Mar 4, 2009 3:14 am | |
| John Daniels | Mar 4, 2009 7:12 am | |
| Markus Bestehorn | Mar 4, 2009 7:57 am | |
| Randy Smith | Mar 4, 2009 8:20 am | |
| John Daniels | Mar 4, 2009 8:51 am | |
| Arshan Poursohi | Mar 4, 2009 8:51 am | |
| Randy Smith | Mar 4, 2009 8:51 am | |
| John Daniels | Mar 4, 2009 9:16 am | |
| John Daniels | Mar 4, 2009 9:24 am | |
| Eric Arseneau | Mar 4, 2009 9:28 am | |
| Eric Arseneau | Mar 4, 2009 9:31 am | |
| Markus Bestehorn | Mar 4, 2009 10:03 am | |
| Robert Taylor | Mar 4, 2009 10:50 am | |
| John Daniels | Mar 4, 2009 11:24 am | |
| Ron Goldman | Mar 4, 2009 11:27 am | |
| John Daniels | Mar 4, 2009 11:30 am | |
| John Daniels | Mar 4, 2009 11:37 am | |
| Eric Arseneau | Mar 4, 2009 11:38 am | |
| Eric Arseneau | Mar 4, 2009 11:39 am | |
| Dave Cleal | Mar 4, 2009 12:22 pm | |
| Eric Arseneau | Mar 4, 2009 12:43 pm | |
| Randy Smith | Mar 4, 2009 3:43 pm | |
| John Daniels | Mar 4, 2009 10:54 pm | |
| John Daniels | Mar 5, 2009 8:16 am | |
| Markus Bestehorn | Mar 5, 2009 10:43 am | |
| Markus Bestehorn | Mar 6, 2009 9:19 am | .jpg |
| Markus Bestehorn | Mar 11, 2009 6:26 am |
| Subject: | Re[2]: radio update | |
|---|---|---|
| From: | John Daniels (jd...@syntropy.co.uk) | |
| Date: | Mar 4, 2009 7:12:34 am | |
| List: | net.java.dev.spots.dev | |
Hi Ron,
I have not tried to measure the number of NO_ACKs, but I have not seen any increase in the number of route discoveries. Of course the SPOT app I am running is not causing any GCs....
I still think you will see an increase in route discoveries if the receiver is going GCs, although with the numbers you've chosen ({0, 20, 45}) it's marginal whether every GC will cause a FIFO overflow. I think it would be better to go with {0, 60, 60}. I doubt whether the extra 55mS will make any difference. Route discoveries are expensive; you want as few as possible.
One change I just made is to change the initial backoff exponent to 0 so that there is no backoff delay before the first attempt to send. I'll be trying that out tomorrow to see if it makes any noticeable difference. I'm also thinking of modifying the retry part of the CSMA- CA algorithm so that the backoff exponent for the first retry is immediately jumped to 3 (since delays for a value of 1 or 2 are going to be less than the time to transmit a full packet) and then incremented by 1 for any subsequent retries. So technically that won't be exactly according to the spec, but I think that change will not break compatibility with any other 802.15.4 compliant receiver.
The backoff algorithm is very clearly defined in the spec, and I can't help thinking that the backoff delays and exponents were chosen for a good reason. And if SPOTs use a different algorithm to other 802.15.4 devices there might not be equitable network usage (assuming it ever becomes important that SPOTs work in the same environment as other 802.15.4 devices).
As I mentioned the old radiostream timeout was 15 seconds, which explained why I was seeing the OTA commands fail while waiting for a CRC level ack. As soon as I lowered the radiostream timeout below 1 second those errors just went away. At one point I tried using a 400 millisecond timeout, but that was a bit too short. Right now I am using an 800 millisecond timeout and that seems to be a good compromise. Deploying an app over 4 hops had very few retransmissions.
I'm sure 800mS is fine for a 4 hop transmission (8 hops really including the hops to get the end-to-end ack back) provided there aren't any/many link-level retransmissions. But as soon as there are, perhaps because of collisions in a busy environment or because the signal strength is marginal, I predict things will go crazy, with lots of unnecessary radiostream retransmissions. Since the hop limit is 7, and since your experiment suggests you need 200mS of timeout for each hop, surely the timeout has to be at least 1400mS? If you allow for say, 1 link-level retransmission for every 7 hops (i.e. 2 link-level retransmissions for the round trip) then you'll need a end-to-end timeout of about 1600mS. I can't see any case for it being less than that. But clearly, 15 seconds was a bit silly :-).
Note the retransmission now isn't scheduled until after the transmit happens, so any route finding happens outside of the end-to-end timeout.
That's a good improvement.
I presume the added delay is the reason for the decrease in FIFO overflows.
Yes I imagine so, and it certainly explains why the crc error and channel access error counts went down.
How big was the suite? The KSN work indicates that suite size is a major factor in reliability.
I used the BounceDemo, which is 25652 bytes.
If I understand the surface graph in Marcus Bestehorn's Symposium presentation correctly, the real pain with the Blue stack begins once the size is over 40kb. Judging from their graph, I suggest you try a suite of at least 60kb.
Ron, I really think it's time to create a proper test rig that allows repeatable and consistent multihop testing. I discussed the idea with Pete and Bob more than a year ago, and Bob seemed ready to create a hardware set up with SPOTs connected via coax instead of aerials, so that the set of receivers can be accurately controlled. If you had, say, a matrix of 16 SPOTs with controllable pathways you'd be able to conduct some very meaningful experiments. A simpler form of that rig would be very useful for Eric, too, for the continuous test system.
--John
==== Original message From: Ron Goldman <Ron....@sun.com> Date: 04 March 2009 Time: 11:15:29 AM Subject: radio update
---- John,
Some more background. Before doing anything I did some informal experiments, running a SPOT app that just displayed the CRC errors, channel access failures & rx fifo overflows in its LEDs. Then I tried doing some OTA commands over several hops and looked at the errors involved.
What I quickly discovered was a huge number of CRC errors & even more channel access errors. So with my changes I tried to understand the causes of those errors and make changes to decrease them.
Let me try to answer some of your questions.....
-- Ron --
On Mar 2, 2009, at 11:43 PM, John Daniels wrote:
Hi Ron,
Sounds promising.
So far I've decreased some timeouts (MAC- layer retry if no ack, radiostream retry if no end-to-end ack), added some delays (between lowpan packets) & improved the error handling in the OTA code.
I notice that you have checked-in some of these changes. You have changed the MAC retry timeouts from {0, 50, 200} to {0, 20, 45}. As we discussed privately, the intuitive effect of this change on heavily loaded receivers should be to increase the number of MAC-level NO_ACKs caused by GC pauses, and hence increase the number of route discoveries. Is that what you observed?
My thinking here was basically not to try to worry about why a SPOT might not respond, but to instead try to more quickly have the attempt to transmit complete, one way or another. If a SPOT didn't reply because it was doing a GC then tough---radio packets are inherently unreliable and it is up to higher layers of the radio stack, e.g. Radiostream, to recover from dropped packets.
I have not tried to measure the number of NO_ACKs, but I have not seen any increase in the number of route discoveries. Of course the SPOT app I am running is not causing any GCs....
I notice you have increased DEFAULT_MAX_CSMA_BACKOFFS from 4 to 5. It probably isn't important, but that moves the MAC layer outside the 802.15.4 spec which says the default should be 4 (although 5 is in the valid range of values).
Since I was seeing many, many channel access failures my first fix was to increase the number of attempts to send and that did seem to help. My reading of the 802.15.4 spec was that it is okay to choose any valid value, so we don't always need to use the default one.
One change I just made is to change the initial backoff exponent to 0 so that there is no backoff delay before the first attempt to send. I'll be trying that out tomorrow to see if it makes any noticeable difference. I'm also thinking of modifying the retry part of the CSMA- CA algorithm so that the backoff exponent for the first retry is immediately jumped to 3 (since delays for a value of 1 or 2 are going to be less than the time to transmit a full packet) and then incremented by 1 for any subsequent retries. So technically that won't be exactly according to the spec, but I think that change will not break compatibility with any other 802.15.4 compliant receiver.
Could you give us some more details about the other changes since these don't seem to have been checked-in yet?
What end-to-end timeout are you using?
As I mentioned the old radiostream timeout was 15 seconds, which explained why I was seeing the OTA commands fail while waiting for a CRC level ack. As soon as I lowered the radiostream timeout below 1 second those errors just went away. At one point I tried using a 400 millisecond timeout, but that was a bit too short. Right now I am using an 800 millisecond timeout and that seems to be a good compromise. Deploying an app over 4 hops had very few retransmissions.
Note the retransmission now isn't scheduled until after the transmit happens, so any route finding happens outside of the end-to-end timeout.
How long is the delay between lowpan packets, and under what circumstances are you inserting the delay? Is this delay the reason the number of FIFO overflows has gone down (since the MAC changes probably haven't affected that).
The major reason for the channel access failures I think can be explained by the lowpan layer trying to send the next packet before the previous one has moved out of radio range. I.e. if A is sending several packets to D with B & C acting as relays, then after sending a packet to B, A must wait long enough for B to relay it to C and also for C to send it out to D. Otherwise A's next packet will collide with the relayed packet. IF A & C send simultaneously (because they cannot hear each others transmission), then B will not receive the packet from A (the hidden terminal problem).
After each lowpan packet I have the sending thread wait for 10 msecs plus another 10 msecs if it was a broadcast packet or plus (5 + 7 * number of hops) for a meshed packet.
After sending a RREP or RRER routing packet add a 10 msec delay.
I presume the added delay is the reason for the decrease in FIFO overflows.
Are you able to say which of these changes had the most effect?
Adding the delays by far had the most effect.
Did you evaluate the changes separately or did you have a theory that required them all to be made at one?
What effect has this all had on performance?
Well the good news is that multihop transmissions now work much better. The bad news is that adding the delays does slow down the single hop tests:
Time to send 20000 integers = 23863msecs, target was 12400 RadiostreamMasterTest test 1: Expected: 12400 Got: 23863
Time to send 500 integers = 12662msecs, target was 11800 RadiogramMasterTest test 6: Expected: 11800 Got: 12662
We sent 1000 integers = 27260msecs, slave received 983 target is: 150 Time to send 1000 integers = 27260msecs, target was 6400 RadiogramMasterTest test 7: Expected: 6400 Got: 27260
in Isolate:
Time to send 20000 integers = 28349msecs, target was 19200 RadiostreamMasterTest test 1: Expected: 19200 Got: 28349
Time to send 500 integers = 17676msecs, target was 15350 RadiogramMasterTest test 6: Expected: 15350 Got: 17676
We sent 1000 integers = 31968msecs, slave received 985 target is: 150 Time to send 1000 integers = 31968msecs, target was 11700 RadiogramMasterTest test 7: Expected: 11700 Got: 31968
Right now I'm focused on adding enough delays to make OTA over multiple hops work reliably. Later we can try to minimize those delays to improve throughput.
Which test suites did you run to verify the changes?
802-15-4-systemtests & MultihopBaseTests
and deploy over 5 hops
How big was the suite? The KSN work indicates that suite size is a major factor in reliability.
I used the BounceDemo, which is 25652 bytes.
Have you tried deploying the library suite?
not yet
Sorry for all the questions but I find these exciting developments :-)
Me too. The more I examine matters, the more amazed I am that the previous radio stack worked at all...
Cheers,
--John
p.s. I forgot to mention in my previous message that before I made any changes I could not reliably do OTA or deploy an app over more than 2 hops.
==== Original message From: Ron Goldman <Ron....@sun.com> Date: 03 March 2009 Time: 1:07:21 AM Subject: radio update
---- I've been looking at the radio stack and trying to make it more reliable for OTA commands. So far I've decreased some timeouts (MAC- layer retry if no ack, radiostream retry if no end-to-end ack), added some delays (between lowpan packets) & improved the error handling in the OTA code. These changes have drastically cut down the number of radio receive buffer (FIFO) overruns, channel access failures (i.e. starting to send, but detecting that another transmission is in progress), collisions, etc., and the code now handles the errors more robustly.
The result is that I can now reliably get information about a SPOT via OTA over 6 hops and deploy over 5 hops (6 hops is about 50% successful). This is using the MAC-layer filtering to create a chain with the SPOTs only seeing messages from their "neighbor" (packets from other SPOTs are filtered). So for my test setup the number of channel access failures & collisions is probably much greater than if the SPOTs were spread out and could not each hear all the other SPOTs.
Hopefully this will be added to a new developer release in the next week or so.
-- Ron --






.jpg