| From | Sent On | Attachments |
|---|---|---|
| Ron Goldman | Mar 11, 2009 1:37 am | |
| John Daniels | Mar 11, 2009 2:01 am | |
| Ron Goldman | Mar 11, 2009 2:15 am | |
| John Daniels | Mar 11, 2009 2:46 am | |
| Markus Bestehorn | Mar 11, 2009 6:24 am | |
| Bart Braem | Mar 11, 2009 6:52 am | |
| Robert Taylor | Mar 11, 2009 7:24 am | |
| John Daniels | Mar 11, 2009 7:56 am | |
| Bart Braem | Mar 11, 2009 8:11 am | |
| Kurt Smolderen | Mar 11, 2009 11:04 am | |
| Ron Goldman | Mar 11, 2009 4:25 pm | |
| Ron Goldman | Mar 11, 2009 4:47 pm | |
| Ron Goldman | Mar 11, 2009 6:13 pm | |
| Ron Goldman | Mar 11, 2009 6:26 pm | |
| Stephan Kessler | Mar 16, 2009 12:04 am |
| Subject: | Re: Radio update, update | |
|---|---|---|
| From: | Ron Goldman (Ron....@sun.com) | |
| Date: | Mar 11, 2009 6:13:21 pm | |
| List: | net.java.dev.spots.dev | |
Markus,
Thanks for your comments.
-- Ron --
On Mar 11, 2009, at 6:24 AM, Markus Bestehorn wrote:
just some general thoughts: i think that tuning timeouts will not fix unreliability completely in this case. Just set the timeouts "high enough" and you should get 100% reliability, if everything else is correct. Yes, it won't be fast, but this is less important than a working solution.
plus also set the number of retries "high enough"...
.. though I got lots of failures at 5 hops due to route timeouts.
And how was your success rate? How often did you try?
I have not done much testing, just enough to see if my changes improved reliability. For basic OTA commands or deploying small applications like the BounceDemo the success rate is very high, maybe even close to 100%.
We ran a test last week, are your numbers close to ours?
Deploying the bounceDemo takes about 76 seconds over 6 hops. It worked 3 of 3 times.
Deploying the library (500k) over 5 hops works & takes 16-17 minutes. I can deploy it over 6 hops in about 20 minutes, but there is some java.lang.OutOfMemoryError bug at the end that I need to track down.
The real improvements in reliability were due to changes to the LowPan & Radiostream code. At the LowPan level I added delays after transmitting each packet so that the transmitting thread (A) would not send another packet until the first had been forwarded out of radio range: [...]
That is a good idea, but I think even a better idea would be letting the routing layer decide this, since the routing layers knows everything needed to calculate a good retransmitting time. In order to achieve this, the routing engine has to be a real layer (more on this topic later).
For us LowPan is part of the routing layer. LowPan takes care of forwarding packets, fragmentation, reassembly, etc. LowPan calls our AODV implementation to discover the route when sending/forwarding packets.
Radiostream
-----------
[..]
Several times when trying to flash the library over 5 hops I was able to successfully transmit all of the bits, but the final end-to- end ACK got lost and the operation failed. This is a basic failing of the current Radiostream design. One fix would be to have the sender transmit an ACK when it receives the end-to-end ACK.
In my opinion the routing system has to provide the end-to-end ack capability because it guarantees reliable end-to-end communication. The Radiostream is "only" a high level protocol, which should not care about how many hops are used for transmission or how the routing protocol works, we just don't care.
Really? My understanding is that the Network layer, which provides routing, takes care of sending/forwarding packets, but does not guarantee delivery. It is the higher Transport layer's responsibility to implement protocols (TCP, Radiostream) to guarantee reliable end-to- end communication.
IMHO this is where the whole problem starts: the existing layer design causes these problems! Let me explain that: correct me if I'm wrong but in the current SDK stack implementation the routing "module" is no real layer. Both the LowPan Layer and the Routing module are responsible for the end-to-end transfer (and it gets even more complicated since the LowPan Layer is also responsible for transferring data larger than a 802.15.4 RadioPacket node-to-node). Separating the concerns will lead to a much more cleaner design and to a more reliable solution because the routing layer controls the whole packet flow.
Please take a look at the network stack diagrams at
https://spots-network-library.dev.java.net/stack-overview.html
. The LowPan layer (which includes the AODV route discovery mechanism)
is only responsible for sending packets on to the next hop. It has no
concern with end-to-end transfers, nor with reliable transfers---that
is the Transport layer's problem/responsibility.
Our intent is to have a radio stack that is easy to experiment with. If the current interfaces are not adequate (which is almost certainly the case), please let us know how they need to be extended. For example if someone is implementing a Route Discovery Protocol and needs changes to the IRoutingManager Interface, e.g. to report alternative routes instead of just one route, then we would want to work with them to add that capability to the IRoutingManager Interface & to have LowPan make use of the extra information.
From our point of view, the OTA Server is just an application that runs on top of the stack. When this application sends something, the call for sending data to the target node must be blocking. Timeouts at the application layer are something that should be avoided and usually they point at some design flaw. It is the task of the routing layer to find and maintain routes. If no such route can be found at some point, the blocking call must return.
The problem with the timeout is, that no matter how the timeout is set at this level, at some point some scenario will occur, where this timeout is inappropriate.
Yes indeed. The timeouts & retries are all taking place in the Transport layer (TCP, Radiostream).
timings:
1 hop flashlibrary: 2:50 2 hop flashlibrary: 6:40
[java] CRC errors = 14 [java] Overflow errors = 45 [java] ChannelAccessFailures = 677 [java] No ACKs = 92 [java] Partial GC's = 143 [java] Full GC's = 3 [...]
What do I get from these numbers? You tried flashing the lib over 1, 2, 3 and 4 hops successfully once and tried 5 hops twice, with the first try failing the second one succeeding?
Actually I successfully flashed the library several times for each of 1-5 hops and the values given were indicative of what I saw. The only failures I saw were at 5 hops and so I gave representative numbers for success & failure.
If you have questions, we can also set up a phone conference (maybe also with Bart).
Email seems to be working so far.





