UDP connection loss after migration from EHS6 to ELS61 | Telit Cinterion IoT Developer Community
October 13, 2021 - 7:43pm, 4409 views
Hello again,
We recently migrated our data logging java app from EHSx to ELS61 platform. Not many changes were needed, we can still build the app form the same codebase, just switching the WTK.
We use UDP protocol for data transfer from module to server and we have been using it very successfully for several years without issues on EHSx plaftorm (even on TC65i). But now, moving to ELS61, we are noticing connection failures occurring every few hours. In previous platforms, this event was so rare we just used the watchdog to kick in and reset the device after an hour since last data exchange.
On ELS61, this happens several ***** daily and so frequent restarts look bad on the stability statistics charts.
Most of the time, our app is waiting in DatagramConnection.receive method. Every few hours, there are no received datagrams for 5 minutes (should come at least every minute) and then the method throws an InterruptedIOException. On retry, the exception is thrown instantly.
So, is there a setting, mybe a timer that would not get triggered by UDP-only data transfer? We are seeing this behavior on 50+ devices, so it's not just one piece gone rouge.
Next issue is trying to work around it by resetting the network on InterruptedIOException and not triggering the watchdog.
We tried with AT+COPS=2, then AT+COPS=0, but more ***** than not, the module freezes at AT+COPS=2. As we don't use ATCommandListener yet, that essentially freezes our comm Thread, and again, we wait for the watchdog.
We tried with AT+CFUN=4 followed by AT+CFUN=1, that put the module permanently in Airplane mode, even restart didn't help. Like AT+CFUN=1 was never executed. We don't want to got that way. If this happens in the field, we're doomed. Watchdog is no help here...
Next is to try only with AT+CGACT=0/AT+CGACT=1, maybe that will be enough...
So, what say you. What is different in ELS61 UDP or what else can I try to make it just work all the time?
Thank you in advance for any comments or suggestions!
Jure
Hello,
Please check the firmware version with ATI1 command. What WTK version you use?
Do you have the stack trace for this InterruptedIOException?
I see that you assume that this problem has something to do with the network connection. Is the module still registered to the network when this happens? Are there any other network activities on the module? Have you tried some other operator?
To monitor the bearer from the MIDlet perspective you could use com.cinterion.io.BearerControl class and setup a listener to monitor the data bearer state. It would be interesting to see if the state changes when the datagrams are no longer received.
What do you mean by "On retry, the exception is thrown instantly." - is it thrown again on receive() method call immediately or after some time or when you try to call Connector.open() again?
After leaving the airplane mode you should send AT+COPS=0 to register.
Best regards,
Bartłomiej
Hello,
thanks for the questions and suggestions. So...
Found this in WTK folder:
It says IMP-NG Wireless toolkit by Cinterion
I implemented BearerControlListener and found this situation:
But in this case, a simple Connection close in reopen helped. Quite some ***** it doesn't. The orange LED blinks normally every ~4 seconds even though the module is effectively not connected. SMS transmission is blocked as well. AT+COPS=2 some***** works, but other ***** it blocks indefinitely (I waited 30 min). After a watchdog triggered restart, everything is OK instantly.
I'm testing with two sim cards from different providers (one local, one in roaming), same behaviour.
If I immediately retry the Connection.receive(datagram) after the InterruptedIOException, the exception is immediately thrown again indicating it's not a one time interruption.
Sorry if it's to much info and questions at once, but I've done so many tests already and still can't figure out the situation... The module stops communicating on all channels (data, SMS) after a few hours, without clear indication of problems (all I get is my applications timeout signal), and a watchdog triggered reboot solves it every time. If it was once every few days, I'd take it, but...
Any clues or pointers?
Regards,
Jure
Hello,
Would it be possible to check ATI1 reply instead of ATI? As for the WTK please also paste the installCD package file name.
As for the log you pasted it looks like the bearer is closed after datagram sending even though the receiver is active. This should not happen. Could you also check the connection strings you use for opening both connections? Please try to set timeout=0.
Does your application check the network registration status or signal quality (AT+COPS?, AT+CREG?, AT^SMONI etc.)?
How many ATCommand instances are used bu the app?
You wrote that SMS also does not work - is the app sending the messages - what is the reply on SMS sending? Or the module only receives and the messages do not come?
BR,
Bartłomiej
Hello,
here the ATI1 response:
WTK installed from els61-e_rev02.000_arn01.000.01_install-cd.zip
About the timeout setting in AT^SJNET command, the docs say not to use a value under 10s. I've had it set to 3600. What do I base this number on? I'll try with 0...
Our app continuously monitors connection state with AT+CREG? inside the communication loop, also AT+COPS? and AT^SMONI are used to gather current network data... No issues with those commands.
By current *****, 4 ATCommand instances are being used in our app (instantiated on startup then kept alive until app termination).
In the period when data transfer is not working, SMS are neither received by the module nor sent. When a module was in this state, I tried sending an SMS from terminal and got ERROR on AT+cmgs after Ctrl-Z. Also issuing AT+COPS=2 or AT+CFUN=4 blocks the terminal (tried over telnet, so had to reboot the module afterwards to get new terminal access). But AT+CFUN=4 is executed, because after reboot, the terminal is in airplane mode.
Regards,
Jure
Hello,
From the log it looks like the bearer is closed just after the sender thread sends the datagram. I thought that maybe the timeout is very short and sender thread closes the connection which causes the bearer closing. It should not happen when the receiving thread still works.
And it seems now that it can't be the case if the timeout is 3600. And the timeout 0 basically means no timeout.
It looks like you are using the WTK released for arn01.000.01 while on the module there is a newer A-REVISION 01.000.05. But I'm not sure if it could be a problem here as in general there should be a backward compatibility. Anyway you could try to use the WTK for A-REVISION 01.000.05.
You also might try to test UDP only scenario to check if this is strictly related to UDP or not.
As for the hanging AT command I'm only guessing but maybe it is somehow related to the fact that there are other AT commands intensively executed via other ATCommand instances.
BTW as you wrote about AT commands over Telnet it seems that you use a LAN terminal. This device automatically established a WAN connection for device connected over LAN. Is it also used for WAN connection? You can deactivate it (via web GUI for instance) if not.
BR,
Bartłomiej
Hi,
do you have the URL for WTK install CDs handy? I thought I use the latest one...
I don't believe other ATCommand instances are doing anything heavy. Once every 30secs...
I'll experiment with different timeouts and try the lastest WTK...
WAN connection is disabled in OpenWRT...
I'll keep you posted,
Best regards,
Jure
I've sent you the link via email.
Hello again Bartłomiej,
thanks for the WTK, I've tried it, with not apparent change, but...
I've actually taken a step back (debug code startup growing too thick)... I've stripped down the app a bit and disabled the watchdog. It's been years since I last ran the app without it and had no clue how it would perform... I've alse added some network monitoring to better understand what is going on...
So, what I found it a bit scary:
I've not caught this before, because the watchdog stepped in after 1 hour of data transfer silence.
So now the obvious question: since my app never calls AT+COPS=2 not AT+CFUN=4, what is happening?
I've only tested this for a limited time on one location, where the signal (LTE) is not very good, but it is stable and the terminal reconnects to the network instantly after a restart, resuming normal communication. What makes it go to +COPS: 2 mode and losing the network connection?
I really need some pointers here, please!
Regards,
Jure
Hello,
Could you paste an example SMONI output? Is the module configured to 4G only? Is 3G network available in the area?
SMONI may only show CONN when there is an active data transfer ongoing, so NOCONN does not necessarily mean that something is wrong if there is no continuous data upload/download.
But anyway it looks now like some network related issus if the module looses the network connection in the repetitive scenario. Maybe it's due to the poor signal quality (should register to 3G or 2G if available), interferences or some changes in the network introduced by the operator, for instance related to 5G development.
Please try the following:
- test the same in 3G (EHSx is 3G and you don't experience such problems)
- test in different location
- check some other operator
Best regards,
Bartłomiej
Hello,
thanks for persisting...
This is a SMONI log on startup...
Then after a couple of hours,
at 23:03, the last message was received and the SMONI output remaining constant until manual reset of modem. SMS comms were down as well.
I noticed no 4G vs 2G changes in the log (except one on startup).
I'm testing with AT^SXRAT=0 now. This limits the connection to 2G. How to limit it to 3G?
I have seen this situation with SIMs from two different operators (different country) and several different locations...
Any more hints?
Best regards,
Jure
Pages