Sunday, July 14, 2013

Responses were getting delayed and taking more than 50ms....!

One of the critical issues reported after the Vehicle production started is there was a delay in response to the Diagnostics requests.

As per the requirement or as per the standard, for any diagnostics requests, response should be within 50ms. Hence at the end of assembly in production, using scripts, tested will be performed, called EOL tests (End of Line) and the result was NG (No Go or Not Good). Hence because of the failure in EOL tests, these issues became critical.

When we were analyzing the issues, we also took the same scripts were used for EOL tests where that script was sending DIAG requests continuously for every 60 ms. Hence our software module was expected to respond to the script requests within 50ms. But when there was a delay, script was showing the error. Sometimes there was no response at all..!

We were not able to reproduce the issue with the GGDS tester simulation which came with CANoe. Hence we were depending on the customer's script only. We spend more than 3-4 days of time to analyze the issue, to understand the pattern of the reproducibility. Finally, decided to use Oscilloscope to find out where and when the delay will be started. Using debug ports, configured the code and see the delay in the Transmission of CAN signal (Tx) in Oscilloscope, in CANoe itself. We started enabling all periodic messages one by one and found that one of the CAN signal handler is taking more than 50ms of time and delay is getting accumulated because which was a periodic message. Normally CAN signal handlers should not take more than 1-2ms of time because Vector task should get the CPU for every 2ms (schedule time). When we look into the code (signal handler callback) to understand which line of the code taking more time, found that there was a EEPROM Write operation was there which was taking more than 50 ms of time.

To fix this issue, later  we moved the EEPROM operation to ACC OFF handler.

This issue was not observed when there is not CAN traffic. Due to the periodic CAN messages in the netwrok, this issue was observed. This helped us to suspect one of the signal handlers might taking more time.

Due to all of the above, DIAG task was not getting the CPU to respond to the requests and delay was accumulating.

Ideally, CAN signals handlers should have minimum instructions to execute so that vector's tasks should not get delayed or blocked on anything. Schedule time of the CAN task will be 2-3 ms and which should get the CPU for every 2-3 ms. If we do any operation which takes more than the schedule of the CAN task, will cause issues.

This was one of the good analysis we made and can help us to root cause the issues where response delay is the major issue.