ODK Collect crashing on Tecno SAS6 after only one or two interviews

Q1, 2, 3: Problem/ versions / tried-so-far:

I'm supporting a field test in Kenya running ODK Collect v.1.24.1 and Android 7.0 on TECNO SAS6 hardware with 8GB of storage and 1GB of memory onboard. The data is being stored on https://kobo.humanitarianresponse.info/#/forms. Most of the interviewer phones do not have SIM cards, and are meant to upload data via hotspot 2-3 times per day. We hoped that each would do ~20-40 completed interviews per day, although the pilot is meant partly to sort whether that's realistic.

We have had numerous phones crash while runnning ODK Collect. I haven't been present when it happens, but I think today I can probably get my hands on a phone that has recently crashed. If there is something helpful I can pull out of the phone to help narrow down the cause, I'll be grateful to learn how to do that.

The project has an extended set of IT support staff, but at the moment we don't have a cohesive approach for this problem and different staff are trying different approaches and solutions and it's not obvious to me that any are effective. Some report that even after resetting the phone to factory settings and re-installing ODK and the forms, it crashes again quite soon.

Some of the IT team believe the problem might be related to the phones needing a Google update. Others believe the problem may have to do with the list of apps that we disabled or deleted to make the phones "less fun" for the field data collectors. I have a list of apps that were disabled/deleted. I've pasted a link to files that describe that below.

I read online today and understand that the problem can be due to using up all the RAM - either by storing too many completed forms or by using logic that is too complicated and not modular. Some of the crashes happened even during training before they had many forms at all...I don't think it's a problem (yet) of too many stored forms. I've pasted a dropbox link to a folder holding the forms. Is there a straightforward or objective tool or way analyze whether the logic is too complicated? I'm sure that if we had realized it was important, we could have made the calculations and relevance statements more succinct or modular, but at this time that is not an area where we can easily experiment as we have the forms in the field on 200+ forms. We CAN replace the forms on all phones if needed, but I would prefer not to do that more than once and I'd like to have a high degree of confidence that it will work before assembling the crew to do that.

Today is our first day in the field and I'll know in a few hours how many successful interviews we had and I'll have an informal report of the proportion of phones with a problem, but based on training yesterday and the day before it was on the order of 10% or more of the phones.

I've instituted a procedure to track which phones fail, what we try, and whether they fail again, but we lack good ideas other than running Google updates.

4. What steps can we take to reproduce the problem?

I don't know what to say here. I've been entering data on my project phone this morning and I think it's safe to say I've entered more data than anyone could have during the training...and I haven't crashed mine yet. I'll keep trying.

5. Anything else we should know or have? If you have a test form or screenshots or logs, attach below.

How do I access a log to share that?

Here's a link to a dropbox folder that holds two sub-folders:

"Apps removed" - holds two excel files...showing how the apps were dropped for phones with and without SIM cards.

"Forms" holds five .xls forms. I have permission to share the forms. The signin forms are almost trivial. The child_vaccination forms are the main point of this survey and are somewhat complex. The missed_vx_follow_up is also somewhat complex; we won't try that one in the field until late in the week, so if we should simplify the logic there, we probably have time. I believe that the failures are happening mostly in: child_vaccination_VOL_tool_v12.xls

Arrgh. I see that I inadvertently posted this in community instead of support. Apologies for that oversight. And thank you in advance for any input!

Hi @Dalerhoda

I'm sorry to hear that you are encountering crashes. We have done a lot to make the app crash free for last 1-2 years and especially v1.24.1 seems to be historically the most reliable release according to our reports.
In those reports we have info about devices, android versions etc and I can filter them. Unfortunately I can't find Tecno SAS6. If the app crashed on that device and the device had internet connection (when the app crashed or later it doesn't matter) a report should have been sent.
You said that you have many devices, are they all the same (Tecno SAS6)?
Finding reports from your devices would be the easiest way to help you because I wouldn't need to ask you when it happens, what you did etc. When I goggle it (Tecno SAS6) I don't even receive many results (this topic is the second one) maybe it has different name?

Thank you, @Grzesiek2010.

I made a mistake: It is the Tecno SA6S (not SAS6).

And I should have mentioned that our data collectors' phones do not have SIM cards, so they are offline until they meet up with a supervisor and upload forms via hotspot.

I am now holding a stack of failed phones and the word 'crash' is not quite right...'hang' would be a better description. I'm not sure what happened in the moment before it entered the 'hang' state, but here is what I see:

  1. The screen is completely black, with only the Android status bar at top. (Screenshot 1 in the dropbox linked below.)

  2. I press the square 'overview' button at bottom left several times and the phone comes back to life...I can not proceed in the Collect App, but I can close it and start again. If I do that, and load my simple sign-in form, I can proceed as normal. If I try to 'fill blank form' and select the more complicated form, then it says 'Loading Form'. And then the screen goes BLACK again.

Note that many phones ran the complicated form over-and-over again without difficulty. The problem appeared in about 25 phones out of the 300 we had in the field yesterday.

  1. I press the overview button several more times and get a dialog box: 'ODK Collect isn't responding: Close App or Wait'. If I wait, nothing happens. If I close App, I can go back to the loop above, but cannot proceed to fill the main 'child_vaccination_VOL_tool' form.

  2. Some of these phones show a message on the title page saying 'system wants to do a
    Google security patch and a second message about a Google Play Services 'account action required'. We've been pulling these phones back to the Ops Center, doing the updates, and when they don't ask for updates, we've been un/re-installing the Collect app and the forms. I don't have good data yet concerning whether the fixes are working. The IT guys have the impression that phones that fail and are fixed have been failing again, but we didn't start to keep careful track of this until this morning.

Because we are not crashing, per se, we are probably not sending helpful log files to Google Play. Is there a way for me to pull out a log file that might indicate why the program is 'not responding'. Why it hangs while loading the form?

Thank you again for any pointers,
-Dale

Ok so seems like it's not a crash but ANR (Application Not Responding) error.
Fortunately we collect data about ANRs as well and here I can find some devices Tecno SA6S.

To be sure the reports come from your devices please answer my questions:

  1. You said the problem occurred in 25/300 phones, are all of them the same Tecno SA6S?
  2. Do they send finalized forms directly from those devices or maybe you pull them using ODK Briefcase? I'm asking because I need to know if those devices are connected to the internet sometimes and they have a chance to send those reports at all.
  1. Yes...all Tecno SA6S...our training began on Nov 13 and most of the problems came on Nov 14 and 15.
  2. They send the forms directly from the phones when connected to a hot-spot once or twice a day. If they had been online, you would see probably three dozen ANR reports from our team over the past week.

Our IT team added updates and reset the problem phones and since Saturday we have had only 2-7 phones 'freeze' per day. We were very worried when we lost almost 10% the first day, but since then, the problem has been manageable

When a phone fails, we swap in a spare in the field and bring the ANR phone back to HQ where we exit the app and upload the forms collected before the problem. Our team installs updates and resets to factory settings and re-installs ODK & the forms and puts the phone back in the pool of spare phones.

So our problem is not urgent for this field exercise, but if you have any insight from the ANR reports, we will appreciate hearing what you learn. Thank you!

Really glad to hear that the issue is no longer critical, @Dalerhoda. Still, we should figure out what's going on. Thanks for sharing your forms and we'll let you know when we know more.

I spent some time analyzing the problem you have ran into. I tried to reproduce the issue but to no avail, just like I expected taking into account what you have said that it has appeared in <10% your devices.
However I think that I know what the cause is... My general conclusion is that your forms might be too complex for the devices you have been using, below are more details:

  1. Tecno SA6S is a budget device and has just 1GB RAM it's very little taking into account it uses Android 7 (for example some devices I have with Android 4 or 5 have more).
  2. In your form you use pretty complex calculations (maybe not complex in terms of difficulty but they are long what makes them complex - those used in columns: calculation and relevant).

So using forms with complex calculations on not very powerful device might lead to such problems.

I can recommend:

  • please review your calculations and try to simply them, you can split them into a few smaller calculations. Here we had a similar problem with a complex form and such a trick helped.
  • you can periodically reboot your devices like every morning to free up some resources
  • you can ask your interviewers not to use those devices for other purposes, I mean not to play games, not to install not required apps (the same reason like above)

Unfortunately, it's not a thing that we could easily fix on our side. We have been improving the performance and probably there are still a lot to do but it's an ongoing process and it will never be perfect.

@Grzesiek2010 and @LN: Thank you very much for your attention. Thanks especially for the time you spent looking at ANR logs and trying to reproduce the problem.

I don't want to sound ungrateful...because I'm very grateful...but I do want to press into this theory that complex calculations could be the problem. If those were the culprit, I would have expected to see a fairly constant volume of failures across our seven days of data collection. Our teams visited over 8,000 homes per day for a week and the complexity of the interviews...the path thru the ODK form...should have been similarly complex across days. That is to say that the teams should have encountered many hundreds of respondents in the target audience, who took the longest possible path thru the interview and encountered the most complex calculations. So it seems odd to me that 10% of the phones would fail on day 1 and then fewer than a dozen per day...and often fewer than 5 per day would fail on the other days if this is a matter of calculation complexity. The ODK form was getting a thorough workout on about 300 phones per day, day after day, and didn't cause consistent problems. If the problem were with the calculations, wouldn't we expect to see widespread problems day after day?

Second, is there helpful guidance somewhere on recommended device specs when planning to do this kind of work? Of course the usual advice is to buy the best hardware you can afford, but is there any advice more useful than that? Overall, the Tecno SA6s devices served us well in this field effort. I would love to be able to plan and say "If we purchase XXX device with YYY specs (RAM, Android version, ODK version, etc), it should comfortably be able to collect data using interview form ZZZ from NNN respondents without needing to reboot or upload the forms." Is there a straightforward way to make such a confident statement to project planners and the procurement team?

Third...I take your point about making the calculations simple and we will strive to do that for future projects. Is there a utility that shows how many resources are used with different versions of the logic? It would be satisfying to load one form and see what gets used and then to load the simpler form and see the savings with the same survey responses. How might we do that? Ideally I would like to be able to amend the statement above and say "If we use the form in its most straightforward incarnation with the logic expressed in natural but somewhat complicated form, we can collect data from NNN respondents without rebooting and if we devote resources to simplify the form, we expect to see ___ operational benefit. (Not crashing? More interviews before reboot needed? Other?)

In this project, I seem to have gotten off lucky. The hardware recommendations and purchases were made before I got involved. My team developed a form that instantiated the questionnaire. We didn't see any problems during testing, although we also did not simulate an entire day of data collection. We won't make that mistake again. On field day 1 it looked like we had substantial problems, but then once the phones were re-re-updated, everything went fairly well and we got the data we hoped to collect.

I didn't particularly deserve to be this lucky...and I would rather not rely on good luck next time, so I'll appreciate pointers to resources to help plan and to decide how many resources to devote to simplifying the logic in the interview forms.

Thank you,
-Dale

If you were way outside the RAM needed by Collect to process your form, you're right that it would just fail systematically. And if you were comfortably within needed RAM, you'd never have a problem. But since you're mostly within the need but flirting with the edge, my guess is that you saw failures when something else was happening on the device outside of your control like another app updating or some operating system task running. Alternately it could be that some enumerators just got to more households for some reason and I'd expect them to have more problems as described below.

No but that's an interesting idea. The thing that is resource-intensive is relationships between fields. So if field B is computed using field A's value, that relationship means that field A's value changing has ripple effects. This gets magnified if you have long chains of relationships. Those relationships are represented in memory. The most effective change you can make in form design is capturing expressions that are identical in calculates and reusing them so that fewer relationships need to be represented.

There is some strangeness around how these relationships are represented when repeats are involved and that takes up further memory. @ggalmazor is actually currently exploring this part of the implementation to at least have it better documented but also to see whether there are improvements that could be made.

In your case, I'm guessing that you ran into issues because of the number of repeats that specific enumerators added and that if they had been assigned fewer households or worked half days or something, you would not have seen any issues. Did you notice adding repeats taking progressively longer? Or saving the form taking progressively longer as more repeats were added? Was that disruptive?

I agree that this should be better documented and have filed https://github.com/opendatakit/docs/issues/1149.

I wish! I know it sounds simple but because of the broad range of things that can be done in form design, the different Android versions available and the huge amount of variation in how devices are set up, I don't think we can really provide such specific and confident guidance. We'll know more after you answer some of my questions above but I'm fairly confident in your case that it's the combination of the relationship between fields and the number of repeats added that caused problems.