External data - current state of affairs

(Tom Smyth) #1

Hello friends. Can anyone point me in the direction of the latest on external data? I remember there had been talk of somehow unifying the disparate worlds of pulldata, external itemsets, etc.

I am running into this limitation: https://github.com/opendatakit/collect/issues/179

and wondering what the most sensible route to overcoming that might be.

(Hélène Martin) #2

Which are you trying to use? If you're using search()/pulldata(), what @andrew (I think!) suggested in the follow-on post is still your best bet.

(Hélène Martin) #3

Hopefully that quick info was enough to help you address your immediate problem, @tomsmyth. I do also want to give a broader update about large datasets in ODK since I don't think that information exists in one place currently. This question ties in to conversations happening about more sophisticated data workflows. In particular, the TSC has been exploring options for adding some kind of case management and that will likely rely on a case list managed as an external document. @adam.butler has been working on a spec at https://github.com/opendatakit/roadmap/issues/23 and I know @Tino_Kreutzer and the folks at Kobo are also interested in this broad topic. @martijnr has also been involved in related conversations and generally really helped me understand the state of things.

I first want to point out that large datasets directly in the form have gotten a lot more performant over the last year thanks to work by @dcbriccetti. So I do recommend trying an internal dataset first as this is simpler to deal with and non-controversial. Those datasets can be updated without having to change the form_id and Collect can can automatically pull those updates (see general settings > Form management > form updates). Depending on the device and specific usage, this can be performant enough for 1000s of data elements.

There are currently 3 incompatible approaches to external data supported by ODK Collect.

External secondary instances

As documented in the ODK XForms specification. This approach is entirely consistent with how datasets in the form are handled. That means everything that works with internal choices also works with external choices including choice filters for cascading selects and any kind of complex XPath querying. This is the only option for external data that is part of the ODK XForms spec. It is supported in ODK Collect and in Enketo. It does have two major downsides: clumsiness of the XML format and performance.

XML is not reasonable to create by hand. For this to be a viable way to represent external data, there needs to be some way to go from tabular data to an XML doc. This could be done server-side, for example. XLSForm could also be able to generate XML external instances just like it builds CSVs for external itemsets. Related to all this is @martijnr's proposal to formally introduce a way to attach a CSV file that would be queryable in the same way as the XML (by, for example, generating the XML on the device). The proposed spec is described in xforms-spec#88.

This works well for mid-size datasets but because the whole XML doc is currently represented and queried in memory in the Collect implementation, it can become quite slow for large datasets or slow devices. @dcbriccetti made some great improvements to performance about a year ago and there are probably more to make without big changes to the code structure. There are likely larger changes possible such as not holding the whole doc in memory or maybe even using a database.

search() and pulldata()

As documented in the XLSForm spec and with more details here. These predate external secondary instances and have no grounding in the ODK XForms spec. As I understand it, the two major design criteria were fast loading of large datasets and minimal changes needed to existing tools, especially pyxform. I think that's what led to the unexpected choice to make search() an appearance. Implementation-wise, this approach takes a CSV, loads it into a database table and then makes queries against it. search() entirely bypasses the form structure and injects the values from a database into a list of options for a select. pulldata() behaves roughly like a normal XPath function and makes it possible to pull a single value out of the database and use it in some expression.

These are performant but have limitations like the one @tomsmyth pointed to because the data is not part of the form. SurveyCTO (company that builds tools on top of ODK) contributed this and I know they have implemented some form of case management on top of it.

Fast external itemsets

As documented in the XLSForm spec. This implementation also predates external secondary instances and I think was added roughly at the same time as search()/pulldata() to solve the same problem of making large choice sets performant. It also has no grounding in the ODK XForms spec and works in a similar way -- Collect puts the contents of a CSV named itemsets.csv into a database and then queries it. It's more limited than search()/pulldata() and narrowly solves the problem of speeding up cascading selects. It only works with select ones (not even select multiples). I don't think it would be appropriate for representing general datasets like case lists.

Pulldata with two lookup values
(Tom Smyth) #4

Terrific write-up, thank you.

(Kevin Cooper) #5

Update on the state of affairs:

We've encountered a couple JavaRosa bugs filed here preventing us from using external secondary instances.

Based on the docs I've read it sounds like external selects behave in effectively the same way as internal selects, thus would be expected to have similar performance, but according to benchmarks done by @dcbriccetti here it looks like external was much faster (at least prior to his performance improvements about a year ago).

I recently benchmarked internal selects to be significantly slower than his original measurements (at least up to several thousand items), though the time currently increases linearly rather than polynomially. A middling phone took 25 seconds to load only 1500 items. He didn't specify what device he used originally.

His original data (September 2017):

My current data (December 2018) based on Galaxy Nexus, Android 6.0, 1 GB RAM:

We're considering whether our client is going to need this to be improved; if so we'll pursue further. The next step would probably be to profile and figure out if there are bottlenecks or how else we can improve performance.

(Dave Briccetti) #6

Hi Kevin. Thanks for your detailed investigative work.

@ggalmazor, you were asking how I made that chart. I ran timesParsingLargeInternalSecondaryInstanceFiles and timesParsingLargeExternalSecondaryInstanceFiles in XFormParserTest. They produce output like

Children Seconds
1,000 0.126
1,258 0.01

I graphed those values with Apple’s Numbers.