Associating a longitudinal form with an "entity type"

adam.butler · September 3, 2019, 9:38am

I'd like to add a question/discussion point to the agenda that relates to longitudinal data collection (I've started work on a proper spec, which I plan to have ready in advance of the Convening so that we can discuss it there, if time allows).

In longitudinal collection, we have the concept of an "entity", which is the thing about which you want to collect data: a patient for whom you are doing a followup; a tree whose current height and status you want to report; a village whose population you are tracking.

When filling a longitudinal form, the first step is to select the entity with which you're concerned. You therefore need to associate a longitudinal form with an "entity type" (e.g. patient/tree/village). In the simple case, the entity is a filled form (in which case the form itself is the definition of the entity type). So in this simple case, the entity type identifier could be the formId.

Based on this, some questions:

Is it OK to just use the formId, or would we want to add a UUID identifier for forms?
Is there some way in which we could support the creation of entities by uploading a list in csv/xls/xml format? If so (and I think it would be good if we could), the list itself would be the entity type. How would we identify this list?
I have a gut feeling that it might be good to use URLs to identify entity types. Is this just a dumb idea?

It would be great if people could give this some thought prior to Wednesday's meeting.

Xiphware · September 3, 2019, 9:33pm

I think you probably mean instance id (or perhaps submission id?). The formid uniquely identifies the (generic) form used to capture/display the data. The entity data, in this case, constitutes a populated instance of that form, ie a 'submission'. Right?

[upon re-reading, I see you say entity type identifier (as opposed to just entity identifier). In which case yes, I think you probably could use the associated (entity) form formid to specify the type of entity]

In our implementation, we have "assets", ie the thing inspected, equivalent of your 'entities'. And we use a specific XForm form for the purpose of displaying this data - called an "asset form". Typically these asset forms are just simple flat forms that just have a bunch of 'questions' to display each of asset's field values. The asset data is display by populating the asset form with the appropriate instance XML instance (and making it read-only, [although you could potentially use the same mechanism for capturing new asset/entities...]. That is, the asset (aka entity) is basically represented as an existing 'submission' of the asset form (!)

This works great for us, as it allows us to use the exact same framework for displaying (capturing, updating) the asset data associated with an (inspection) form; basically its just another XForm displayed alongside the actual inspection (aka survey) form you are filling in.

It also means we can pull specific asset data out using the same XPath functions we use for regular forms (!)

adam.butler · September 4, 2019, 9:14am

Thanks @Xiphware. And yes, I did mean formId, as you surmised upon re-reading .

I'm interested in finding a universal(ish) way of uniquely identifying an entity type. That entity type might be a form (in which case the entities are submissions), but I would like to make it possible for the entity type to be, for example, an uploaded CSV or XLS file.

Essentially, entities just need to be a collection of objects along with an object definition (a form, schema, row of column names, etc.). It's not necessarily the case that entities will come from a form, and in the interest of making this spec as generic as possible, I'm looking for a solid way of referencing a [definition + collection], such that the implementation in Central could allow either a [form + submissions] or a [definition + collection] that was created by uploading a CSV/XLS file.

Xiphware · September 5, 2019, 12:43am

Instead of a treating an 'entity' as specifically a form instance per se, would you be aggreable to treating it as an instance XML (document) instead? That is, the row from your DB table, or CSV line, or XLS row, is represented as XML- an instance XML to be precise. This XML may or may not end up ever getting displayed via an associated, suitably constructed XForm, but its basically choosing to represent these 'entities' in a data format that's the same as already used for representing external instances - namely an instance XML. And in doing so the 'entity' is thereby susceptible to the same XPath operations that we currently use for looking up instance/entity properties.

Which is to say, an entity may or may not originate from a (X)form, but whatever its source we would first normalize it to an instance XML format for subsequent handling within the ODK client? In effect, instance XML would become our universal data format for storing both submission data and entity data.

adam.butler · September 9, 2019, 3:56pm

Yes, that makes sense - I'm not 100% happy with instance XML as our "universal data format", but I think it would be very useful to settle on a universal data format of some kind, and I guess it would make the most sense to use what we already have.

OTOH, I think the spec will include one or more new XPath functions that will in effect be syntactic sugar to avoid forcing people to write actual XPath queries - see for example the new pulldata function: https://opendatakit.github.io/xforms-spec/#fn:pulldata.

I'm thinking of something along the lines of entity(<fieldname>) to retrieve a field value for the currently selected entity (e.g. entity('last-name')). This would resolve into something like instance('entities')/root/item[id=<selected-entity-id>]/last-name.

This of course leaves an open question of where <selected-entity-id> comes from: I imagine that would be the value of the the first question, which would be - automatically for any longitudinal form - a <select1> containing a list of available entities.

Does that sound reasonable?

Xiphware · September 10, 2019, 6:30am

Actually, for mine I implemented a custom XPath function - called property(<fieldname>) - that you can call from any inspection (aka survey) form XPath expression (eg relevant, calculation, ...), which performs the following XPath against the form's associated asset (aka entity) form:

string(//*[local-name()='%@'])

where <fieldname> is substituted into %@. Basically, it looks for an XML element of the same name anywhere under the asset/entity data, so you dont have to worry if your field happens to buried under a few convenient display groups in your asset/entity XML doc...

In practice you can use this to turn on/off questions in your form depending on the asset/entity in question, which saves you either (1) having to create separate forms for different subtypes, or (2) making the inspector/enumerator re-enter specific fields to trigger particular show/hide logic sections. Its also quite useful for the purpose of displaying certain asset/entity fields - names, identifiers, etc - in your form.

Not suggesting that this alone is anywhere near sufficient to constitute a formal spec (!), rather just to say "Been there, done that, got the t-shirt" [and I can happily report that works like a charm! ] Its nice to see that, barring choosing 'property' over 'entity', that I perhaps wasn't too far off the (your?) mark...

Xiphware · September 10, 2019, 6:49am

additional note...

In our case, you effectively pick the asset/entity (from a list) before initiating/starting an actual inspection/survey against said target. ie

pick your asset
pick your form

So in effect the property() function operates against the prescribed instance XML document. Taking this approach, vs having an explicit select1 question in your form, may be an issue of workflow.... But given that, short of having your entire form relevant on a single question, its hard to force any specific question to necessarily be 'first'. So there may be some benefit to having your entity selected - via some means - on the client before starting up your form [sic].

adam.butler · September 10, 2019, 10:27am

Absolutely! It's really good to know that I'm thinking in directions that you've already tried, and extremely helpful to be able to learn from your experiences! A good example of the latter is your improved XPath - it hadn't occurred to me that fields could be arbitrarily nested in groups (duh).

Also, the fact that (from what I understand) you're making the entity/asset accessible to the longitudinal/inspection form as a standalone object, rather than referring to it within the list of entities - I guess that just a subtle scoping difference, but it's enlightening nonetheless.

adam.butler · September 10, 2019, 10:42am

I guess that this comes down to a question of what knows about what - does an entity type know about all the longitudinal forms that can be applied to it? do longitudinal forms know about all the entity types?

My thoughts so far have been that an entity type should not need to know about a longitudinal form, since these seems to make things simpler. If I have a patient entity type, and I decided to start doing drug trials, it feels odd to me that I would have to make any changes to the patient type in order to do so. The drug trial will need to know that it's working with patients (and it will probably even need to know that it's working with a filtered subset of the complete collection of patients; the drug trial form will probably need to include a definition of that filter) - but not vice versa.

Within this structure, the order would have to be

pick a form
pick an entity

But I envisage that the initiating select1 would not need to be explicitly included in the form - Collect (or any other client that implements the spec) would know that it needs to be prepended for any form that is marked as longitudinal.

This thread is now getting waaaaaay off the original topic - @yanokwa or @ln, is it possible for an admin to move this conversation into a new topic? I don't think my trust level is high enough

Xiphware · September 10, 2019, 11:12pm

To be clear, the 'entity form' is merely the mechanism by which to display the entity data; typically its a pretty simple flat form that just displays a bunch of field-value pairs, with little if any complex calculations, show/hide logic etc. Its really the instance XML (of the entity form) that is storing the actual entity data that is of most importance.

So in practice the entity form itself knows nothing about what inspection or survey forms might be applicable to entities of that type. However, conversely, inspection/survey forms do need to know enough about the entity form to know how to extract whatever fields they want. For example, knowing the fieldname that, say, the entity identifier is stored under in the instance XML, which is needed to fill the string(//*[local-name()='%@']) query. This can be facilitated somewhat with some best practices; eg all key identifiers are called itemid, general description of entities is always called name, primary location/address is stored in a field called location, any fixed GPS position is stored in gps, etc. So yes, the inspection/survey form needs to be loosely coupled to its associated asset/entity form, but typically not the reverse.

Presumably, in a longitudinal survey, it is the entity in question that is the primary focus, and the particular form used against it is somewhat secondary, and may well in fact change depending on certain aspects (or current state) of the selected target entity; eg Triage, PreOp, PostOp, Discharge, Followup, ... In our case, pretty much in all our user stories you must first select the asset (aka entity) you wish to inspect (aka conduct survey), and then selected a particular form according to the specific subtype of the entity, or its previous inspection history (eg the previous inspection failed), or what specific task you wish to perform (eg inspect electrical systems vs plumbing). That is,

select asset
select form

Pretty much the only usecase where you select and immediately initiate a specific form, without first identifying the asset/entity, is when you are out in the field and either identify an entirely new asset but which still needs to be surveyed, or its an existing entity but you cannot identify it yet (eg the entity list wasn't preloaded onto the device). We call these ad hoc inspections, and typically they are later resolved and assigned against a specific entity when you get back to the office (and/or after you created a suitable new entity for it).

If you'll be at the convening I'd be happy to show you our workflows. [and I'll try not to make it a hard sales pitch, although I'm sure my CEO would be overjoyed if eHealth Africa chose GoMobile to perform all its toilet inspections! ]

Xiphware · September 10, 2019, 11:20pm

Oh, and your 'entity type' roughly corresponds to our various 'asset registers' (toilets being just one of them... ). So within each register you have a list of associated asset items (your individual entities), and a collection of all forms applicable to items in that register. That is:

register --> items
register --> forms

The latter of which happens to line up quite nicely with Central's 'projects' --> forms (!) So I think the missing piece from an similar ODK puzzle might be something that performs the functional equivalent of perhaps: projects --> entities ?

LN · September 18, 2019, 8:33pm

Thank you for your leadership on this, @adam.butler and @xiphware!

Is there a "table of contents" for conversations and documents related to improving longitudinal data workflows somewhere? Building up that context and history would be helpful, I think. There have also been unrelated conversations that have veered into related territory like the one about the ${entity#...} syntax for XLSForm. If this doesn’t exist, I’d be happy to build it up.

In particular, the user stories document that was generated some time ago provides helpful context for this thread. I would also appreciate a sense of who participated in which conversations and what the conversation/document's status is. For example, I was on maternity leave when the user stories reached their current state and so I don’t have a good sense of how much discussion happened around them and what conclusions were drawn (e.g. how the grayed out stories became grayed out).

At a high level, I believe the stories that remain black can be summarized as:

Linking of form definitions through specific fields
Synchronization of datasets between server and clients
Tasking enumerators

This thread is mostly about the first theme with a bit of the second and it focuses on the implementation strategy. The roadmap issue also focuses on implementation. I would find it very helpful to see more detail from an idealized user experience perspective. How project managers will set up projects, how analysts will view data, how enumerators will pick entities, how conflicts will be resolved will all impact the spec design. Has that been done somewhere? Here are some examples of the types of questions that would be answered by this:

“As an administrator I want to be able to designate a particular form (“patient form”) as a source of entities for a record form (“visit form”)”
- Does the user need to enter a server-side “mode” distinct from the current disconnected forms mode? Do they need to designate which field in each of these forms should be the key or is that done in the form design (as this conversation assumes)?
- Is “entity” a concept explicitly surfaced to the administrator? Is it a separate concept from “project”?
“As an administrator I want to be able to view records by entity”
- Are viewing by entity and by form both possible?
- How are different values collected during different encounters for the same property represented on view/export? Is only the latest provided? Are they all made available?
- What happens if two enumerators create the same entity while offline? (the de-duping story is grayed out but something has to happen)
- Is there a difference made between a correction to a mistake made about a fixed property and a new value for a changing property?
“As an enumerator, I want to be able to select an entity from a list before I begin a record form”
- What will the enumerator see in order to make that choice (e.g. just the entity ID? I believe an “identifying fields” concept has previously been mentioned)
- Is only one kind of entity available at a time or does the enumerator first have to pick an entity type?
- Is picking an entity a different client “mode” than picking a blank form to fill?

To more explicitly tie these back to the question that started this thread -- where is the “entity” concept visible to the form designer, the project manager, the data manager and the enumerator?

adam.butler · September 19, 2019, 9:29am

And thanks to you for picking it @LN!

A table of contents would definitely be a good idea. So far I've made two presentations to the TSC:

the first was based on this set of slides (some of the ideas here have been superseded by other discussions, but hopefully it's a good, albeit basic, introduction)
the second was based on the user stories doc that you linked to

There have also been a couple of forum threads:

I presented the user stories here, and received some useful feedback
some useful info here too

I'm sure there are plenty more threads that have touched on the topic, so perhaps we could build this up together? I'm not sure what the best format would be... In a separate thread?

Meanwhile, let me see if I can answer some of the important questions that you've raised here. Please bear in mind that these answers only reflect the way that I've been arranging my thoughts on the topic, and I'm very happy to be convinced that I'm wrong! I've been working on a spec that I was hoping to distribute prior to the Convening, so that we could use it as a starting point for the discussions there; maybe it would be good if I put that WIP into a Google doc and share it here so that you and others can chime in as I work on it (or would you prefer some other way of collaborating on it? I'm open to any suggestions).

In particular, the user stories document that was generated some time ago provides helpful context for this thread. I would also appreciate a sense of who participated in which conversations and what the conversation/document's status is. For example, I was on maternity leave when the user stories reached their current state and so I don’t have a good sense of how much discussion happened around them and what conclusions were drawn (e.g. how the grayed out stories became grayed out).

As I said, I presented these user stories to the TSC, and during the discussion we added, edited and removed some of the stories. At the end of the session, there was general agreement that the stories reflected everyone's sense of what longitudinal functionality should look like in ODK. In terms of the greying out, we were trying to be pragmatic about what a reasonable MVP could consist of. There are some greyed stories that might be controversial, e.g. the ability to view previous filled forms for a given entity - this is clearly desirable, but we concluded that it would represent a not inconsiderable effort that could be postponed to a v2.0 (or a v1.1, or whatever....)

I would find it very helpful to see more detail from an idealized user experience perspective. How project managers will set up projects, how analysts will view data, how enumerators will pick entities, how conflicts will be resolved will all impact the spec design. Has that been done somewhere?

In my head, mostly

“As an administrator I want to be able to designate a particular form (“patient form”) as a source of entities for a record form (“visit form”)”

Does the user need to enter a server-side “mode” distinct from the current disconnected forms mode? Do they need to designate which field in each of these forms should be the key or is that done in the form design (as this conversation assumes)?

Is “entity” a concept explicitly surfaced to the administrator? Is it a separate concept from “project”?

I envisage that this process would be strongly tied to the concept of a "project", as is currently used in Central. The workflow I have in mind is looks something like this:

The administrator creates a new project on Central (or another server that implements longitudinal data collection - from now I'm just going to talk about Central, but obviously I'm not excluding other implementations)
The administrator is prompted to specify whether this is a longitudinal project
If the administrator specifies that this is a longitudinal project, the first thing they will need to do is specify the entity type. This can be done in one of three ways:

a) upload a new form, e.g. a form for patient details

b) select an existing form that is already on the server, currently in a different project

c) provide a CSV/XLS file, where each row defines an entity (maybe this isn't a 1.0 feature, but I think it would be important that entities can come from sources outside of ODK)

In case (c.), the administrator will be asked to designate which field to use as the key. In the other cases, I would suggest that we use the instanceId, or even add another metadata field to the xform spec, so that we can make this transparent to the user. But in any case, we also need to ask the administrator which field(s) should be shown in the preliminary entity choosing widget.
The administrator will then be able to upload one of more longitudinal data collection forms that reference this entity type. These forms can reference the entity about which they are collecting data using some as yet undefined syntax. They will not need to specify a select_one in their form to choose the entity; this "entity choosing widget" should be handled automatically by Collect or any other implementing client application. The form should just assume that a selection has already been made before data entry begins.

“As an administrator I want to be able to view records by entity”

Are viewing by entity and by form both possible?

Yes, I think that this would be important. As a data analyst I would like to see both (a) all data that has been collected about village X (and note that this might be data from more than one longitudinal form...), and (b) all data that was collected using a particular longitudinal form (in which case each record should link back to the entity that "owns" it)

How are different values collected during different encounters for the same property represented on view/export? Is only the latest provided? Are they all made available?

When viewing as a data analyst, I might say "show me all the seasonal growth data that has been collected for tree X"; this would then show a number of records that all come from enumerators filling in the same longitudinal form at different points in time. (So here I would be "viewing by entity"). I can then analyse the tree's progress over time.

For export, I would say that this should be configurable: maybe I want the latest reported height of all the trees I have records on; maybe I want all the reported heights of all the trees; maybe I want to see the growth of all the trees in separate graphs.

What happens if two enumerators create the same entity while offline? (the de-duping story is grayed out but something has to happen)

It's greyed out because we agreed that this was a piece of functionality that could be omitted in the MVP; we certainly didn't intend that it would never be handled. It's a difficult question, but in the past I've done semi-automatic server-side deduping using matching algorithms that then present possible duplicates to an administrator who can then decide whether the entities should be merged; if so any key references in longitudinal records are updated accordingly.

My thinking is that this would then cause a new entity list to be generated, which - since the entity list will be something like an External Secondary Instance - will then cause Collect to show that a form update is available, so that enumerators can then download the deduped version (I'm aware that I'm eliding a lot of details in that sentence...)

Is there a difference made between a correction to a mistake made about a fixed property and a new value for a changing property?

If I've understood you correctly, then by "fixed property" you mean a field on an entity, while by "changing property" you mean a field in a longitudinal form. In which case, yes, there is definitely a difference. This question also uncovers several more:

Is it possible for an enumerator to update an entity? In the simplest possible case (MVP! - although this probably doesn't satisfy "viable") entities would be read only. OTOH it's easy to imagine a scenario where an enumerator selects and entity before filling a longitudinal form, and then realises that there is an error in the entity data. Maybe it should be possible to link to the entity form from within the longitudinal form so that she can correct the entity error ("Wait a minute, this isn't a beech, it's an oak!") before starting on the longitudinal report? (And this is where we intersect with the "Linked/Sub-forms" topic that has been discussed elsewhere)
What happens when the entity form is the same as the longitudinal form? This is a particular definition of longitudinal data collection that often comes up, where people just want to keep filling the same form about the same thing over time, without having a separate, originating entity. I would maintain that it should always be possible to separate out something immutable that could be used as an entity definition (e.g. the geopoint of the tree; a birth date), but I think that this is something that needs more discussion

“As an enumerator, I want to be able to select an entity from a list before I begin a record form”

What will the enumerator see in order to make that choice (e.g. just the entity ID? I believe an “identifying fields” concept has previously been mentioned)

I kinda covered this above, but there's a lot more that needs defining w/r/t the "preliminary entity choosing widget"

Is only one kind of entity available at a time or does the enumerator first have to pick an entity type?

I don't think the enumerator shouldn't be able to pick an entity type - we don't want people inadvertently using an arboreal seasonal growth report for village populations, for example.

Is picking an entity a different client “mode” than picking a blank form to fill?

If you want to fill a longitudinal form, you must begin by choosing the entity on which you are reporting, as mentioned above ("the preliminary entity choosing widget"). So - again, as I envisage it - first the enumerator chooses the form; if it's a longitudinal form, they then have to select an entity; data collection can then commence, as usual.

To more explicitly tie these back to the question that started this thread -- where is the “entity” concept visible to the form designer, the project manager, the data manager and the enumerator?

The form designer has some new syntax that allows them to reference fields from the entity within test, labels, skip logic, calculations, etc. within their form
The project manager has to identify the source of the entities (I'm calling this source the "entity type"; I think that an xform or a CSV structure is an adequate "schema" that defines the "type" of the entity) before they can create a longitudinal project.
The data manager can display and "slice" the data via the entity dimension, should they wish to.
The enumerator has to choose an entity on which they are reporting before they can start filling a longitudinal form.

Phew! I hope all of that makes sense and, as I said at the beginning, this is just my vision for how all of this could work; I'm not saying this is how it has to be (although I do think that it's coherent, covers most use cases and doesn't reinvent too many wheels). I'm looking forward to hearing your thoughts, and figuring out how we can move things forward in a meaningful way.

LN · September 20, 2019, 4:27am

I started a wiki topic to act as a table of contents at ODK ecosystem entity-based data collection table of contents. Anyone with sufficient privileges can edit it. Does that seem reasonable?

I really appreciate all the thinking you've shared.

That's a wonderful start! It's still all very fuzzy in mine because there are so many possibilities and angles. I think it would be great to pick a couple of specific scenarios and workflows (e.g. turtle nesting) as anchors and ideally make sure that the user experience development happens as close to the field as possible. That is one of your strengths, @adam.butler, being so close to the eHA use cases. Have you written up some of the existing specific studies/workflows that you're basing your design around?

I agree all of these paths would be useful. I would find it very useful to see these options ranked by priority and one picked as the highest priority path that can initially be designed for.

Yes, exactly, there's a lot here and a big risk is to be so bogged down considering everything at once that we never get started! Thanks for shaking off that paralysis!

I meant my question to be about the "entity type" concept, I'm sorry! Luckily I think the two questions lead to roughly the same context. With what you've shared, I'm not convinced that "entity type" is something that needs to be explicitly represented. That is, there could be a project which contains an updating list of entities. Implicitly the project name probably would describe the entity type. That's also what I understand when @Xiphware says

Is it possible that you're thinking about an explicit entity type identifier because you're starting to imagine using entities of one type in a project that has to do with entities of another type?

The format that a user uses and the underlying representation don't have to be the same. Perhaps you could expand on why this requirement leads to an "entity type" concept requirement?

Xiphware · September 20, 2019, 7:09am

Correct. I would imagine that in most cases the detailed contents of individual (longitudinal) survey/inspection forms will be largely specific to or dictated by the type of entity being surveyed/inspected. Hence grouping entity-specific forms - and their associated entity instances - under different entity-specific 'projects' (or 'registers', or what-have-you) seems natural. Although in the general case entity types and forms could be completely orthogonal to each other, I believe that in typical usecases forms will be ostensibly entity-type specific (and in the worst case you can always duplicate a generic form over multiple entity-type projects).

dr_michaelmarks · October 7, 2019, 11:44am

Just a few thoughts on Longitudinal data collection.
Previously @yanokwa built a nice feature for the LINKS implementation of ODK which I think they called Dynamic Cascading Selects.
This effectively generated lookup lists in real-time.
i.e
Enter village, complete village form
Go to house, select village from dynamically generated list of villages on that device, complete house form
Repeat for individual level form

This type approach would actually work well for an off-grid linkage of multiple forms to the same Entity/Unit. Obviously it only works at the device level although if the dynamic look up could be shared via upload/download then cross-device on-grid solutions might also be possible.

Just a thought from my experience using this specific implementation.

Michael