XLSForm spec proposal: add syntax to make it easy to use a value from the last saved instance

LN · March 25, 2019, 5:48pm

This proposal provides user-friendly XLSForm access to the remembering previously entered values feature. Implementation in pyxform depends on a decision at Spec proposal: add first-load event to replace xforms-ready.

I propose adding a yes/no column to represent whether a particular question's value should default to the last saved value. For example:

type	name	label	default_to_last
`text`	state	State	yes
`text`	street	Street
`select_one animal_type`	animal	Animal	yes

If the column contains yes for a particular question, that question's value will default to the last saved value.

As usual, the hardest part is naming. I think including the word "default" is helpful. This makes it clear that it's related to the existing default column in that it means the client will show a value that can be edited by the surveyor (or not). Some other ideas:

default_to_latest
default_to_previous
fill_from_latest
remember_value

I believe @cooperka and @tomsmyth will be adding this to a visual form builder and perhaps we could coordinate on naming.

Alternatives considered
An alternate approach would be to use the existing support for external instances and a more general dynamic_default column. This would generate a setvalue triggered on form load with whatever value is in the column. The form author could then enter something like instance('__last-saved')/data/a in that column to get the desired behavior. I think it's too hard to use and explain and so the added flexibility isn't worth much.

CC @Ukang_a_Dickson @yanokwa @martijnr @Xiphware @ggalmazor

LN · March 25, 2019, 10:00pm

I was in a private chat with @cooperka and he pointed out the fact that it's the "last saved" record as opposed to last opened or last created is important. With that reminder, my preferred name now is default_last_saved.

yanokwa · March 25, 2019, 10:03pm

I'm generally OK with this, but had some questions.

Why isn't this a new column and not a parameter? My guess is because parameters apply to a type of question and this applies to all questions, It'd be good to be explicit in the issue that's filed.
Currently in pyxform, yes is an alias for true(). That is most (all?) columns are passthrough expressions and that would not be true (pun ) in this case. Feels dangerous to me, but I don't really have a good alternative. It's not like we can use sounds_good or no_prob as the value...

LN · March 26, 2019, 5:03am

I swear I think about these proposals deeply before publishing them!

Writing out a description of the feature and its usage here got me disliking the yes/no column idea. I think it's too limiting not to be able to easily do transformations on the last saved value. I now am preferring the dynamic_default column with the addition of some way to hide the complexity of instance('__last-saved')/data/a.

This proposal includes two parts:

a new dynamic_default column. The contents are passed through to the value attribute of the setvalue action (see documentation) triggered on first load.
a new transformation to hide instance('__last-saved')/data/a. In the example below, I suggest a __last suffix. For example, just as ${a} expands to /data/a, ${a__last} would expand to instance('__last-saved')/data/a

I didn't explicitly mention this in the prior proposal but either way, pyxform would add the proper XML to define the __last-saved instance if one of these dynamic defaults is used.

type	name	label	dynamic_default
text	street	Street	${street__last}
date	disaster_date	Disaster date	today()
integer	patient_count	How many patients have you seen today?	if(${patient_count__last} == '', 0, ${patient_count__last} + 1
select_one yes_no	same_street	Are you still on ${street__last}?

This column would provide value beyond just making previously-entered values available. For example, form designers have been wanting to set a date to default to today's date for some time.

adam.butler · April 1, 2019, 3:48pm

This definitely sounds like a better approach.

One proposed change (this may well be a naive question): would it be possible to just use the existing default column, and extend it so that you could enter values like the examples you provide (${street__last}, today(), etc)?

It feels like having columns for default and dynamic_default is a bit redundant, and I'm not sure that there's any precedent for having two separate columns for such closely related functionality.

LN · April 1, 2019, 4:04pm

Thanks so much for the feedback, @adam.butler, and very good point about default vs. dynamic_default. They do use different mechanisms "under the hood" so that's why I think of them as separate but I can't think of a use for specifying both. That is, any static default would be immediately overwritten by the dynamic default on form load anyway, I think.

So I do agree it would be ideal to combine them. From an implementation perspective, that means either all defaults need to use the setvalue mechanism or the XLSForm parser (pyxform) needs to be able to identify dynamic defaults. I don't like using setvalue for all defaults because it makes the XML harder to read and would make the form a little slower. I think regular expressions would be sufficient for pyxform to identify dynamic defaults.

yanokwa · April 1, 2019, 6:38pm

I'm not a fan of the ${street__last} syntax.

I think it's hard to see how many underscores you have and it all feels a bit too magical.

I'd prefer something like last-saved(${street}), but not sure if this adds a lot of implementation complexity to pyform. Also, this would be the first example of a function that isn't passthrough, so maybe we don't want to go there.

One (probably terrible) alternative might be to introduce an entirely new syntax: !{street}

What do you think, @Ukang_a_Dickson?

tomsmyth · April 3, 2019, 1:56pm

How about ${street:last}? I think this would make it still usable in the label column (which I don't think last-saved(${street}) would be), it avoids the need for a totally new syntax, just the addition of a kind of psuedo-class to the existing dollar sign notation...

I'm not sure what valid characters are for the thing inside the {} though...

Super fun discussion!

adam.butler · April 3, 2019, 2:36pm

Thinking about this some more, I'm tending towards Yaw's proposals. The ${...} syntax is clearly interpolating a variable value into a string using a common idiomatic syntax, but adding a __ or a : (or even a ::) diverges from that common idiom into the realms of magic. Another argument against __ is that it's explicitly mentioned in the spec section on Markdown as a way of bolding text.

Of course, if there was another part of the XLSForm spec that already used the same syntax to apply a magical function to a value, then it would be more OK (it's like rule #1 of improvising: if you make a mistake, repeat it - that way it's not a mistake anymore).

There might be an argument for saying that ${last_saved('street')} is the most logical syntax, since it says "get the value of the variable named 'street' from the last saved form and interpolate it here". It feels more logical to me, but others might see it as an abomination...

last-saved(${street}) is pretty consistent with selected(${favorite_topping}, ‘cheese’), so that could work, although I have to say that the whole string interpolation idiom falls apart with this usage.

And then !{street}... well the spec explicitly says:

Note the ${ } around the variable likes_pizza. These are required in order for the form to reference the variable from the previous question.

I could easily imagine a similar note to explain the meaning of !{ }.

Tino_Kreutzer · April 3, 2019, 6:19pm

I'd also support finding a way to keep this new functionality within the existing default column. It makes intuitive sense, which means it will be easier to explain to users.

!{street} is nice and short; we could consider other characters instead of ! (e.g., % or #) since ! often implies not.

Along the same syntax logic, a word instead of a character could work as well: data{street}

The advantage of not using last-saved as the prefix is that this implementation would work well for future extensions of this feature, i.e. towards case management. In that case data is not pulled literally from the last saved instance but could come from another source.

LN · April 3, 2019, 6:32pm

I think your broadened concept always requires specifying a record in some way, right? That is, to query arbitrary record you need to specify both the field you want and from which record you want it. I see how a short syntax like !{} can specify a specific record (e.g. the last one) but it's not clear to me how that could be generalized. Maybe you could share an example?

Tino_Kreutzer · April 3, 2019, 8:51pm

You're right, it broadens the potential scope for this new handle based on features we haven't built yet, i.e. case management, so it's probably premature to try and consider this use case. I suppose examples would be

recordID/data{street} or data{recordID/street}

in which case recordID could be the UUID or a different unique lookup field that specifies the record from which data should be retrieved. If and when we support case management, this would be a potential way of implementing this using the default column. It would be nice if the new syntax we're introducing now can logically be extended to accommodate this new feature. But of course it could do that also with a syntax of !{street}.

LN · April 4, 2019, 12:46am

I like the idea of consistent syntax between representing a value from the last saved instance and one from any instance as @Tino_Kreutzer is describing but I don't feel confident that I know enough about what the general case will look like to design for it.

My sense is that the ID of a record that needs to be consulted would always be dynamic (so ${recordID}) which makes it hard to use as a prefix and that the goal would be to fetch records based on many different characteristics, not just ID. I'd guess that advanced users will use XPath directly (as they already do) and there will be convenience functions for those who don't need the full flexibility (like pulldata or indexed-repeat).

I have a slight preference for something that leverages the existing ${} syntax because it feels like, to use @adam.butler's language, it's the same interpolation but with a qualifier:

${street} goes to /data/street
whatever new syntax is agreed on goes to instance('__last-saved')/data/street

I agree that __ is hard to deal with in a user-facing context so let's take ${street__last} off the table. To answer @tomsmyth's question, only characters that are valid in XML element names are allowed in field names: https://www.xml.com/pub/a/2001/07/25/namingparts.html. If we do stay within ${}, then, we can use a separator that is not valid in XML to make absolutely sure it can't conflict with a user-given name. Something like ${street#last-saved} or ${last-saved#street}.

Introducing a whole new thing like !{} or #{} feels a bit heavy for something that probably won't be used in so many forms. I also think it's hard to remember which special character to use. I'm not deeply against it, though.

I'm not thrilled about something that looks like a function but isn't but I could be convinced.

tomsmyth · April 4, 2019, 1:33pm

Yes this is my thought exactly. It's quite like the psuedoclass concept in CSS. What about ${street|last}.

LN · April 8, 2019, 4:48am

My experience is that non-developers can spend a LONG time hunting for the pipe character on their keyboards and/or use a capital i or a lowercase L and get very confused. It's especially confusing because some keyboards have it labeled as a split pipe (¦).

An additional requirement: whatever characters need to be typed should be recognizable to anyone.

adam.butler · April 8, 2019, 9:10am

That's a good point about the split pipe, @LN - I also think that # is a good option. I've been thinking about @Tino_Kreutzer's idea of thinking forward to possible case management use cases. It seems like it would be helpful to come up with a generic way of saying "the value of field x of entity y". In this particular example, x = "street" and y = "the last saved form". In a possible case management scenario, we would probably want to reference fields on the root entity (where "root entity" means e.g. "the patient I'm reporting on", or "the tree that I return to measure every month") (NB I'm not saying that "root entity" is the best name for this, just using it for the sake of these examples!).

In the case management scenario, we would probably want to reference these fields in labels (e.g. "what is 's temperature?") or skip logic (e.g. "skip the next question if the tree is partially in shade"), rather than default values, so that's worth bearing in mind: it would be good to come up with a syntax that is also usable in both of those contexts.

It seems like we all agree on ${...} as being a reasonable way of representing "the value of", so now we just need to work out the best way to say "field x of entity y". The two options we've talked about are x#y and y(x), but I wonder whether it might also be worth considering something along the lines of y#x, as @LN suggested?

Possible values of y would be pre-defined, such as last-saved or root-entity (or more succinctly, last and entity). So in the examples that we have:

Last filled street: ${street#last} or ${last#street} (it might be worth allowing an abbreviation if the field name being accessed is the same as the current field name, so then it would be ${#last} or ${last#}
Patient name in label: what is ${full_name#entity}'s temperature? or what is ${entity#full_name}'s temperature?
Tree status in skip logic: ${partial_shade#entity}=yes or ${entity#partial_shade}=yes

I think my vote would go to ${y#x}, i.e. ${last#street} / ${entity#full_name} / ${entity#partial_shade}

LN · April 9, 2019, 3:04am

@adam.butler I like where you're going.

Here's what I'm understanding for ${entity#full_name}:

there'd be some kind of standard identifier (e.g. recordId) linking the current form instance to info previously collected about the entity this form instance is concerned with. Presumably, the previously-collected info would be in an external secondary instance representing all entities and their info.
the XForm would define a __entities instance to give access to all of the entities' info
entity in ${entity#full_name} would expand to something like instance('__entities')/recordId/data
the entity# shortcut would only allow referring to values related to the entity this form is about, not other entities. For example, if I'm defining a form that will collect information about houses, I can use it to refer to information previously collected about a specific house but I can't use it to refer to information about the neighbor's house or an occupant of the house that info is being collected about.

Did I get that right?

I think it's a useful concept even with the limitations in my last bullet above. I'm on board for introducing ${last#<unqualified field name>} for now with the goal of expanding to other prefixed keywords like ${entity#<unqualified field name>}.

adam.butler · April 9, 2019, 8:55am

Yes, yes, yes and yes - and thank you for making all of that explicit @LN !

Just as an addendum to your last bullet: if the occupants were somehow marked as being a direct attribute of the house entity, then I think that it ought to be possible to refer to them using this syntax. But that's probably a discussion for another day...

Xiphware · April 9, 2019, 9:14am

A possible issue I see with this definition of #entity is that it appears (correct me if I’m wrong...) to tie entities to a specific instantiation of a specific form id+version (?). Whereas in the general case - and certainly in mine, you can perform multiple (and completely different) ‘inspections’ (aka fill in completely different forms) about the same ‘entity’, so this vague “entity” thingy exists independent of any particular form, yet alone specific form version.

In this context, What is an ‘entity’? Or should it instead be called say “specific form instance”?

“Entity” to me conveys a physically unique object. Whereas a form instance/submission is rather more an partial snapshot in time, unique only unto itself.

adam.butler · April 9, 2019, 1:21pm

@Xiphware the way I see case management working in ODK is that you would have two different types of form: entity forms and report forms (this nomenclature is not yet written in stone). There's more information about the proposed approach here:

github.com/getodk/roadmap

Entity-based data collection

opened 09:59AM - 26 Jun 18 UTC

closed 09:03PM - 13 May 23 UTC

admbtlr

See [the forum table of contents](https://forum.getodk.org/t/odk-ecosystem-longi…tudinal-data-collection-table-of-contents/22234) - https://github.com/getodk/central/issues/298 adds Datasets of Entities generated from form Submissions and attached to follow-up forms using the existing CSV mechanism. <details> <summary>2018 strawman proposal from @admbtlr </summary> ## User Stories *As a health worker, I want to be able to collect a medical record every time a patient visits my health facility, so that I can keep track of the patient's progress over time* *As a census taker, I want to visit a village every year and record population data* *As a vaccine delivery driver, I want to keep track of the quantities of vaccines that I deliver to cold storage facilities during my weekly deliveries* *As a regional vaccine administrator, I want to download CSV files that show the quantities of vaccine that have been delivered to all the cold storage facilities in my region over the last six months* ## Proposed Implementation For the sake of this explanation, I'm going to use the following terminology: - **Entity** refers to the thing about which data is collected. The kind of thing -- the "entity type" -- will depend on the use case. So in the above user stories, the entity types would be "patient", "village", "cold storage facility", "cold storage facility" - **Record** refers to one round of data collection. So in the above user stories, a record would be 1. the details of patient's visit to a health facility 2. an annual set of population data for a village 3. the quantities delivered to a health facility in a given week 4. again, the quantities delivered to a health facility in a given week The simplest solution is probably to have two separate forms, one to collect the details of an entity ("the Entity Form") and one to collect the details of each visit ("the Record Form"). A Record must have one (and only one) Entity associated with it. An Entity can have multiple Records associated with it. ### The Entity Form Forms for creating entities must have a certain field (or fields) marked as an "identifying field". This would be for example a patient's name and DOB, or a village name and region, or a cold storage facility name and ID number. These identifying fields can then be used as labels in the CSV file that the Record Form uses to enable a data collector to choose the linked Entity. Entity Forms can also have fields marked as "filter fields". These will be used to reduce the number of options shown in the list of Entities (see *Getting Entity lists onto devices* below). ### The Record Form Forms for creating records must have one attribute called `entity_type_id`; this attribute can only contain the UUID of an Entity Form. They must also have one field called `entity_id`. This field should be of type `select_one_external` (see *Getting Entity lists onto devices* below). ### Getting Entity lists onto devices The first question in a Record Form should be a selection of the associated Entity. This question should be of type `select_one_external`. The values will then be loaded into the form from an external CSV file that is downloaded from the server. The CSV file should have the following format: ``` list_name,name,label,<filter_field_1>,<filter_field_2>,... entities,<instanceID>,<identifying field value>,<filter field 1 value>,<filter field 2 value>,... entities,<instanceID>,<identifying field value>,<filter field 1 value>,<filter field 2 value>,... ... ``` [More](http://xlsform.org/#external) on external CSV files in X(LS)Forms. These CSV files should be generated automatically by ODK Central, and updated every time a new Entity Form is submitted. It should then be possible to use the automatic form update functionality to keep the CSV file up to date. _[Question: if a media file is updated - in this case the CSV - does that count as an updated form? or would ODK Central have to automatically make a new version of the form each time it updates the CSV file?]_ ### Local Entities A common use case is to create an Entity and then immediately create a Record for that Entity. In an offline scenario, this is not possible with the spec so far. It is there therefore necessary to add a mechanism for adding Entities locally, within ODK Collect. Every time the Entity form is completed, the data should be written to a local CSV file (or a local database?). There should then be a mechanism whereby the local CSV file is merged with the downloaded CSV file whenever the Record Form is opened. It might make sense to clean up the local CSV file every time a new CSV file is downloaded from the server, but it's questionable whether this will be necessary (one reason: if an Entity is deleted on the server, it will still be in the local CSV and the merge will make it available in the form). ## Required Changes ### XForm Spec - addition of concept of an Entity Form and a Record Form (not sure if this is totally necessary, but ODK Central will need to recognise an Entity Form so that it can do the automatic generation of CSV files) - addition of identifying fields and filter fields ### ODK Central - automatic generation of CSV files from Entity instances - automatic form update after generation of CSV file (is this necessary?) - a UI to enable display of Records by Entity ### ODK Collect - ability to store a local Entity Instances CSV file and merge it with a downloaded CSV ## Additional Notes ### De-duplication of Entities It would make sense to build some duplication detection and resolution into ODK Central. Ideally, it would only possible to do data collection on entities that have come from Central, so that they will always have to go through this de-duping, but this is obviously not acceptable if I want to register a patient and then make a case report on them in a totally offline setting. I could see a possible solution using a kind of tombstone for de-duped entities, so that a process might look like this: - while offline, I register patient `dd6c32a4` using Form A - `dd6c32a4` is now marked as "pending" on my device, which means I can submit case reports against it, but it's not on Central - I then do a case report on `dd6c32a4` using Form B - when eventually online, I submit both to Central - it turns out that patient `dd6c32a4` is an exact duplicate of an existing patient, `19f44a40`, who already has case reports - (more details about how exactly de-duping works here) - my case report is switched to refer to the existing patient, `19f44a40` - patient `dd6c32a4` is replaced in Central with a tombstone that refers to `19f44a40` - all incoming case reports for `dd6c32a4` will be switched to refer to `19f44a40` - once my device has updated its entity list, I will no longer be able to make a case report against `dd6c32a4` For the specifics of the de-duping process, I would probably use a combination of approaches. First you need to find possible matches, probably using a trigram algorithm (or possible Levenshtein distances) on identifying fields such as name, village, etc. There's a really good trigram module for Postgres. This is then combined with matches on other fields (e.g. date of birth or geopoint) to calculate a similarity score. You can then figure out values and say something like "if it's over 95%, just merge them automatically" and "if it's over 80%, flag them as probable dupes", and provide a simple interface that displays the data with yes/no buttons. I've done something like this for de-duping patient lists in DRC and it worked pretty well. </details>

It's due for another round of TSC discussion in the near future.