Collect: keep history of changes to values in the form

dr_michaelmarks · October 11, 2018, 10:54am

The answer here is that unless there is something I am missing this log file doesnt seem to have what is needed.

For example I created a form and used a complex text string for one of the values.
I then changed the entry to a different string.

In the logfile if I do CTRL+F and search for either string neither appears.
To me it seems like the log file records records system stuff but not the actual data which is what I need to be able to track changes to.

Xiphware · October 11, 2018, 9:02pm

OK, so it looks like the Collect log file used to capture user input, but may not now or in the future. Instead, the recommendation is to use the audit log for such purposes [sorry I sent you on a wild goose chase and didnt point you here to begin with...]. The audit log doesn't currently capture user input, but that is probably a very legitimate feature to consider adding, specifically for exactly your sort of usecase.

I might suggest taking a look at the audit log stuff, and perhaps opening a feature request(s) against it to add whatever is still missing for you.

dr_michaelmarks · October 11, 2018, 9:13pm

Yes as I udnerstand it the Audit log only captures timestamps etc not the user input; but if it did that then that would be perfect.
Where would you like the feature requested logged - here or Github?

Xiphware · October 12, 2018, 2:00am

Probably start here, if only for the wider audience so we get as much feedback as possible from other potential exploiters of such an addition. Then the TSC can make a decision/prioritize accordingly if it looks like a good idea with broad appeal, and they can open a github feature request to do the necessary development work. End of day, the key folk that need to hear already hang out in both places, so I dont think you have to worry about this falling thru the cracks.

LN · October 12, 2018, 2:52am

@dr_michaelmarks If I recall correctly, doing this kind of logging was discussed when the audit log was being designed but we ultimately decided against it being a default feature for two reasons:

Including potentially sensitive data in another place seems like it could be harmful
There would be certain limitations to this (in line with the existing log limits). Most notably, events are on a per-screen basis so changes to values within a field list would not be tracked.

I think the first concern is mitigated by making it a configurable option on the log. Is the second acceptable for you in the short term?

dr_michaelmarks · October 12, 2018, 7:03am

@LN

Yes I agree this would be a configurable option (?set within the XLSForm as part of including the audit line so that the form designer has control over this?)

1a) Could the log be encrypted? We do this with our clinical datasets now because of these issues and also GDPR regulations within the EU

If I understand this correctly your point is that if say multiple fields are on the same screen
Age:
Height:
Weight:

That if I enter age, move to the height field on the same screen and then go back and change Age that this would not be tracked.
But if I moved on to the next screen
Sex:
Marital Status:
And then went back to the Age/Height/Weight screen and amended the data this would be tracked?

I think that would be fine.

Below are actions I think should be tracked:
Minimum:
A)
User exits form either part way through (clicking save changes) or at the end of the form but does not mark form as finalised (save changes)
User reopens that saved form and goes back to any data entry field and amends it
This type of change should be tracked
(Note we use encryption so once they mark the form as finalised we block editing)

Ideal
B)
User is on a screen and enters the age
They move on to the next data entry screen
They then go back to the age screen (without exiting the form) and amend the age
Ideally this should also be logged

For comparison
I was just playing with REDcap which is broadly considered to be GCP compliant so I can see what kind of audit trail that maintains.
It maintains something equivalent to scenario A outlined above - that is if I complete a whole record on Redcap, mark it as saved, re-open and amend a value it can clearly show me that change.

dr_michaelmarks · November 28, 2018, 1:10pm

@LN @yanokwa
@chrissyhroberts& I would be interested in getting a sense of a ballpark figure for doing either A) Minimum or B) Ideal implementations of this as above.
We have some potential money but I am bad at guaging the cost of this kind of thing

dr_michaelmarks · January 15, 2019, 2:20pm

@LN @yanokwa
We have money in a grant which we would be interested in putting towards this.
Do yuo think we could get a cost estimate for the work?

@chrissyhroberts

Grzesiek2010 · February 1, 2019, 6:33am

So looks as if we need a new column in aduit.csv file, currently we have:
event, node, start, end
or
event, node, start, end, latitude, longitude, accuracy
if location tracking is enabled.
a new column could be named just answer

We would need to fill that column only in case of questions. Questions are interval events that mean we set start and end dates for them, so my approach would be to fill the new answer column once end date is set (it takes place when a user leaves the question - navigates to another one or opens the HierarchyView etc)

ggalmazor · February 1, 2019, 7:24am

Hi all! Seeing that this thread is coming to life again, I just wanted to point out that the TSC is discussing this feature at https://github.com/opendatakit/roadmap/issues/30.

There are two ongoing topics, in different degrees of consideration:

Remember the answer to a question from the last saved submission, in order to pre-load the answer when loading a form for the first time.

There seems to be a consensus about this one, although we're waiting on more opinions about it. This one will probably be the first one to be implemented.
Remember all the answers to all the questions ever answered for a form in a device, and offer some sort of autocompletion feature with them.

This one needs more discussion and will be probably delayed after the previous feature gets shipped to get more user feedback.

Feel free to comment!

Grzesiek2010 · February 1, 2019, 7:37am

I think it's not the same @ggalmazor this is not about any auto-filling/preloading. The user who asks about the feature just needs the history of changes.

dr_michaelmarks · February 1, 2019, 7:48am

Correct - this is an audit trail of changes not auto population.

ggalmazor · February 1, 2019, 9:07am

I understand. Thanks for the clarification!

yanokwa · April 23, 2019, 8:01pm

@Grzesiek2010, @LN and I have been iterating on this feature and we've made good progress! The key design decisions we've made so far are:

The feature can be enabled/disabled through changes to form design
- Likely an attribute in the form called odk:audit-track-changes (or something similar) which can be set to true.
If enabled, we will add new event called value change (or something similar) will be added. That event will have a new column called value where we write the changed value. We will include event, node, timestamp, and location columns.
- This approach makes for easier to analyze logs and we don't have to solve the "how do you represent NULL" problem that comes with using the question event to store this data. More here.
Every value that is changed (on swipe/next) will be written to log (even if in a field-list)
- We will not log things a user can’t see (e.g., calculates) because those aren’t necessarily triggered on swipe. Calculates are constantly being re-evaluated and it would be a lot in the log.

We have an initial pull request to evaluate feasibility of the above and will update this topic as we make progress. One current unknown that I'm looking into is if/how we store extra information about the reason for the change.

Grzesiek2010 · April 23, 2019, 8:13pm

Only event, node and that new value columns should be filled right in that new event? I don't think it makes sense to record time since it wouldn't be the time of answering a question but the time of navigating.

LN · April 23, 2019, 9:35pm

I agree that it's redundant and not exactly when the enumerator modified the data but my sense is that it will be easier for users to analyze the data if those columns are populated. That is, they at least give a sense of when the change happened and so it's possible to get a reasonable picture of the edits made to the data by filtering the CSV to see only value change events.

dr_michaelmarks · April 23, 2019, 9:50pm

It would definitely be preferable (ideal) to get a clear change history so you can see not only the changes made but the order that was done. Ideally date/time stamp.

yanokwa · April 23, 2019, 10:04pm

I agree that we want to have timestamp because it does make it a lot easier to order the changes. And I'd also argue for including the location values too. I've updated my post accordingly.

This will be a lot of data, yes, but this is an opt in feature and Central and Aggregate on Tomcat can both zip the data in transmit.

LN · May 3, 2019, 4:27am

Decisions like the structure of the log are very hard to change in the future so I want to make sure we're carefully considered which is going to be most useful to users. That depends on the type of analysis that is eventually going to be done and I don't have a good feel for that.

I see three different options that have been discussed:

Adding old-value and new-value columns and tracking both as part of a question event.
Adding a value column and tracking the current value as part of a question event.
Adding a new value changed event and a value column that would only be populated for that new event.

To make things concrete, I have put a form and examples of the three logs in this Gdrive folder. Each log the following form-filling session: swipe through and fill all fields, swipe back from school details to age to see age without modifying it, swipe forward to school details without modifying it, swipe to the end screen, swipe back to school details and fix teacher name and clear first class time, swipe to age and modify it, jump to last name to view it and then jump to end. @Grzesiek2010 also has two prototype implementations at collect#3042 and collect#3024.

The big question for me is whether users want to being able to identify when values have changed with simple spreadsheet analysis rather than through visual inspection or more sophisticated analysis software. To make this concrete, do we think users will want to answer questions like:

which question's value was revised the most times (this could help identify an unclear question if done across enumerators)
how many questions' values were modified by this enumerator?

If this is desirable, then I think always logging the current value when there's a question event (option 2), may not be ideal because detecting when a change occurred requires doing comparisons across rows.

I don't have a strong sense of the tradeoffs between options 1 and 3 when it comes to analysis. There's also the possibility of a hybrid approach where old-value and new-value are only logged when there's a change. I have included that as option 4. This makes it easy to identify when changes occurred and what those changes are. Logging only when there's a change can't be done with a single value column because then there is no difference between no change and a change to a blank value.

CharlieKeyes · May 3, 2019, 9:18am

I would be quite interested in this feature, particularly for clinical trials where by an audit trial is needed for trial monitoring and reporting.

Thanks for these examples, I think a visual inspection of the values that have changed is reasonable, as well as the means to undertake a more sophisticated analysis with software. I would prefer not to have the old and new value columns, presumably new columns would be added every time a value is changed?

I would prefer the new value changed event and value column approach, (3audit-values-new-event). But if it not possible to keep the values in one column then option 4 (4audit-values-old-value-new-value-only-on-change) seems valid and appropriate. It also makes it easier to read across the rows to see when and what the change was.

best,
John