Pull submissions starting from date

chrissyhroberts · November 7, 2018, 4:23pm

Hi, I am waking this old thread up as has useful info on how to start briefcase query on old data from a point other than the start of time.

Here's why

I have a lot of data on server, but would prefer not to purge as (a) it is safe there and (b) international partners may want direct download access
When I use briefcase to pull, the app spends a long long time querying old forms that have already been downloaded
I'd like to start querying at time-point at or near where the last pull left off (or to specify a date/time)

The idea in the thread above is to use a push to set the time-point for the next pull run seems nice but it is unclear whether

(i) this would require dummy data to be added to the server
(ii) how to do it using briefcase / briefcase CLI
(iii) whether briefcase would remember the correct time-point if I used briefcase for another server between pulls

Would there be scope for a future feature in briefcase "Start query at date : "

dr_michaelmarks · November 8, 2018, 3:45pm

Given I work with @chrissyhroberts it is perhaps unsurprising that I agree it would be very useful to be able to say "Pull from X date" and do so from the command line interface.
Michael

ggalmazor · November 9, 2018, 12:18pm

Hi, @chrissyhroberts, @dr_michaelmarks!

I've split the post to discuss this as a new feature for Aggregate and Briefcase. Would you care to edit your post and give a little more context?

ggalmazor · November 9, 2018, 12:27pm

I've also seen how pulling data from servers with lots of submissions can be slow. It looks like there's place for optimization when users know what's the last submission they've already pulled to their computers.

I understand this would filter submissions with the sumission date field.

I think we should explore what happens when submissions are delayed and don't get included in pulls like this.

Scenario:
Maybe I do a pull every day at 3am to get yesterday's submissions, but a submission sent yesterday arrives to Aggregate at 10am, after my script has been launched.

Could weird scenarios like this one take place when using Briefcase to pull submissions from Collect and push them to Aggregate?

chrissyhroberts · November 13, 2018, 10:23pm

Hi,
Thinking on your scenario, I would always set the date to start checking for a couple of days or even weeks prior to last pull. That way it can double check for these stray submissions, but still not have to go back to the very beginning of time

For instance I am on day 100 of a study and running through days 1-90 is pretty pointless, but 91-100 could catch new submissions whilst still saving 90% of the time

ggalmazor · November 14, 2018, 10:24am

I've been studying the code and I have some insights regarding this feature proposal:

Briefcase pulls data from Aggregate in batches of a maximum of 100 submissions each.
A batch consists on an XML with:
- A list of submission instanceIDs
- A "cursor" that can be used to get the next batch
A cursor is an XML with:
- The field used to order submissions. Currently we're using LAST_UPDATE_DATE
- The last update date (ISO8601) of the last submission from the previous batch
- The instanceID of the last submission from the previous batch
- A boolean telling if the cursor is a forward cursor (purpose of this yet to be determined)
```
<cursor xmlns="http://www.opendatakit.org/cursor">
  <attributeName>_LAST_UPDATE_DATE</attributeName>
  <attributeValue>2018-11-07T14:43:24.644+0000</attributeValue>
  <uriLastReturnedValue>uuid:d9b67b6f-2058-469b-8cb1-c86b9c34b632</uriLastReturnedValue>
  <isForwardCursor>true</isForwardCursor>
</cursor>
```
In order to get the first batch of submissions, Briefcase sends Aggregate an empty cursor
Briefcase will continue asking for more batches until an empty batch arrives, which ends the pull operation

Here's an idea:

Briefcase stores the last cursor used for each form
We add a checkbox to "resume" the last pull in the Pull tab
When that's enabled, Briefcase sends the stored cursor instead of an empty one, effectively resuming the pull operation.

We could even build arbitrary cursors to resume pulls starting from different submissions, but the idea above seems like the smallest possible increment that brings value and would let us test this on the field.

chrissyhroberts · November 14, 2018, 11:21am

Sounds like an excellent idea. I assume because the cursor is based on UUID that any new submissions that had taken a while to arrive would be in the new batches of data even if they had 'submission dates' that predated time of last pull.

ggalmazor · November 14, 2018, 11:47am

In fact, the UUID is a secondary criterion used to filter what's part of a batch after we get the ordered list of submissions based on the last update date (this is the primary criterion).

Since Aggregate uses the last update date (which is metadata that Aggregate adds to every submission), instead of the submission date, we could expect those delayed submissions to make it to the batch, since their last update date would be effectively their reception date.

ggalmazor · November 15, 2018, 4:04pm

I think we have the grounds for a new feature here. I'll document this into an issue so that we can discuss the coding aspect more comfortably.

github.com/getodk/briefcase

Resume last pull

opened 04:24PM - 15 Nov 18 UTC

closed 05:55PM - 13 Feb 19 UTC

ggalmazor

Users with big forms and/or many submissions (10k, 25k, and more), often complai…n about how much time the pulling and exporting processes take to complete. ([forum conversation](https://forum.opendatakit.org/t/pull-submissions-starting-from-date/16200)) This problem can be worse in scenarios where users have a long-living form that receives periodically new submissions i.e. to receive the last 100 submissions, Briefcase has to navigate through the first 10k to reach them. (note that Briefcase doesn't actually download the 10k submissions it already pulled before, see the explanation below) Some insights about how Briefcase pulls forms and their submissions from Aggregate: - Briefcase pulls data from Aggregate in batches of a maximum of 100 submissions each. - A batch consists of an XML with: - A list of submission instance IDs - A "cursor" that can be used to get the next batch - A cursor is an XML with: - The field used to order submissions. Currently, we're using `LAST_UPDATE_DATE` - The last update date (ISO8601) of the last submission from the previous batch - The instance ID of the last submission from the previous batch - A boolean telling if the cursor is a forward cursor ```xml <cursor xmlns="http://www.opendatakit.org/cursor"> <attributeName>_LAST_UPDATE_DATE</attributeName> <attributeValue>2018-11-07T14:43:24.644+0000</attributeValue> <uriLastReturnedValue>uuid:d9b67b6f-2058-469b-8cb1-c86b9c34b632</uriLastReturnedValue> <isForwardCursor>true</isForwardCursor> </cursor> ``` - In order to get the first batch of submissions, Briefcase sends Aggregate an empty cursor - Briefcase will continue asking for more batches until an empty batch arrives, which ends the pull operation (now this is what happens with each non-empty batch received) - Each batch will contain a maximum of 100 submission instance IDs. - Briefcase uses a local database to exclude from the list any submission that's been already downloaded - For each of those remaining submission instance IDs, Briefcase downloads them from Aggregate, including all their media attachments **Optimization problem** Taking how Briefcase works into mind, in scenarios like the one described above, where a long-living form receives new submissions periodically, Briefcase will be forced to "scan" all the submission instance IDs until it reaches a batch where new submissions start to show up. For a form with 10k submissions, this means 100 requests before reaching a batch that actually contains new submissions. That has a negative impact on the overall performance of the operation and the monthly expenses for users hosting their Aggregate instances in App Engine. **Proposed solution** - Briefcase stores the last cursor used for each form - We add a checkbox to "Always try to resume pulls" in the Settings tab - When that's enabled, Briefcase sends the stored cursor instead of an empty one, effectively resuming the pull operation. - Briefcase should also provide a new CLI flag to enable this feature for users launching Briefcase from the command line

yanokwa · November 15, 2018, 11:42pm

@ggalmazor I'm really liking this idea and I'm moving it to the Features category.

Instead of the checkbox, it seems like we should just make things resumable by default? Seems pretty safe. And if we are worried we could add a global setting to do full downloads.

ggalmazor · November 16, 2018, 8:37am

Making this feature to be on by default could affect users with other workflows/workloads, but I like the idea of moving the checkbox to the settings tab, since it would be also used when exporting with the "pull before export" conf param.

Could we settle on having a new "Always try to resume pulls" checkbox on the Settings tab and having it disabled by default?

yanokwa · November 16, 2018, 5:34pm

Maybe I'm slow, but what exact problems are you expecting?

ggalmazor · November 18, 2018, 11:38pm

Shipping it disabled by default feels safer because it has no downside at all. Enabling it by default, on the other hand, could confuse users with different workloads/workflows e.g. they could miss submissions until they realize they need to force full pulls.

dr_michaelmarks · November 19, 2018, 7:44am

I don't understand enough about the technical aspects buts sometimes we certainly set our workflows to delete downloaded XMLs after running analysis and so in that situation want to people.to redo a full pull.
Equally for our very large datasets we don't do this so want to be able to pull from last point only.

Can see arguments in both directions for default setting

dr_michaelmarks · November 23, 2018, 5:39pm

An additional point we would want to be able to control this using command line interface as we use this to run all our automated scripts

ggalmazor · November 26, 2018, 9:18am

Thanks, @dr_michaelmarks! I'll add it to https://github.com/opendatakit/briefcase/issues/681

ggalmazor · February 14, 2019, 7:30am

Just a quick heads up to tell you that we've released Briefcase v1.14.0-beta.0 with a new "resume last pull" feature that we need help testing

The new feature will save time and resources by skipping all the instanceID batches that have already been pulled. In a form with 10 000 submissions that means saving 100 HTTP requests and checking 10 000 submissions in the local instance database.

More details at the usual places:

If there's no major issue with the release, we will be releasing it next week.