Want to download only new records

odk-briefcase

(Frances James) #1

Hi everyone,

My team and I have developed an SQL-Lite database that combines the two .csv files exported by ODK Briefcase (1.9.0 Production) into a format which can be analysed easily by fisheries officers.

However, the database is becoming slower and slower as it erases and uploads the whole dataset every time.

I have just noticed that under the 'Export' tab on Briefcase you can select a specific date range. I am hoping this will solve my problem? Or is there a much easier way of just exporting new data from the server?

Many thanks,

Frances


(Yaw Anokwa) #2

Hi @Frances! I wanted to tease out a couple of things here...

If you do a pull, Briefcase will always scan the entire Aggregate database to see if a submission on the server exists on the local machine. If it doesn't then it'll download the submission. On Briefcase v1.7 and greater, both the existence check and the download (if parallel downloads is enabled) happens extremely quickly.

If you do an export, Briefcase doesn't talk to the server at all. It takes the data on your local machine and exports that. And yes, as you noted, you can export a particular date range.

Some optimizations:

  • If you are working in the GUI, upgrade to Briefcase v1.10 (found at https://github.com/opendatakit/briefcase/releases), and you can pull and export in one operation.
  • If you switch to the CLI, then you can write a script that does the pull and exports a date range that you can have the script configure (e.g., a Bash script that figures out yesterday's date and makes that the start_date).

I don't know how you are generating your SQLite DB, but if you are comfortable in Python, you can write a script that uses pandas that takes the exported CSVs and does whatever merging you need and output to SQLite. If you find that your SQLite DB is being slow, then run VACUUM! https://blogs.gnome.org/jnelson/2015/01/06/sqlite-vacuum-and-auto_vacuum/ has a quick explanation.


(Frances James) #3

Hi @yanokwa

Thanks for replying and apologies for not acknowledging sooner!

I am using a GUI. So if I understand correctly, I have 2 options.

  1. Pull all submissions and export only the dates I want (not ideal as not automatic)
  2. Upgrade from v.1.9 to v.1.10? Will I be able to export the latest data this way, i.e. does the system recognise the last submission that was downloaded previously. Sorry this is the bit I don't quite understand.

I am afraid I am not versed in database language (my colleague has built our system) so I have asked him to help me with your last paragraph.

Many thanks again,

Fran


(Frances James) #4

@yanokwa just checking you saw my reply above? Many thanks :slight_smile:


(Yaw Anokwa) #5

Hi @Frances, let me try to be a little more precise!

Briefcase scans Aggregate's entire data store every time you pull and tries to download every submission that Aggregate has. Briefcase has no idea what submission is "new". It just tries to get every submission that it does not have. It's not straightforward to change this behavior.

Once submissions are in Briefcase, you can specify a date range for export. If you use the GUI, you have to manually set that date rage. If you use the CLI, you can wrap that export command in a script that updates the start date each time you run the script so it exports the data from that day.

I haven't tested this script (macOS, Linux) lately, but I'd be something like...

#!/bin/bash

JAVA_BIN="/usr/bin/java"
BRIEFCASE_JAR="/home/briefcase/briefcase.jar"

AGGREGATE_URL="https://example.com/ODKAggregate"
ODK_USERNAME="briefcase"
ODK_PASSWORD="mysupersecretpassword"

STORAGE_DIR="/home/briefcase"
EXPORT_DIR="/home/briefcase/exported"

FORM_IDS=($(curl --silent "X-OpenRosa-Version:1.0" "$AGGREGATE_URL"/formList 2>&1 | awk -F "=|\"" '{print $4}'))

YESTERDAY=$(date +%Y/%m/%d -d "1 days ago")
TODAY=$(date +%Y/%m/%d)

for FORM_ID in "${FORM_IDS[@]}"; do
	echo "Downloading $FORM..."
	"$JAVA_BIN" -jar "$BRIEFCASE_JAR" --form_id "$FORM_ID" --storage_directory "$STORAGE_DIR" --aggregate_url "$AGGREGATE_URL" --odk_username "$ODK_USERNAME" --odk_password "$ODK_PASSWORD"
	echo "Exporting $FORM..."
	"$JAVA_BIN" -jar "$BRIEFCASE_JAR" --form_id "$FORM_ID" --storage_directory "$STORAGE_DIR" --export_directory "$EXPORT_DIR" --export_filename "$FORM_ID-$YESTERDAY".csv --exclude_media_export --overwrite_csv_export --export_start_date "$YESTERDAY" --export_end_date "$TODAY";
done

So if you ran the above script every day with cron, then the $YESTERDAY and $TODAY variables will change and give you daily exports which would be the "new" submissions.