How to Automate ODK Data into Visualization Dashboards

Hi fellow ODK-ers,

I'm wondering if this is the right category for this post - hopefully! To start off - I am not a developer. However, I want to try to understand the different possibilities for workflow or information flow of automating as much as possible the information/data management cycle. In my experience with ODK - new users start off with one or two forms for one or two surveys - and it's quite manageable to "humanly" manage pulling data, analysing in excel, throwing a report together, and sharing within your organisation.

BUT - once you get excited by the possibilities that ODK opens up - then you want MORE forms - to the point that then most organisations don't have automated systems to help them manage all their digital data collection - and loads of data then sits around never being used, because humans don't physically have enough time to manually pull, clean, analyze, and visualize all the different data coming in. This is especially true for multi-sectoral teams that are using ODK for daily/regular data collection.

Putting aside the fact that a recommendation could be "don't collect more data than you're going to be able to use" - organisations have to start somewhere. So I was wondering if any of you out there have best practices to share about how you automate some or all of the different parts of ODK data management?

I'm going to share what I know from my own experience first (excuse any incorrect terminology), and would be really happy if others have ideas about this, what tools work best for different steps, or what are "better" ways of doing this:

I'm assuming data is collected with ODK Collect and stored on an ODK Aggregate server hosted on Google Appspot.

1) Automating the data downloads. Included in this: a timer that will automatically pull new data submissions on a set schedule; a script to use ODK Briefcase to pull data so that encrypted submissions can be decrypted; a way of ensuring non-English alphabets are saved in UTF-8 format properly; saving all files in an acceptable format (.csv in my case) to a local or cloud (encrypted) server.
2) Automating data cleaning and data associations. Included in this: making sure labels are substituted for 1s, 2s, 3s, etc; associating "repeat" data with their parent keys; translating data results into multiple languages (based on the ODK form); flagging duplicates.
3) Automating data analysis. Included in this: setting up standard indicators to be calculated based on the collected data (such as age categories if collecting age, or "time to complete questionnaire" based on start/end times).
4) Automating data visualisation (maybe this is combined with #3 above). Included in this: loading clean data and visualizing indicators set up through data analysis; ideally, this visualization is via a URL that can be accessed by appropriate stakeholders through #5.
5) Automating data dissemination. Included in this: having a website that people can log into to view visualized data, ideally with user management capabilities to allow access to different dashboards by different stakeholders; or having a PDF report template set up that inserts new data automatically so that reports can be created automatically.

Previously, I've seen QlikSense Enterprise be used to do pretty much all of the above steps - however, that can be quite a significant investment for smaller organisations. Does anyone have suggestions for doing the above automations with free or close-to-free tools?

A couple use cases I'm interested in:

  1. Is there a way to automate this process to show data in a PHP/mySQL website dashboard?
  2. Could this be set up with PowerBI dashboards?
  3. I know next to nothing about Google Fusion Tables, which are mentioned in the documentation. Would this be a good solution?

Some related threads:
https://docs.opendatakit.org/aggregate-data-access/#publish-data
https://forum.opendatakit.org/search?q=php
https://forum.opendatakit.org/t/dashboard-recommendations-for-odk-data-and-connecting-to-dhis2

Okay, this is now too long. Thanks for your ideas!
Janna

5 Likes

Thanks for taking the time to write this, @janna!

Tiny weeny correction: It's "Google App Engine". "appspot.com" is the domain where App Engine instances can be accessed.

I know that PostgreSQL has the COPY command that lets you bulk load CSV files directly to tables. It works great and it can be used as part of an automatic data ingestion phase in your workflow. And PHP can interact with PostgreSQL too.

MySQL has the LOAD DATA INFILE command that achieves the same, but I don't have experience with that.

2 Likes

What tools do you usually use for data cleaning? Especially once I start trying to do things like merge repeat group CSV files, I find that doing the data cleaning in Excel starts getting more complicated and less reproducible. I find that statistical software like R, Stata, and SPSS are sometimes even more helpful for data cleaning than for data analysis. If you happen to be using Stata (which is not free), I've written a Stata program named odkmeta that imports and cleans data exported from Briefcase.

I think this sort of automation is a very interesting idea. I think ODK has the potential to enable it, since you can use the form to glean metadata about each field. For example, you can use the form to automatically determine the names of the fields used for start and end times.

In between basic data cleaning and data analysis, I think there is also a lot of potential for automated data checking. For example, some projects check the quality of their incoming data on a regular basis. I know that Innovations for Poverty Action has a system for implementing high-frequency checks of incoming data, which they describe on GitHub here. However, again that's a Stata-based workflow.

Power BI seems to work well for a number of workflows. If you're interested in Power BI, you may be interested in ODK Central, a new ODK 1 server for which we recently released a beta version. ODK Central supports OData out of the box, which means that you can use it immediately with Power BI, Tableau, and other tools.

In general, automation gets easier when the server offers an API. For example, one suggestion in the thread you mentioned was to use a combination of an API and Shiny/R. Aggregate has API options, and Central has a REST API.

2 Likes

@Matthew_White, thank you so much for your great response! Agreed - Excel is my last choice for data cleaning. I prefer something that is based on some sort of script, that I can track everything back to find errors or what exactly is happening.

Previously, I'd just scripted most of this process into QlikSense, which I can still do on the free desktop version, but not with an online shareable version. Do you know if PowerBI allow you to script in this type of data-cleaning?

I don't use Stata - but I've visited your odkmeta on GitHub before, actually. I wonder if anyone has a python or [R] script for this or for automating the data analysis?

I'm not actually sure what the difference between "API" and "REST API" is - however if ODK Central essentially makes it easier to link up to other software, then awesome. With humanitarian data, I prefer to use encrypted forms - I see that's not yet supported. But do you think it will be in the future?

1 Like

I'm actually not sure myself. @issa, do you know whether Power BI supports this? Or is there a compatible tool that you'd suggest for completing this sort of data cleaning/transformation that is free or close to free?

I haven't heard of one myself. However, let us know if you find one!

A REST API is an API that conforms to a set of conventions that make it easier for humans and other software to interact with the API. With Central, one of our goals from the start has been to provide an extensive API: everything you do interactively through Central's online user interface you can also do through the Central API.

We don't support encryption yet, but we plan to do so in the future!

Power BI has some tools built-in, but they're not as comprehensive as you'd find in some other tools. On the other hand, the big advantage Power BI (and even Excel using PowerQuery) has over normal Excel is that you can set your cleaning scripts up as nondestructive data intake tasks, so that next time you have a new piece of data (ie when hitting refresh on the OData feed) all your cleaning steps are run again on the data. I believe you can also connect R to Power BI and use R to do your cleanup.

3 Likes

I was wondering about this too.
I think I'll go with R/Rstudio, and then learn some Shiny to publish. For the moments the visualizations were for myself, so I didn't really bother, but I'll have to start publishing the visualizations, so will need to look for other options.
Anyone has experience with Tableau? Have heard good things about it but hasn't tried it out yet.
Those would be my recommendations for non-developers. I do have a background in C programming. Haven't done anything in Python yet but I think it could be an option, and using one of the dashboards for visualization (dash, bokeh)

1 Like

Hi to all,

since 3 years now we use odk to collect field species data. Aggregate in installed over a PostgreSQL Database, so my experience is not really generic.
Alle the validation work is made within PostgreSQL.

What we did first was creating an odk schema into our "central" database and set some triggers to automatically show collected data into our web house-made tool so our colleagues can control and correct data.

As an evolution we externalized our aggregate database to allow other NGO to use it.

I created "read-only" Foreign Tables into our 'central database" pointing Aggregate forms tables. And my trigger became on cron task wich runs "insert into working_table SELECT new datas from ODK Foreign tables.

Since 6 months I am trying some opensource web reporting tools (Metabase, Redash and Grafana). For the moment those tools are connected to our historical database but as I mentionned a few days ago when I present thoses tries, I think we could easilly connect reporting tools to aggregate database to show and explore data and media collected threw odk.

For the moment I generate two report. The first one is an atlas-like web page showing a synthesis of all data per species, the second one as a synthesis of all collected data over the team.

As a conclusion I said that We are not so far from an out of the box (odk + redash) solution to collect and visualize data.

2 Likes

@mathieubossaert very interesting! As I've been chatting with people about this automation over the last couple of weeks, I've heard about "cron" tasks for the first time :slight_smile: (told you I wasn't a developer!). But that seems like the main suggestion people have for getting data into a mySQL database + PHP website visualisation.

Is this ODK + redash solution documented anywhere, or open sourced? Does it work for aggregate server on the App Engine? Or does it need to be PostgreSQL?

Sorry for the random questions - it seems you've thought this through a little bit more - thanks for your help!
Janna

At the begining we wanted to have a transparent process for the end user. Since 2009 we used to work with a web tool on wich we used to create data over gpx files.

My colleagues continue to use this tool a lot but odk make us think that we could spend more time on the field and less behind the computer.

So we worked on that unvisible workflow. My colleague are back at the office, they push the forms on Aggegate and they see it in our web application at the same time.

This was quite easy to do because PostgreSQL is our database engine since 2006. So we were able to use postgresql facailities to do it (no file transfert between a server and the database).

Here is a presentation of the context and how we do it : https://si.cenlr.org/OpenDataKit_au_FOSS4G-fr_2018

And here, in french, another presentation with voice (in french) : https://webcast.in2p3.fr/video/presentation_de_solution_de_types_boites_a_outils_pour_faire_un_carnet_de_terrain

Starting at 27' I push to aggeragte a form I created in the morning and I show (28'40") the data appears in our web app.

After that I explain how to do it.

This part of the work is documented as we created a "course" for collegues like me : https://framagit.org/formationodk/formulaires/wikis/gestion-et-récupération-des-données-en-base

The reporting tool we choose has to be installed and after that you "just" have to configure a connection to the database you want to explore (ie the aggregate database) and write SQL queries to show.

I am not a developper but I still had to install the tools and create scheduled (cron) tasks.

I hope this will help you a little but i'm sorry I just can talk about the PosgreSQL approach.

1 Like

We do a lot of automatic analysis and favour R/R-Studio because it is free, cross platform and relatively easy to use. The command line interface for ODK briefcase works well and can be added to a cron task (though beware if you don't really know what you are doing when setting up cron, full automation can be a dangerous game)

A really simple CLI version of a pull and export would look like this (note you need to have the ODK briefcase jar file in the same folder as the script and it probably isn't a great idea to hard code a password in to the script. Better to use secure keys or to run the script manually each day/week/whatever and to pass the password manually from command line as an argument

#!/usr/bin/env bash

## declare an array variable
declare -a arr=(

"FORM_ID"
"FORM_ID2"
"FORM_ID3"

                )

## now loop through the above array
for i in "${arr[@]}"
	do
	java -jar odkbriefcase.jar --form_id "$i" -f "$i" --storage_directory /Users/user/somedir/storagedir --aggregate_url https://urlofyourserver/servername --odk_username admin --odk_password your password -ed /Users/user/somedir/storagedir/output -pf storagedir.PRIVATE.KEY.pem -oc
   	done


Alternatively you can use software to control ODK briefcase from above.

The below gist gives the very basic version of a script that can download, decrypt and export CSV files from within R Studio. The added benefit of this is that (a) you don't have to work with command line which is nice for non-coders (b) you can add your downstream analysis to the same script file and (c) R is amazing and (d) it's a simple click once to do everything type thing.

Most simple thing to do is to have R produce some figures and tables in a folder, then to have those pushed automatically to whatever website you want (either manually or via an automated upload API). For most purposes you don't need 'real time' updates so you could pull once a day or week and update the files manually. Like I say, full automation is great when it is needed, but can be dangerous. Automatically pushing things to the internet can also be tough if you haven't got IT support to help you get it running.

3 Likes

Dears,
Is there a final/recommended solution for displaying ODK data into a web dashboard? @janna have tried any of the suggested options?
Thanks

My preferred approach is to use R and flex-dashboard. Very simple way to make a dashboard, but does need some R skills.

https://rmarkdown.rstudio.com/flexdashboard/index.html

We output data from aggragate to googledrive, we then run an automated R script to decode and compile our mulitple forms into a Postgres database, we then have Shiny dashboard that reads the compiled data from Postgres and present this to the end users. Although having Aggregate-> googlesheet add's another step we've actually found this to be useful as we can make changes (correct user errors) in the googlesheet before it is compiled which gives us a difference between what has been submitted to aggragate as our raw data and our corrected data in the googlesheet. We run our decode an compile scirpt once a day but we could set this to run every hour etc. One thing to note is if you have repeat groups add a unique ID to each of them as aggragate to googlesheets are prown to duplicating submissions.

Thank you @chrissyhroberts. I have been using ODK for the last four years using ODK aggregate but not R and flex-dashboard. So can you give me a guide to this end. I hope you have gone through this in a better way.Do I need to use PostgreSQL which you have suggested in your previous logs.
Thanks

Thank you @Stuart I see some of the tools are also suggested by others as well. As a newbie to the Dashboard task, i am trying to see how I should proceed.It will be great if there is any guideline for this staff.

Thanks everyone, @chrissyhroberts

I am currently working on ODK Rstudio integration and this discussion is very resourceful!

More questions/clarifications/comments coming soon.

Asante
Paul

1 Like