Many Fusion Table duplicates being created

odk-aggregate

(Greg Pommen) #1

We use ODK Collect and Aggregate to collect and submit data to Fusion Tables which is our main repository after collection. We are mostly a seasonal operation, with prime data collection occurring between roughly March to October of every year. We've been slowly becoming more reliant on it since 2011.

There have always been a few duplicate entries, but nothing major, and easy enough to remove. This year though, I've been seeing up to 100 to 200 duplicate entries in our bigger tables, around 1500 to 2000+ records. I've seen as many as 5 duplicates per record, although usually not more than 2 or 3.

I don't know what could have changed this year though. Most of the entries I've seen on the topic in the forum here suggest this was a problem mostly taken care of years ago.

What ODK tool and version are you using? And on what device and operating system version?

As far as I can tell, this has occurred while using Aggregate 1.6.x since March, and whichever Collect versions have been released during that time frame.

What steps can we take to reproduce the problem?

I am not sure. We just submit forms and get duplicates.

What you have you tried to fix the problem?

I'm not sure what to try. I'm hoping for suggestions. It's possible there is something external to ODK in our environment interfering with Fusion Table's publishing reply, but I have to start with ODK first.

I'm attaching a file with two GAE Application logs from Stackdriver Logging that I think recorded an instance of duplicate records being created today.

Thank you!
gae_app__logs__2018-11-02T01-00.json (62.9 KB)


(Yaw Anokwa) #2

@Greg_Pommen These sorts of problems are hard for us to track down because it's hard to reproduce. Most of the time, it's an issue on Google's end (of course, it's never our bug :smile:)

Can you give me a sense of size? How big are these forms? And how many records do you have? Does the data that ends up in Fusion Tables have IDs that you can use to dedup?


(Greg Pommen) #3

@yanokwa LOL, yes, always blame Google! :grin: In this case I can see it being Google, ODK might have something different in this year's versions, or perhaps it's our IT department's security software interfering somehow. But I have very little ability to contact the other two so I'll start with ODK!

Size of forms, does 157 to 556 lines of code help? We currently have 34 ODK forms active, with some having 0 records (no one seems to use those but then no duplicates either!), then ranging from 10 to 1320 records in 26 forms that are actually used. All of those have had duplicates. My apologies, we used to have over 2000 records in some tables but that was in previous years.

I use the meta-instance-id field to find the duplicates in the Fusion Tables. When there are many duplicates (10+), I export the CSV, eliminate the duplicates in that, then delete all the records in the affected Fusion Table, and import the cleaned file. When there are 100+ duplicates, I can't think of a better way to get rid of them.


(Guillermo) #4

Hi, @Greg_Pommen!

You mentioned that your activity is seasonal. Do you start with a clean slate each season or do you reuse the same Aggregate database and/or Fusion Table instances?


(Greg Pommen) #5

Hi @ggalmazor!

Most are clean slate, but a few we have been using for years. We're getting the same duplication problem, regardless, Aggregate and Fusion Tables.


(Guillermo) #6

Hi, @Greg_Pommen!

Unfortunately, I can't suggest anything at this moment about your problem :frowning:

These kind of errors are tricky because there are many possible error points. Any error down the chain could trigger resending data even when data could be effectively being written in Fusion's end. We can't make Aggregate check whether data has already been sent to Fusion before sending it because that would consume quota much faster, and it could increase expenses too.

In any case, we've tried to stay clear from changes in core features of Aggregate since v1.4 because we're eventually replacing it with ODK Central. We have solved some web UI issues, we have improved security, and provided better VM/Docker alternatives, etc. but we haven't changed the publishers.

Looking at the logs you've attached, I can see some errors while Aggregate tries to acquire write locks on the database, but there are some recovery mechanisms in place for that and it doesn't look like the source of the issue you're experiencing.

Is there any other error on the logs?


(Guillermo) #7

Hi, @Greg_Pommen!

We're trying to figure out what to do with this issue... Would you be so kind to write some lines explaining why you're using Fusion Tables among all the other alternatives to take data out from Aggregate? How do Fusion Tables fit in your data workflow?

Thanks!


(Greg Pommen) #8

I'll try to keep a long story, short. Our unit has a tiny budget for IT stuff. I'm neither a permanent nor an IT worker. ODK and Fusion Tables allow us to use real time data collection and mapping features without having to run our own servers that we neither have the time nor budget to implement. I can review and correct flawed data collected in the same day easily. Since the City of Edmonton adopted the Google environment, sharing data is even easier than before. The Fusion Tables API makes it fairly simple for me to make custom dashboards and website displays. The data can be accessed through Google Apps Script for backups, and can use Javascript with SQL statements to modify the data and maps from a web browser. We can collect data on a smartphone, view it centrally, and view it on the same smartphone within seconds. These are abilities that are only starting to be implemented in other sections of our organization using ESRI ArcGIS technology. I find it even funnier that the Survey123 uses Xforms like ODK!

Problems: ODK doesn't work well within our corporate environment since we need an external Google account to run the AppEngine, and therefore the Fusion Tables too. Google's support for Fusion Tables seems to be iffy, and it hasn't been updated in years. One reason I have to use the Fusion Tables API for custom websites is because Fusion Tables do not display on smartphones very well. We never had a problem with duplicates before this year. Now they are problem that I'm not looking forward to solving in the future.


(Yaw Anokwa) #9

@Greg_Pommen I've got some emails out to try to get some face time with the Fusion Tables team and I'll update this topic once I've made contact. Until that happens, I'd like to explore some alternatives...

  1. Do you think you'd be able to use Google Sheets instead of Fusion Tables? The limitations that I see are support for repeats and very large data sets, but we might be able to flatten data in Aggregate and Google Sheets now supports 5 million cells.
  2. Would something like ODK Central's OData support help? I'm not proposing you switch. I'm just curious that functionality addresses your use-case.