Data recovery from a failing Aggregate server

What is the problem? Please be detailed.
I have an ODK Aggregate server that is reporting numerous errors that indicate that there are problems in the database. I am using ODK Aggregate V1.5.0 on a Google App Server.

I have deleted all submissions that were uploaded normally. However, there are a number of submissions that were incomplete that contain data that we desperately need to try and recover. However, the frequency of the error pop-ups preents being able to even see the entries in this section.

I would really appreciate some assistance with trying to recover the data or to stabilize the database so that error pop-ups subside so that the the submissions can be marked as complete and the data retrieved that way.

Below is a screenshot of a typical error message

Hi @Mark_Schormann1. Can provide a bit more information about how the server got into this state?

  1. Was this a new server that had a fresh install of v1.5.0 or was it an upgraded server from a previous version. If it was an upgrade, what version did you upgrade from?
  2. Was there something that happened (e.g., an upgrade, a big campaign) before these problems surfaced?
  3. How did you delete the submissions (e.g., in the UI, in the underlying database)?
  4. What, if anything, have you tried to fix the problem?
  1. This server was upgraded at least once. The immediate prior version was 1.4.5
  2. The server usage was heavy and I think in doing the upgrade, I did not adjust the background and foreground to the F4, B4 settings that had previously been set for the App Server performance
  3. Recently, I had only deleted submissions using the X next to the records in the Submissions tab. However, in the past, I have used the Database Admin tool within the Google Cloud Console to manually delete all the files associated with the forms submitted.
  4. I have deleted all normal submissions to isolate the data on the server to just that which is of interest. I have also recovered the images data from the tablets associated with the submissions. What is missing is the XML data. The missing submissions data is suspected to be amongst the Incomplete submissions.

I have never seen that error message in Aggregate and my guess is that the manual changes you made in the underlying database had unintended consequences.

At this point, I think our best bet is to try to recover this data manually from the DB, then delete the entire install. What do you think, @ggalmazor?

From my point of view, that seems the most sensible approach. Now, I just hope there is a mechanism to do that and recover that data....

Just for the sake of completeness - the procedure I used for deleting the data in the underlying databases is one I had used previously approximately 20-30 times. It's a terribly tedious process, but generally, it was the only way to bring an Aggregate server back to an operational state once there were numerous error pop-ups from the database. I know there is an alternative process described (I think it was by Mitch) somewhere using the data contained in the error pop-up and the error logs, but that often proved to be a rabbit hole.

Anyway - the point being made is that the process had always worked and had never produced side effects like this. However, I understand that manual changes that are outside of what the developers intended, always carry the real risk of side effects.

Can you please invite me (yanokwa@nafundi.com) as an owner to your App Engine instance?

I will do so shortly. Thanks.

And it is done!

Hi, @Mark_Schormann1 and @yanokwa!

The error from you screenshot is telling us that a query to get the media attachments of a form is taking so long that it's timing out. This would be consistent with a very large dataset, combined with a probably undersized instance.

Does the procedure you follow to delete submissions include removing entities from all the *_BN, _BLB and _REF tables?

@yanokwa, could you confirm that there is an error log from 2018-07-13 21:41:48 that points to the BinaryContentManipulator.java file, line 584-ish?

Hi @ggalmazor - yes, I delete all tables that contain any form data - but I leave the tables that only contain meta-data.

@ggalmazor I've given you read access to the instance. Can you poke around and see the least amount of effort it'd take to get the data out?

Of course! :+1:

-- some filler -- ignore this --

Hi, @Mark_Schormann1!

I've written a basic post with some of the processes and tools I've used to help you with your Aggregate deployment. I'll be expanding on that topic there.

Regardless, I wanted to comment here some specifics about the problem you were having and the way I used some of the tools.

The main problem you were reporting was that you wanted to extract data from incomplete submissions. Normally, one would have used Aggregate to mark the submissions as complete and used Briefcase to pull/extract them, but you were experiencing issues with Aggregate that were preventing you from achieving that, and these issues were related to some structural flaws in the DataStore.

This is a rough list of steps we have taken to solve this situation:

  1. Make a complete backup of the DataStore data to a Google Storage bucket

  2. Use DSIO to extract all data from incomplete submissions:

    1. Get a list of all kinds in the DataStore with SELECT * FROM __kind__
    2. For all 4 forms present in Aggregate, download to a CSV file all rows from the main _CORE kind that had their _IS_COMPLETE field with a false value.
      GQL: SELECT * FROM %some_kind_prefix%_CORE WHERE _IS_COMPLETE = false
    3. Open those CSV files in LibreOffice and extract all different values of the _URI field
    4. Get all the rows of non-main kinds for all the different URIs that we got in step 2.
      GQL: SELECT * FROM %some_kind% WHERE _TOP_PARENT_URI = '%some_URI%'

    At this point, we have a complete backup and CSV files of the submissions we wanted to safeguard. We can move on to riskier actions. We focus on artificially marking all those submissions as complete.

  3. Use DSIO to extract to YAML files all incomplete submissions (only main kinds)

  4. Add a _MARKED_AS_COMPLETE_DATE: datetime field to the scheme in the YAML file's header block, in a new line before the _IS_COMPLETE: boolean field definition.

  5. Add a _MARKED_AS_COMPLETE: 2018-07-23T00:00:00.000Z field and value to all the entities in the file.

  6. Change the _IS_COMPLETE field's value to true of all entities in the file.

  7. Use DSIO to upsert the changed YAML file.

This updated the DataStore in a way that tricked Aggregate into thinking that those submissions were complete. This didn't stabilize Aggregate in a way that could be normally used but opened the option of using Briefcase to pull and export/push all these submissions, which was a huge win at this point.

4 Likes

Thanks Guilliermo @ggalmazor - that's quite a process, but I suspect it was the only way to de-convolve the issue.

As you say, generally, one should never have important data in the "Incomplete submissions" section of Aggregate. However, if one has an Aggregate server that is under high loads, with many uploads per day or large submissions with lots of photos, there can actually be a LOT of Incomplete submissions. Ironically, this same set of circumstances also causes the other issue - which is potential database corruption, which can cause all the data to become inaccessible via the Aggregate interface, as it starts reporting errors in the underlying database. This can also cause the Incomplete submissions section to become inaccessible.

But therein lay the actual crux of the problem. Whilst one can download from the normal submissions section using Briefcase to pull the data (even when there are database issues and the Aggregate interface is becoming unusable), one has not similar way to access/download/pull the incomplete submissions, They become completely inaccessible. Thankfully, the process you devised above got the data out.

Thanks again
Mark

1 Like

This is great feedback, @Mark_Schormann1. Can you give me a sense of how many photos per form, the size of the photos, and the number of submissions per day? And do those submissions come in at the same time or spread over the day?

Relatedly, did you know you can scale down some or all images on the device with https://docs.opendatakit.org/form-question-types/#scaling-down-images

@ggalmazor can you do some quick research to see if we could have some flag in Aggregate or Briefcase to consider incomplete submissions as complete?

I got some answers from Mark that, with his permission, I wanted to document publicly.

Can you give me a sense of how many photos per form, the size of the photos, and the number of submissions per day? And do those submissions come in at the same time or spread over the day?

Photos per form is generally in the region of 150-300 – but there can be more sometimes. Number of submissions per day is typically in the region of 15-20. The are typically spread out over the course of the day.

Why not put fewer photos per form? Or rather, why not redesign the form so you are doing one submission per photo?

It's already quite complicated with different forms for different situations. I don't thing it would be feasible to split that way. What I am doing now is running 3 Aggregate servers in parallel to reduce the load on each. Seems to be working so far.

1 Like

Hi! We've released Briefcase v1.13 today, which comes with a new CLI param to include incomplete submissions when pulling forms. This will only work with Aggregate v1.7 at the moment. Check it out here: ODK Briefcase v1.13

1 Like

Thanks Guillermo. Much appreciated.

I have just replied to another user bringing up the subject of what we consider to be the single biggest issue we have had with Aggregate over the years - the lack of a programmatic way of cleaning up the database. I know some time ago you shot the idea down, but I would argue (quite strongly) that the utility of the feature would far outwiegh the potential damage it could cause.

I would really appreciate it if we could revisit that one.

1 Like

A post was split to a new topic: Data is not available in the odk software