Add a GeoJSON export to Briefcase and Aggregate

yanokwa · September 14, 2018, 4:37am

What is the general goal of the feature?
As a data manager, I would like to export data from Briefcase and Aggregate that I can more easily import into GIS or other downstream tools.

How will the feature work
For Briefcase, we will include a drop-drop menu with Include location file. For Aggregate, we will add the same option to the CSV exports. Enabling this option will cause Briefcase and Aggregate to export a GeoJSON file in addition to the CSVs.

The GeoJSON will be a FeatureCollection. The following mapping will be used between ODK XForms and GeoJSON: geopoint -> Point, geotrace -> LineString, geoshape -> Polygon. GeoJSONs support tree structures, so we will embed repeats inside the same file.

Here is a sample file:

{ "type": "FeatureCollection",
  "features": [
    { "type": "Feature",
      "geometry": {"type": "Point", "coordinates": [102.0, 0.5] },
      "properties": {"key": "1234", "name": "site_location"} },
    { "type": "Feature",
      "geometry": {"type": "LineString", "coordinates": [[102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]] },
      "properties": {"key": "4567", "name": "site_path"} }
  	]
}

Each geometric object will have key and name properties.

key are the keys used in CSV exports. In a form without repeats, this will be the instanceID (e.g., uuid:3cc27b08-7ab4-42a6-9261-48eb2f9827d6). In a form with repeats, this will be the instanceID with the name and order of the element (e.g., uuid:3cc27b08-7ab4-42a6-9261-48eb2f9827d6/child[4]/location[2]).
name is the column/field name as Briefcase/Aggregate exports them.

To keep file sizes small, there will not be any other form data in the GeoJSON.

Open questions
I thought this feature might be able to support a number of formats (e.g., Shapefiles, WKT, KML), but as I dug deeper, it seems like GeoJSON is a very popular, open, and well-understood format and so we might as well start (and maybe stop) there. Anyone feel very strongly about supporting other formats?

I'd like to limit the data in the GeoJSON file. The hope is that if you want to join the geodata with the form data using your favorite tool, it should be possible to do so using the key and the name. Is this a reasonable limitation?

Xiphware · September 14, 2018, 4:52am

this is very much how I'd gone about converting the XForm's instance XML to GeoJSON, which likewise seemed the best GIS target format to me.

Enrico_Ferreguti · September 14, 2018, 6:34am

Great new feature! Geojson is the best interface with FOSS GIS world and web mapping interfaces, shapefile for proprietary ESRI world and KML could be interesting for those non GIS people who use Google Earth or Google services as mapping support.
About data limitation, I understand the need to limit data output but could be interesting to include, as option, all geometries with form data, to obtain a complete snapshot of current submissions that can be viewed in GIS without further connection to aggregate server.

ggalmazor · September 14, 2018, 6:38am

This is awesome

My to starting with GeoJSON

A quick Google search has gotten me to some online converters for all the other formats we're considering:

MapBox also offers command line tools like this one:

https://github.com/mapbox/tokml

ggalmazor · September 14, 2018, 6:42am

@yanokwa, did you forget to explain what's going to go into the name property? I can guess that it's the field name but could you confirm it?

Would it be the field name as Briefcase exports them, like on repeats or groups, where we are prefixing the group's name?

Ivangayton · September 14, 2018, 6:46am

Hi Yaw!

This is eminently reasonable. Very happy to see this proposal.

GeoJSON vs other formats
GeoJSON is an excellent choice; you can open it in any GIS program, use it directly in a number of Web settings, and in any case easily convert it to any other format as other posters have said. The handling of tree structures and the ability to embed multiple geometry types (points, lines polygons) in the same file/string is a bonus.

WKT is great, but if we go all the way to GeoJSON, WKT becomes unnecessary and offers only a subset of the potential functionality of GeoJSON. We've been using a conversion script to go from the raw JavaRosa linestrings to WKT, but this has several serious limitations:

Large WKT linestrings break a lot of tools. The Python limit for a single CSV field is 131072 (2^17 for some reason), and there's a similar limit in LibreOffice (and presumably Excel). It's rather easy for a field surveyor to exceed that limit when walking a line taking a point every few seconds. So embedding WKT linestrings in CSV creates a result that cannot be opened in a number of common tools.
WKT (and JavaRosa) linestrings depend on delimiters that seem to be brittle; some phones we encounter in the wild place weird delimiters in the data that foul up parsers and break the whole CSV output. No guarantee that this will not be a challenge for generating good GeoJSON, but at least the problem will be confined to the geo-feature file!
Of course, you can also make WKT a separate file rather than embedded in a field, which mitigates the problems above, but a GeoJSON file can do anything a WKT text file can without any real additional overhead (file size, complexity, or computational load), but a WKT file can't do all the things a GeoJSON can. WKT strings are probably a bit easier to embed within other kinds of data (for example as a field in a CSV), but this leads to other problems anyway (see field size issue above) and GeoJSON can, in a pinch, be embedded as strings. So WKT just doesn't offer any additional advantage over GeoJSON once you decide to grasp the nettle and create proper geo-features.

While shapefiles are very compact and computationally efficient, they have some awful traits that, I think, make them a poor choice for ODK.

Without some very unpleasant hacks, shapefiles cannot deal with field names longer than 10 characters. Not necessarily problematic if you're only storing the geography in the shapefile and linking to the data using keys and names, but then why use a file format with a database component anyway?
Each shapefile consists of a number of files, which makes it easy to get confused and lose things.
Shapefiles can't contain more than one type of geometry; if you have points, lines, and polygons you'll need 3 shapefiles (for a total of at last 9 actual files, probably more like 15).

More stuff in the geo-feature file?
Fully agree that there's no reason to keep any data in the GeoJSON other than the keys to join with the survey results. I'd like to think of the geo-features as an element of the survey data, rather than thinking of the survey data as elements of the geo-features. This means that a survey instance that doesn't happen to have a specific geo-feature works.

For example, if I am collecting data on families, normally with a point for their home, but encounter one family without a home, I still have a perfectly usable line for that family that simply doesn't contain a geometry. If some of the data about that family is supposed to be embedded in the geo-feature, this will be problematic (a geo-feature file without geometry is technically possible, but is going to cause hassles).

shiva_Reddy · September 14, 2018, 6:52am

Hi @yanokwa ,
Thanks for inviting me for the discussion.
I am looking for ODK aggregate API for accessing submissions as GeoJSON formats.

How you are planning to include other non-spatial attributes in the GeoJSON file?

I can help in testing by calling the API and testing in QGIS.

Thanks

ggalmazor · September 14, 2018, 9:36am

Hi, @shiva_Reddy!

As far as I understand, the initial proposal doesn't include any non-spatial attributes, and @Enrico_Ferreguti proposes we include it optionally.

yanokwa · September 14, 2018, 4:05pm

@Xiphware If you have the precise details of this conversion written up somewhere, I'd love to see it. It might give me ideas, maybe even good ones

@Enrico_Ferreguti If the joining becomes a pain, exporting all of the properties is something that we can do as a v2. That said, consensus seems to be it's not hugely important.

@ggalmazor Yes, on the name being the field/column name that is put in the CSV. I've updated the initial post.

@Ivangayton Thank you pointing out some of the limitations of WKT and Shapefiles. I didn't know about these. It sounds like GeoJSON is absolutely the way to go.

paul_macharia · September 15, 2018, 3:14pm

Hey ODKers,

Not a subject expert on this but I think it's a great feature to enhance data management.

Paul

Xiphware · September 17, 2018, 9:13am

First off, I'm doing these conversions/translations in my iXForms app (ala Collect) not in the Aggregrate, and on a per submission basis not as part of a collective database dump/export. Although converting a (Collect) submission from XForm XML -to- GeoJSON in the frontend vs backend is of no consequence to the problem at hand, I do think it is probably worth first focussing on what a single (XForm) submssion should look like in GeoJSON (eg a "Feature"?), and let the larger export flow naturally from that (eg a "FeatureCollection"?)...

In my case, iXForms can export a submission to a number of different formats:

KML, to launch Google Earth to view the submission
XLS, to launch MS Excel to view it as a spreadsheet
CSV, to be able to process the data in other tools (more readily than XLS)
JSON, as a lighter-weight payload than XML
and finally GeoJSON, as an obvious GIS-targetted alternative to KML.

One of the first issues faced when dealing with KML or GeoJSON is that a Collect XForm form (and hence its resulting submissions) may well contain one, none, or multiple geo-referenced properties. Although in many cases there will only be only one, eg the geopoint of dwelling, there could easily be many (eg location of all water sources, as a repeat group, when surveying a village), or none (eg the device has no GPS, or a survey of aid recipient demographics for which spatial location is meaningless/useless).

In the case of only a single, primary, mandatory geo-referenced property (probably a geopoint) the GeoJSON representation is fairly obvious: the submission becomes a GeoJSON "Feature", you pull out the (single) geopoint from the XML (however deep its buried...) which becomes the Feature's "geometry" point, and everything else in the submission XML is put under "properties" as key-value pairs. And because GeoJSON supports nested properties, this property/group hierarchy can pretty much be a direct 1:1 mapping of XML to JSON (sans the geopoint obviously).

In the case of the XForm submission containing no georeferenced property, a strong argument can be made that - as a consequence - this has no legitimate GeoJSON representation. Specifically, GeoJSON is defined as:

GeoJSON is a format for encoding a variety of geographic data
structures using JavaScript Object Notation (JSON) [RFC7159]. A
GeoJSON object may represent a region of space (a Geometry), a
spatially bounded entity (a Feature), or a list of Features (a
FeatureCollection). GeoJSON supports the following geometry types:
Point, LineString, Polygon, MultiPoint, MultiLineString,
MultiPolygon, and GeometryCollection. Features in GeoJSON contain a
Geometry object and additional properties, and a FeatureCollection
contains a list of Features.

[emphasis added]

My interpretation of this is that a 'structure' containing no "spatially bounded entity" is not a GeoJSON Feature per se, and therefore neither can it be an element in a FeatureCollection. Whereas XML and JSON (and CSV?) can largely be considered a universal data serialization format, GeoJSON simply isnt JSON [sic]; its meaningful scope is a georeferenced object(s), and its not intended to represent more abstract (non-georeferenced) data structures. If we consider GeoJSON as just-another-export-format for (a) XForm submission, I would argue that XForm submissions lacking any georeferencd properties should return a NULL result (!) - because anything else would be disingenuous - and that if a particular XForm is fully intended to be georeferenced then this must be explicitly enforced by the inclusion of a mandatory geopoint (or similar) property.

When it comes to the case of the XForm actually containing multiple georeferenced proerties things get interesting... First off, what is the 'principal' georeference (eg geopoint) that we want to use to display this submission on a map? Perhaps the easiest is simply take the first control, under the assumption that the location of the thing you are conducting the survey about is probably one of the first questions that will appear in the form [this is what I did, as a quick-n-dirty solution]. But obviously we'd want a more robust/less ad hoc solution for ODK. To this end I might suggest introducing an ODK-specific attribute on any geopoint/geoshape/geotrace binding to indicate that it is the 'primary' one to use for the overall result (eg orx:geoprimary=yes
). The primary georeference would therefore become the top-level "geometry" point. Then the question is what to do with the rest of these geo* properties... Unfortunately, GeoJSON doesn't allow nested GeoJSON objects as properties:

A GeoJSON text is a JSON text and consists of a single GeoJSON object.

[although we could leverage using a FeatureCollection for this purpose, I think this is a stretch - there's no implicit notion of what the 'primary' Feature is in this list - and we probably want to exploit the FeatureCollection for our multi-submission export in any case].

What GeoJSON does allow is to instead represent these sub-features (eg locations of the various wells within the village), as so-called "Foreign Members". The problem here however is that these extended properties may well contain, but will not be interpreted as, geospatial properties. The spec even gives an example:

GeoJSON semantics do not apply to foreign members and their
   descendants, regardless of their names and values.  For example, in
   the (abridged) Feature object below

   {
       "type": "Feature",
       "id": "f2",
       "geometry": {...},
       "properties": {...},
       "centerline": {
           "type": "LineString",
           "coordinates": [
               [-170, 10],
               [170, 11]
           ]
       }
   }

   the "centerline" member is not a GeoJSON Geometry object.

[it might be possible to exploit a GeoJSON "GeometryCollection" somehow, which appears to allow multiple georeferenced values, but its not clear to me that these can be any more than the actual geopoint/geoshape/geotrace; that is, they couldn't have any additional key-value properties associated with each. But perhaps one of the GeoJSON GIS experts can chime in here?].

In conclusion, I do think XForm submissions can be well-defined in GeoJSON, with the following caveats:

an XForm submission containing no geopoint, geoshape or geotrace has no valid GeoJSON representation [which I expect will elicit some debate...],
an XForm submission containing a single geopoint, geoshape or geotrace is represented as a single GeoJSON "Feature" whose "geometery" is defined by the geopoint/geoshape/geotrace; all other submission properties are represented as key-value pairs under the GeoJSON "properties".
an XForm submission containing multiple geopoint, geoshape or geotrace are represented as a GeoJSON "Feature" whose "geometery" is defined by the primary geopoint/geoshape/geotrace - as determined by (say) the associated binding's orx:geoprimary value. All non-primary georeferenced properties of the XForm submission may be represented via GeoJSON "Foreign Members", but their interpretation (eg as secondary georeferenced features) is entirely implementation dependend.
multiple XForm submissions, eg an Aggregate export, are represented as a GeoJSON "FeatureCollection", subject to the above.

Alas, I can offer no guarantee my ideas are good, or even above the median. But you can rest assured they'll probably be "provocative"...

danbjoseph · September 17, 2018, 1:05pm

@Xiphware, thanks for the very detailed post, it has made me think about this in a different way.

I had envisioned this functionality as less having a geographic entity for each submission and more just a way to have a geographic entity for each geo-type question. In much the same way that an image-type question will have a unique filename and then there will be an image file with that filename. So perhaps a form would have a single FeatureCollection with a feature for each time a geo-something was collected. Properties such as name and key as mentioned in Yaw's top post would link the feature to the question/submission for which it was created.

This doesn't make for a ready-to-go API that shows each survey submission on a map (as you point out, this would require some sort of 'primary' tag in the event that there are multiple geo features associated with a single submission). Also, each GeoJSON feature would lack any attributes from the rest of the survey. If you included all of a submission's data in a feature's properties, I think file size would almost always balloon to great size. So any sort of external visualization platform would need to make at least 2 API calls and then do some sort of join to attach the data you want to the geographic features.

ggalmazor · September 17, 2018, 1:14pm

Thanks for the detailed comment, @Xiphware!

Some questions/thoughts:

I agree, and I see no problem in omitting these submissions in the output GeoJSON file.

I understand that the GeoJSON output we're proposing would be bound to the normal CSV exports i.e. we will use the same field names and IDs to link submissions to their corresponding entries in the GeoJSON output.

The consequences would be:

All GeoJSON Features will be linked to a submission from the CSV
The submissions from the CSV will be linked to 0 or more GeoJSON Features

After studying the RFC, and tinkering with GeometryCollections and Foreign Members as you've suggested, I've come with a structure that, I think, could fit our purposes:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "geometry": {
        "type": "GeometryCollection",
        "geometries": [
          {
            "type": "Point",
            "coordinates": [102.0, 0.5],
            "meta": {
              "name": "some_geopoint"
            }
          },
          {
            "type": "LineString",
            "coordinates": [
              [102.0, 0.0],
              [103.0, 1.0],
              [104.0, 0.0],
              [105.0, 1.0]
            ],
            "meta": {
              "name": "some_geotrace"
            }
          },
          {
            "type": "Polygon",
            "coordinates": [
              [
                [-64.73, 32.31],
                [-80.19, 25.76],
                [-66.09, 18.43],
                [-64.73, 32.31]
              ]
            ],
            "meta": {
              "name": "some_geoshape"
            }
          }
        ]
      },
      "properties": {
        "key": "uuid:53a83013-99ff-434d-a096-68fe1464c249"
      }
    }
  ]
}

This is a GeoJSON FeatureCollection object that can include a number of GeometryCollection Features.
Each GeometryCollection Feature represents a single submission (showing just a submission in the example)
- It includes a properties attribute with a key member with the submission's instance ID
- Each included GeometryObject is one of the non-empty-valued spatial fields in the submission
  - They all include a Foreign Member meta with their corresponding field name.
This is valid GeoJSON (checked with geojson.io and geojsonlint.com)

Also:

"To maximize interoperability, implementations SHOULD avoid nested GeometryCollections" (RFC) (sorry for the CAPS, it's like that on the RFC)

Contrary to what I believed, it appears we won't be able to have a tree-like structure to support the nested repeat groups as Aggregate does with the JSON exports. We will be forced to flatten all the spatial values into a list.

It seems to me that adding an attribute won't help as much as we think since users will still have to filter results. Why not tell them to use the related field names for that? Users will get the same results without having to complicate things further.

Xiphware · September 17, 2018, 9:35pm

Note, taking the approach on re-presenting the entire XML submission as GeoJSON, it would probably result in a size roughly 60% of the original; on the whole (uncompressed) JSON is about 60% the size of the equivalent represented in (uncompressed) XML.

Probably the first decision that needs to be made is whether this new GeoJSON support is going to be a complete standalone representation of the data - ie converting the entire XML submission to a different format (ie how Aggregate exports CSV, JSON, and how I went about it for iXForms GeoJSON), or rather an 'embelishment' to some other format (CVS) that simply extracts any GIS features and references the original document.

yanokwa · September 18, 2018, 7:32pm

@Xiphware I like start with the narrowest feature first and that to me is a GeoJSON file that is a sidecar/embellishment the CSV. That file would be a FeatureCollection and if at some point we want to put the data inside (which feels odd as an export because submissions with no GIS data will be excluded) then we can do that later as an option.

@ggalmazor Assuming that we are OK with the focus on a GeoJSON as a sidecar file, then the format I described in my initial post still seems reasonable. Further, it solves the problem of flattening. That is, I believe that FeatureCollections, unlike GeometryCollections can be nested.

And because it is a belief, I'd like to double check by asking @Ivangayton, @danbjoseph, and anyone else on this topic who is familiar with the format. Do popular GeoJSON importing tools know about nested FeatureCollections? If not, how do folks represent nested data?

Xiphware · September 18, 2018, 9:02pm

My interpretation of the spec is that Features cannot contain properties which are FeatureCollections, otherwise the whole example in the spec around "GeoJSON semantics do not apply to foreign members and their descendants..." seems a bit pointless. But it would be good to get confirmation of this from someone in the GIS community; nested FeatureCollections could solve all our problems, if they're legit.

ggalmazor · September 18, 2018, 9:44pm

I agree with @Xiphware. I did test this and it is not valid GeoJSON according to the linters I've tried (linked in my previous comment). Right now I don't see any better option than having a root FeatureCollection with flat GeometryCollection features in it.

This is in line of what the RFC says:

3.2. Feature Object

A Feature object represents a spatially bounded thing. Every Feature
object is a GeoJSON object no matter where it occurs in a GeoJSON
text.

o A Feature object has a "type" member with the value "Feature".

o A Feature object has a member with the name "geometry". The value
of the geometry member SHALL be either a Geometry object as
defined above or, in the case that the Feature is unlocated, a
JSON null value.

o A Feature object has a member with the name "properties". The
value of the properties member is an object (any JSON object or a
JSON null value).

o If a Feature has a commonly used identifier, that identifier
SHOULD be included as a member of the Feature object with the name
"id", and the value of this member is either a JSON string or
number.

3.3. FeatureCollection Object

A GeoJSON object with the type "FeatureCollection" is a
FeatureCollection object. A FeatureCollection object has a member
with the name "features". The value of "features" is a JSON array.
Each element of the array is a Feature object as defined above. It
is possible for this array to be empty.

This implies that a FeatureCollection is not a Feature, which would prevent to have nested FeatureColections

ggalmazor · September 18, 2018, 9:53pm

o A Feature object has a member with the name "geometry". The value
of the geometry member SHALL be either a Geometry object as
defined above or, in the case that the Feature is unlocated, a
JSON null value.

@Xiphware, would you say we could take advantage of that part about "... or, in the case the Feature is unlocated, a JSON null value" to represent submissions that have no spatial values on them?

Xiphware · September 18, 2018, 9:56pm

Correct. Good catch - I never spotted that! Awesome

Now we dont have to exclude non-georeferenced submissions, which should make @yanokwa happy

Its might be an interesting exercise to see what some of the common GIS tools do when trying to import a GeoJSON feature set containing NULL geometries(!). It might seem to be a very unusual use-case, from a GIS standpoint, so I'd be curious to see if they handle it gracefully...

Ivangayton · September 19, 2018, 7:16am

Hi all,

Sorry I'm late to the party, and nested GeoJSON question has already answered better than I can by @Xiphware and others, so I'll keep that part of my response short:

In my experience, though nested FeatureCollections are not really kosher in the GeoJSON spec (they are treated as foreign members, meaning it's up to whatever is reading it to decide whether or not to interpret it as a first-class feature). They usually work in common tools; I have certainly seen nested FeatureCollections work in QGIS. I would not go this way simply because we can't be sure that even the tools that can manage FeatureCollections within FeatureCollections will behave consistently.

I seem to recall reading that Leaflet might ignore the inner FeatureCollections, but I can't find the citation at the moment. I've never tested that.

Why not create a GIS file with all of the data in it?

I think I share the view that @danbjoseph had : geo-features are most naturally seen as children of the larger dataset rather than the survey data being subordinate to the geo-features.

Of course it's convenient to have a geographical file that you can slap straight into a GIS program or Web map viz tool and see everything right away, but this is only actually straightforward when there's only one, and exactly one, feature per survey (or repeat). The classic case of "We mapped every house and now we want to quickly see a map with the data as a pop-up when we click on each house" is perfectly compatible with a GeoJSON containing everything.

But what about a survey that includes the house and any outbuildings belonging to that family (outdoor toilet, animal stable, etc). What data is included in the toilet feature?

A more subtle case: you're recording patient origins in a hospital, and some patients are able to tell you their District, Chiefdom, Section, and Village name. Others are only able to tell you their District and Chiefdom. So you have geo-features that may be a large polygon (Chiefdom), a small polygon (Section) or a single point (Village). As attributes of a patient, this is fine—you know everybody's Chiefdom but some people's Village is blank—a GIS analyst can deal with this. But if the patient data is an attribute of the Village, you can forget a patient because they aren't part of a Village feature.

I'm sure there are cases where the data-as-attribute-of-feature(s) is more sensible, but I feel like these are the minority of situations, other than the simplest one:

There's certainly a good justification for allowing a non-GIS user to just get a simple KML with the data to look at it in a Web viewer! It would be nice if that were still possible for people who don't know how to do a basic join of a CSV to a GeoJSON by key. This capacity could be retained by keeping a KML export that is basically the CSV turned into a points-only KML with the first GeoPoint column as the coordinates. That may answer @danbjoseph's concern for some external visualisation platforms, some of which will probably always want to do the simplest thing which is simply show points with associated attributes.

But for anything more complicated than that, and certainly anything for which we'd be looking for the power of GeoJSON instead of KML, I think it makes way more sense to target a CSV export and GeoJSON sidecar (with joining keys for surveys, groups, and repeats).