Naming of repeat CSVs in Briefcase export in v1.12.1

ggalmazor · September 30, 2018, 10:19am

I'm worried I might have missed something in the last release. Could you confirm that every change you were expecting is in v1.12.1 and that there's nothing missing?

This is my current understanding of how things should work:

Non-repeat group ancestors should be omitted in repeat group keys
Non-repeat group ancestors should be included in repeat group filenames
Group names in column headers should be separated using hyphens
Group names in repeat group filenames should be separated with hyphens

Let me know if I missed something or if I didn't understand something, please.

Regarding separator characters, it could be a problem to use the underscore, because Briefcase normalizes special characters to underscores. This could be confusing for users, and that's why we typically would want to use hyphens.

Matthew_White · September 30, 2018, 9:57pm

(@Francisco_Carballo, you can probably skip this post: I'll reply to your post in a separate post below.)

@ggalmazor, it's been a while since I dug into some of this, but I just tested out a form with nested groups and repeat groups, and I think I have a better sense of the expected behavior. The form I used has the title nested and has the following structure:

nested.xlsx (9.6 KB)
nested.xml (6.8 KB)

field1
group1
- field2
- group2
  - field3
  - group3
    - field4
  - repeat1
    - field5
- repeat2
  - field6
  - group4
    - field7
  - repeat3
    - field8
repeat4
- field9
- group5
  - field10
  - group6
    - field11
  - repeat5
    - field12
- repeat6
  - field13
  - group7
    - field14
  - repeat7
    - field15

I actually don't know a lot about the specifics of PARENT_KEY and KEY. The only thing that odkmeta assumes is that the PARENT_KEY of a row of a repeat group CSV exactly equals the KEY of the corresponding row of the parent CSV.

I think this is right, and Briefcase 1.12.1 matches my expectation here.

It looks like earlier versions of Briefcase took an interesting approach here that differs slightly from Briefcase 1.12.1 when there is a repeat group within another repeat group. In earlier versions, the filename of a CSV file for a nested repeat group would not include the name of the repeat group's parent repeat group. However, it would include the names of any groups between the repeat group and its parent repeat group. On the other hand, Briefcase 1.12.1 uses the fully qualified name of the nested repeat group for the filename, so it includes the name of the repeat group's parent repeat group in addition to the names of any groups between the repeat group and its parent repeat group.

When I use Briefcase 1.12.1 for the nested form, it exports the following files:

nested.csv
nested-group1-group2-repeat1.csv
nested-group1-repeat2.csv
nested-group1-repeat2-repeat3.csv
nested-repeat4.csv
nested-repeat4-group5-repeat5.csv
nested-repeat4-repeat6.csv
nested-repeat4-repeat6-repeat7.csv

Briefcase 1.10.1 exports the following files:

nested.csv
nested-group1-group2-repeat1.csv
nested-group1-repeat2.csv
nested-repeat3.csv
nested-repeat4.csv
nested-group5-repeat5.csv
nested-repeat6.csv
nested-repeat7.csv

I think using the fully qualified repeat group name, as Briefcase 1.12.1 does, is safer for forms that contain two repeat groups with the same name. I would guess that's pretty rare, especially since XLSForm doesn't allow it. However, when I test such a form with Briefcase 1.10.1, the two repeat groups with the same name are written to the same file.

I'd be happy with either approach: the current approach seems slightly safer, but the older approach also seems reasonable, and it wouldn't require any odkmeta changes.

While we're on the subject of duplicate names, I think there's still a slight chance that Briefcase 1.12.1 could run into duplicate CSV filenames, if there are two repeat groups that have different names but have the same fully qualified name. For example, if there is a repeat group named r that is nested within a group named g, and there is a second repeat group named g-r, then even though there are no duplicate names, the fully qualified names are duplicate. However, I think the risk here is probably vanishingly small, and I think Briefcase 1.12.1 does the right thing by throwing an error. But I thought I'd mention this in case it makes sense to include it as a test case.

This sounds right to me!

This sounds like a good strategy to me, and using a hyphen as the separator means that odkmeta won't have to change.

ggalmazor · October 1, 2018, 5:07pm

Thanks for your detailed response, @Matthew_White!

It looks like everthing is kind of OK except for the names of the repeat group CSV filenames.

It sounds resonable to keep doing it like Briefcase v1.12.1, though. An improvement we could add is to replace any "-" in group names to something else to really prevent dupes. Do you agree on this?

In any case, I think the best would be to have Briefcase do exactly the same it was doing on v1.10, and then think about improving it. I'll prepare a change and we can have a Briefcase v1.12.2 in no time

Matthew_White · October 2, 2018, 1:14pm

This sounds great to me. It doesn't require the introduction of a new option, and it'd be easy for downstream tools like odkmeta to implement. The convention in XLSForm seems to be to use underscores rather than hyphens in names, so hopefully hyphens in group names are not common. What replacement character do you think would work well? Also, do you know offhand how Briefcase replaces colons in names? (Those aren't allowed in Windows filenames I think?)

Perhaps I should also try to summarize this part of the conversation for the rest of the Central team, since one goal for Central has been to export CSVs that match the Briefcase export.

I'm actually close to releasing a new version of odkmeta for the first time in a while, so I'm in a good position at the moment to implement any changes to CSV filenames.

LN · October 2, 2018, 5:04pm

I think this is generally a good process but given that @Matthew_White is working on an odkmeta release right now and that it took a couple of months for a user to run into the issue (hopefully none experienced it and gave up), I think it would be ok to stay with the behavior introduced in v1.11 -- using the fully qualified name for the filename.

I can think of two downsides:

filenames can get really long. Presumably the truncation was introduced to address this. Is it commented anywhere or related to an issue? Did a user run into trouble with the name being truncated? If so, we should reconsider.
once odkmeta is updated to expect the full filenames, it won't work with files with nested groups exported from older Briefcase versions, right? Or could the change be made in a way that works with both by reading the filename from the right, for example? That would be the ideal, I think.

How about Briefcase just fail upon seeing that it would generate duplicate filenames? This is such an odd case! If a user then reported running into it, we could think about a substitution but I don't think it's worth doing now.

ggalmazor · October 3, 2018, 5:27am

This seems reasonable, thanks, @LN!

Just a clarification about this: using FQNs was introduced in v1.12.1 as a fix for the main issue reported here. After v1.12.1 we discovered the issue with nested repeat group filenames we are discussing about fixing or doing nothing about it.

Matthew_White · October 3, 2018, 7:35am

I've been going back and forth on this, because I can see the advantages and disadvantages of each option. However, I thought I'd throw some additional thoughts out there.

Apparently there is a Windows limit such that file paths (not just file basenames) cannot exceed 260 characters. That seems fairly long, but I could see fully qualified names hitting that limit if the form contains enough nesting and uses long enough names.

Given that, maybe there is a case to be made for continuing to use partially qualified names in the way that Briefcase did before 1.11. The partially qualified names are shorter than the fully qualified ones, so they are less likely to run into the 260-character limit. If we haven't heard of users running into issues with that limit, maybe that means that partially qualified names have tended to be short enough to avoid it.

Because partially qualified names include some group names, even if there are two repeat groups with the same name (which itself should be quite unlikely), there will be an issue only if the repeat groups either don't have parent non-repeat groups or have similarly named parent non-repeat groups. Partially qualified names don't protect against all cases, but compared to unqualified names, I think they make duplicate filenames less likely.

If we can expect older versions of Briefcase to be in use for a while, I do think that means that if we change the filename pattern, then downstream tools will have to account for both the old pattern and the new pattern. That may be an advantage to continuing to use partially qualified names, because doing so wouldn't require changes to downstream tools.

In the case of odkmeta, the output of odkmeta is a Stata do-file (a Stata script) to import the CSV files. The CSV files don't have to exist when odkmeta is run, but only when the do-file is run.

That means that searching for the correct file (or otherwise accounting for the two patterns) would have to happen in the do-file. I think there's no issue if the internals of odkmeta get more complex, but I think there are downsides to increasing the complexity of the do-file, because it makes it less likely that the user will be able to understand it or debug or modify it. I think accounting for both patterns would increase the complexity of the do-file, probably not hugely, but in a few different places.

All things considered, I would probably lean toward continuing to use partially qualified names. I think it'd still be a good idea to check whether Briefcase would generate duplicate filenames and fail if it would. And if we hear about any issues, maybe we can circle back to this question then. But they help avoid the 260-character limit, provide limited protection around duplicate names, and don't require changes to downstream tools, so even though they're a bit of an unusual pattern, I think they have advantages.

LN · October 3, 2018, 6:46pm

Aha, yes, I had a fuzzy memory of such a thing. Unfortunately, I think that means using the fully qualified name isn't a good option. 260 goes really fast if a user organizes their files in subfolders.

I can't help but wonder whether odkmeta and other downstream tools could use the contents of each CSV to determine what it represents rather than parsing the filename? It seems that might be less brittle.

Either way, at this point I'm leaning towards merging https://github.com/opendatakit/briefcase/pull/659 for now to revert the FQN behavior introduced in 1.12.1 (sorry I got that wrong above, @ggalmazor) and continuing the conversation.

Matthew_White · October 3, 2018, 10:32pm

This conversation has also left me thinking that filenames are pretty brittle. For using the contents of the CSV, is it possible for two different repeat groups to have the same column headers? If I recall correctly, I think we had discussed removing the SET-OF columns at some point, which I think could be a good simplifying move. (The next version of odkmeta won't use SET-OF columns.) But especially without the SET-OF columns to distinguish them, I could imagine the CSV files of two small repeat groups having the same column headers. It does seem pretty unlikely though, especially since XLSForm doesn't allow duplicate names...

What would you think about exporting a "manifest" along with the CSV files that maps filenames to repeat groups and lists the version of Briefcase used? As we consider adding more export options to Briefcase (for example, removing group names from column headers), perhaps the manifest could also list the export options that the user selected so that downstream tools can be sure that the CSV files are in the expected format.

ggalmazor · October 4, 2018, 6:12am

I like that v1.11 uses just the repeat group's name (shortest possible meaningful piece of info). We could build some disambiguation prefix/suffix in case of dupes.

LN · October 4, 2018, 6:39pm

Fair enough. My sense is that this would be relatively low priority, though, and would require some decisions and coordination. Starting with going back to the behavior that has long existed and making sure it's well-documented lets us focus on other things in the pipeline and come back to it if desired.

Matthew_White · October 9, 2018, 3:20am

After discussing it with @LN, and given that I'm close to an odkmeta release, I'm thinking of updating odkmeta so that it recognizes repeat group CSV filenames that use any of the following:

Partially qualified names, delimited by a hyphen
Partially qualified names, delimited by an underscore
Unqualified names

This means that if a future version of Briefcase exports CSVs with filenames that use unqualified names, odkmeta will support that even as it continues to support the older patterns as well.

I think there are a lot of advantages to this strategy. The resulting filenames are short, so they shouldn't run into the Windows limit. Unlike partially qualified names, the strategy is easy to describe. It should also be easy to implement in downstream tools.

I don't have strong feelings about the disambiguation strategy. odkmeta only works with XLSForms, not XForms, so it can safely assume that repeat groups have unique names.

I also checked the current behavior of Central. Right now it names repeat group CSV files using fully qualified names, delimited by a hyphen. Given the Windows limit, I think Central may need to start using a different pattern. If Central starts using unqualified names, then Central's CSV exports should work with the upcoming odkmeta release.

Around that time, perhaps it will make sense to start using unqualified names in Briefcase as well? If so, I think it would be helpful for Briefcase and Central to use the same disambiguation strategy as well, and we should make sure that strategy is well documented for downstream tools.

@LN, @ggalmazor, how does this sound?

LN · October 9, 2018, 4:34am

That sounds like a reasonable way to ensure compatibility over a range of systems and versions.

So I think the next steps are

Release Briefcase v1.12.2 that reverts to partially qualified names for now
Coordinate between Briefcase and Central on a disambiguation strategy for duplicate unqualified names
Add the CSV filename info and disambiguation strategy to the Briefcase export format docs
Release filename change to Central and Briefcase

Does that sound right?

ggalmazor · October 9, 2018, 8:58am

All this makes sense. Let's do it

ggalmazor · October 9, 2018, 11:04am

ODK Briefcase v1.12.2 has been released

ggalmazor · November 23, 2018, 9:26am

Here goes an update regarding this: we've just released Briefcase v1.13, which now will produce shorter (non-qualified) repeat group filenames. ODK Briefcase v1.13