Allow more control over variable-naming syntax


(Anthea) #1

What is the general goal of the feature?
To allow more control over how variable names appear in the data after being submitted. Currently, the file name and the group name are automatically added before the actual name given to the variable. It would be good to be able to turn off file name prefixes and group name prefixes.

For example, in my survey named household_survey, all my variable names end up in my sheet looking like this:

household_survey-today    
household_survey-enum    
household_survey-basic-name_f    
household_survey-basic-name_l    
household_survey-basic-gender    
household_survey-basic-age

But I want them to be named like this:

today
enum
name_f
name_l
gender
age

I've described this issue in this thread also: How to remove survey name from headers after submission in ODK Collect?

What are some example use cases for this feature?
Having the file and group names makes it difficult to do preliminary analyses on my data as they come in (i.e. not yet complete or ready to move over to other software, and so names cannot be changed yet) as all the variable names look the same. The columns have a fixed width so all of them show household_ before being cut off. I then need to select the cell with the name to see the full name in the formula box. This is tedious especially when presenting on ODK to stakeholders and introducing the concept of data collection and analyses to them, showing how survey responses get turned into data. Sure, we can guess at some of the variable names by looking at the responses (date, names), but when there are a few questions in a row that contain integers, that's where it starts to get confusing and cumbersome to keep having to check the variable name.

image

The front of a variable name is what is read first and having all variable names share the same file name at the start just creates clutter and makes names unnecessarily long. This also makes me reconsider my practice of naming my files with the most recent date, as I don't want to further add to the length of my variable names. This issue is carried through to other programs such as R where calling a column name requires a lot of typing and renders keyboard arrow shortcuts useless.

I understand that it is currently set up the way it is to ensure that variable names are unique, but ODK Validate already checks for unique names.

These are the errors I get in XLSForm Offline when the variable is repeated in the same group:

ODK XLSForm Offline Errors:
There are more than one survey elements named 'name' (case-insensitive) in the section named 'test_form'.

This is when the variable name is repeated across two groups:

ODK XLSForm Offline Errors:
There has been a problem trying to replace ${name} with the XPath to the survey element named 'name'. There are multiple survey elements with this name.

Given that there are already checks in place to make sure that names are unique regardless of group, the final variable name is needlessly long with both file names and group names as prefixes.

Additionally, a good feature to have could be to allow non-unique names during form building, and upon submission, append numerals to the end of names that are repeated. This is a common solution used in all computers when saving something with a duplicate name, adding (1) or (2) etc. Software like R/RStudio also do this when joining data frames.

This will save a lot of collective time and effort for users as I'm sure many are already manually removing the unwanted prefixes as the first step of data analysis. If there is an option to turn off either or both file name prefixes and group name prefixes, it will be a welcome improvement.

Thanks.


(Yaw Anokwa) #2

@notaplatypus Thanks for the feature request. When you get a chance, please introduce yourself here. I'd also encourage you to add a real picture as your avatar because it helps build community!

@ggalmazor I could have sworn we had discussed adding this to Briefcase and Aggregate, but I can't find any evidence. Maybe it was all in my head! I do think it'd be very nice to have. Any risks to implementation that you see?


(Guillermo) #3

I vaguely remember some conversation about group name prefixes on field names with @Matthew_White, maybe?

I'm guessing that the form @notaplatypus is using has a root group called household_survey and that's why all the fields are being prefixed with that in the exported file. Could you confirm this bit? Root-level fields shouldn't get any prefixes.

I still think there's a risk of having repeated fields across different groups. Even though tools like Validate prevent it, my interpretation of the XForm specs is that there's really no limitation on this particular thing.

This means that someone could manually create a form with two groups and the same field in both groups: form (1.6 KB). I've been able to upload that to Aggregate, get it with Collect and export it with Briefcase. No problem.

Also, how would the customization work in a user-interface sense? Did you have something in mind, @notaplatypus?


(Dr. Gareth S. Bestor) #4

There is nothing in the XForms spec that requires instance node names be unique across the entire XML instance tree. Indeed doing so would probably violate all sorts of XML rules... Rather, it’s the full path down thru the XML tree to the node that uniquely identifies it (if even then, if you consider secondary instances).


(Anthea) #5

Thanks for the replies everybody. I should have mentioned this before: I'm speaking only from experience building an XLSForm, then converting to XML, entering data with Collect and then submitting to Google Sheets. I've not tried Briefcase or Aggregate before and I'm not familiar enough with XML to know the requirements, tradeoffs, or if my recommendations are even feasible!

@yanokwa I've uploaded a profile picture and will post an introduction shortly! I'm still new to ODK so I'll do my best to answer everyone's questions.

The filename of my form in XLS is household_survey.xlsx and the first few variables (today and enum) are not in any group. If my filename is household_survey_2018-09-19 then that will be the new prefix for all the variables, i.e. household_survey_2018-09-19_today, household_survey_2018-09-19_enum, etc.

I think a new parameter could be added to the settings sheet in xls. Along with other parameters such as version, instance_name, and submission_url, I'm thinking that we could have one called group_prefix and the setting can be off or on (you guys can choose which is the default if the parameter is not specified). Another filename_prefix could be added with the same setting options. Or, both could be combined to one parameter called var_prefix and setting options can be filename:on/off, group:on/off, filename:on/off; group:on/off or something like that.

If there is a need to force unique names, the other feature I was thinking of would be to automatically append numerals to duplicated names regardless of the prefix settings. This could be a default feature, and numerals added would be in the order in which the name appears on the form.

So with all prefixes on, the exported names (with some-var as the duplicated name in the form) can be:
filename-group1-some-var_1
filename-group2-different-var
filename-group3-some-var_2

With all prefixes off, the names can be:
some-var_1
different-var
some-var_2

Sorry I can't speak to the XForm/XML specs or rules side of things. Just contributing what I would like to see as a user in terms of data output using the tools I've chosen. If there is already a solution, that would be welcome too. But in my searches I haven't found anything, only a couple of posts recommending using the search and replace feature in Excel or coding the first few lines in R/Stata to clean up the names.

Thanks!


(Matthew White) #6

Welcome @notaplatypus!

Am I right that you're suggesting the following two changes?

  1. The option to remove filenames and group names from the data
  2. The ability to use the same name for two fields in two different groups

For (2), like @ggalmazor, I'd been working under the assumption that this was already allowed. It sounds like it is allowed in the XForm spec, but XLSForms do not allow it. @LN, do you think it'd be possible to remove this restriction from XLSForms? @notaplatypus, can you think of a time when you've run up against this restriction?

I have more experience with suggestion (1). When I wrote odkmeta to import Briefcase data to Stata, a common request was to remove group names from column headers as part of the import. We ended up adding that as an option: odkmeta includes group names by default, but users can choose to remove them if they prefer.

This is often useful in Stata in particular, because Stata variable names can be a maximum of 32 characters, which can be well short of the length of a long field name when it's qualified by its group names. (A Stata "variable" is similar to the column of a spreadsheet.) Without removing group names, it's not uncommon for the column headers of a Briefcase .csv file to result in duplicate Stata variable names.

I think some other servers also present this as an option, for example, KoBo users see an option to include/exclude group names when exporting data.

Like @ggalmazor said, it does seem possible that removing group names would result in duplicate column headers. In odkmeta, if there are two field names with the same name and the user chooses to remove group names, odkmeta will remove group names from the first variable's name but will not modify the second variable's name. As a result, the second variable sometimes ends up with an ugly Stata variable name like var123. I also like @notaplatypus's suggestion of removing group names, then appending numbers to disambiguate any duplicate column headers that result.

Assuming there's a viable strategy to resolve duplicate names, rather than adding this functionality through an XLSForm setting, it might end up being easier to add it as an option in Briefcase/Aggregate/Central. For it to work with Google Sheets, maybe there would need to be a Collect setting as well under the Google Sheets settings...

Like @ggalmazor, I've never encountered filename prefixes: I've only seen group names. @notaplatypus, you're creating your form using XLSForms and XLSForm Offline, right? Can you share an example XLSForm for which you're seeing filename prefixes?


(Dr. Gareth S. Bestor) #7

That may be tricky, because in XLSForm ${bar} refers to the question with id="bar", irrespective of whether "bar" is actually a question under a group called "foo". And you cant say something like ${foo/bar} in XLSForm. So, in a flat namespace of a spreadsheet, all these ids pretty much have to be globally unique, alas.

Which is (one of the reasons) why XLSForm-to-XML is a one-way street!... :slight_smile:


(Anthea) #8

@Matthew_White thanks for your reply. For change (2) that you mentioned, I don't typically encounter this use case as it's my own practice to assign unique names to all fields. It was brought up as a situation in which the group prefix would be helpful in making all variable names in the data export unique. I'm not sure how common this user/builder behaviour is so I can't comment on reactions to removing the restriction. But I like keeping it and I think it's good data hygiene to avoid duplicate names, not to mention this is already required in other statistical programs and software. However, if you might want to allow (2), then I think appending numbers is a cleaner approach in dealing with it.

For the filenames, here is my process and some files for you to take a look at:

  1. I build an XLSForm in Excel, with the three sheets survey, choices, and settings.
  2. Once I'm done I run it through XLSForm Offline to convert it to an XForm.
  3. I either upload the XForm to my Google Drive, or transfer it to the ODK folder on my Android phone through USB.
    3a. If the form was uploaded to Google Drive, I use Get Blank Form in ODK Collect to download my survey.
  4. I fill in the form with Fill Blank Form in ODK Collect on my phone.
  5. I send my responses with Send Finalized Form. (The submission URL is specified in my XLSForm in the settings sheet and points to a blank Google Sheet).
  6. I check my responses in Google Sheet. This is where I see the headers with the full filename, group name, and field name.

This is the link to the responses.

And here is my XLSForm: household_survey_2018-09-13.xlsx (15.9 KB)

And this is the converted XForm: household_survey_2018-09-13 (2).xml (18.3 KB).


(Matthew White) #9

Thank you, @notaplatypus!

When I look at the data in the Google sheet you linked to, I can see that all column headers have the prefix household_survey_2018-09-13. When I look at the XML, there are no groups with the name household_survey_2018-09-13, but the name (not the id attribute) of the child node of the primary instance is household_survey_2018-09-13. Perhaps XLSForm Offline uses the filename of the XLSForm as the name of the child node? Often the name of the child node is not used, but it seems to be prepended to column headers in Google Sheets. I tried the XML in Aggregate, and household_survey_2018-09-13 did not appear in the column headers of the Aggregate .csv export, so this does seem specific to Google Sheets.

@yanokwa, can you confirm whether the name of the child node of the primary instance of the XForm is prepended to colum headers in Google Sheets? If so, I think the following additions could be useful:

  1. Add an option in Collect under the Google Sheets settings to not add the name of the child node of the primary instance to column headers.
  2. Add an option in Collect under the Google Sheets settings to not add group names to column headers.
  3. Add an option to Briefcase to not add group names to column headers.
  4. Add an option to Aggregate to not add group names to column headers.
  5. (Down the road) Add an option to Central to not add group names to column headers.

For (2)–(5), we would need to come up with a strategy to resolve duplicate names.

@ggalmazor, what do you think of these options?


(Hélène Martin) #10

Agreed with @Xiphware that this is one of the important simplifications that makes XLSForms easier for users. I don't see any reason to remove it.

That is correct. There is a proposed change to always use data as other form converters do but it is a risky change and requires deeper review and testing: https://github.com/XLSForm/pyxform/pull/152. If someone has cycles to participate in that, it would be greatly appreciated.

Some existing discussion of that at https://github.com/opendatakit/collect/issues/298 started by @Enrico_Ferreguti; see particularly this suggestion:

How about a configuration option in the Google Sheets settings such as
[ x ] Include form and group names in column names

By default, this would be checked and the behavior would be the same as it is now. If unchecked, only the field name would be included in the column headers.

A form with duplicate field names would not be able to be sent to Google Sheets.

Options 2-5 sound good to me. I would even be ok with simply warning in documentation and perhaps in the UI that column names in the output document may not be unique.


(Matthew White) #11

That sounds good to me. I'm convinced that there are nice benefits to the restriction, and I also don't think it's hard to work around it.

This sounds great to me. I could see us perhaps wanting to split this into two options at some point (someone might want to remove a lengthy form name but keep group names). However, I wouldn't be surprised if most users who are interested in these options would want to remove both the form name and group names.

This seems like a nice solution to me. Given how many users use XLSForms, I would think that duplicate names should be relatively rare.

By the way, @notaplatypus, I'm not sure whether this would be helpful, but I believe it's possible in Google Sheets to automatically resize all columns at once: select the entire sheet (you can use the select all keyboard shortcut), then double-click the divider between the A and B columns toward the top. The columns might end up wider than ideal, but at least all column headers will be visible without much manual effort. (Or is it not possible to edit a Collect Google sheet in this way?)


ODK 1 TSC Call - 2018-10-03
ODK 1 TSC Call - 2018-10-17
ODK 1 TSC Call - 2018-10-03
(Aurelio) #12

I agree. Most users now use XLSForms or Build. Using "Validate" will not allow duplicated variable names and most of the data managers at the end need to manually (or with a specific manipulation post export) remove the group names for analysing the data. Ideally an option on Briefcase to export using directly the variable name instead of the long name would be really useful.

I started to advocate odkmeta recently and the option there to use the short name is great @Matthew_White, especially for Stata users. Thanks

@ggalmazor We also had a quick exchange about that couple of weeks ago on my post Feedback from trainings


(Hélène Martin) #13

A post was split to a new topic: Should the root node name produced by XLSForm be changed to data?


(Aurelio) #14

two issues opened:

Aggregate

Briefcase


(Guillermo) #15

Dear y'all,

I've prepared a PR that adds a new -uun CLI flag to use unqualified field names (removes the group names), and it would be awesome to have you try it and get some feedback :wink:

Get it here: briefcase_671.zip


(Guillermo) #16

Hi! We've finally released a new Briefcase version including the option to remove group names from CSV exports on the CLI. Check it out here: ODK Briefcase v1.13