Reflection: release process for Collect, Briefcase, JavaRosa

LN · September 22, 2017, 11:42am

TL/DR: there will be no Collect, Briefcase or JavaRosa release in September; your thoughts on the release process are greatly appreciated

Over the past months, several of us have been involved in codifying, streamlining and iterating on the release processes for Collect, Briefcase and JavaRosa. You can see the results in the README files for each project on GitHub. I wanted to take a moment to talk about what has worked well and what still needs improvement. I hope we can use this thread to keep doing periodic reviews and process improvements after releases. I was going to call the thread postmortem but I find that term a bit grim. Please share your thoughts on how things are going and your ideas!

We've been doing regularly-scheduled monthly releases since March. That has made it possible to work through a backlog of issues and get lots of incremental improvements out in a timely way. It has also been a tiring pace that's hard to maintain. For example, the posts about the latest releases were only posted to the forum several days after they were actually out. Let's take a month off (no releases in September) and try to refine the process for the next set of releases.

Successes

When there have been issues with new releases, fixes have typically been out within 24 hours of discovery
The feedback loop has tightened with many more users participating in the Collect beta program and identifying issues before releases
Manual verification is now systematic. In particular, @mmarciniak90 has been leading that effort on Collect. @Nader also provided really valuable feedback in the last set of releases
Having a predictable release cycle has helped keep translations up to date
It has been motivating for organizations and individuals who submit code to enjoy the benefits in a timely manner
Careful monitoring of Firebase Crash and Play Store crash reports for Collect has led to identifying and fixing a number of issues (Currently @yanokwa, @Grzesiek2010 and I have access)
@jamesknight, @Grzesiek2010, @shobhit_agarwal and others have been adding automated tests and improving the testing infrastructure on Collect. @dcbriccetti is keeping tests up to date on JavaRosa and adding some on Briefcase. @daveycrockett is also working on Briefcase tests. All of these are increasing confidence as changes are made and making it easier to add more testing along with new features
Download patterns are systematic with low downloads on Sunday. Releasing on Sunday means that in case of problems, we have time to react quickly before the bulk of users upgrade
JavaRosa releases to Maven Central are smooth and easy as is updating JavaRosa versions in other tools thanks to @yanokwa

Challenges

We know there are still users suffering without describing their problems -- there are some crashes we can see in logs but can’t yet explain and others we know are related to form design and easy to fix. How do we encourage feedback from users when they run into problems?
Forks of Collect log to Firebase Crash. Several have bad crashes that drown out crashes from the core. We need to figure out how to prevent those from logging to the ODK account (GitHub issue here)
The last couple of releases have introduced subtle issues that only affect certain kinds of forms or devices. How do we generalize from those problems and prevent introducing similar ones in the future?
Releases are currently very labor intensive -- the binaries need to be built and verified, release notes are written for and posted to GitHub, the Play Store and the forum. No task is big but cumulatively it takes a long time. What can be automated? I know @Shobhit_Agarwal has some ideas for release note automation
Downloadable binary hosting uses a custom solution and someone from UW needs to do the uploads (thanks, @clarice_larson!). Moving to a more standard hosting solution like Bintray would reduce the labor involved
The release process currently has a high bus factor -- only @yanokwa has the keys to do a Collect release, for example. How can that bus factor be responsibly increased? Does anyone have experience with systems like Passbolt or any other best practices to share credentials safely?
How do larger, more disruptive changes work with monthly releases? We may need to introduce practices like feature switches

W_Brunette · September 22, 2017, 11:32pm

Thanks for the great reflection post @LN!

Also thank you for your continual efforts to improve/streamline the release process! I think this is a great conversation to have.

To to decrease any community anxiety, I wanted to clarify to the community we attempt to ensure there is no single point of failure or extremely high bus factor. Therefore, I thought I would mention that as far as I know @yanokwa is not the only one who can do a collect release as several of us have the permissions access to publish a release.

Furthermore, I agree with @LN we need to enable a broader number of people to perform release tasks and that it would nice if had a shared credential infrastructure to avoid any bus factor.

Souirji_Abdelghani · September 23, 2017, 2:47pm

Hi @LN and @W_Brunette,

I certainly welcome this very thoughtful post by @LN; It is very useful to pause from time to time and evaluate the work done so far.

I have tried sometimes ago to draw the attention to the excessive pace of software releases https://forum.getodk.org/t/necessity-for-long-term-release-of-odk-software/8921/8.

I think more than ever that it would be best to publish yearly (or six-monthly) long-term (LTR) releases because constant changes are not necessarily a good thing. Both users and developers need some stability. Only bug corrections should be released soonest whenever they are ready. This is the case with many well-known open-source software such as QGIS. This will allow more thorough testing of the software and more robust architecture planning.

As to who should publish new releases, once their calendar is agreed upon, this should be determined beforehand and it should be restricted. Perhaps I have misunderstood @W_Brunette, but if several people can publish new software releases, this is going to be chaotic and detrimental to the community. Most probably @W_Brunette only wants to point out rightly to the need of sharing better the tasks involved in the publication process to avoid a glut (bus factor).

Better sharing of responsibilities would certainly help improve the reliability and speed of software releases. But I am still worried by the current excessive pace of these releases.

Nader · September 23, 2017, 6:52pm

Hi @LN and @W_Brunette and @Souirji_Abdelghani

in first of all It's really grate Idea @LN this will help us to improve the quality and the stability.

i agree with @Souirji_Abdelghani LTR is better changes not necessary to fix as soon as possible
(it depend on the issue) but in general constant changes can lead to a lot of issues.

in the same time exactly developer need to work without pressure and in the same time we need to give the testing a space before releasing, in this time developer don't have to go directly to fix issues
we have to collect all issues for a while, then developer team meeting discuss this issues and planning the solution for it.

sometimes you will find a lot of issues and it's only one issue and the other are a result for that main issue, so developers spent their time fixing issues and that issues have a non-direct relation with the main issue, code will grow and this will decrease the stability of the software.

finally all what we look for is to make the process of releasing to be less risky most effectively

thank you for all of you

Best Regards For all of you

W_Brunette · September 23, 2017, 6:56pm

To clarify @Souirji_Abdelghani question/statement, I was responding only to the "bus factor" or single point of failure problem. The current extra people who have permission access do not currently (nor am I proposing) do releases. The point is not to have multiple people doing releases simultanously, instead if something happens to the person doing the release causing them to be unavailable others can step in. "Bus Factors" are risks that many users try to avoid because it can effect them.

I also think spreading the release work with a coordinated effort between people could help distribute the workload more. As @LN points out we do not have a good way to share credentials safely.

LN · September 25, 2017, 8:20am

Good news about credentials, @W_Brunette! Is there a spreadsheet of credential types (not actual credentials!) and who has access to each that could be publicly shared? Knowing for sure that there is redundancy for every credential type would certainly make me sleep better at night. Having it be public would help make sure the information is broadly available in case of any problems and would help make it clear where there could be opportunities for community members to increase redundancy. I think it would be a good first step towards shared credential infrastructure.

@Souirji_Abdelghani Thanks for bringing up this important question about release pace. It's something we've discussed in the latest ODK 1 Developer Calls. There are tradeoffs associated with any release schedule and so far the sense has been that regular releases has been overall positive for the following reasons:

Crashes have gone down significantly on Collect. There aren't yet analytics on the other tools but issues that were brought up have been fixed in a timely manner.
Contributors know that their work will be available soon. If a change isn't ready, we can wait until the next release knowing that it isn't too far in the future. In particular, we never need to rush to get a change into a particular release because we know another one is coming soon.
It's easier to fix issues in code that has just been integrated rather than to address them months later.
We've developed a good culture of considering risk for each change that gets proposed. You can see this on every pull request and in conversations on the developer Slack.
Only having one active development branch to manage lowers overhead significantly. This is important because the developer community continues to be small relative to the user community.
User feedback has overall been positive as measured by positive comments on public channels, user interviews and word of mouth.

If there have been problems with the releases, it's really important that they are highlighted immediately. @Souirji_Abdelghani if you have observed some, please take a moment to write them up for the support category so that they can be addressed. The intent has been and continues to be that changes released do not require retraining. This has been verified informally with various groups that use ODK at scale but if it's something you are concerned about, please make sure to try out pre-releases and ask for changes as needed.

All that said, while I do think this process has been effective for incremental changes, as more sweeping changes are considered we will need to look to different processes including things like using feature switches and putting out longer-running betas. Any suggestions for those would be much appreciated so that we are as well-prepared as we can be to make bold changes in positive ways.

I have to admit my personal experience with LTS builds has not been positive so I'm wary of them. @Souirji_Abdelghani if you have a good process for validating and maintaining LTS builds, please consider writing up a proposal post in this thread and/or linking to other process documents you find useful. In particular, all builds are available on the website and continue to be supported here on the forum, what processes would make the LTS builds different? What do you imagine your role could be in making this happen? Could you maintain LTS branches and get the ready for release? Help get user feedback and testing?

Souirji_Abdelghani · September 26, 2017, 8:45am

Hi @LN,

Thank you for these interesting and exhaustive clarifications. I shall address your questions and remarks in the coming 3 weeks because I am now very busy with some unrelated technical work.

Best

LN · October 23, 2017, 6:54pm

Does anyone else have any feedback and suggestions?

We are sitting on a lot of improvements to Collect. For example:

big speed improvements for large select_ones
upgrade of the GPS libraries for faster lock and better accuracy
autocomplete on select multiple
scaling down of images from the general settings
Ethiopian calendar
lots of bug fixes

My preference would be to release these soon because I think they bring a lot of value. I don't believe any of the changes will require enumerator re-training though of course there's always a risk of unintended consequences.

Going back to our earlier schedule would mean a code freeze and beta out by Wednesday and a release by Sunday. If you are committed to verifying the beta and would like more time, please share what you think a more helpful timeline would be.

Unfortunately the release automation work has not yet happened. If anyone has interest in this area, let's start a separate thread and see what we can improve!

LN · December 4, 2017, 6:00pm

Because of holidays in various locales -- Thanksgiving, Christmas, etc -- we have discussed a modified release schedule on Slack. The next release of Collect is planned for December 17th. Beta testing will be announced by December 13th but preferably before. As always, please share any comments or questions you have about the schedule and process.