Automated benchmarks in JavaRosa

ggalmazor · March 25, 2019, 2:25pm

Hi, all!

I wanted to share with y'all some ideas that we've been discussing for some time related to adding automated benchmarks to JavaRosa.

The motivation behind investing a collective effort on this is to be aware of which are currently the most pressing pain points in JavaRosa's implementation performance-wise and to have a tool to assess any proposed technical improvement when addressing them.

We're not talking about assessing JavaRosa's behavior (which is something important that we should do as well), but the performance of its current behavior. This means that sometimes we will want to focus on low-level features, like a specific xforms function, or high-level performance when parsing full-fledged forms.

From a technical point of view, our best option (after doing some research) will be to use a the JMH Java micro-benchmarking framework, although there are some things we need to talk about:

Benchmarks could be wrapped in junit4 tests so that they can become a new step of the build workflow, which would enable us to break the build when a benchmark worsens or doesn't meet some threshold after some change to the implementation.
We could create different groups of benchmarks depending on their abstraction level. A low-level benchmark should be close to the codebase in order to target specific methods, but high-level benchmarks could even be outside of the codebase and approach JR following an outside-in strategy (blackbox testing).

From a logistics point of view, we need to identify low-level and high-level targets for our benchmarks. We've already collected a couple of particularly big forms that present performance issues when used in Collect that we could explore for our first high-level benchmarks. It's not clear yet what low-level targets we should focus on first, although generic XML parsing and secondary external instance support are good first candidates.

From a strategic point of view, here are my thoughts:

I expect we will have to go through a first phase to try stuff out, keep what works, and throw away what doesn't. During this exploration phase, I wouldn't create much structure for this work. Maybe having a branch that the involved people can share, and some focused conversations here and over at Slack would be enough.

Once we make all the technical decisions, and we can provide examples for every possible benchmarking scenario, we could create specific issues targetting our benchmarking needs for anyone to contribute.

nribeka · March 26, 2019, 2:12pm

Guillermo,

Very interesting tool. The first time I heard about this tool

So the plan would be to write / update existing test and annotate parts we want to benchmark? I agree that we need to play around with it first to get a feel and see what kind of metrics we can gather.

ggalmazor · March 26, 2019, 3:29pm

Yes, that's the gist of it.

Maybe I could start by sharing a branch I have with some basic benchmarks so that you can check how it is.

ggalmazor · April 1, 2019, 10:09am

I've created a WIP PR with a sample JMH benchmark inside a jUnit4 test. I've also added a list of things we should explore. The goal would be to take note of any question we have during the exploration process (there are some already), and list the examples we create for each thing in the list.

I think anyone can pull my PR and start tinkering away

We can continue the conversation in the PR:

github.com/getodk/javarosa

WIP - Micro-benchmarks

getodk:master ← ggalmazor:benchmarks

opened 10:06AM - 01 Apr 19 UTC

ggalmazor

+164 -1

This PR explores options to add micro-benchmarks to JavaRosa. See [forum convers…ation](https://forum.opendatakit.org/t/automated-benchmarks-in-javarosa/18808/3). Exploration guide: - Benchmarks running inside jUnit4 tests Examples: - ... Questions: - Do we create a separate test suite, or do we integrate benchmarks in current build workflow, or both? - Standalone execution of benchmarks Examples: - ... Questions: - Can they be integrated into the current build workflow? - Low-level benchmarks for specific functions, XML parsing... Examples: - `org.javarosa.core.TestBenchmark` - Benchmarks for high-level stuff: parse a form, save a filled submission Examples: - ... - Benchmarks targetting specific forms that we know are problematic Examples: - ...

nribeka · April 2, 2019, 6:47pm

I will check the branch and look around the code. Some of the questions probably need answer from @LN

LN · April 3, 2019, 3:31am

You've kind of alluded to these in your original post, @ggalmazor, but to make sure we're on the same page, these are the user-facing performance issues I see as priorities:

Loading forms with many questions can take a long time
Saving values in forms with many expressions that tie fields together can take a long time
XPath queries can take a long time
Loading forms with large secondary instances can take a long time

Hopefully we can agree on a list of high-level problems like the ones above and target those by starting with exactly what you suggest:

There are more good forms to consider at Share your BIGGEST or slowest form and win a badge!. As a step even before that, how about starting with synthetic forms that have particular characteristics? For example, I would be interested to have benchmarks for first load, cache load, and saving values into a form with 1000 simple questions (e.g. many_questions_1000). These could be used as a baseline to compare real forms against.

Hopefully we could then actually say with confidence, for example, how performance and number of questions in a form are related. Linearly? Not at all, it's only the number of relations between fields that matter? Etc. That will at least help us better communicate with users about the implications of form design decisions and hopefully also start the process of identifying areas for improvement.

It's a core part of Java as of Java 9 so that makes it clear to me that it's the way to go. Crucially, we can trust it really does properly hide away JVM optimizations that can make reliable, trustworthy timing data so hard to get.

Yes. Maybe even with Collect as the client? Ultimately, that's where users are feeling the pain. And there's certainly at least some overhead at that level.

This might be nice to have but I don't think it's a priority. As long as there's a consistent way to get performance information, we can make sure it's something reviewers do when there are questions about the performance implications of a change.

LN · April 10, 2019, 11:04pm

As far as "real" forms go, here is a pair that I think would be good to have benchmarks for:

nigeria_wards_external.xml (15.9 KB) with attachments wards.xml (2.1 MB) and lgas.xml (112.3 KB) (XLSForm in Google Sheets -- be sure to see my note if you want to convert with pyxform)

nigeria_wards_internal.xml.zip (510.3 KB) (XLSForm in Google Sheets)

They both have cascading selects with many elements. They also both have a question with a simple XPath query to get a single value. My quick qualitative assessment is that the form with external data loads much more quickly than the one with internal instances and that this should be explored further. Once the form is loaded, evaluation time feels the same which is what I would expect.

@dcbriccetti has already done some work profiling these (see Collect large form performance - #26 by dcbriccetti) and having benchmarks that any further performance improvements can be verified against will be very valuable.

So we don't lose track of related work, these are past conversations that might be worth revisiting:

If this all looks rather disjoint, it is. We haven't had a specific performance mandate yet so this has been pushed on bit by bit as people have time/curiosity/insomnia.

Another quick note -- someone opening up the XML for the Nigeria ward form with internal secondary instances might notice that there are unnecessary translations. This is a pyxform issue and I've filed it at https://github.com/XLSForm/pyxform/issues/285. I don't think that makes a big performance difference.

ggalmazor · April 12, 2019, 9:20am

A little update on the examples we're trying out:

This is an example of taking an existing unit test and converting it into a benchmark.

Key things to have in mind:
- All the preparation must be done in the @State class. In this case, loading the files and parsing the forms are not what we're testing/benchmarking, so that goes there.
- Each action that we want to measure goes into one @Benchmark methods.
  
  These methods use the available assets in the @State class and perform just one action.
  
  In order to measure only what we want to measure, it's important that we're careful and write the least amount of code that perform the action we want to measure. Using the @State class to make any required & not strictly related work is super important.