Sampling without replacement

chrissyhroberts · November 14, 2018, 1:25am

This refers to this post which asked if there was a way to sample without replacement in ODK.

The basic requirement was to select a random sample of n entities from a list p long, without replacement.

i.e. if you had these items in the list

A, B, C, D, E, F

and you wanted a random sample of 3 items without replacement. then an acceptable sample would be

A, D, E
or
E, F, B

but not

A, E, A in which there is replacement of A in the list after A is sampled the first time.

In the previous post there was suggestion that this might become an obvious feature of ODK but unclear if it has been finished so in the meantime, I had a look at a quasi-random solution that uses only the random() and pulldata() commands that are already built in to ODK. This is pretty sketchy and I would like to see if anyone has a more elegant way to achieve the same end.

First we need to make an external CSV file that contains an array of truly random sequences of numbers which are sampled from a larger set without replacement. In the example below I've sampled 10000 sequences, each with 10 numbers between 1:50 (inclusive), but this could also be sampling of strings or logicals. The larger the number of sequences, the more random the system becomes, but 10000 should be good enough for most real world purposes such as clinical trial randomisations.

The following code (in R) will generate a csv file with the required name_key structure to allow these sequences to be pulled in to odk using pulldata() command.

#make a data frame with 10,000 rows
a<-(as.data.frame(1:10000)

#Change the header to name_key
names(a)<-"name_key"

#create columns to house random samples
a[,2:11]<-NA

#populate columns with randomly sampled data (here 10 columns, each with a number between 1 and 50 without replacement)
for(i in 1:nrow(a)){a[i,2:11]<-sample(size = 10,replace = F,x = 1:50)}

#save a CSV file
write.csv(a,file = "randomer.csv",row.names = F)

Then we need an XLSform design to use some of this

The rnd variable simply generates a random integer from 1:10000
The pulldata commands on the subsequent lines then use the random number from rnd to access the matching line in the csv file.
Adding more lines here (I called them randomperson1...4 would extend the length of the random sample you get (in the example you could go up to ten, but there's no limit on this)

Convert this xls to xml and load to aggregate with the csv file attached and it should work.

type	name	label	calculation
calculate	rnd		once(int(10000*random())+1)
note	note_rnd	The random number is ${rnd}
calculate	randomperson1		pulldata('randomer', 'V2', 'name_key', ${rnd})
calculate	randomperson2		pulldata('randomer', 'V3', 'name_key', ${rnd})
calculate	randomperson3		pulldata('randomer', 'V4', 'name_key', ${rnd})
calculate	randomperson4		pulldata('randomer', 'V5', 'name_key', ${rnd})
calculate	randomperson5		pulldata('randomer', 'V6', 'name_key', ${rnd})
note	note_1	The first person is ${randomperson1}
note	note_2	The first person is ${randomperson2}
note	note_3	The third person is ${randomperson3}
note	note_4	The fourth person is ${randomperson4}
note	note_5	The fifth person is ${randomperson5}

Example Here

sample_no_replacement.xml (3.0 KB)
sample_no_replacement.xlsx (10.0 KB)
randomer.csv (323.0 KB)

Ebrahim · September 27, 2019, 6:53am

Thanks for solution

Olimpia · February 11, 2021, 2:44pm

Hello,

Thank you very much for this solution, it is very helpful.

Your code is working great and I just wanted to ask for your help in trying to further solve my problem.

From my understanding, your code is creating sequences with numbers between 1:50. For my survey, I would like to 40 numbers without replacement between 1 to N, where N is the total number of households which is inputted in a previous question in ODK. Do you know a way to extract this number inputted in ODK and put it in R to then generate the random numbers?

Any help would be greatly appreciated.

Many thanks,
Olimpia

chrissyhroberts · February 16, 2021, 3:35pm

Hi @Olimpia
I am not entirely sure I understand your question, so might need a little more guidance from you on what you want to do.

The R code above creates the random number table randomer.csv that ODK uses to get a quasi-random assignment.

this bit
sample(size = 10,replace = F,x = 1:50)}
defines what that table looks like.

I'll take a guess though...
Let's say that you have n = 253 households
You probably want to change these lines in the R code

#create 40 columns to house random samples
a[,2:41]<-NA
#populate columns with randomly sampled data (here 40 columns, each with a number between 1 and n without replacement)
for(i in 1:nrow(a)){a[i,2:41]<-sample(size = 40,replace = F,x = 1:253)}

Olimpia · February 18, 2021, 5:38pm

Dear @chrissyhroberts,

apologies for not being very clear in my previous response and thank you very much for your reply.

I will try and explain again my current issue

In question 1, we ask for the number of households in the village. Following this question, I would like insert a 'calculation question' that will randomly select 40 numbers without replacement between 1 to N, where N is the total number of households. In other words, compared to the example you kindly reported in your response, I unfortunately won't know the n=253 but the n number will be entered in ODK in a previous question. So from here, I was wondering if there is a way to take a number n previously entered in ODK and use it to create the table that ODK then uses to get a quasi-random assignment.

I also have another, more practical, question. We will be using this survey on ODK Collect to collect data during fieldwork in Sub-Saharan Africa. Practically, how can the R code be ran each time we collect data in a village? I can see how this can be done right now since I have R studio on my laptop but I don't understand how this can be implemented when collecting data in the field with ODK collect.

Thank you very very much for your help and stay safe
Olimpia

chrissyhroberts · March 15, 2021, 11:38am

Hi @Olimpia

Again, sorry for long delay in responding.

Having done this kind of stuff before, I would say that rather than try to do it all in one step, you could make things easier for yourself by either

(1) doing a rapid survey of the village a few days before. This would allow you to do a full count (±map) survey of households [I don't think asking one person to estimate this during a survey question is a very good idea]. Then when you go back to base, you can write the code for the sampling without replacement based on empirical measure of n and finally return to study village fully equipped to do it as described in the messages above.

(2) use google maps to build a map of households visible from space, then either base count on that before using ODK to randomise, or otherwise simply assign numbers to each household on the map and do your random sampling a priori.

For deployment of automatic R processes, you may find this paper useful.

It's not super simple, but there's some links in the manuscript to some code examples that can help you set up some automatic processes.