Sampling of the rows of data is not uniform #133

kwstat · 2021-04-01T21:41:42Z

In this function, k is a vector of row indexes that represent the sample rows of the data. Currently:
k <- round(runif(n, 1, nrow(data)))
However, this does NOT use an equal probability to sample rows. For example:

table(round(runif(10000, 1, 10)))
#   1    2    3    4    5    6    7    8    9   10 
# 532 1083 1138 1087 1116 1109 1111 1133 1132  559

The first and last rows of the data are only sampled half as often as the other rows of the data.

The proposed fix samples all rows with equal probability:

table(sample(1:10, 10000, replace=TRUE))
#    1    2    3    4    5    6    7    8    9   10 
# 1032  975 1020 1021  962 1009 1064  949  962 1006

In this function, `k` is a vector of row indexes that represent the sample rows of the data. Currently: k <- round(runif(n, 1, nrow(data))) However, this does NOT use an equal probability to sample rows. For example: ``` table(round(runif(10000, 1, 10))) # 1 2 3 4 5 6 7 8 9 10 # 532 1083 1138 1087 1116 1109 1111 1133 1132 559 ``` The first and last rows of the data are only sampled half as often as the other rows of the data. The proposed fix samples all rows with equal probability: ``` table(sample(1:10, 10000, replace=TRUE)) # 1 2 3 4 5 6 7 8 9 10 # 1032 975 1020 1021 962 1009 1064 949 962 1006 ```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sampling of the rows of data is not uniform #133

Sampling of the rows of data is not uniform #133

kwstat commented Apr 1, 2021

Sampling of the rows of data is not uniform #133

Are you sure you want to change the base?

Sampling of the rows of data is not uniform #133

Conversation

kwstat commented Apr 1, 2021