Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling of the rows of data is not uniform #133

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kwstat
Copy link

@kwstat kwstat commented Apr 1, 2021

In this function, k is a vector of row indexes that represent the sample rows of the data. Currently:
k <- round(runif(n, 1, nrow(data)))
However, this does NOT use an equal probability to sample rows. For example:

table(round(runif(10000, 1, 10)))
#   1    2    3    4    5    6    7    8    9   10 
# 532 1083 1138 1087 1116 1109 1111 1133 1132  559

The first and last rows of the data are only sampled half as often as the other rows of the data.

The proposed fix samples all rows with equal probability:

table(sample(1:10, 10000, replace=TRUE))
#    1    2    3    4    5    6    7    8    9   10 
# 1032  975 1020 1021  962 1009 1064  949  962 1006

In this function, `k` is a vector of row indexes that represent the sample rows of the data.  Currently:
    k <- round(runif(n, 1, nrow(data)))
However, this does NOT use an equal probability to sample rows.  For example:
```
table(round(runif(10000, 1, 10)))
#     1    2    3    4    5    6    7    8    9   10 
# 532 1083 1138 1087 1116 1109 1111 1133 1132  559
```
The first and last rows of the data are only sampled half as often as the other rows of the data.

The proposed fix samples all rows with equal probability:
```
table(sample(1:10, 10000, replace=TRUE))
#    1    2    3    4    5    6    7    8    9   10 
# 1032  975 1020 1021  962 1009 1064  949  962 1006
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant