选择随机样本

A random selection of rows or columns from a Series, DataFrame, or Panel with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

  1. In [111]: s = pd.Series([0,1,2,3,4,5])
  2. # When no arguments are passed, returns 1 row.
  3. In [112]: s.sample()
  4. Out[112]:
  5. 4 4
  6. dtype: int64
  7. # One may specify either a number of rows:
  8. In [113]: s.sample(n=3)
  9. Out[113]:
  10. 0 0
  11. 4 4
  12. 1 1
  13. dtype: int64
  14. # Or a fraction of the rows:
  15. In [114]: s.sample(frac=0.5)
  16. Out[114]:
  17. 5 5
  18. 3 3
  19. 1 1
  20. dtype: int64

By default, sample will return each row at most once, but one can also sample with replacement using the replace option:

  1. In [115]: s = pd.Series([0,1,2,3,4,5])
  2. # Without replacement (default):
  3. In [116]: s.sample(n=6, replace=False)
  4. Out[116]:
  5. 0 0
  6. 1 1
  7. 5 5
  8. 3 3
  9. 2 2
  10. 4 4
  11. dtype: int64
  12. # With replacement:
  13. In [117]: s.sample(n=6, replace=True)
  14. Out[117]:
  15. 0 0
  16. 4 4
  17. 3 3
  18. 2 2
  19. 4 4
  20. 4 4
  21. dtype: int64

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a NumPy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:

  1. In [118]: s = pd.Series([0,1,2,3,4,5])
  2. In [119]: example_weights = [0, 0, 0.2, 0.2, 0.2, 0.4]
  3. In [120]: s.sample(n=3, weights=example_weights)
  4. Out[120]:
  5. 5 5
  6. 4 4
  7. 3 3
  8. dtype: int64
  9. # Weights will be re-normalized automatically
  10. In [121]: example_weights2 = [0.5, 0, 0, 0, 0, 0]
  11. In [122]: s.sample(n=1, weights=example_weights2)
  12. Out[122]:
  13. 0 0
  14. dtype: int64

When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

  1. In [123]: df2 = pd.DataFrame({'col1':[9,8,7,6], 'weight_column':[0.5, 0.4, 0.1, 0]})
  2. In [124]: df2.sample(n = 3, weights = 'weight_column')
  3. Out[124]:
  4. col1 weight_column
  5. 1 8 0.4
  6. 0 9 0.5
  7. 2 7 0.1

sample also allows users to sample columns instead of rows using the axis argument.

  1. In [125]: df3 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
  2. In [126]: df3.sample(n=1, axis=1)
  3. Out[126]:
  4. col1
  5. 0 1
  6. 1 2
  7. 2 3

Finally, one can also set a seed for sample’s random number generator using the random_state argument, which will accept either an integer (as a seed) or a NumPy RandomState object.

  1. In [127]: df4 = pd.DataFrame({'col1':[1,2,3], 'col2':[2,3,4]})
  2. # With a given seed, the sample will always draw the same rows.
  3. In [128]: df4.sample(n=2, random_state=2)
  4. Out[128]:
  5. col1 col2
  6. 2 3 4
  7. 1 2 3
  8. In [129]: df4.sample(n=2, random_state=2)
  9. Out[129]:
  10. col1 col2
  11. 2 3 4
  12. 1 2 3