2.2.1 ndarray的一生 - 2.2.1.4 索引体系：步幅 - 《SciPy Lecture Notes 中文版（Python科学计算生态的介绍的中文翻译）》

2.2.1.4 索引体系：步幅

2.2.1.4 索引体系：步幅

2.2.1.4.1 主要观点

问题

In [28]:

x = np.array([[1, 2, 3],
             [4, 5, 6],
             [7, 8, 9]], dtype=np.int8)
str(x.data)

Out[28]:

'\x01\x02\x03\x04\x05\x06\x07\x08\t'

item x[1,2]开始在x.data中的哪个字节？

答案（在Numpy）

步幅：寻找一下个元素跳跃的字节数
每个维度一个步幅

In [29]:

x.strides

Out[29]:

(3, 1)

In [31]:

byte_offset = 3*1 + 1*2   # 查找x[1,2]
x.data[byte_offset]

Out[31]:

'\x06'

In [32]:

x[1, 2]

Out[32]:

简单、灵活

2.2.1.4.1.1 C和Fortran顺序

In [34]:

x = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]], dtype=np.int16, order='C')
x.strides

Out[34]:

(6, 2)

In [35]:

str(x.data)

Out[35]:

'\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00\x07\x00\x08\x00\t\x00'

需要跳跃6个字节寻找下一行
需要跳跃2个字节寻找下一列

In [36]:

y = np.array(x, order='F')
y.strides

Out[36]:

(2, 6)

In [37]:

str(y.data)

Out[37]:

'\x01\x00\x04\x00\x07\x00\x02\x00\x05\x00\x08\x00\x03\x00\x06\x00\t\x00'

需要跳跃2个字节寻找下一行
需要跳跃6个字节寻找下一列

更高维度也类似：

- C：最后的维度变化最快（=最小的步幅）
- F：最早的维度变化最快

注意：现在我们可以理解.view()的行为：

In [38]:

y = np.array([[1, 3], [2, 4]], dtype=np.uint8).transpose()
x = y.copy()

变换顺序不影响数据的内部布局，只是步幅

In [39]:

x.strides

Out[39]:

(2, 1)

In [40]:

y.strides

Out[40]:

(1, 2)

In [41]:

str(x.data)

Out[41]:

'\x01\x02\x03\x04'

In [42]:

str(y.data)

Out[42]:

'\x01\x03\x02\x04'

当解释为int16时结果会不同
.copy()以C顺序（默认）创建新的数组

2.2.1.4.1.2 用整数切片

通过仅改变形状、步幅和可能调整数据指针可以代表任何东西！
不用制造数据的副本

In [43]:

x = np.array([1, 2, 3, 4, 5, 6], dtype=np.int32)
y = x[::-1]
y

Out[43]:

array([6, 5, 4, 3, 2, 1], dtype=int32)

In [44]:

y.strides

Out[44]:

(-4,)

In [45]:

y = x[2:]
y.__array_interface__['data'][0] - x.__array_interface__['data'][0]

Out[45]:

In [46]:

x = np.zeros((10, 10, 10), dtype=np.float)
x.strides

Out[46]:

(800, 80, 8)

In [47]:

x[::2,::3,::4].strides

Out[47]:

(1600, 240, 32)

类似的，变换顺序绝不会创建副本（只是交换的步幅）

In [48]:

x = np.zeros((10, 10, 10), dtype=np.float)
x.strides

Out[48]:

(800, 80, 8)

In [49]:

x.T.strides

Out[49]:

(8, 80, 800)

但是：并不是所有的重排操作都可以通过操纵步幅来完成。

In [3]:

a = np.arange(6, dtype=np.int8).reshape(3, 2)
b = a.T
b.strides

Out[3]:

(1, 2)

到目前为止，都很不错，但是：

In [4]:

str(a.data)

Out[4]:

'\x00\x01\x02\x03\x04\x05'

In [5]:

Out[5]:

array([[0, 2, 4],
       [1, 3, 5]], dtype=int8)

In [6]:

c = b.reshape(3*2)
c

Out[6]:

array([0, 2, 4, 1, 3, 5], dtype=int8)

这里，没办法用一个给定的步长和a的内存块来表示数组c。因此，重排操作在这里需要制作一个副本。

2.2.1.4.2 例子：用步长伪造维度

步长操作

In [2]:

from numpy.lib.stride_tricks import as_strided
help(as_strided)

Help on function as_strided in module numpy.lib.stride_tricks:
as_strided(x, shape=None, strides=None)
    Make an ndarray from the given array with the given shape and strides.

警告：as_strided并不检查你是否还待在内存块边界里..

In [9]:

x = np.array([1, 2, 3, 4], dtype=np.int16)
as_strided(x, strides=(2*2, ), shape=(2, ))

Out[9]:

array([1, 3], dtype=int16)

In [10]:

x[::2]

Out[10]:

array([1, 3], dtype=int16)

也可以看一下：stride-fakedims.py

练习

In [ ]:

array([1, 2, 3, 4], dtype=np.int8)
-> array([[1, 2, 3, 4],
          [1, 2, 3, 4],
          [1, 2, 3, 4]], dtype=np.int8)

仅使用as_strided.:

提示：byte_offset = stride[0]_index[0] + stride[1]_index[1] + …

解密：

步长可以设置为0：

In [11]:

x = np.array([1, 2, 3, 4], dtype=np.int8)
y = as_strided(x, strides=(0, 1), shape=(3, 4))
y

Out[11]:

array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]], dtype=int8)

In [12]:

y.base.base is x

Out[12]:

True

2.2.1.4.3 广播

用它来做一些有用的事情：[1, 2, 3, 4]和[5, 6, 7]的外积

In [13]:

x = np.array([1, 2, 3, 4], dtype=np.int16)
x2 = as_strided(x, strides=(0, 1*2), shape=(3, 4))
x2

Out[13]:

array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [1, 2, 3, 4]], dtype=int16)

In [14]:

y = np.array([5, 6, 7], dtype=np.int16)
y2 = as_strided(y, strides=(1*2, 0), shape=(3, 4))
y2

Out[14]:

array([[5, 5, 5, 5],
       [6, 6, 6, 6],
       [7, 7, 7, 7]], dtype=int16)

In [15]:

x2 * y2

Out[15]:

array([[ 5, 10, 15, 20],
       [ 6, 12, 18, 24],
       [ 7, 14, 21, 28]], dtype=int16)

…看起来有一些熟悉…

In [16]:

x = np.array([1, 2, 3, 4], dtype=np.int16)
y = np.array([5, 6, 7], dtype=np.int16)
x[np.newaxis,:] * y[:,np.newaxis]

Out[16]:

array([[ 5, 10, 15, 20],
       [ 6, 12, 18, 24],
       [ 7, 14, 21, 28]], dtype=int16)

在内部，数组广播的确使用0步长来实现的。

2.2.1.4.4 更多技巧：对角线

也可以看一下 stride-diagonals.py

挑战

提取矩阵对角线的起点：（假定是C内存顺序）：

In [ ]:

x = np.array([[1, 2, 3],
              [4, 5, 6],
              [7, 8, 9]], dtype=np.int32)
x_diag = as_strided(x, shape=(3,), strides=(???,))

提取第一个超级-对角线的起点[2,6]。
那么子对角线呢？

（后两个问题的提示：切片首先移动步长起点的点。）

答案

…

提取对角线：

In [6]:

x_diag = as_strided(x, shape=(3, ), strides=((3+1)*x.itemsize, ))
x_diag

Out[6]:

array([1, 5, 9], dtype=int32)

首先切片，调整数据指针：

In [8]:

as_strided(x[0, 1:], shape=(2, ), strides=((3+1)*x.itemsize, ))

Out[8]:

array([2, 6], dtype=int32)

In [9]:

as_strided(x[1:, 0], shape=(2, ), strides=((3+1)*x.itemsize, ))

Out[9]:

array([4, 8], dtype=int32)

笔记

In [7]:

y = np.diag(x, k=1)
y

Out[7]:

array([2, 6], dtype=int32)

但是

In [8]:

y.flags.owndata

Out[8]:

False

这是一个副本？！

也可以看一下stride-diagonals.py

挑战

计算张量的迹：

In [9]:

x = np.arange(5*5*5*5).reshape(5,5,5,5)
s = 0
for i in xrange(5):
   for j in xrange(5):
      s += x[j,i,j,i]

通过跨越并且在结果上使用sum()。

In [ ]:

y = as_strided(x, shape=(5, 5), strides=(TODO, TODO))   
s2 = ...   
assert s == s2

答案

…

In [ ]:

y = as_strided(x, shape=(5, 5), strides=((5*5*5 + 5)*x.itemsize,
                                         (5*5 + 1)*x.itemsize))
s2 = y.sum()

2.2.1.4.5 CPU缓存效果

内存布局可以影响性能：

In [13]:

x = np.zeros((20000,))
y = np.zeros((20000*67,))[::67]
x.shape, y.shape

Out[13]:

((20000,), (20000,))

In [14]:

%timeit x.sum()

The slowest run took 20.69 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 15.4 µs per loop

In [15]:

%timeit y.sum()

The slowest run took 114.83 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 53 µs per loop

In [16]:

x.strides, y.strides

Out[16]:

((8,), (536,))

小步长更快？

CPU从主内存中拉取数据到缓存块 pulls data from main memory to its cache in blocks
如果需要数据项连续操作适应于一个内存块（小步长）：
- 需要更少的迁移
- 更快

也可以看一下：numexpr设计用来减轻数组计算时的缓存效果。

2.2.1.4.6 例子：原地操作（买者当心）

有时，

In [ ]:

a -= b

并不等同于

In [ ]:

a -= b.copy()

In [21]:

x = np.array([[1, 2], [3, 4]])
x -= x.transpose()
x

Out[21]:

array([[ 0, -1],
       [ 1,  0]])

In [22]:

y = np.array([[1, 2], [3, 4]])
y -= y.T.copy()
y

Out[22]:

array([[ 0, -1],
       [ 1,  0]])

x和x.transpose()共享数据
x -= x.transpose()逐个元素修改数据…
因为x和x.transpose()步长不同，修改后的数据重新出现在RHS