N Body 问题之 CUDA 计算

行星运动我们都是按万有引力定律算的，实际上按照爱因斯坦的说法，万有引力不是惯性力是时空弯曲的效果，但是按照爱因斯坦的公式算起来特别复杂，所以现在航天轨道的设计还是按万有引力定律在算。

牛顿万有引力定律

牛顿万有引力定律：任意两个质点由通过连心线方向上的力相互吸引。该吸引力的大小与它们的质量乘积成正比，与它们距离的平方成反比，G 是万有引力常数。

$$ F=-\frac{G M \cdot m}{r^2} \cdot \frac{\vec{r}}{r} $$

多体问题

多体问题，是所有吸引力的叠加，这里排除物体相撞的情况，因为宇宙非常大，两个天体之间的距离非常远，暂时不考虑这种情况，通过加入一个小常量可以在计算上忽略这个问题。

$$ F=-\frac{G M m r}{(r^2 + \epsilon^2)^{\frac{3}{2}}} $$

这样累加的时候自己也可以加进去，因为 r 是 0，那一项相当于没算。

$$ F_j= \sum_{i=1}^{n} F_i $$

对于 N body 有很多的稍微近似的算法，多是基于实际的物理场景的，比如大部分天体是属于一个星系的，每个星系之间都非常远，所以可以将一些星系作为整体计算，构建一颗树，树中的某些节点是其所有叶子的质心，这样就可以减少计算量，但是放到 GPU 的场景下一个 O(N^2) 的暴力算法利用 GPU 的并行能力可以非常快的算出来。

Python 暴力解

下面是 N-Body 暴力实现的例子，网上有个各种语言实验的，比较好用来测试准确性，来自这里，就是两层循环。

cat <<EOF >> nbody.txt
0.01 3 20
1
0 0 0
0.01 0 0
0.1
1 1 0
0 0 0.02
0.001
0 1 1
0.01 -0.01 -0.01
EOF

import math
 
class Vector:
    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z
 
    def __add__(self, other):
        return Vector(self.x + other.x, self.y + other.y, self.z + other.z)
 
    def __sub__(self, other):
        return Vector(self.x - other.x, self.y - other.y, self.z - other.z)
 
    def __mul__(self, other):
        return Vector(self.x * other, self.y * other, self.z * other)
 
    def __div__(self, other):
        return Vector(self.x / other, self.y / other, self.z / other)
 
    def __eq__(self, other):
        if isinstance(other, Vector):
            return self.x == other.x and self.y == other.y and self.z == other.z
        return False
 
    def __ne__(self, other):
        return not self.__eq__(other)
 
    def __str__(self):
        return '({x}, {y}, {z})'.format(x=self.x, y=self.y, z=self.z)
 
    def abs(self):
        return math.sqrt(self.x*self.x + self.y*self.y + self.z*self.z)
 
origin = Vector(0, 0, 0)
 
class NBody:
    def __init__(self, fileName):
        with open(fileName, "r") as fh:
            lines = fh.readlines()
            gbt = lines[0].split()
            self.gc = float(gbt[0])
            self.bodies = int(gbt[1])
            self.timeSteps = int(gbt[2])
            self.masses = [0.0 for i in range(self.bodies)]
            self.positions = [origin for i in range(self.bodies)]
            self.velocities = [origin for i in range(self.bodies)]
            self.accelerations = [origin for i in range(self.bodies)]
            for i in range(self.bodies):
                self.masses[i] = float(lines[i*3 + 1])
                self.positions[i] = self.__decompose(lines[i*3 + 2])
                self.velocities[i] = self.__decompose(lines[i*3 + 3])
 
            print("Contents of", fileName)
            for line in lines:
                print(line.rstrip())
            print
            print("Body   :      x          y          z    |")
            print("     vx         vy         vz")
 
    def __decompose(self, line):
        xyz = line.split()
        x = float(xyz[0])
        y = float(xyz[1])
        z = float(xyz[2])
        return Vector(x, y, z)
 
    def __computeAccelerations(self):
        for i in range(self.bodies):
            self.accelerations[i] = origin
            for j in range(self.bodies):
                if i != j:
                    temp = self.gc * self.masses[j] / math.pow((self.positions[i] - self.positions[j]).abs(), 3)
                    self.accelerations[i] += (self.positions[j] - self.positions[i]) * temp
        return None
 
    def __computePositions(self):
        for i in range(self.bodies):
            self.positions[i] += self.velocities[i] + self.accelerations[i] * 0.5
        return None
 
    def __computeVelocities(self):
        for i in range(self.bodies):
            self.velocities[i] += self.accelerations[i]
        return None
 
    def __resolveCollisions(self):
        for i in range(self.bodies):
            for j in range(self.bodies):
                if self.positions[i] == self.positions[j]:
                    (self.velocities[i], self.velocities[j]) = (self.velocities[j], self.velocities[i])
        return None
 
    def simulate(self):
        self.__computeAccelerations()
        self.__computePositions()
        self.__computeVelocities()
        self.__resolveCollisions()
        return None
 
    def printResults(self):
        fmt = "Body %d : % 8.6f  % 8.6f  % 8.6f | % 8.6f  % 8.6f  % 8.6f"
        for i in range(self.bodies):
            print(fmt % (i+1, self.positions[i].x, self.positions[i].y, self.positions[i].z, self.velocities[i].x, self.velocities[i].y, self.velocities[i].z))
        return None
 
nb = NBody("nbody.txt")
for i in range(nb.timeSteps):
    print("\nCycle %d" % (i + 1))
    nb.simulate()
    nb.printResults()

CUDA 并行计算

在 CUDA 中并行化首先要理解 CUDA 的层次结构，CUDA 有 grid 和 block 对线程做划分。首先设计一个 computation tile，相当于一个 block 的计算。

左侧表示一个 block 中的 p 个 body 执行 p 次，最后就能更新所有 body 的加速度，这里的内存占用是 O(p) 的不是 O(p^2)，浅绿色表示的是串行执行的流，官方的文档没有更新，对应 CUDA 10 的例子里是这样的，vec3 存的是 x/y/z 轴的加速度，vec4 存的是坐标加质量，就是计算 i 受到 j 的加速度，这里没有并行设计是串行的。

template <typename T>
__device__ typename vec3<T>::Type
bodyBodyInteraction(typename vec3<T>::Type ai,
                    typename vec4<T>::Type bi,
                    typename vec4<T>::Type bj)
{
    typename vec3<T>::Type r;

    // r_ij  [3 FLOPS]
    r.x = bj.x - bi.x;
    r.y = bj.y - bi.y;
    r.z = bj.z - bi.z;

    // distSqr = dot(r_ij, r_ij) + EPS^2  [6 FLOPS]
    T distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
    distSqr += getSofteningSquared<T>();

    // invDistCube =1/distSqr^(3/2)  [4 FLOPS (2 mul, 1 sqrt, 1 inv)]
    T invDist = rsqrt_T(distSqr);
    T invDistCube =  invDist * invDist * invDist;

    // s = m_j * invDistCube [1 FLOP]
    T s = bj.w * invDistCube;

    // a_i =  a_i + s * r_ij [6 FLOPS]
    ai.x += r.x * s;
    ai.y += r.y * s;
    ai.z += r.z * s;

    return ai;
}

block 的 tile computation 就是串行计算 p 次 p 个 body 的加速度。

for (unsigned int counter = 0; counter < blockDim.x; counter++)
{
    acc = bodyBodyInteraction<T>(acc, bodyPos, sharedPos[counter]);
}

每次计算完了以后要对线程做共享内存的同步

黑实线就是一次 block 共享内存同步的边界，也就是一次 tile compuation，图中是 p 个 body 按时间串行执行的流程，超出 p 的时间计算的不是 p 中的 body，看下面这张图。

也就是按照时间串行的计算每个 body 对一个 body 的加速度，然后 N/p 个 block 并行，每个 block 有 p 个线程并行，没 p 次计算每个 block 要同步一次，但是多个 block 之间不需要同步，所以每个并行线程的主体是这样的要在 tile computation 之间包一层同步的调用 cg::sync(ca)，同步一次 block 中线程的共享内存。

template <typename T>
__device__ typename vec3<T>::Type
computeBodyAccel(typename vec4<T>::Type bodyPos,
                 typename vec4<T>::Type *positions,
                 int numTiles, cg::thread_block cta)
{
    typename vec4<T>::Type *sharedPos = SharedMemory<typename vec4<T>::Type>();

    typename vec3<T>::Type acc = {0.0f, 0.0f, 0.0f};

    for (int tile = 0; tile < numTiles; tile++)
    {
        sharedPos[threadIdx.x] = positions[tile * blockDim.x + threadIdx.x];

        cg::sync(cta);
        // This is the "tile_calculation" from the GPUG3 article.
#pragma unroll 128

        for (unsigned int counter = 0; counter < blockDim.x; counter++)
        {
            acc = bodyBodyInteraction<T>(acc, bodyPos, sharedPos[counter]);
        }

        cg::sync(cta);
    }

    return acc;
}

从 GPU 的角度来看，时间复杂度是 O(N)，空间复杂度也是 O(N)，但前提是要开 N 个线程同时计算。

参考资料

n body nvidia

O(N) 算法计算天体运动

n body problem