航空航天高性能计算 - 课程列表 - “交我算”教学支撑门户

# 《航空航天高性能计算》课程计算

<center>
<div style='height:2mm;'></div>
<div style="font-family:华文楷体;font-size:14pt;">航空航天学院 印子斐，胡祎乐</div></center>
<center>
<span style="font-family:华文楷体;font-size:11pt;line-height:9mm">助教：张晨晨（2021）</span>
</center>
<center>
<span style="font-family:华文楷体;font-size:9pt;line-height:9mm">研究生课程</span>
</center>

<div style="width:66px;float:left; font-family:方正公文黑体;">简介：</div> 
<div style="overflow:hidden; font-family:华文楷体;">
This course is a specialized course for full-time graduate students in SAA, it can be also taken by senior undergraduate students. The primary goal of this course is to introduce the current status of modern High Performance Computing (HPC) and basic concepts in computational simulations of fluids, solid mechanics and navigation & control. Students can systematically learn the numerical methods and algorithms for solving practical problems in aerospace engineering. It helps students to build programming skills for solving mechanics problems and contributes to their future studies.<br>
This course requires both in-class teaching and programming practices. It starts from teaching student how to extract mathematical models from a practical mechanics problem and establish its linear system. Then, the solution techniques for solving linear system will be introduced as well as their applicability and characteristics. Next, the hardware architecture and implementation method for high performance computing will be studied, especially on OpenMP, MPI and CUDA.
</div>

<div style="width:66px;float:left; font-family:方正公文黑体;">内容：</div> 
<div style="overflow:hidden; font-family:华文楷体;">
Students will get access to HPC center of Shanghai Jiao Tong University and complete their homework and projects. By running their own code on SJTU-HPC, students can deepen their understanding on the algorithms and numerical methods in HPC programming.
</div>

<div style="width:66px;float:left; font-family:方正公文黑体;">硬件：</div> 
<div style="overflow:hidden; font-family:华文楷体;">ARM Kunpeng 920; GPU V100/32GB</div>

<div style="width:66px;float:left; font-family:方正公文黑体;">软件：</div> 
<div style="overflow:hidden; font-family:华文楷体;">OpenMP; MPI and CUDA</div>

[TOC]

## 计算账号

上海交通大学校级计算平台面向全校师生提供科研与教学支持。

教学账号 stu 限课程实验使用，每人可使用 2 个 ARM 节点，共 256 核 CPU，作业时长最多 6 小时，若需延长计算时间，请将作业号发送至 hpc@sjtu.edu.cn

## 登录集群

### 方法一：

浏览器中访问超算 Studio 可视化平台 https://studio.hpc.sjtu.edu.cn/

顶上 Shell -> CPU Cluster Shell Access

![](https://notes.sjtu.edu.cn/uploads/upload_91cbc65fa4f92509c2f599f17ca34eeb.png)

### 方法二：

使用 SSH 连接（macOS 或 Linux 上使用终端，Windows 上安装 SSH 客户端），登录超算：

```shell=
ssh username@armlogin.hpc.sjtu.edu.cn
```

登录方法具体请见 HPC 文档：https://docs.hpc.sjtu.edu.cn/login

## 课程计算

选用下方两种方式之一计算。方式一的交互模式适合调试，方式二的作业模式适合正式运行长时间的作业

### 方式一：交互模式

申请 ARM 资源，并登录到该 ARM 计算节点。

注意，
- -N 为节点数，选择 1 或者 2；
-   -n 为进程数。每个 ARM 节点 128 核，所以若 N 为 2， -n 最多可以选 256

下方命令申请 4 核：
```shell=
srun -p arm128c256g -N 1 -n 4 --pty /bin/bash
```

下方命令申请 128 核：
```shell=
srun -p arm128c256g -N 1 -n 128 --pty /bin/bash
```

下方命令申请 256 核：
```shell=
srun -p arm128c256g -N 2 -n 256 --pty /bin/bash
```

接下来在 ARM 节点上进行交互模式的软件测试或计算

```shell=
module purge
module load gcc/9.3.0-gcc-4.8.5
```

### 方式二：sbatch 脚本提交模式

调试完成后，推荐使用 sbatch 方式提交作业脚本进行计算

作业脚本示例（假设作业脚本名为 `test.slurm`）：

```shell=
#!/bin/bash

#SBATCH --job-name=test
#SBATCH --partition=arm128c256g
#SBATCH -N 1
#SBATCH --ntasks-per-node=128
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module purge
module load gcc/9.3.0-gcc-4.8.5
```

作业提交：

```shell=
sbatch test.slurm
```

查看正在排队或运行的作业:

```shell=
squeue
```

取消作业，若 job ID 为 `12345`:

```shell=
scancel 12345
```

查看作业，若 job ID 为 `12345`:

```shell=
sacct -j 12345
```

查看已完成的作业效率，若 job ID 为 `12345`:

```shell=
seff 12345
```

作业相关命令文档：https://docs.hpc.sjtu.edu.cn/job/slurm.html

## 可视化

若有可视化使用需求，可参考以下部分：
* 请在浏览器中使用 HPC Studio

HPC Studio: https://studio.hpc.sjtu.edu.cn/
使用文档：https://docs.hpc.sjtu.edu.cn/studio/

* 使用VScode

参考网址：https://blog.csdn.net/qq_38120851/article/details/107696066
## 具体实验举例
### 实验一：实现基于OpenMP的并行扩散方程求解程序
#### 问题描述
本实验所针对问题为扩散方程，该方程的控制函数如下：

此处，设置边界条件𝑇0=0，热流系数𝑄=1。本实验将问题计算区域简化为一边长为1的正方形区域，直接采用笛卡尔网格。

基于给定的边界条件以及热流系数，计算出流场中的热流分布。
    
#### 实验目标
* 学会对控制方程进行离散化，并根据离散化结果组装矩阵
* 学会稀疏矩阵的CSR格式
* 学会采用共轭梯度方法（Conjugate Gradient,CG）方法求解线性方程组
* 掌握并理解OpenMP fork-join线程模型
* 掌握并理解OpenMP中parallel, for,reduce等基础语句以及private,shared,default等子指令。
* 掌握并理解线程并行中的数据竞争，以及采用OpenMP atomic语句的解决方法
* 学会OpenMP线程数量设置方法
* 了解OpenMP线程调度方式
#### 程序编译与提交
* 编译模块加载
采用gcc编译器，对程序进行编译，因此需要先加载gcc模块，其指令如下：
 `module load gcc/9.3.0-gcc-4.8.5`
* makefile编写
编译链接时，由于用到数学库以及OpenMP，因此需要采用-lm以及-fopenmp指令，链接相应的静态库，则具体的makefile内容如下：
```shell=
#如果采用C++则改成g++
compiler    = gcc 
cflags      += -fopenmp -lm
#对编译后的可执行文件进行命名
program     = Test1
#程序源文件放置在./src文件夹中
src         = $(wildcard ./src/*.cpp)
head        = $(wildcard ./src/*.h)
obj         = $(patsubst %.cpp,%.o,$(src))

$(program): $(src) $(head) 
	$(compiler) $(cflags) $(src) -o $(program)

.PHONY: clean
clean:
	rm -vf $(obj) $(program)
```
* 程序编写与提交 
完成makefile后，采用以下指令编译：
` make`
如果程序中源文件过多，编译较慢，则采用以下指令在申请的节点中并行编译（请勿在登录节点并行编译）：
  ` make -j`  
编译成功生成可执行文件后，便可对任务进行提交（参考sbatch 脚本提交模式部分）。

#### 实验结果
* 求解结果展示
<div align="left">
<img src=https://notes.sjtu.edu.cn/uploads/upload_d0befcfc62e5eca336c16f8847d4c744.png width=83%>
</div>
    
* 程序拓展性测序结果
<div align="center"> 
<img src=https://notes.sjtu.edu.cn/uploads/upload_1db52630efb39ad4cc806eafdbb73ac3.png width=60% style="margin-left:48px">
</div>
    
### 实验二：实现基于MPI的并行扩散方程求解程序
#### 问题描述
同实验一
#### 实验目标
* 理解MPI消息传递机制
* 理解并掌握MPI_Send以及MPI_Recv等基本点对点通信API
* 理解并掌握MPI_Reduce,MPI_Gather等基础组通信API
* 理解并掌握MPI死锁原因以及解决方案
* 了解MPI非阻塞通信
* 了解科学计算区域分解思想
#### 程序编译与提交
*  程序编译模块加载
由于采用MPI进行并行，因此除了需要加载gcc模块外，也需要加载MPI库。在上海交通大学ARM平台上，目前支持OpenMPI,因此此处对OpenMPI进行加载，其加载指令如下：
    `module load openmpi/4.0.3-gcc-9.3.0`
如需其他版本，则可采用 `module av openmpi`进行查看。
*  makefile文件编写
MPI程序的makefile文件内容如下：
```shell=
#如果采用C++则改成mpic++或mpicxx
compiler    = mpicc 
cflags      += -lm
#对编译后的可执行文件进行命名
program     = Test2
#程序源文件放置在./src文件夹中
src         = $(wildcard ./src/*.cpp)
head        = $(wildcard ./src/*.h)
obj         = $(patsubst %.cpp,%.o,$(src))

$(program): $(src) $(head) 
	$(compiler) $(cflags) $(src) -o $(program)

.PHONY: clean
clean:
	rm -vf $(obj) $(program)
```
* 程序编译与提交
完成makefile后，采用以下指令编译：
` make`
如果程序中源文件过多，编译较慢，则采用以下指令在申请的节点中并行编译（请勿在登录节点并行编译）：
  ` make -j`  
编译成功生成可执行文件后，便可对任务进行提交（参考sbatch 脚本提交模式部分）。

#### 实验结果
*  求解结果展示
    与实验一相同
* 拓展性测试
<div align="center"> 
<img src=https://notes.sjtu.edu.cn/uploads/upload_97958e1ed944e78b46f870a0ff0ae035.png width=65% style="margin-left:50px">
</div>

### 实验三：基于CUDA实现的GPU版本扩散方程求解程序
#### 问题描述
同实验一
#### 实验目标
* 了解GPU与CPU体系结构区别
* 理解并掌握GPU线程组织（grid,block）
* 理解并掌握GPU端内存申请函数API以及主机与设别端内存拷贝函数API（cudaMalloc,cudaHostAlloc,cudaMemcpy）
* 理解并掌握CUDA 主机端调用kenel函数(_ _ global__)以及设备端调用函数(_ _ device__)编程方法
* 理解并掌握GPU端原子操作函数（atomicAdd）
* 理解并掌握GPU全局内存以及共享内存
* 理解并掌握GPU warp概念
* 了解GPU分支发散
#### 程序编译
* 编译模块加载
目前上海交通大学ARM平台并没有相应的CUDA模块，因此需要登录X86 CPU节点进行程序编译与任务提交。X86 CPU节点超算登录指令如下：
```shell=
# pi2.0超算登录
ssh username@login1.hpc.sjtu.edu.cn
ssh username@login2.hpc.sjtu.edu.cn
ssh username@login3.hpc.sjtu.edu.cn
# 思源一号超算登录
ssh username@sylogin.hpc.sjtu.edu.cn
```
C/C++编译器加载(二者加载其一即可)：
```shell=
#如果使用gcc套件
module load gcc/11.2.0
#如果使用intel oneapi套件
module load intel-oneapi-compilers/2022.1.0
```
CUDA模块加载：
```shell=
module load cuda/10.1.243-gcc-4.8.5
```
* makefile文件编写
CUDA C/C++程序编译所需makefile文件内容如下：
如果程序中故意区分.cpp文件以及.cu文件，采用以下makefile
```shell=
compiler_gpu    = nvcc 
#如果采用Intel套件则使用icpc
compiler_cpu    = g++ 
cflags      += -std=c++11 -O2
#对编译后的可执行文件进行命名
program     = Test3
#程序源文件放置在./src文件夹中
src_cpu         = $(wildcard ./src/*.cpp)
src_gpu         = $(wildcard ./src/*.cu)
head_cpu        = $(wildcard ./src/*.h)
head_gpu        = $(wildcard ./src/*.cuh)
obj_cpu         = $(patsubst %.cpp,%.o,$(src))
obj_gpu         = $(patsubst %.cu,%.o,$(src))

$(program): $(src_gpu) $(src_cpu) $(head_gpu) $(src_cpu) 
	$(compiler_gpu)  $(obj_cpu)  $(obj_gpu)  -o $(program)

%.o:src/%.cpp $(head_cpu)
	$(compiler_cpu) -c $(cflags) $< -o $@
                                      
%.o:src/%.cu $(head_gpu)
	$(compiler_gpu) -c $(cflags) $< -o $@
    
.PHONY: clean
clean:
	rm -vf $(obj) $(program)
```
如果程序中将所有源文件命名为.cu文件，头文件为.cuh文件，采可直接采用以下makefile
```shell=
compiler    = nvcc
cflags      += -std=c++11 -O2
program     = Test3
src         = $(wildcard *.cu)
src1		=$(wildcard *.cuh)
obj         = $(patsubst %.cu,%.o,$(src))

$(program): $(src)
	$(compiler) $(cflags) $(src) -o $(program)  
    
.PHONY: clean
clean:
	rm -vf $(obj) $(program)
```
* 程序编译
完成makefile后，采用以下指令编译：
` make`
如果程序中源文件过多，编译较慢，则采用以下指令在申请的节点中并行编译（请勿在登录节点并行编译）：
  ` make -j`  
编译成功生成可执行文件后，便可对任务进行提交。

* 程序提交
对GPU进行任务提交时，上述CPU程序有着不同，需要对slurm文件进行修改，然后通过sbatch命令提交。修改后的.slurm文件如下：
```shell=
#!/bin/bash

#SBATCH --job-name=test
# 如果登录的为思源一号则需要把dgx2改成a100
#SBATCH --partition=dgx2
#SBATCH -N 1
#SBATCH --ntasks-per-node=6
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1              #若使用 2 块卡，就给 gres=gpu:2
#SBATCH --mail-type=end
#SBATCH --mail-user=YOU@EMAIL.COM
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load gcc cuda

./Test3
```   
#### 实验结果
* 求解结构展示
  与实验一相同
* 计算时间对比
<div align="center"> 
<img src=https://notes.sjtu.edu.cn/uploads/upload_c2f932e7ebcec9860a422b731a4ba17c.png width=65% style="margin-left:100px">
</div>

## 注意事项

* 教学账号仅限教学使用；一人一账号，请注意保管
* 使用中遇到问题，欢迎邮件联系我们: hpc@sjtu.edu.cn

## 参考资料

* 超算使用文档: https://docs.hpc.sjtu.edu.cn/
* Matlab 使用：https://docs.hpc.sjtu.edu.cn/studio/matlab.html
* π 实时利用率：https://account.hpc.sjtu.edu.cn/top
* HPC 网站：https://hpc.sjtu.edu.cn
* 公众号：交我算
* 视频号：交我算
* 简短版使用手册（Cheat Sheet）：https://hpc.sjtu.edu.cn/Item/docs/Pi_GetStarted.pdf