Do1e

Do1e

github
email

CITE Lab Server User Guide

This article is synchronized and updated to xLog by Mix Space
For the best browsing experience, it is recommended to visit the original link
https://www.do1e.cn/posts/citelab/server-help


Connection and Login#

ssh connection or download the remote-ssh plugin for vscode, search for details yourself

::: banner {error}
Starting from 2024.08.11, all servers will no longer allow password login. Please provide a public key when assigning a new account.
:::

Create a key pair:

ssh-keygen -t rsa -b 8192

On Linux/Mac, it is saved by default in ~/.ssh/id_rsa (private key), ~/.ssh/id_rsa.pub (public key)
On Windows, it is saved by default in the C:\Users\[username]\.ssh folder, with the same names
The public key can be shared publicly and should be saved in the server's ~/.ssh/authorized_keys file, one public key per line corresponding to the private key of different PCs

::: banner {error}
The private key must be kept safe and not leaked. It is strongly discouraged to use the same key on all your PCs!
:::

You can configure ~/.ssh/config on your own computer as follows, so you can connect to the server directly using the ssh s1 command, which is more convenient

Host s1
  HostName s1.xxx.cn
  Port 22
  User xxx
  IdentityFile xxx/id_rsa

For detailed tutorials, see: VSCode Configuration for SSH Connection to Remote Server + Passwordless Connection Tutorial

Solution for Terminal Showing Only $#

Use the following to change the default terminal to bash or another terminal you are comfortable with, enter the password (on Linux, the password will not be displayed when entered, this is normal; just press enter after typing), then restart the terminal or reconnect

chsh -s /usr/bin/bash

Environment Configuration#

conda#

If there are no special requirements, you can use conda directly. If you find that conda: command not found, execute the following command and restart the terminal

/opt/anaconda3/bin/conda init

And edit the file ~/.condarc as follows (using Nanjing University mirror source is faster, and saves the environment to your home path)
Note: It has been configured for all users, no need to configure ~/.condarc separately, but you still need to use pip config set global.index-url https://mirror.nju.edu.cn/pypi/web/simple to replace the pypi source with Nanjing University source

Since the environment is saved in the ~/.conda directory, switching servers only requires copying the entire directory to complete the environment migration, no need to reconfigure. You can also edit ~/.condarc as follows and change envs_dirs and pkgs_dirs to /nasdata/[name]/.conda/[envs/pkgs], configuring the environment on NAS so that multiple services can use the same environment

show_channel_urls: true
default_channels:
  - https://mirror.nju.edu.cn/anaconda/pkgs/main
  - https://mirror.nju.edu.cn/anaconda/pkgs/r
  - https://mirror.nju.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirror.nju.edu.cn/anaconda/cloud
  msys2: https://mirror.nju.edu.cn/anaconda/cloud
  bioconda: https://mirror.nju.edu.cn/anaconda/cloud
  menpo: https://mirror.nju.edu.cn/anaconda/cloud
  pytorch: https://mirror.nju.edu.cn/anaconda/cloud
  simpleitk: https://mirror.nju.edu.cn/anaconda/cloud
auto_activate_base: false
envs_dirs:
  - ~/.conda/envs
pkgs_dirs:
  - ~/.conda/pkgs

After configuring the environment, running conda clean —all and rm -rf ~/.cache/pip can clear a lot of useless conda cache to alleviate space shortage issues

docker#

If the system software cannot meet the needs, you can use docker. You can search for specific tutorials to learn, but all docker containers must be started as a normal user, otherwise they will be cleared (lines 2-6 must be retained, the rest can be customized as needed)

docker container run --name pytorch-dpj \
  --gpus all \
  --user $(id -u ${USER}):$(id -g ${USER}) \
  -v /etc/passwd:/etc/passwd:ro \
  -v /etc/group:/etc/group:ro \
  -v /etc/shadow:/etc/shadow:ro \
  -v /data1/peijie:/data/:rw \
  -v /home/peijie:/home/peijie:rw \
  -it fenghaox/pyt1.3cu10.1:v2 /bin/bash

Alleviating Home Space Shortage Issues#

  • conda clean --all: Delete conda cache
  • rm -rf ~/.cache/pip: Delete pip cache
  • rmoldvs: Delete old version of vscode-server (must be used in the vscode terminal)

Check GPU Usage Status#

https://nvtop.nju.do1e.cn/
or the nvtop command

Starting from December 29, 2024, to protect laboratory confidentiality, https://nvtop.nju.do1e.cn/ will only allow access to users on the IP whitelist. Send your student ID to Diao Peijie, and you will be shared a table where you can fill in your IP. It will be updated every 5 minutes.

Using Specified GPU#

If parallelism is not enabled, pytorch will default to using GPU 0. If parallelism is enabled, it will default to using all GPUs.
Before running the code, configure the CUDA_VISIBLE_DEVICES environment variable to specify which GPU to use. If not using parallelism, use GPU 1:

export CUDA_VISIBLE_DEVICES=1

Or to use GPUs 0-3 in parallel:

export CUDA_VISIBLE_DEVICES=0,1,2,3

Try learning multi-GPU parallel methods like DataParallel (which is relatively simple to implement but incurs extra memory overhead on the first GPU, leading to lower memory utilization) and DistributedDataParallel (which is more complex to implement and debug but is efficient; it is recommended to switch to this method after fixing the code).

nvtop can be used to check GPU usage, coordinate with those who are using it.

Networking Issues#

A proxy has been configured. If there are networking issues (like with github), add proxychains before the commands that require internet access, such as:

proxychains curl https://www.baidu.com

If you need to log in to p.nju.edu.cn, you can refer to this project:

使用命令行登录南京大学校园网(p.nju.edu.cn),统一身份验证方式

Running Code in the Background#

The server has tmux installed. To run code in the background (which can continue running after exiting the terminal), you only need to use the most basic features.

Type tmux new in the terminal to create a new terminal, execute long-running commands inside it, then press ctrl+B, followed by D to detach. At this point, the code continues to run in the background.
Alternatively, use tmux new -s <name> to specify a name for the new terminal, which defaults to a number starting from 0.

You can view the names of terminals running in the background with tmux ls.
To return to that terminal and check the running status, use tmux attach -t <name>.

In the tmux terminal, press ctrl+B, then [ to enter scroll mode, where you can use the up and down keys to scroll, and press q to exit scroll mode.

Data!!!#

Data Storage Location#

::: warning
The home directory has limited space; do not place data files in the home directory. Please place them under /data1.
:::

Infrequently used files can be placed under /nasdata, see the NAS explanation section below for details.

Data Backup#

::: warning
It is essential to ensure data security on public servers.
:::

Rclone is installed on the server, providing a convenient and scheduled backup method (to sync important files from the server to NJUBox):

rclone config

n → Custom configuration name (e.g., njubox) → 56 (seafile) → https://box.nju.edu.cn → Student ID → Password (enter y first, then enter the password twice) → 2fa (just press enter) → Database name (press enter to indicate all unencrypted databases) → Follow the prompts for the rest

Common rclone Methods#

View Remote Files#

rclone ls [configuration name]:/[directory]

image

Sync#

The first run will copy all files (source address) to the remote (target address).
Subsequent runs will only copy changed and new files.

::: warning
Special Note: After each run, the files at the target address will be exactly the same as those at the source address. If files are deleted at the source address, running sync will also delete the corresponding files at the target address (using rclone copy will not delete files at the target address).
:::

rclone sync -v [source directory] [configuration name]:/[target directory]

image

Scheduled Sync#

Copy the above sync command and use crontab for scheduled tasks; specific details can be found online, as there are many related tutorials.

NAS Explanation#

Download the application from the Synology website: Enterprise Cloud | Synology Drive_Private Cloud_Access Data Anytime, Anywhere_Multi-Person Collaboration | Synology Inc.
Or access directly via the web: https://nas.njucite.cn:5001

IP/Domain: nas.njucite.cn

The Drive application login will only show the home directory, which is only visible to you.
The web login will show the share directory, which is a shared directory mounted on each server at /nasdata, used for transferring data between servers. Some servers (s4 and s5) have a 10G connection to NAS, while others have a 1G connection.

::: warning
Everyone has access to /nasdata. To prevent accidental deletion by others, it is recommended to configure important data using rclone, refer to the section on Using rclone to Sync Local and NAS Files below, and remember to replace the URL.
:::

You can move files in the two directories via the web interface.

image

You can also mount using webdav, webdav address: https://nas.njucite.cn:5006

Use iperf3 to test connection speed:

iperf3 -c nas.njucite.cn

image

Using rclone to Sync Local and NAS Files#

rclone config
e/n/d/r/c/s/q> n # Create a new configuration
name> nas # Name the configuration nas
Storage> 52 # WebDAV, may vary with different rclone versions
url> nas.njucite.cn:5006 # It is recommended to use the 10G network at 10.0.0.100:5005 on the server
vendor> 7 # Other site/service or software, may vary with different rclone versions
user> abcd # NAS username
y/g/n> y # Enter password
password: ... # Enter NAS password twice
# Press enter for the rest

After creating the configuration on your local computer as described above, you can use the previously introduced rclone copy or rclone sync commands to sync files (e.g., upload local files to NAS or download NAS files to local).

::: warning
Special Note: After each run, the files at the target address will be exactly the same as those at the source address. If files are deleted at the source address, running sync will also delete the corresponding files at the target address (using rclone copy will not delete files at the target address).
:::

Advanced#

Automatically Fill Previously Entered Commands#

You can use zsh as the default terminal and configure oh-my-zsh, powerlevel10k, zsh-autosuggestions, and zsh-syntax-highlighting.

zsh+oh-my-zsh+powerlevel10k terminal configuration_powerlevel10k configuration-CSDN Blog

Alternatively, you can directly use my configuration by unzipping the following file into your home directory.
zshconfigs.tar.gz

Some commands may prompt that there is no display. If you must use GUI and have no other options, you can refer to the following two methods. The first method is suitable for executing commands in your terminal, while the second requires executing in MobaXterm. The former requires additional configuration, while the latter is ready to use.

Method One#

Install MobaXterm on your local computer and open the X server.

image

Hover over it to see [IP]:[x11port], choose an IP and port that are not under router NAT (in Nanjing University, non-NAT IPs generally start with 114 or 172, while IPs under router NAT generally start with 192.168 or 10) and enter the following in the server terminal:

export DISPLAY=[IP]:[x11port]

Then enter commands related to GUI, and click "Yes" in the pop-up window on your local computer.

image

Method Two#

Directly use MobaXterm for SSH connection and execute GUI-related commands.

Copy with Progress Display#

Add the following to ~/.bashrc or ~/.zshrc:

function rcp(){
    local src=$1
    local dst=$2
    if [ -f "$src" ] && [ -d "$dst" ]; then
        dst="$dst/$(basename "$src")"
    fi
    mkdir -p "$(dirname "$dst")"
    rsync -a --info=progress2 "$src" "$dst"
}

After that, use rcp, which has a slightly different logic from cp; the second parameter dst should be the target directory and cannot be renamed like cp.

Send Email Notifications After Training Ends/Fails#

Add the following Python code at the end of your training script.

sender = "[email protected]"             # Configure the sending email address
sender_name = "s1"                     # Define the sender name as the server name
passwd = "xxxxxxx"                     # Email password, if using QQ email, it is the authorization code
server = "smtphz.qiye.163.com"         # Email server for sending, for QQ email it is smtp.qq.com
port = 465                             # Port number for sending email, usually this one
receiver = "[email protected]"   # Receiving email address
receiver_name = "Peijie Diao"          # Receiving email name
subject = "train on s3"                # Email subject
message = "Training on s3 is finished" # Email content

import smtplib
from email.mime.text import MIMEText
from email.utils import formataddr
import socks

# The server cannot access the internet without logging in. Here I configured a proxy that allows local network connections.
socks.set_default_proxy(socks.SOCKS5, "xxxx", 7891)
socks.wrapmodule(smtplib)

msg = MIMEText(message, 'plain', 'utf-8')
msg['From'] = formataddr((sender_name, sender))
msg['To'] = formataddr((receiver_name, receiver))
msg['Subject'] = subject

server = smtplib.SMTP_SSL(server, port)
server.login(sender, passwd)
server.sendmail(sender, [receiver], msg.as_string())
server.quit()
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.