背景
团队基于 Armbian 设计了一个 LoRa 网关,它要求上电后开始运行主程序 packet_forwarder (它实现 LoRa<-(转)->UDP 与服务器通信)。
这本来是一个简单的需求,将其设计成一个 service 加载到 systemd 中就可以完成,该 rime_gateway.service 代码如下:
[Unit]
Description=Rime LoRaWAN Gateway
[Service]
WorkingDirectory=/home/rime/packet_forwarder/lora_pkt_fwd
ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh
Restart=always
[Install]
WantedBy=multi-user.target
语法解释请参考 Systemd 入门教程:命令篇
不稳定的服务
当使用 systemctl start rime_gateway.service 手动启动时,它工作得很好。
然而,当 Armbian 上电自启动后,使用 systemctl status rime_gateway.service 查看发现该服务已经停止工作:
rime_gateway.service - Rime LoRaWAN Gateway
Loaded: loaded (/lib/systemd/system/rime_gateway.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Mon 2020-04-20 06:51:46 UTC; 29s ago
Process: 1112 ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh (code=exited, status=1/FAILURE)
Main PID: 1112 (code=exited, status=1/FAILURE)
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 5.
Apr 20 06:51:46 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Start request repeated too quickly.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:46 orangepizero systemd[1]: Failed to start Rime LoRaWAN Gateway.
上面的语句显示服务重启太快,系统退出重启。
使用 journalctl -u rime_gateway.service 查看日志,系统以 100ms 间隔 5 次重启都失败。
-- Logs begin at Mon 2020-04-20 06:51:31 UTC, end at Mon 2020-04-20 06:55:01 UTC. --
Apr 20 06:51:40 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 06:51:40 orangepizero start_gateway.sh[572]: Reset start_gateway.sh
Apr 20 06:51:41 orangepizero start_gateway.sh[572]: Starting start_gateway.sh
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
Apr 20 06:51:41 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 1.
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
Apr 20 06:51:45 orangepizero start_gateway.sh[1112]: Reset start_gateway.sh
Apr 20 06:51:46 orangepizero start_gateway.sh[1112]: Starting start_gateway.sh
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=100ms expired, scheduling restart.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 5.
Apr 20 06:51:46 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Start request repeated too quickly.
Apr 20 06:51:46 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 06:51:46 orangepizero systemd[1]: Failed to start Rime LoRaWAN Gateway.
查看网关日志,发现失败的原因是网络没有建立成功 tail -f /tmp/start_gateway.sh.log
ERROR: [up] connect returned Network is unreachable
修改启动顺序
很明显,该服务依赖于网络的建立,因此,首先添加如下语句
After=network.target
这个启动顺序生效了吗?为此,我们导出并查看了启动顺序
systemd-analyze plot > boot.svg
使用 chrome 浏览器打开 boot.svg 发现:先启动 network.target,后启动 rime_gateway.service
更多启动顺序请参考 Linux systemd启动守护进程,service启动顺序分析及调整service启动顺序
检测故障重启
为了让服务更健壮,检测到失败退出时自动重启。为此,添加了如下的代码。
systemd 将尝试永久重启服务
StartLimitIntervalSec=0
每隔 1 秒重启服务是个好主意,以避免在出现问题时对服务器施加太大压力。
RestartSec=1
更多自动重启请参考 使用systemd创建Linux服务
稳定的服务
最终的 rime_gateway.service 代码如下所示
[Unit]
Description=Rime LoRaWAN Gateway
After=network.target
StartLimitIntervalSec=0
[Service]
WorkingDirectory=/home/rime/packet_forwarder/lora_pkt_fwd
ExecStart=/home/rime/packet_forwarder/lora_pkt_fwd/start_gateway.sh
Restart=always
RestartSec=1
[Install]
WantedBy=multi-user.target
使用 systemctl status rime_gateway.service 和 journalctl -u rime_gateway.service 查看日志,服务正常启动。
在异常的情况下,先拔出网线,再重启 Armbian,发现 systemd 以每隔 1 秒间隔启动服务,直到网络恢复正常为止(本案例重启 78 次)。
-- Logs begin at Mon 2020-04-20 07:32:09 UTC, end at Mon 2020-04-20 07:35:12 UTC. --
Apr 20 07:32:19 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 07:32:20 orangepizero start_gateway.sh[839]: Reset start_gateway.sh
Apr 20 07:32:20 orangepizero start_gateway.sh[839]: Starting start_gateway.sh
Apr 20 07:32:20 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 07:32:20 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 07:32:21 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=1s expired, scheduling restart.
Apr 20 07:32:21 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 1.
Apr 20 07:32:21 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 07:32:21 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 07:32:22 orangepizero start_gateway.sh[991]: Reset start_gateway.sh
Apr 20 07:32:22 orangepizero start_gateway.sh[991]: Starting start_gateway.sh
。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。。
Apr 20 07:34:54 orangepizero systemd[1]: rime_gateway.service: Main process exited, code=exited, status=1/FAILURE
Apr 20 07:34:54 orangepizero systemd[1]: rime_gateway.service: Failed with result 'exit-code'.
Apr 20 07:34:55 orangepizero systemd[1]: rime_gateway.service: Service RestartSec=1s expired, scheduling restart.
Apr 20 07:34:55 orangepizero systemd[1]: rime_gateway.service: Scheduled restart job, restart counter is at 78.
Apr 20 07:34:55 orangepizero systemd[1]: Stopped Rime LoRaWAN Gateway.
Apr 20 07:34:55 orangepizero systemd[1]: Started Rime LoRaWAN Gateway.
Apr 20 07:34:55 orangepizero start_gateway.sh[2644]: Reset start_gateway.sh
Apr 20 07:34:56 orangepizero start_gateway.sh[2644]: Starting start_gateway.sh
—————————P2————————–
# Systemd-控制进程的启动顺序
# 问题描述
问题描述:重启linux系统以后,发现微信公众号的无法提供服务,登录服务器排查,mysql服务正常启动。查看supervisor的日志,发现启动uwsgi进程的时候报错,而且提示是数据库连接错误。考虑到mysql和supervisor都是通过systemctl实现开机自动启动,所以应该是supervisor的启动先于mysql,所以出现了连接失败。
# 解决方式
systemctl可以通过Before和After参数控制进程的启动顺序。
vim /lib/systemd/system/supervisor.service
[Unit]
After=mariadb.service
多个进程可以写成
After=syslog.target network.target remote-fs.target nss-lookup.target
systemctl daemon-reload
systemctl enable yourservice
syetemctl restart yourservice
参考资料:
- Systemd 入门教程:实战篇 (opens new window)
- systemd.unit — Unit configuration (opens new window)
截至目前,“阮一峰的网络日志-Systemd 入门教程:实战篇”服务器没有引入合适的https,如果点击以上文字出现错误,请自行输入以下网址:http://www.ruanyifeng.com/blog/2016/03/systemd-tutorial-part-two.html