How to debug and analyse in nowadays disparate software world?
Onderstaand een artikel dat we geschreven hebben i.v.m. onze deelname aan EU R&D project Vista. Het beschrijft een conceptueel ontwerp van een systeem & applicatie logging faciliteit voor een gedistribueerd low latency ADAS (Advanced Driver Assitance System), waar we aan mee ontwerpen, prototypen en bouwen.
When VISTA in the coming months will move from development phase into its integration cycle in real life, there will be a need for both indepth low level troubleshooting and high level global end2end understanding of systems behaviour.
For a complex, distributed and quite low latency systems like VISTA, therefor soon there will be a need for a high fidelity unified logging system, supporting the heterogeneous technology involved in it. In this proposal it will be defined, starting with some requirements for it.
This document is intended for quite technical software & IT professionals with some experience in developing and deploying complex distributed software and applications. A lot of details about troubleshooting, logging and DevOps is not explained to keep this document compact.
Deployment diagram for VISTA
The diagram below shows the quite unique deployment diagram for VISTA so far. The different tiers defined (currently 5 of them) all will run their unique VISTA primary and supporting software functions. But with that they have their own unique individual logging demands, which need to be unified in order to be able to effectively conduct correlated troubleshooting in practise.
Here we’ll present the requirements for unified, distributed & integrated, smart VISTA system and application logging & analytics
- Must be capable of collecting and correlating heterogeneous sources of logging data during debug time and run time of the VISTA system
- Must be capable to scale both to the number of connected subsystems, the number of physical nodes, the variety of logging sources and the number of log messages
- Must be not overly complex to be used by the different WP producing software, both individually and collectively
- Must be able to bring everyone involved on the same debug page during debug time
- Must be able to handle different wall clock times on different system nodes
- Must be capable to handle high volume message streams (hundreds / second) without utilizing more than 2% of system resources (CPU + bandwidth)
- Must be able to stay within the near real time characteristics of the VISTA system for latencies < 300 ms end2end
- Must be able to dynamically set filters to limit message collecting
- Must be able to store the collected logging data for post mortem analytics
- Must be able to provide live viewing of filtered log data to narrow down to an issue
- Must support the most important computer languages, development frameworks, containers and OSes used within VISTA
- Must be modular, rather easy to start with and being able to grow with increasing requirements including DevOps for production
- Must be multi user analytics and tailing to support specialist of different WP simultaneously
- Log agents should cleverly coexists with yet to be defined SC&D (Ansible probably) to make dynamically changes in the Filtering & Forwarding rules
The topology of a unified, distributed & integrated, smart VISTA system and application logging & analytics is visualised here.
There are 4 main subsystems in this topology:
- Logging channels
- Filtering and forwarding
- A collector and message storage
- Application for live and post mortem analytics
The first two of them are (partly) actually running on computer systems which might be bare iron servers, HMI devices, network components or containers (e.g. Docker).
The last two are part of a specific logging & analytics application for which dozens are available both Open Source and Commercial. Most likely they run in a public of private cloud, as they play the central role in a particular complex application system like VISTA.
The DevOps Monitoring part is currently out of scope, but a good solution has the capability to facilitate that as well, when VISTA becomes a real operating system. For that reason it’s shown.
In this diagram a partly impression is given for the different type of logging channels which will become available in VISTA and it’s location in the software/hardware stack on physical or virtualized computer systems.
It’s among others this variety, which makes an adequate unified, distributed & integrated, smart VISTA system and application logging & analytics service not that straightforward.
Basic idea for a solution
Roughly the basic idea consists of a 2 step approach, with making the distributed logging agents stable from day 1 and making a change of the central parts possible with the increased demands later on.
- ROS/ROS2 logging in it’s core can be using rosout
- Human readable messages and files.
- That’s fine for the ROS based core part of VISTA from a global perspective
- Nevertheless, there is are more subsystems than that and they are not living in the ROS based core
- Please note the extensive table below to remark the VISTA ≠ ROS!
- ROS -> logfile monitoring for core extended VISTA end2end logging
- Fluentd & Fluent Bit
- Combined suitable for embedded, edge, servers, containers
- No Java pls. as for Logstash!
- So the suggested ELK by coduct is only partly a viable idea
- nxlog as a filtering and forwarding for very rapid starting
- we have got that working
- Loggly to rapidly start with some basic analytics including Syslog, Android, Ubuntu, NodeJS and ROS rosout logfile monitoring
- We have some experience with that
- With increased need for filltering, searchnig and correlation
- Fluentd (with GELF output for Graylog) -> Elastic -> Kibana
- or Graylog for in depth large logging dataset analytics
The basic idea projected on all foreseen VISTA run time technologies
Regarding all possible logging channels, programming languages, OSes and middleware, the following table shows the very practical possible choices to get that unified, distributed & integrated, smart VISTA system and application logging & analytics:
|Logging channel of origin||Practical use of proposal||Remark|
|Jetson TX2 / Nano||
Syslog standard Fluentd on Ubuntu using the in_syslog Input plugin, please refer to https://docs.fluentd.org/input/syslog
Fluent Bit when RTOS alike are being used, please refer to https://fluentbit.io
|So far Jetson TX2 have been considered as Ubuntu based devices. For some reasons it might become a real embedded device with a RTOS. In case of that there is still an excellent software piece to keep that device in the unified logging service|
|ROS rosout file based logging||Fluentd with Tailpath https://github.com/xthexder/fluent-plugin-tailpath||ROS/ROS2 has it’s own logging mechanism.That’s fine but not suitable to serve a complex application like VISTA end 2 end. So the ROS core functions continue to use it for ROS domain purposes, but to get this part of a unified logging service we gonna create a logging gateway function using the human readable file logging.|
|Ubuntu||Syslog standard Fluentd using the in_syslog input plugin, please refer to https://docs.fluentd.org/input/syslog|
|NodeJS||standard Fluentd using the ‘fluent-logger-node‘ library, please refer to https://docs.fluentd.org/language-bindings/nodejs|
|Python||standard Fluentd using the ‘fluent-logger-python‘ library, please refer to https://docs.fluentd.org/language-bindings/python|
OS / Syslog library ( https://github.com/gabime/spdlog)
Fluent Bit integration (https://support.treasuredata.com/hc/en-us/articles/360000691168-Data-Ingestion-from-Embedded-Apps-C-C-)
|For the C++ based software parts in WP3, there a lot of choices. In this function domain integrating the appropriate software library, will bring high performance logging, which might be needed for system integration|
|.Net||standard Fluentd, please refer to https://github.com/fluent/NLog.Targets.Fluentd||Somewhat outdated, Log4Net and .NET Logs directly to Loggly is a perfect alternative, please refer to https://www.loggly.com/docs/net-logs|
|Docker||standard Fluentd||Docker is container technology which might be expected to be used on the Jetson TX2 devices and the VISTA DC Controller Unit|
|ROS2/Web bridge||= NodeJS||One of the VISTA supporting functions, which will not become part of the ROS based core.|
|Android t.b.v. HMI device||Loggly direct, please refer to library https://github.com/inrista/loggliest||Android has a Fluentd implementation, but barely old. Best alternative would be this suggestion.|