How to debug and analyse in nowadays disparate software world?

Onderstaand een artikel dat we geschreven hebben i.v.m. onze deelname aan EU R&D project Vista. Het beschrijft een conceptueel ontwerp van een systeem & applicatie logging faciliteit voor een gedistribueerd low latency ADAS (Advanced Driver Assitance System), waar we aan mee ontwerpen, prototypen en bouwen.

Vista team

Introduction

When VISTA in the coming months will move from development phase into its integration cycle in real life, there will be a need for both indepth low level troubleshooting and high level global end2end understanding of systems behaviour.

For a complex, distributed and quite low latency systems like VISTA, therefor soon there will be a need for a high fidelity unified logging system, supporting the heterogeneous technology involved in it. In this proposal it will be defined, starting with some requirements for it.

This document is intended for quite technical software & IT professionals with some experience in developing and deploying complex distributed software and applications. A lot of details about troubleshooting, logging and DevOps is not explained to keep this document compact.

Deployment diagram for VISTA

The diagram below shows the quite unique deployment diagram for VISTA so far. The different tiers defined (currently 5 of them) all will run their unique VISTA primary and supporting software functions. But with that they have their own unique individual logging demands, which need to be unified in order to be able to effectively conduct correlated troubleshooting in practise.

Requirements

Here we’ll present the requirements for unified, distributed & integrated, smart VISTA system and application logging & analytics

Must be capable of collecting and correlating heterogeneous sources of logging data during debug time and run time of the VISTA system
Must be capable to scale both to the number of connected subsystems, the number of physical nodes, the variety of logging sources and the number of log messages
Must be not overly complex to be used by the different WP producing software, both individually and collectively
Must be able to bring everyone involved on the same debug page during debug time
Must be able to handle different wall clock times on different system nodes
Must be capable to handle high volume message streams (hundreds / second) without utilizing more than 2% of system resources (CPU + bandwidth)
Must be able to stay within the near real time characteristics of the VISTA system for latencies < 300 ms end2end
Must be able to dynamically set filters to limit message collecting
Must be able to store the collected logging data for post mortem analytics
Must be able to provide live viewing of filtered log data to narrow down to an issue
Must support the most important computer languages, development frameworks, containers and OSes used within VISTA
Must be modular, rather easy to start with and being able to grow with increasing requirements including DevOps for production
Must be multi user analytics and tailing to support specialist of different WP simultaneously
Log agents should cleverly coexists with yet to be defined SC&D (Ansible probably) to make dynamically changes in the Filtering & Forwarding rules

Topology

The topology of a unified, distributed & integrated, smart VISTA system and application logging & analytics is visualised here.

There are 4 main subsystems in this topology:

Logging channels
Filtering and forwarding
A collector and message storage
Application for live and post mortem analytics

The first two of them are (partly) actually running on computer systems which might be bare iron servers, HMI devices, network components or containers (e.g. Docker).

The last two are part of a specific logging & analytics application for which dozens are available both Open Source and Commercial. Most likely they run in a public of private cloud, as they play the central role in a particular complex application system like VISTA.

The DevOps Monitoring part is currently out of scope, but a good solution has the capability to facilitate that as well, when VISTA becomes a real operating system. For that reason it’s shown.

Logging channels

In this diagram a partly impression is given for the different type of logging channels which will become available in VISTA and it’s location in the software/hardware stack on physical or virtualized computer systems.

It’s among others this variety, which makes an adequate unified, distributed & integrated, smart VISTA system and application logging & analytics service not that straightforward.

Basic idea for a solution

Roughly the basic idea consists of a 2 step approach, with making the distributed logging agents stable from day 1 and making a change of the central parts possible with the increased demands later on.

ROS/ROS2 logging in it’s core can be using rosout
- Human readable messages and files.
- That’s fine for the ROS based core part of VISTA from a global perspective
- Nevertheless, there is are more subsystems than that and they are not living in the ROS based core
- Please note the extensive table below to remark the VISTA ≠ ROS!
- ROS -> logfile monitoring for core extended VISTA end2end logging
Fluentd & Fluent Bit
- Combined suitable for embedded, edge, servers, containers
- No Java pls. as for Logstash!
  - So the suggested ELK by coduct is only partly a viable idea
- nxlog as a filtering and forwarding for very rapid starting
  - we have got that working
Loggly to rapidly start with some basic analytics including Syslog, Android, Ubuntu, NodeJS and ROS rosout logfile monitoring
- We have some experience with that
With increased need for filltering, searchnig and correlation
- Fluentd (with GELF output for Graylog) -> Elastic -> Kibana
- or Graylog for in depth large logging dataset analytics

The basic idea projected on all foreseen VISTA run time technologies

Regarding all possible logging channels, programming languages, OSes and middleware, the following table shows the very practical possible choices to get that unified, distributed & integrated, smart VISTA system and application logging & analytics:

Logging channel of origin	Practical use of proposal	Remark
Jetson TX2 / Nano	Syslog standard Fluentd on Ubuntu using the in_syslog Input plugin, please refer to https://docs.fluentd.org/input/syslog Fluent Bit when RTOS alike are being used, please refer to https://fluentbit.io	So far Jetson TX2 have been considered as Ubuntu based devices. For some reasons it might become a real embedded device with a RTOS. In case of that there is still an excellent software piece to keep that device in the unified logging service
ROS rosout file based logging	Fluentd with Tailpath https://github.com/xthexder/fluent-plugin-tailpath	ROS/ROS2 has it’s own logging mechanism.That’s fine but not suitable to serve a complex application like VISTA end 2 end. So the ROS core functions continue to use it for ROS domain purposes, but to get this part of a unified logging service we gonna create a logging gateway function using the human readable file logging.
Ubuntu	Syslog standard Fluentd using the in_syslog input plugin, please refer to https://docs.fluentd.org/input/syslog
NodeJS	standard Fluentd using the ‘fluent-logger-node‘ library, please refer to https://docs.fluentd.org/language-bindings/nodejs
Python	standard Fluentd using the ‘fluent-logger-python‘ library, please refer to https://docs.fluentd.org/language-bindings/python
C++	OS / Syslog library ( https://github.com/gabime/spdlog) Fluentd (https://github.com/m-mizutani/libfluent) Fluent Bit integration (https://support.treasuredata.com/hc/en-us/articles/360000691168-Data-Ingestion-from-Embedded-Apps-C-C-)	For the C++ based software parts in WP3, there a lot of choices. In this function domain integrating the appropriate software library, will bring high performance logging, which might be needed for system integration
.Net	standard Fluentd, please refer to https://github.com/fluent/NLog.Targets.Fluentd	Somewhat outdated, Log4Net and .NET Logs directly to Loggly is a perfect alternative, please refer to https://www.loggly.com/docs/net-logs
Docker	standard Fluentd	Docker is container technology which might be expected to be used on the Jetson TX2 devices and the VISTA DC Controller Unit
ROS2/Web bridge	= NodeJS	One of the VISTA supporting functions, which will not become part of the ROS based core.
Android t.b.v. HMI device	Loggly direct, please refer to library https://github.com/inrista/loggliest	Android has a Fluentd implementation, but barely old. Best alternative would be this suggestion.

Neem contact op voor meer informatie over dit vraagstuk en ons advies

Let's talk

Unified & distributed smart “VISTA” system application logging & analytics

How to debug and analyse in nowadays disparate software world?

Introduction

Deployment diagram for VISTA

Requirements

Topology

Logging channels

Basic idea for a solution

The basic idea projected on all foreseen VISTA run time technologies

Neem contact op voor meer informatie over dit vraagstuk en ons advies

Zoek iets

Onderwerpen

Like it?

Unified & distributed smart “VISTA” system application logging & analytics

How to debug and analyse in nowadays disparate software world?

Introduction

Deployment diagram for VISTA

Requirements

Topology

Logging channels

Basic idea for a solution

The basic idea projected on all foreseen VISTA run time technologies

Neem contact op voor meer informatie over dit vraagstuk en ons advies

Zoek iets

Tags

Onderwerpen

Like it?