ESA title
Enabling & Support

RTEMS

4273 views 26 likes
ESA / Enabling & Support / Space Engineering & Technology / Software Systems Engineering

In ESA, our missions need real-time software - that is, software that can handle inputs and respond to them with actions within bounded time frames (translation: really, really fast).

What does "real-time" mean?

You have probably heard of Operating Systems (OSes) that run on mainstream devices:

  • like desktop PCs and laptops, were the dominant OSes are Windows and OSX
  • ...and mobile phones, were the main OSes are Android and iOS.

 

These form the "software heart" of your device; every application that runs on your computer, phone or tablet, whether it is Microsoft Word or WhatsApp, makes use of functionality that is offered by the underlying operating system.

These operating systems are good for what they are designed to do - which is to cover the needs of everyday users. They are however not so good at handling real-time requirements.  To put things in perspective, think of the computer that constantly monitors and controls a very fast car - for example, a Formula One prototype. The computer gets feedback from sensors, and suddenly realizes that some sort of accident has happened, and there's an obstacle less than 100 meters ahead. It therefore has to hit the breaks, hard - and do so as fast as possible.

No delay can be tolerated here.

None of the major Operating Systems are designed to cover this sort of need. Under various conditions, Windows, OSX, Android and iOS can end up being completely blocked - for large amounts of time; which would mean that the computer in question would not be able to "hit the breaks" in time. In the context of what we are discussing about - that of urgently reacting to something - even 1/10th of a second can be a VERY large delay; it can mean the difference between life and death.

For our missions in space, we may need orders of magnitude faster responses than 1/10th of a second. It is this kind of requirement that classifies what is termed safety-critical, real-time software - and a specific kind of operating system is required to support it.

There are many OSes that endeavor to address this; each one with their own advantages and disadvantages. In many missions supervised by TEC/SWE, we use RTEMS - the Real-Time Executive for Multiprocessor Systems. It is an Open Source operating system, with various features that are very desirable in our target domain (that of embedded systems requiring real-time responses, and utilizing our custom-made, radiation-hardened processors.

Why do these delays happen?

 For plenty of reasons. As a simple example, think of two processes - e.g. two applications on your phone - that both try to utilize the same resource (e.g. the data coming from your phone's camera). Due to hardware constraints, the camera data can't be accessed concurrently by both; even though your phone's central processing unit (CPU) has multiple cores and can run the two applications in parallel, the underlying Operating System has to block one application while the other one reads the HW resources of the camera; and when that one is done, it must then unblock the one that was waiting.

This kind of waiting and unblocking sounds simple, in theory - but things get very complicated if the number of applications is more than two, and if it's not just one resource but multiple ones that need to be acquired to perform a task. If the work isn't done properly, the applications may in fact end up having to wait for each other, and even end up completely deadlocked!

 

(you may have already experienced this phenomenon in your desktop OS, when for reasons unknown things seem to get stuck - you end up frantically hitting the keyboard, desperately hoping for something to happen - to no avail).

In TEC/SWE, we recently concluded work on RTEMS that significantly improves the response times in the primitives that block and unblock applications requiring synchronization.

And in addition, we are also qualifying RTEMS.

"Qualify"? What does that mean?

We'll have to oversimplify here a bit - but put simply, safety critical software needs to be tested in a rather exhaustive manner. In most commercial software, that would mean hiring armies of human testers and have them try to "break" the software; find some sort of use that leads it to an undesired behavior.

 

But in our case, qualification means something else. We write code that reproduces test scenarios - and execute these tests on the same platforms that will actually fly. To make sure that the software works properly, our test scenarios are executed, verifying every possible aspect of the functionality we want in the mission (including the handling of undesired conditions!)

 

But very importantly, in so doing, the tests must also force the execution of every possible line of code, and depending on the criticality of the mission, every possible decision point in the code as well.

 

Imagine this kind of testing applied to something as complex as an Operating System - i.e. something that handles multiple hardware devices, executes concurrent tasks and makes sure all of them get to have a fair share of the available resources respecting each other's priorities, guarantees they all don't step on each other's toes (resources), etc.

 

This is (a part) of what "Qualification" means. We have done this for RTEMS, and continue to do so as the Operating System evolves and accumulates more and more functionality.