MQTT automation message format


#1

Another thread that I’ve forgotten the name of now, presented an MQTT message structure very different to the one I’ve been using, and it got me thinking, what are peoples experiences in various different structures.

So I thought I’d put the question to everyone here… What MQTT messaging schemes do people use, and what rationale’s, pros, and cons, have you experienced with them…?

The scheme I saw mentioned, used cmnd/<id>/<port> to change the state of something, and stat/<id>/<port> to report status, with cmnd messages being reflected with a corresponding stat message.

The messaging mine is using, is just <id>/<port>, keeping the message tree as short as possible. I didn’t do the acknowledgement to commands, because MQTT seems to be a reliable enough protocol (especially with QOS), and doing it this way also let me reflect a query (simply a command message with no value) without having to rewrite the topic at all. I am wondering, though, if the acknowledgement is a better idea than I’d given it credit.

However, doing it this way, also means that my devices receive their own status messages; rarely an issue, I just compare them to the current state, and ignore them as appropriate (and many times, it doesn’t even matter anyhow). But on the up-side, with judicious use of the keep flag, the MQTT server provides natural and automatic storage of the devices state in the case of restart, it simply receives the most recent commands again when it re-subscribes to the MQTT server.

I also have an <id>/hello message, which the device transmits on MQTT (re)connection (or when asked for it), composed of the firmware name, version, network address (eg. ip4:1.2.3.4), and then any relevant parameters: for my RGB led strip, that was the type and number of LEDs. And for discovery, I also have an id of zero, representing a broadcast address — mostly more likely to be troublesome than useful, except in the case of broadcasting a hello query to see what devices are present.

I did actually independently consider a structure not unlike the cmnd/stat one mentioned, but didn’t see much benefit, and a couple disadvantages; like being slightly harder to reflect the status messages (admittedly, not very much so long as they’re the same length), and potentially having more topics to monitor. My devices subscribe to their device id, and the broadcast id, and receive all messages below those two roots, and that’s it (all outbound responses always go out on the devices id, even if received by broadcast).

Using the cmnd/stat separation, there would be argument for keeping the status messages instead of the command messages, since it removes the need for every client to know which messages should be kept, and which shouldn’t. However that comes with its own complication for storing current device state on the MQTT server, since I’d then need to temporarily subscribe to the stat topic at startup.

On the flipside, the comment that the response message was useful for their Node-Red automation intrigues me — I am considering replacing my home-grown automation engine, with Hass.io or something of that ilk, in which this may become an issue (also, my present means of mapping device id’s to more friendly names in the rules, may not work any more).

And so I am considering re-implementing the cmnd/stat separation, with the form <id>/cmnd/<port>, since I personally find having the id first to be more useful, and practically the same anyhow, so far.

In the case of most of my devices, they also run a trivial HTTP interface, which feeds into the exact same request parser as the MQTT receiver (with the exception of some HTTP-only commands which don’t really make sense in isolation — the ledstrip supports indicators that steal a few LEDs from one end or the other, and I’ve found the HTTP interface only, makes more sense there). This would require some extra handling regarding the cmnd/stat issue, because I’ll need to emit both — one to store the setting, and one to let everything else know it’s been changed.

That, in turn, brings up a question; should a cmnd always emit it’s stat message, even if it was a redundant command? If that is the case, then a HTTP or local input should only emit the cmnd, and wait to receive the reflection before it emits the corresponding stat message.

So, yeah, what do people use in situations like this? Does anyone bother with any kind of discovery? How many people encode the devices location into it’s messages, rather than just a generic ID? Does anyone store a changeable device ID in EEPROM instead (assuming, of course, the absence of external addressing, as in the case of dumb light switches connected to a centralised controller)?


#2

An interesting question… It depends on what the data is how you’re using it.

For example - a temperature sensor will only report data to MQTT; no response required. If you query it for current temperature, it listens on one topic and posts temperature data to a different topic.

A light switch is the other way around. It can be useful to know if a light is on or not /before/ issuing a new command; but you could be making incorrect assumptions.

Lets say you have a rule to turn the lights on at dusk. You can go one of two routes -

  1. Just send a arbitrary ON command;
  2. First check to see if the light is already ON and then do nothing. If the light is OFF, then send an ON command.

In my rules engine for this use case, I simply send the “ON” command. The end point can easily handle reciving an “ON” even it it is already “ON” and I don’t have to worry about what the previous status is and build in all sorts of error handling myself to check the status and handle reconnects/dropouts.

I have seen both approaches work well and see both fail. Knowing the status first then deciding which action to take has a prerequisite of having accurate information first and handling dropouts and reconnects well.

In some cases you HAVE to know the current status before taking action -
My Arduino garage monitor can open the door when i come home and closes automatically after 30 minutes of no motion. The problem presents itself very prominently in this case - The opener can’t simply be told “open” or “close” - I can only pulse the input button. The door responds by moving in the opposite direction as it did the last time it moved. So I had to add an “open” and a “closed” prox. The Auto close doesn’t do anything if the door is already closed; and the Auto Open will only open the door if it is currently closed.

Technically, a garage door would have 5 possible states - Closed, Open, Closing, Opening, and Stopped [in the middle]. Compare this with a more binary scenario like a regular door which is either Open or Closed.


#3

Having a definite set/reset that just gets absorbed if it’s redundant, is also good for self-healing, if parts of the system have slipped out of sync somehow. Toggles I try to support within the controller — because I fell in love with JK flip-flops as a kid — but they can be problematic at the edges; for something like that garage door, I’d definitely at least put in the effort of adding a couple limit switches and a timer (or even better, picking up on the ones it probably already has).

My focus was on the choice of MQTT topic, more than that, though, and mostly, why people chose split topics, over the single-topic mechanism I set mine up with.

My view was that the MQTT server is the authority, even more so than the edge device itself, since if MQTT is down, the whole system is pretty well broken anyhow, and storage is probably cheaper on the MQTT server as well (storing state on an edge, typically means EEPROM with the various issues that entails). And so I didn’t bother with messages acknowledging a command, as you’d require with command and state on separate topics — the only situation I can think of in which that would be useful, is failure of the device controlling the globe (not the far more likely case of the globe itself, and it’s a case also covered already by Will messages), and confirmation sensors within the node can still be issued just the same, the benefit of the single-topic method, being a device when it comes up will get either the last command it was sent, or the last status it issued. The alternative if you want an edge node to pick up where it was after a restart, is that the edge node needs to either subscribe to it’s own status messages as well as the command messages, or wait for the controller to reflect the last status message back to it. And even then, an edge node wanting to transmit a message can still keep watch on its connection to the MQTT server, and store the message away in EEPROM if needed, but if it’s that important, it should probably have a default state rather than persisting the last state.

In the case of the garage door specifically, I think tossing in some limit switches (or picking up the ones it probably already has) is by far a damned good idea, but if that’s absolutely not possible, then acknowledging the command with a second set of messages would probably be a second-best option since at least that way you know the node closest to the garage door is at least trying to do the job. But I still don’t see any clear advantage between single vs. split topic. In my single-topic system, opening and closing states would be inherent in the command itself — the garage door was told to open, so therefore, it is opening. When it completes the action (including if it was already in that state), it would then emit an “opened” or “closed” message, and likewise if a timer ran out without either of the limit switches triggering, then it’d emit a “stopped” message (which could also be done in the controller instead). An observer, like for a light which should be on when the garage door isn’t closed, would then listen for an Open command (which doubles as an Opening status), and a Closed status (but ignore the Close command, or a Stopped status). I do think every node in an automated system needs to know it’s current state, though, including garage doors.

One thing that my single-topic system handles naturally, is a device with it’s own local button; it can then emit a status message in response to that button being pressed. Since that status message looks identical to the corresponding command message, as soon as it reconnects to the MQTT server following a power failure, it gets that status message back again, and assumes that state. This will work even if the MQTT server is the only thing reachable.


#4

The more I think about it, I see this a multiple different specific “use cases”… (more on that at the end)

In response to specifically your original question as I think I now understand - One useful reason for separate topics definitely has to do with keeping the entire system in sync. There would only be one “master” topic (the Status) which is reflecting the ‘truest’ known state. Then other things can keep in sync - aka some rule like “only turn the light on if the door opens and its nighttime” You would make the rules based on the Status topic and NOT the Command topic.

I think even the icons in OpenHab reflect the “status” and you are only really sending a “request” to the end-node to turn the light on via the CMND topic. Only if the light responds on the OTHER topic “ON” does OpenHab allow the light bulb icon and switch icon to change. The request itself doesn’t change this - then if the light never responded, the icons never change.

So the Requests (CMNDs) are sent out “behind the scenes” to the end-node; which is really the gatekeeper of the Status topic.

If an end-node is offline or otherwise unresponsive, and you send an “ON” command using a ‘single-topic’ architecture, then this would force things out of sync. All the rest of the system will wrongly assume that the device is ON.

My ‘garage controller’ (Arduino based system that I put together) uses a single topic and it acts funny - this shows up because I send an “Open” command via MQTT and instantly the icon gets messed up. It starts correctly as a “closed” icon, but when I tap the icon to open it, the icon switches immediately & directly to “fully opened” because OpenHab sees this instantaneously. Then, the request reaches my controller and the physical door starts to open. A moment later it clears the “closed” prox and updates MQTT --> the Icon changes to “50% open”. A few seconds later when it hits the “Open” prox, the icon changes a second time to the “Fully open” one.

So the order for the icon based on a single tap under “Single Topic” is: Closed --> Fully open–> 50% open --> Fully open (again). A two-topic system would correct this to Closed–>50% open–>Fully Open.

Any ‘local’ button presses in a two-topic system only update the Status topic. No other part of the system should need to know or care about the Command topic other than the subscribed endpoints which would react exactly once to each new post.

For the other things tied to this topic - which are tightly intertwined -
I think that handling disconnects are really a specific use case - we should start a different thread for that. Too much to include here.


#5

I posted a new topic specifically about reconnecting endpoints.
Also - to your point about

since if MQTT is down, the whole system is pretty well broken anyhow

I think that endpoints should have enough smarts to perform their most basic functions stand-alone. This goes against what Jon talked about in one video and is a common topic of debate - “where do the Smarts go - centralized or distributed”. This is really also another topic - I’ll make one :slight_smile:


#6

If an end-node is offline or otherwise unresponsive, and you send an “ON” command using a ‘single-topic’ architecture, then this would force things out of sync. All the rest of the system will wrongly assume that the device is ON.

Yes, for a little while — but only for low-importance devices. I’m unsure why that’s a bad thing, but pretty sure if it is, there’s some nasty complexity in your layout that’s gonna come back to bite you.

Take a light, for example. Right now, my bedroom lamp, being a Sonoff device, I have had it take almost a minute to come on or go off but it none the less has, every single time in the couple months it’s been connected up; comes on in the morning, goes off when I leave the room, comes back on when I re-enter, and goes off when I sleep — the only weak point being the Bluetooth on my phone that I use for presence detection, causes it to switch off when I’m still present (also used to occasionally lock my computer in the middle of whatever I was doing). If switch it in the interface, it always takes a second or two at least, but it does come on. Whenever I’ve checked the Sonoff app while I’m out, it’s always shown as off. And that’s with it being hideously convoluted, too; the interface and controller scripts (literally, they’re written in bash), set an MQTT topic, another controller script is monitoring that topic, and issues a web request to IFTTT, which passes it on to Sonoff (or rather, eWeLink, or whatever it’s called), and there back again to the device via “the cloud” and wifi, that then switches on the light (haven’t gotten around to trying out Tasmota, waiting until I get a second Sonoff to screw up rather than the one that’s in actual use). If that process breaks at any point after the MQTT server, I’m sol anyhow, but even still so what, it’s a light, no huge drama, that’s what the flashlight button on my phone is for. And the local loop is still vastly quicker than the Sonoff leg of the journey, plus for the most part if I press the light switch, it indicates the light has changed, but it hasn’t, I’m going to know anyhow, and if I don’t, I’m probably not going to care for at least a little while either — plenty of time for it to catch up. So yeah, I’ve probably got a bit of a, “she’ll be right” attitude going on there, at least partly enforced by the present situation.

Also, don’t forget MQTT Will messages, I use those with every MQTT-connected device, along with a “hello” message whenever it (re)connects, and in general, it seems to have a lot of self-healing. But I do agree, the failure response case is perhaps a little too long for immediate-response edge nodes like a simple light, because you have to wait for the MQTT server to realise there’s a problem — of course, assuming the network or node fails right then, if it was already down the interface surface will already be displaying a failure state.

The question is, what exactly are you protecting against with that status response?

The only benefit I can see, is you get a bit of an impromptu network latency test thrown in for free. Unless something really weird happens within the edge node that causes it to continue to respond on the network, but still not work… and in that case, I’d say the chances are pretty good it’s also sent back the response too, and yet still hasn’t done anything? Anything after the firmware (including the output to the relay, and the relay itself) won’t be detected either way, and anything before it, still will — there’s just a little delay in my case before it shows as a problem, and that will, eventually, be picked up and sent to my phone as an alert, for example.

And in the case of a light with more than one control interface, the immediate notification I consider to be better, and less likely to cause race conditions — which is another source of desync. If two devices emit opposing commands at the same time, things could potentially get a little ugly.

My ‘garage controller’ (Arduino based system that I put together) uses a single topic and it acts funny - this shows up because I send an “Open” command via MQTT and instantly the icon gets messed up. It starts correctly as a “closed” icon, but when I tap the icon to open it, the icon switches immediately & directly to “fully opened” because OpenHab sees this instantaneously. Then, the request reaches my controller and the physical door starts to open. A moment later it clears the “closed” prox and updates MQTT --> the Icon changes to “50% open”. A few seconds later when it hits the “Open” prox, the icon changes a second time to the “Fully open” one.

That sounds like the messages are being interpreted wrong; the “Open command” in my layout, gets interpreted by the interface as an “in progress” status, and should be displayed as thus — so if it showed as “fully open” immediately, then something’s wrong, or is that a limitation in the control interface you’re using (considering I’m still using a fairly crufty web page and some Javascript)? I had a mockup using a randomish delay to represent a door opening (garage doors and lights being like the two most common cases that I can think of), for a while, and it showed just fine and reliable with my single-topic scheme as I’ve presented it: when it received the “Open command” reflected back from MQTT, it showed as “opening”, and when the thing was done it sent back “Opened status”, which shows as “open” on the page — that was as part of a couple mockups I tossed together for various devices to work through their communications needs. And I was tormenting it by yanking out the network or power connections at odd times, too.

You have made me think, though, the failure message is presently coming from a timer on the edge node, which is perhaps not the best place for it. But, I was also assuming it’d emit a message when the Closed limit switch broke, so you’d get an Opening message anyhow in addition to the Open command, some very short time after the door started to open — and that’s much better than the edge node acknowledging the command, anyhow. Again, if the network and control system is intact, is there any reason the edge node wouldn’t receive and act on the command? If the edge node itself is down, what’s the chances of a functional MQTT connection to hold back the Will message (and subsequent alert)? And if it’s just messed up, then all bets are off anyhow unless you’ve got an entirely separate device doing the monitoring (for example, the output might have become an input, and the internal pull-up resistor isn’t enough to switch the controlling relay).

To clear up any confusion, that all involves a philosophy I had in my thinking; for a light, all I really need to know is whether the command has been received by MQTT, and it’ll change when it gets the chance. I can’t think of any state in which it would just silently not turn off (better example than not turning on, in which the globe could be dead — and split or single topics, you’re still not going to have a clue). If it’s something important, like a garage door, than I’ll close the loop either within the device (the limit switches emitting status messages), or through a second device entirely (as in the case of my desk clock with a light level sensor — I’m working on using it to indicate when the Sonoff gets around to flicking the relay). So in general, my philosophy was one of optimism, with fallback. The split topic seems to be more pessimistic, and potentially a false sense of security…

Being the OCD belt-and-suspenders type that I am, that delay between failure, and the failure indication, is now kinda bugging me, so I’ll probably think about going split topic like everyone else just to close that hole… eventually. Where before the single topic method also saved me just enough very precious bytes in my dinky little EtherTen’s firmware, that’s not really an issue now it’s been replanted into an EtherMega (of which only two of it’s huge number of ports are actually used), and with my contemplated switch to RS485 anyhow, there won’t be that great big ethernet library (which I can’t use in the project because it’s now even bigger). Power over RS485… good ol’ Po…S? Getting MQTT running over 485 is going to be kinda fun, too… But Ethernet is just such a gruesome heavy-weight, and quite unnecessary…

Which leaves from my original post — since part of my original question(s) have been shipped off to your new post — the MQTT topic tree. Do people program the edge devices with a common name, or do we give them an ID, and do the translation elsewhere? For example, Jon’s old Arduino light switches, were programmed with a common name, where his newer ones use the ID principle, as I understand them. My LED strip is named “5410EC4C7221”, which is just the built in MAC address. That goes into a mapping file, that allows me to refer to it transparently as either that number, or the common name “ledstrip”, and I can change the common name at will, without having to change the firmware (or vice versa, if I need to update the firmware, every individual device doesn’t need it’s own private instance). And, does anyone use a discovery scheme? My nodes emit a “hello” message (and re-emit it if they subsequently receive one, too), including the devices firmware name and version, IP address, and extra info which might be handy (like, the type and number of LEDs in the ledstrip).

I have been considering adding the ability to set the “board name”, which I believe Arduino expects to be at the end of EEPROM, and use that in place of the device ID… But I’m unsure whether that complexity is even at all useful, and it would require some extra work to handle it — not only in terms of setting/updating it, but just using it in the MQTT topic; need to manage the subscriptions, I believe it’s written backwards in EEPROM (?), which makes it harder to use, etc.