Missing Measurements #731

neilh10 · 2024-08-08T16:32:42Z

I've detected missing measurements across a number of nodes

https://monitormywatershed.org/sites/TUCA_Sa01/
between the following intervals - 20 records at 15minute intervals
PST
2024-08-06 13:15
2024-08-06 18:00
GMT
2024-08-06 5:15
2024-08-06 10:00

For https://monitormywatershed.org/sites/nh_LCC45/
slightly different but 21 records between
PST
2024-08-06 11:45
..
2024-08-06 18:00

A visual analysis on PO03, GV01, GV08, Mi03, Mi06, Mw01, Mw12 , Na13 suggest they also have missing data.

Since I have a reliable delivery mechanism on my nodes, seems likely it is some MMW operations that have caused the gaps.

Possibly same as
#685

ptomasula · 2024-08-08T17:12:15Z

@neilh10, I can confirm that the site was down on August 6th from 13:30 UTC until 18:30 UTC, as indicated on our uptime dashboard. I can plot the log data and verify server responses during that window, but a quick glance indicates that most (I will confirm if it was all) API traffic during those hours received a 500 response.

This would seemingly be different from #685, in which the uptime monitoring data does not show the same level of outage.

ptomasula · 2024-08-08T17:13:59Z

I should also note that the outage was due to a connection issue between the application and database. We have updated our uptime monitoring to better check for this specific type of issue and should get quicker notifications of outage issues like this.

neilh10 · 2024-08-08T18:12:43Z

@ptomasula thanks for the quick insight, and great to see the graphs show the failure :) -
Good to see that you can differentiate between failure types so quickly.
For 500 Internal Server Error wikipedia says
"A generic error message, given when an unexpected condition was encountered and no more specific message is suitable."

I guess I feel like I'm walking on eggshell with the server as there is an evolving discussion with exactly what is reliability and responses. I appreciate this is an old issue and you are picking up on it and making progress on it. #485

As it turns out, 500 was also the response to this server's condition described #628

in an attempt to be really cautious, and cater for unexpected overload problems on the server if it gets upset I can process 500 messages differently and give up on reliably getting it to the server, in the interest of not overloading the server. This was my own call.

That is for my testing systems I only look for 201 (and now 2xx) as a handshake, but for the production software I've left a turd in there and it gives up on the server for generic 500. I'll need to offload the logs from a device in the field to see how its responded.

IHMO Of course for me personally, its an absolute disaster having devices in the field that are not remotely upgradeable and require a field visit for each upgrade.

Since I think the reason for the Missing Measurements has been diagnosed, and you believe you have a better monitor for the processes, I'll close this issue. Really appreciate the ability to diagnose it fast.

neilh10 closed this as completed Aug 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Measurements #731

Missing Measurements #731

neilh10 commented Aug 8, 2024

ptomasula commented Aug 8, 2024 •

edited

Loading

ptomasula commented Aug 8, 2024 •

edited

Loading

neilh10 commented Aug 8, 2024

Missing Measurements #731

Missing Measurements #731

Comments

neilh10 commented Aug 8, 2024

ptomasula commented Aug 8, 2024 • edited Loading

ptomasula commented Aug 8, 2024 • edited Loading

neilh10 commented Aug 8, 2024

ptomasula commented Aug 8, 2024 •

edited

Loading

ptomasula commented Aug 8, 2024 •

edited

Loading