Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Measurements #731

Closed
neilh10 opened this issue Aug 8, 2024 · 3 comments
Closed

Missing Measurements #731

neilh10 opened this issue Aug 8, 2024 · 3 comments

Comments

@neilh10
Copy link

neilh10 commented Aug 8, 2024

I've detected missing measurements across a number of nodes

https://monitormywatershed.org/sites/TUCA_Sa01/
between the following intervals - 20 records at 15minute intervals
PST
2024-08-06 13:15
2024-08-06 18:00
GMT
2024-08-06 5:15
2024-08-06 10:00

For https://monitormywatershed.org/sites/nh_LCC45/
slightly different but 21 records between
PST
2024-08-06 11:45
..
2024-08-06 18:00

A visual analysis on PO03, GV01, GV08, Mi03, Mi06, Mw01, Mw12 , Na13 suggest they also have missing data.

Since I have a reliable delivery mechanism on my nodes, seems likely it is some MMW operations that have caused the gaps.

Possibly same as
#685

@ptomasula
Copy link
Member

ptomasula commented Aug 8, 2024

@neilh10, I can confirm that the site was down on August 6th from 13:30 UTC until 18:30 UTC, as indicated on our uptime dashboard. I can plot the log data and verify server responses during that window, but a quick glance indicates that most (I will confirm if it was all) API traffic during those hours received a 500 response.
image

This would seemingly be different from #685, in which the uptime monitoring data does not show the same level of outage.
image

@ptomasula
Copy link
Member

ptomasula commented Aug 8, 2024

I should also note that the outage was due to a connection issue between the application and database. We have updated our uptime monitoring to better check for this specific type of issue and should get quicker notifications of outage issues like this.

@neilh10
Copy link
Author

neilh10 commented Aug 8, 2024

@ptomasula thanks for the quick insight, and great to see the graphs show the failure :) -
Good to see that you can differentiate between failure types so quickly.
For 500 Internal Server Error wikipedia says
"A generic error message, given when an unexpected condition was encountered and no more specific message is suitable."

I guess I feel like I'm walking on eggshell with the server as there is an evolving discussion with exactly what is reliability and responses. I appreciate this is an old issue and you are picking up on it and making progress on it. #485

As it turns out, 500 was also the response to this server's condition described #628

in an attempt to be really cautious, and cater for unexpected overload problems on the server if it gets upset I can process 500 messages differently and give up on reliably getting it to the server, in the interest of not overloading the server. This was my own call.

That is for my testing systems I only look for 201 (and now 2xx) as a handshake, but for the production software I've left a turd in there and it gives up on the server for generic 500. I'll need to offload the logs from a device in the field to see how its responded.

IHMO Of course for me personally, its an absolute disaster having devices in the field that are not remotely upgradeable and require a field visit for each upgrade.

Since I think the reason for the Missing Measurements has been diagnosed, and you believe you have a better monitor for the processes, I'll close this issue. Really appreciate the ability to diagnose it fast.

@neilh10 neilh10 closed this as completed Aug 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants