-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing Measurements #731
Comments
@neilh10, I can confirm that the site was down on August 6th from 13:30 UTC until 18:30 UTC, as indicated on our uptime dashboard. I can plot the log data and verify server responses during that window, but a quick glance indicates that most (I will confirm if it was all) API traffic during those hours received a 500 response. This would seemingly be different from #685, in which the uptime monitoring data does not show the same level of outage. |
I should also note that the outage was due to a connection issue between the application and database. We have updated our uptime monitoring to better check for this specific type of issue and should get quicker notifications of outage issues like this. |
@ptomasula thanks for the quick insight, and great to see the graphs show the failure :) - I guess I feel like I'm walking on eggshell with the server as there is an evolving discussion with exactly what is reliability and responses. I appreciate this is an old issue and you are picking up on it and making progress on it. #485 As it turns out, 500 was also the response to this server's condition described #628 in an attempt to be really cautious, and cater for unexpected overload problems on the server if it gets upset I can process 500 messages differently and give up on reliably getting it to the server, in the interest of not overloading the server. This was my own call. That is for my testing systems I only look for 201 (and now 2xx) as a handshake, but for the production software I've left a turd in there and it gives up on the server for generic 500. I'll need to offload the logs from a device in the field to see how its responded. IHMO Of course for me personally, its an absolute disaster having devices in the field that are not remotely upgradeable and require a field visit for each upgrade. Since I think the reason for the Missing Measurements has been diagnosed, and you believe you have a better monitor for the processes, I'll close this issue. Really appreciate the ability to diagnose it fast. |
I've detected missing measurements across a number of nodes
https://monitormywatershed.org/sites/TUCA_Sa01/
between the following intervals - 20 records at 15minute intervals
PST
2024-08-06 13:15
2024-08-06 18:00
GMT
2024-08-06 5:15
2024-08-06 10:00
For https://monitormywatershed.org/sites/nh_LCC45/
slightly different but 21 records between
PST
2024-08-06 11:45
..
2024-08-06 18:00
A visual analysis on PO03, GV01, GV08, Mi03, Mi06, Mw01, Mw12 , Na13 suggest they also have missing data.
Since I have a reliable delivery mechanism on my nodes, seems likely it is some MMW operations that have caused the gaps.
Possibly same as
#685
The text was updated successfully, but these errors were encountered: