Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ai): add AI orchestrator metrics #3097

Merged
merged 13 commits into from
Jul 18, 2024
Merged

Conversation

rickstaa
Copy link
Contributor

@rickstaa rickstaa commented Jul 14, 2024

What does this pull request do? Explain your changes. (required)

This pull request introduces new Orchestrator AI metrics to the ai-video branch:

  • ai_models_requested: Tracks the number of AI job requests per capability and model.
  • ai_request_latency_score: Measures latency scores per model job request.
  • ai_request_price: Records the price paid per unit for each model request.
  • ai_request_errors: Logs AI requests encountered while the Orchestrator is processing the requested job.

To decrease code duplication this pull request uses the same metrics as the Gateway metrics pull request (see #3087).

Specific updates (required)

  • Updates census.go to include the new Orchestrator metrics.
  • Updates ai_http.go file to log these metrics.

How did you test each of these updates (required)

I set up both an on-chain and off-chain gateway to validate the metrics. I verified their visibility at http://localhost:7935/metrics and ensured they were correctly visualized in Grafana.

Does this pull request close any open issues?

This implements the functionality outlined in https://livepeer-ai.productlane.com/roadmap?id=d56cae33-2dbd-4187-8d3a-d1c5c35f890a

Checklist:

How to test

  1. Check out this pull request.
  2. Spin up an on-chain gateway with attached orchestrators.
  3. Clone the repository https://github.com/rickstaa/livepeer-monitor-test.
  4. Execute the Dockerfile in that repository to launch Prometheus and Grafana servers.
  5. Navigate to http://localhost:7935/metrics to view the new AI orchestrator metrics.
  6. Visit http://localhost:3000 to inspect these metrics in Grafana.

eliteprox and others added 9 commits July 8, 2024 15:37
This commit adds the initial AI gateway metrics so that they can
reviewed by others. The code still need to be cleaned up and the buckets
adjusted.
This commit improves the AI metrics so that they are easier to work
with.
This commit ensures that an error is logged when the Gateway could not
find orchestrators for a given model and capability.
This commit ensure that the `ticket_value_sent` abd `tickets_sent`
metrics are also created for a AI Gateway.
This commit ensures that the AI gateway metrics contain the orch address
label.
@rickstaa rickstaa changed the title ai orchestrator metrics feat(ai): add AI orchestrator metrics Jul 14, 2024
@rickstaa rickstaa changed the base branch from master to ai-video July 14, 2024 09:34
This commit introduces a suite of AI orchestrator metrics to the census
module, mirroring those received by the Gateway. The newly added metrics
include `ai_models_requested`, `ai_request_latency_score`,
`ai_request_price`, and `ai_request_errors`, facilitating comprehensive
tracking and analysis of AI request handling performance on the orchestrator side.
Name: "ai_request_latency_score",
Measure: census.mAIRequestLatencyScore,
Description: "AI request latency score",
TagKeys: append([]tag.Key{census.kPipeline, census.kModelName}, baseTagsWithNodeInfo...),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eliteprox, @ad-astra-video do you think listing this per gateway label makes sense?

This commit ensures that the right tags are attached to the Orchestrator
AI metrics.
This commit ensures that no devide by zero errors can occur in the
latency score calculations.
@rickstaa rickstaa merged commit 5aadffb into ai-video Jul 18, 2024
6 of 8 checks passed
@rickstaa rickstaa deleted the ai-orchestrator-metrics branch July 18, 2024 13:40
eliteprox added a commit to eliteprox/go-livepeer that referenced this pull request Jul 26, 2024
* Add gateway metric for roundtrip ai times by model and pipeline

* Rename metrics and add unique manifest

* Fix name mismatch

* modelsRequested not working correctly

* feat: add initial POC AI gateway metrics

This commit adds the initial AI gateway metrics so that they can
reviewed by others. The code still need to be cleaned up and the buckets
adjusted.

* feat: improve AI metrics

This commit improves the AI metrics so that they are easier to work
with.

* feat(ai): log no capacity error to metrics

This commit ensures that an error is logged when the Gateway could not
find orchestrators for a given model and capability.

* feat(ai): add TicketValueSent and TicketsSent metrics

This commit ensure that the `ticket_value_sent` abd `tickets_sent`
metrics are also created for a AI Gateway.

* fix(ai): ensure that AI metrics have orch address label

This commit ensures that the AI gateway metrics contain the orch address
label.

* feat(ai): add orchestrator AI census metrics

This commit introduces a suite of AI orchestrator metrics to the census
module, mirroring those received by the Gateway. The newly added metrics
include `ai_models_requested`, `ai_request_latency_score`,
`ai_request_price`, and `ai_request_errors`, facilitating comprehensive
tracking and analysis of AI request handling performance on the orchestrator side.

* refactor: improve orchestrator metrics tags

This commit ensures that the right tags are attached to the Orchestrator
AI metrics.

* refactor(ai): improve latency score calculations

This commit ensures that no devide by zero errors can occur in the
latency score calculations.

---------

Co-authored-by: Elite Encoder <[email protected]>
eliteprox added a commit to eliteprox/go-livepeer that referenced this pull request Jul 26, 2024
* Add gateway metric for roundtrip ai times by model and pipeline

* Rename metrics and add unique manifest

* Fix name mismatch

* modelsRequested not working correctly

* feat: add initial POC AI gateway metrics

This commit adds the initial AI gateway metrics so that they can
reviewed by others. The code still need to be cleaned up and the buckets
adjusted.

* feat: improve AI metrics

This commit improves the AI metrics so that they are easier to work
with.

* feat(ai): log no capacity error to metrics

This commit ensures that an error is logged when the Gateway could not
find orchestrators for a given model and capability.

* feat(ai): add TicketValueSent and TicketsSent metrics

This commit ensure that the `ticket_value_sent` abd `tickets_sent`
metrics are also created for a AI Gateway.

* fix(ai): ensure that AI metrics have orch address label

This commit ensures that the AI gateway metrics contain the orch address
label.

* feat(ai): add orchestrator AI census metrics

This commit introduces a suite of AI orchestrator metrics to the census
module, mirroring those received by the Gateway. The newly added metrics
include `ai_models_requested`, `ai_request_latency_score`,
`ai_request_price`, and `ai_request_errors`, facilitating comprehensive
tracking and analysis of AI request handling performance on the orchestrator side.

* refactor: improve orchestrator metrics tags

This commit ensures that the right tags are attached to the Orchestrator
AI metrics.

* refactor(ai): improve latency score calculations

This commit ensures that no devide by zero errors can occur in the
latency score calculations.

---------

Co-authored-by: Elite Encoder <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants