Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try machine learning distribution on ecs-watchbot #183

Open
jakepruitt opened this issue Feb 1, 2018 · 2 comments
Open

Try machine learning distribution on ecs-watchbot #183

jakepruitt opened this issue Feb 1, 2018 · 2 comments

Comments

@jakepruitt
Copy link

Context

Talking with the @mapbox/ml-club today, it sounds like running training on multiple hosts is still unexplored, and could provide some benefits to the difficulties involved in running single hosts for days on end.

Thoughts

I'm not sure if this belongs in ecs-watchbot or in ecs-api, but it looks like https://github.com/uber/horovod is a potential way to try out distributing a machine learning system across multiple hosts.

The connection takes place through TCP, so maybe ENI's and named DNS records/service discoverability would help here.

cc/ @mapbox/ml-club @mapbox/platform

@rclark
Copy link
Contributor

rclark commented Feb 2, 2018

This would be really cool to explore -- but might be worth waiting until ECS rolls out their upcoming service discovery system. From the sound of it, that system will make it far easier to manage the IP addresses, DNS entries, and healthchecking that's usually needed for this kind of cross-node communication.

Our last communication with the team put the launch of this feature in late Feb / early March.

@jakepruitt
Copy link
Author

Since service discovery is now available, I think we can start experimenting here (refs https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-discovery.html). Looks like there's even cloudformation support: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-servicediscovery-service.html.

From the API standpoint, I think it'd make sense to have service discovery be an option during template creation. Then within the watchbot listen code, we could poll the Route53 record for the service and internally keep the list of IP's or IP:Port combos of all of the containers in the service. Then, we could inject this list as a comma-separated environment variable to the worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants