A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

overview
theory
methods
representation learning
program synthesis
meta-learning
automated machine learning
weak supervision
interesting papers

deep learning
reinforcement learning
bayesian inference and learning
probabilistic programming
causal inference

overview

"The Talking Machines" podcast audio

overview by Igor Kuralenok (first, second) video in russian

applications

artificial intelligence
knowledge representation and reasoning
natural language processing
recommender systems
information retrieval

"Machine Learning is The New Algorithms" by Hal Daume
"When is Machine Learning Worth It?" by Ferenc Huszar

Any source code for expression y = f(x), where f(x) has some parameters and is used to make decision, prediction or estimate, has potential to be replaced by machine learning algorithm.

knowledge bases

http://metacademy.org
http://en.wikipedia.org/wiki/Machine_learning (guide)
http://machinelearning.ru in russian

guides

"Machine Learning Basics" by Ian Goodfellow, Yoshua Bengio, Aaron Courville
"A Few Useful Things to Know about Machine Learning" by Pedro Domingos
"Expressivity, Trainability, and Generalization in Machine Learning" by Eric Jang
"Clever Methods of Overfitting" by John Langford
"Common Pitfalls in Machine Learning" by Daniel Nee
"Classification vs. Prediction" by Frank Harrell
"Causality in Machine Learning" by Muralidharan et al.
"Are ML and Statistics Complementary?" by Max Welling
"Introduction to Information Theory and Why You Should Care" by Gil Katz
"Ideas on Interpreting Machine Learning" by Hall et al.
"Mathematics for Machine Learning" by Marc Peter Deisenroth, A Aldo Faisal, Cheng Soon Ong
"Rules of Machine Learning: Best Practices for ML Engineering" by Martin Zinkevich

courses

course by Andrew Ng video
course by Nando de Freitas video
course by Nando de Freitas video
course by Pedro Domingos video
course by Alex Smola video
course by Trevor Hastie and Rob Tibshirani video
course by Jeff Miller video

course by Hal Daume

course by Yandex video in russian
course by OpenDataScience video in russian
course by Konstantin Vorontsov video in russian
course by Igor Kuralenok video in russian
course by Igor Kuralenok video in russian
course by Igor Kuralenok video in russian
course by Igor Kuralenok video in russian

conferences

ICML 2018
https://facebook.com/icml.imls/videos video

https://david-abel.github.io/blog/posts/misc/icml_2018.pdf notes
ICLR 2018
https://facebook.com/iclr.cc/videos video

http://search.iclr2018.smerity.com
http://iclr2018.mmanukyan.io
http://chillee.github.io/OpenReviewExplorer
NIPS 2017
https://nips.cc/Conferences/2017/Videos video
https://facebook.com/pg/nipsfoundation/videos/ video

https://nips17.ml

https://github.com/hindupuravinash/nips2017
https://github.com/kihosuh/nips_2017
https://github.com/sbarratt/nips2017

https://cs.brown.edu/~dabel/blog/posts/misc/nips_2017.pdf notes
ICML 2017
https://icml.cc/Conferences/2017/Videos video

http://artem.sobolev.name/posts/2017-08-14-icml-2017.html
https://olgalitech.wordpress.com/tag/icml2017/
ICLR 2017
https://facebook.com/iclr.cc/videos video

https://medium.com/@karpathy/iclr-2017-vs-arxiv-sanity-d1488ac5c131
NIPS 2016
https://channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016 video
https://nips.cc/Conferences/2016/SpotlightVideos video

https://youtube.com/playlist?list=PLPwzH56Rdmq4hcuEMtvBGxUrcQ4cAkoSc video
https://youtube.com/playlist?list=PLJscN9YDD1buxCitmej1pjJkR5PMhenTF video
https://youtube.com/channel/UC_LBLWLfKk5rMKDOHoO7vPQ video
https://youtube.com/playlist?list=PLzTDea_cM27LVPSTdK9RypSyqBHZWPywt video

https://github.com/hindupuravinash/nips2016
http://artem.sobolev.name/posts/2016-12-31-nips-2016-summaries.html
ICML 2016
http://techtalks.tv/icml/2016/ video
ICLR 2016
http://videolectures.net/iclr2016_san_juan/ video

http://www.computervisionblog.com/2016/06/deep-learning-trends-iclr-2016.html
NIPS 2015
https://youtube.com/playlist?list=PLD7HFcN7LXRdvgfR6qNbuvzxIwG0ecE9Q video
https://youtube.com/user/NeuralInformationPro/search?query=NIPS+2015 video

http://reddit.com/r/MachineLearning/comments/3x2ueg/nips_2015_overviews_collection/
http://cinrizasti.blogspot.ru/2015/12/a-blog-post-about-blog-posts-about-nips.html
ICML 2015
https://youtube.com/playlist?list=PLdH9u0f1XKW8cUM3vIVjnpBfk_FKzviCu video
http://dpkingma.com/?page_id=483 video
ICLR 2015
http://youtube.com/channel/UCqxFGrNL5nX10lS62bswp9w video
NIPS 2014
https://youtube.com/user/NeuralInformationPro/search?query=NIPS+2014 video

theory

machine learning has become alchemy by Ali Rahimi video (post)
statistics in machine learning by Michael I. Jordan video
theory in machine learning by Michael I. Jordan video

"Learning Theory: Purely Theoretical?" by Jonathan Huggins

problems:

What does it mean to learn?
When is a concept/function learnable?
How much data do we need to learn something?
How can we make sure what we learn will generalize to future data?

theory helps to:

design algorithms
understand behaviour of algorithms
quantify knowledge/uncertainty
identify new and refine old challenges

frameworks:

statistical learning theory
computational learning theory (PAC learning or PAC-Bayes)

theory - statistical learning theory

ingredients:

distributions
i.i.d. samples
learning algorithms
predictors
loss functions

A priori analysis: How well a learning algorithm will perform on new data?

(Vapnik's learning theory) Can we compete with best hypothesis from a given set of hypotheses?
(statistics) Can we match the best possible loss assuming data generating distribution belongs to known family?

A posteriori analysis: How well is a learning algorithm doing on some data? Quantify uncertainty left

Fundamental theorem of statistical learning theory:
In binary classification, to match the loss of hypothesis in class H up to accuracy ε, one needs O(VC(H)/ε^2) observations.

Theorem (computational complexity of learning linear classifiers):
Unless NP=RP, linear classifiers (hyperplanes) cannot be learned in polynomial time.

"Machine Learning Theory" by Mostafa Samir
"Crash Course on Learning Theory" by Sebastien Bubeck
"Statistical Learning Theory" by Percy Liang

course by Tomaso Poggio, Lorenzo Rosasco, Georgios Evangelopoulos video
course by Yaser Abu-Mostafa video
course by Sebastien Bubeck video
course by Tomaso Poggio and Lorenzo Rosasco video

theory - computational learning theory

"Computational Learning Theory, AI and Beyond" chapter of "Mathematics and Computation" book by Avi Wigderson

"Probably Approximately Correct - A Formal Theory of Learning" by Jeremy Kun
"A Problem That is Not (Properly) PAC-learnable" by Jeremy Kun
"Occam’s Razor and PAC-learning" by Jeremy Kun

theory - applications

bayesian inference and learning

deep learning

reinforcement learning
"Theory of Reinforcement Learning" by Csaba Szepesvari video

methods

challenges

How to decide which representation is best for target knowledge?
How to tell genuine regularities from chance occurrences?
How to exploit pre-existing domain knowledge?
How to learn with limited computational resources?
How to learn with limited data?
How to make learned results understandable?
How to quantify uncertainty?
How to take into account the costs of decisions?
How to handle non-indepedent and non-stationary data?

"The Three Cultures of Machine Learning" by Jason Eisner
"Algorithmic Dimensions" by Justin Domke
"All Models of Learning Have Flaws" by John Langford

http://en.wikipedia.org/wiki/List_of_machine_learning_algorithms

http://eferm.com/wp-content/uploads/2011/05/cheat3.pdf
http://github.com/soulmachine/machine-learning-cheat-sheet/blob/master/machine-learning-cheat-sheet.pdf

http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

representation learning

"Representation is a formal system which makes explicit certain entities and types of information, and which can be operated on by an algorithm in order to achieve some information processing goal. Representations differ in terms of what information they make explicit and in terms of what algorithms they support. As example, Arabic and Roman numerals - the fact that operations can be applied to particular columns of Arabic numerals in meaningful ways allows for simple and efficient algorithms for addition and multiplication."

"In representation learning, our goal isn’t to predict observables, but to learn something about the underlying structure. In cognitive science and AI, a representation is a formal system which maps to some domain of interest in systematic ways. A good representation allows us to answer queries about the domain by manipulating that system. In machine learning, representations often take the form of vectors, either real- or binary-valued, and we can manipulate these representations with operations like Euclidean distance and matrix multiplication."

"In representation learning, the goal isn’t to make predictions about observables, but to learn a representation which would later help us to answer various queries. Sometimes the representations are meant for people, such as when we visualize data as a two-dimensional embedding. Sometimes they’re meant for machines, such as when the binary vector representations learned by deep Boltzmann machines are fed into a supervised classifier. In either case, what’s important is that mathematical operations map to the underlying relationships in the data in systematic ways."

"What is representation learning?" by Roger Grosse
"Predictive learning vs. representation learning" by Roger Grosse

deep learning
probabilistic programming
knowledge representation

program synthesis

programmatic representations:

well-specified
Unlike sentences in natural language, programs are unambiguous, although two distinct programs can be precisely equivalent.
compact
Programs allow us to compress data on the basis of their regularities.
combinatorial
Programs can access the results of running other programs, as well as delete, duplicate, and rearrange these results.
hierarchical
Programs have an intrinsic hierarchical organization and may be decomposed into subprograms.

challenges:

open-endedness
In contrast to other knowledge representations in machine learning, programs may vary in size and shape, and there is no obvious problem-independent upper bound on program size. This makes it difficult to represent programs as points in a fixed-dimensional space, or learn programs with algorithms that assume such a space.
over-representation
Often syntactically distinct programs will be semantically identical (i.e. represent the same underlying behavior or functional mapping). Lacking prior knowledge, many algorithms will inefficiently sample semantically identical programs repeatedly.
chaotic execution
Programs that are very similar, syntactically, may be very different, semantically. This presents difficulty for many heuristic search algorithms, which require syntactic and semantic distance to be correlated.
high resource-variance
Programs in the same space may vary greatly in the space and time they require to execute.

"For me there are two types of generalisation, which I will refer to as Symbolic and Connectionist generalisation. If we teach a machine to sort sequences of numbers of up to length 10 or 100, we should expect them to sort sequences of length 1000 say. Obviously symbolic approaches have no problem with this form of generalisation, but neural nets do poorly. On the other hand, neural nets are very good at generalising from data (such as images), but symbolic approaches do poorly here. One of the holy grails is to build machines that are capable of both symbolic and connectionist generalisation."

(Nando de Freitas)

inductive programming

"Program Synthesis Explained" by James Bornholt

"Inductive Programming Meets the Real World" by Gulwani et al. paper
"Program Synthesis" by Gulwani, Polozov, Singh paper
"Approaches and Applications of Inductive Programming" by Schmid, Muggleton, Singh paper

"Program Synthesis in 2017-18" by Oleksandr Polozov
"Recent Advances in Neural Program Synthesis" by Neel Kant paper

interesting recent papers
selected papers

"Deep Learning Trends: Program Induction" by Scott Reed video
"Learning to Code: Machine Learning for Program Induction" by Alex Gaunt video

"Neural Abstract Machines & Program Induction" workshop at NIPS 2016 (videos)
panel at NAMPI 2016 with Percy Liang, Juergen Schmidhuber, Joshua Tenenbaum, Martin Vechev, Daniel Tarlow, Dawn Song video

"The Future of Deep Learning" by Francois Chollet (talk video)

meta-learning

overview by Pieter Abbeel video
overview by Oriol Vinyals video
overview by Nando de Freitas video

Metalearning symposium video

Metalearning symposium panel video
RNN symposium panel video

interesting recent papers

overview by Tom Schaul and Juergen Schmidhuber

"On Learning How to Learn Learning Strategies" by Juergen Schmidhuber video
"Learning how to Learn Learning Algorithms: Recursive Self-Improvement" by Juergen Schmidhuber video

Goedel Machine

"The Future of Deep Learning" by Francois Chollet (talk video)

"Current commercial AI algorithms are still missing something fundamental. They are no self-referential general purpose learning algorithms. They improve some system’s performance in a given limited domain, but they are unable to inspect and improve their own learning algorithm. They do not learn the way they learn, and the way they learn the way they learn, and so on (limited only by the fundamental limits of computability)."

(Juergen Schmidhuber)

automated machine learning

AutoML aims to automate many different stages of the machine learning process:

model selection, hyper-parameter optimization, and model search
meta learning and transfer learning
representation learning and automatic feature extraction / construction
automatic generation of workflows / workflow reuse
automatic problem "ingestion" (from raw data and miscellaneous formats)
automatic feature transformation to match algorithm requirements
automatic detection and handling of skewed data and/or missing values
automatic acquisition of new data (active learning, experimental design)
automatic report writing (providing insight on automatic data analysis)
automatic selection of evaluation metrics / validation procedures
automatic selection of algorithms under time/space/power constraints
automatic prediction post-processing and calibration
automatic leakage detection
automatic inference and differentiation
user interfaces for AutoML

problems:

different data distributions: the intrinsic/geometrical complexity of the dataset
different tasks: regression, binary classification, multi-class classification, multi-label classification
different scoring metrics: AUC, BAC, MSE, F1, etc
class balance: Balanced or unbalanced class proportions
sparsity: Full matrices or sparse matrices
missing values: Presence or absence of missing values
categorical variables: Presence or absence of categorical variables
irrelevant variables: Presence or absence of additional irrelevant variables (distractors)
number Ptr of training examples: Small or large number of training examples
number N of variables/features: Small or large number of variables
aspect ratio Ptr/N of the training data matrix: Ptr >> N, Ptr = N or Ptr << N

"Automated Machine Learning: A Short History" by Thomas Dinsmore

"Automated Machine Learning" by Andreas Mueller video
"AutoML and How To Speed It Up" by Frank Hutter video
"Neural Architecture Search" by Quoc Le video (post)

interesting papers

auto-sklearn project
TPOT project
auto_ml project
H2O AutoML project

The Automatic Statistician project

overview by Zoubin Ghahramani video
overview by Zoubin Ghahramani video
overview by Zoubin Ghahramani video
overview by Zoubin Ghahramani slides

AutoML challenge

"Design the perfect machine learning “black box” capable of performing all model selection and hyper-parameter tuning without any human intervention"

"Benchmarking Automatic Machine Learning Frameworks" by Balaji and Allen paper

"The Future of Deep Learning" by Francois Chollet (talk video)

"Why Tool AIs Want to Be Agent AIs" by Gwern Branwen:

"Roughly, we can try to categorize the different kinds of agentiness by level of neural network they work on. There are:

actions internal to a computation
- inputs
- intermediate states
- accessing the external environment
- amount of computation
- enforcing constraints/finetuning quality of output
- changing the loss function applied to output
actions internal to training the neural network
- the gradient itself
- size & direction of gradient descent steps on each parameter
- overall gradient descent learning rate and learning rate schedule
- choice of data samples to train on
internal to the neural network design step
- hyperparameter optimization
- neural network architecture
internal to the dataset
- active learning
- optimal experiment design
internal to interaction with environment
- adaptive experiment
- multi-armed bandit
- exploration for reinforcement learning"

"The logical extension of these neural networks all the way down papers is that an actor like Google / Baidu / Facebook / MS could effectively turn neural networks into a black box: a user/developer uploads through an API a dataset of input/output pairs of a specified type and a monetary loss function, and a top-level neural network running on a large GPU cluster starts autonomously optimizing over architectures & hyperparameters for the neural network design which balances GPU cost and the monetary loss, interleaved with further optimization over the thousands of previous submitted tasks, sharing its learning across all of the datasets / loss functions / architectures / hyperparameters, and the original user simply submits future data through the API for processing by the best neural network so far."

(Gwern Branwen)

weak supervision

data programming

"CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise" by Lee et al. paper (post, post)

weak supervision - data programming

Snorkel project
Snorkel blog

overview by Chris Re video
overview by Chris Re video
overview by Chris Re video

"Data Programming: ML with Weak Supervision" post
"Socratic Learning: Debugging ML Models" post
"SLiMFast: Assessing the Reliability of Data" post
"Data Programming + TensorFlow Tutorial" post
"Babble Labble: Learning from Natural Language Explanations" post (overview video)
"Structure Learning: Are Your Sources Only Telling You What You Want to Hear?" post
"HoloClean: Weakly Supervised Data Repairing" post
"Scaling Up Snorkel with Spark" post
"Weak Supervision: The New Programming Paradigm for Machine Learning" post
"Learning to Compose Domain-Specific Transformations for Data Augmentation" post
"Exploiting Building Blocks of Data to Efficiently Create Training Sets" post
"Programming Training Data: The New Interface Layer for ML" post

"Data Programming: Creating Large Training Sets, Quickly" paper summary (video)
"Socratic Learning: Empowering the Generative Model" paper summary (video)
"Data Programming with DDLite: Putting Humans in a Different Part of the Loop" paper
"Snorkel: A System for Lightweight Extraction" paper (talk video)
"Snorkel: Fast Training Set Generation for Information Extraction" paper (talk video)
"Learning the Structure of Generative Models without Labeled Data" paper (talk video)
"Learning to Compose Domain-Specific Transformations for Data Augmentation" paper (video)
"Inferring Generative Model Structure with Static Analysis" paper (video)
"Snorkel: Rapid Training Data Creation with Weak Supervision" paper summary (talk video)
"Training Complex Models with Multi-Task Weak Supervision" paper
"Snorkel MeTaL: Weak Supervision for Multi-Task Learning" paper

"Snorkel is a system for rapidly creating, modeling, and managing training data, currently focused on accelerating the development of structured or "dark" data extraction applications for domains in which large labeled training sets are not available or easy to obtain.

Today's state-of-the-art machine learning models require massive labeled training sets--which usually do not exist for real-world applications. Instead, Snorkel is based around the new data programming paradigm, in which the developer focuses on writing a set of labeling functions, which are just scripts that programmatically label data. The resulting labels are noisy, but Snorkel automatically models this process - learning, essentially, which labeling functions are more accurate than others - and then uses this to train an end model (for example, a deep neural network in TensorFlow).

Surprisingly, by modeling a noisy training set creation process in this way, we can take potentially low-quality labeling functions from the user, and use these to train high-quality end models. We see Snorkel as providing a general framework for many weak supervision techniques, and as defining a new programming model for weakly-supervised machine learning systems."

interesting papers

theory
automated machine learning
systems

interesting papers - deep learning
interesting papers - reinforcement learning
interesting papers - bayesian inference and learning
interesting papers - probabilistic programming

interesting recent papers

interesting papers - theory

"A Theory of the Learnable" Valiant

"Humans appear to be able to learn new concepts without needing to be programmed explicitly in any conventional sense. In this paper we regard learning as the phenomenon of knowledge acquisition in the absence of explicit programming. We give a precise methodology for studying this phenomenon from a computational viewpoint. It consists of choosing an appropriate information gathering mechanism, the learning protocol, and exploring the class of concepts that can be learned using it in a reasonable (polynomial) number of steps. Although inherent algorithmic complexity appears to set serious limits to the range of concepts that can be learned, we show that there are some important nontrivial classes of propositional concepts that can be learned in a realistic sense."

"Proof that if you have a finite number of functions, say N, then every training error will be close to every test error once you have more than log N training cases by a small constant factor. Clearly, if every training error is close to its test error, then overfitting is basically impossible (overfitting occurs when the gap between the training and the test error is large)."

"Statistical Modeling: The Two Cultures" Breiman

"There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools."

"A Few Useful Things to Know about Machine Learning" Domingos

"Machine learning algorithms can figure out how to perform important tasks by generalizing from examples. This is often feasible and cost-effective where manual programming is not. As more data becomes available, more ambitious problems can be tackled. As a result, machine learning is widely used in computer science and other fields. However, developing successful machine learning applications requires a substantial amount of “black art” that is hard to find in textbooks. This article summarizes twelve key lessons that machine learning researchers and practitioners have learned. These include pitfalls to avoid, important issues to focus on, and answers to common questions."

"Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer" Vapnik, Izmailov

"This paper introduces an advanced setting of machine learning problem in which an Intelligent Teacher is involved. During training stage, Intelligent Teacher provides Student with information that contains, along with classification of each example, additional privileged information (explanation) of this example. The paper describes two mechanisms that can be used for significantly accelerating the speed of Student’s training: (1) correction of Student’s concepts of similarity between examples, and (2) direct Teacher-Student knowledge transfer."

"During last fifty years a strong machine learning theory has been developed. This theory includes: 1. The necessary and sufficient conditions for consistency of learning processes. 2. The bounds on the rate of convergence which in general cannot be improved. 3. The new inductive principle (SRM) which always achieves the smallest risk. 4. The effective algorithms, (such as SVM), that realize consistency property of SRM principle. It looked like general learning theory has been complied: it answered almost all standard questions that is asked in the statistical theory of inference. Meantime, the common observation was that human students require much less examples for training than learning machine. Why? The talk is an attempt to answer this question. The answer is that it is because the human students have an Intelligent Teacher and that Teacher-Student interactions are based not only on the brute force methods of function estimation from observations. Speed of learning also based on Teacher-Student interactions which have additional mechanisms that boost learning process. To learn from smaller number of observations learning machine has to use these mechanisms. In the talk I will introduce a model of learning that includes the so called Intelligent Teacher who during a training session supplies a Student with intelligent (privileged) information in contrast to the classical model where a student is given only outcomes y for events x. Based on additional privileged information x* for event x two mechanisms of Teacher-Student interactions (special and general) are introduced: 1. The Special Mechanism: To control Student's concept of similarity between training examples. and 2. The General Mechanism: To transfer knowledge that can be obtained in space of privileged information to the desired space of decision rules. Both mechanisms can be considered as special forms of capacity control in the universally consistent SRM inductive principle. Privileged information exists for almost any inference problem and can make a big difference in speed of learning processes."

video https://video.ias.edu/csdm/2015/0330-VladimirVapnik (Vapnik)
press http://learningtheory.org/learning-has-just-started-an-interview-with-prof-vladimir-vapnik/

"Compression and Machine Learning: A New Perspective on Feature Space Vectors" Sculley, Brodley

"The use of compression algorithms in machine learning tasks such as clustering and classification has appeared in a variety of fields, sometimes with the promise of reducing problems of explicit feature selection. The theoretical justification for such methods has been founded on an upper bound on Kolmogorov complexity and an idealized information space. An alternate view shows compression algorithms implicitly map strings into implicit feature space vectors, and compression-based similarity measures compute similarity within these feature spaces. Thus, compression-based methods are not a “parameter free” magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms. To underscore this point, we find theoretical and empirical connections between traditional machine learning vector models and compression, encouraging cross-fertilization in future work."

interesting papers - automated machine learning

"Design of the 2015 ChaLearn AutoML Challenge" Guyon et al.

"ChaLearn is organizing for IJCNN 2015 an Automatic Machine Learning challenge (AutoML) to solve classification and regression problems from given feature representations, without any human intervention. This is a challenge with code submission: the code submitted can be executed automatically on the challenge servers to train and test learning machines on new datasets. However, there is no obligation to submit code. Half of the prizes can be won by just submitting prediction results. There are six rounds (Prep, Novice, Intermediate, Advanced, Expert, and Master) in which datasets of progressive difficulty are introduced (5 per round). There is no requirement to participate in previous rounds to enter a new round. The rounds alternate AutoML phases in which submitted code is “blind tested” on datasets the participants have never seen before, and Tweakathon phases giving time (~1 month) to the participants to improve their methods by tweaking their code on those datasets. This challenge will push the state-of-the-art in fully automatic machine learning on a wide range of problems taken from real world applications."

"Population Based Training of Neural Networks" Jaderberg et al.

"Neural networks dominate the modern machine learning landscape, but their training and success still suffer from sensitivity to empirical choices of hyperparameters such as model architecture, loss function, and optimisation algorithm. In this work we present Population Based Training, a simple asynchronous optimisation algorithm which effectively utilises a fixed computational budget to jointly optimise a population of models and their hyperparameters to maximise performance. Importantly, PBT discovers a schedule of hyperparameter settings rather than following the generally sub-optimal strategy of trying to find a single fixed set to use for the whole course of training. With just a small modification to a typical distributed hyperparameter training framework, our method allows robust and reliable training of models. We demonstrate the effectiveness of PBT on deep reinforcement learning problems, showing faster wall-clock convergence and higher final performance of agents by optimising over a suite of hyperparameters. In addition, we show the same method can be applied to supervised learning for machine translation, where PBT is used to maximise the BLEU score directly, and also to training of Generative Adversarial Networks to maximise the Inception score of generated images. In all cases PBT results in the automatic discovery of hyperparameter schedules and model selection which results in stable training and better final performance."

"Two common tracks for the tuning of hyperparameters exist: parallel search and sequential optimisation, which trade-off concurrently used computational resources with the time required to achieve optimal results. Parallel search performs many parallel optimisation processes (by optimisation process we refer to neural network training runs), each with different hyperparameters, with a view to finding a single best output from one of the optimisation processes – examples of this are grid search and random search. Sequential optimisation performs few optimisation processes in parallel, but does so many times sequentially, to gradually perform hyperparameter optimisation using information obtained from earlier training runs to inform later ones – examples of this are hand tuning and Bayesian optimisation. Sequential optimisation will in general provide the best solutions, but requires multiple sequential training runs, which is often unfeasible for lengthy optimisation processes."

"In this work, we present a simple method, Population Based Training which bridges and extends parallel search methods and sequential optimisation methods. Advantageously, our proposal has a wallclock run time that is no greater than that of a single optimisation process, does not require sequential runs, and is also able to use fewer computational resources than naive search methods such as random or grid search. Our approach leverages information sharing across a population of concurrently running optimisation processes, and allows for online propagation/transfer of parameters and hyperparameters between members of the population based on their performance."

"Furthermore, unlike most other adaptation schemes, our method is capable of performing online adaptation of hyperparameters – which can be particularly important in problems with highly non-stationary learning dynamics, such as reinforcement learning settings, where the learning problem itself can be highly non-stationary (e.g. dependent on which parts of an environment an agent is currently able to explore). As a consequence, it might be the case that the ideal hyperparameters for such learning problems are themselves highly non-stationary, and should vary in a way that precludes setting their schedule in advance."

post https://deepmind.com/blog/population-based-training-neural-networks/
video https://vimeo.com/250399261 (Jaderberg)

"Data Programming: Creating Large Training Sets, Quickly" Ratner, Sa, Wu, Selsam, Re

"Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users provide a set of labeling functions, which are programs that heuristically label large subsets of data points, albeit noisily. By viewing these labeling functions as implicitly describing a generative model for this noise, we show that we can recover the parameters of this model to “denoise” the training set. Then, we show how to modify a discriminative loss function to make it noise-aware. We demonstrate our method over a range of discriminative models including logistic regression and LSTMs. We establish theoretically that we can recover the parameters of these generative models in a handful of settings. Experimentally, on the 2014 TAC-KBP relation extraction challenge, we show that data programming would have obtained a winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a supervised LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way to create machine learning models for non-experts."

"In the data programming approach to developing a machine learning system, the developer focuses on writing a set of labeling functions, which create a large but noisy training set. Snorkel then learns a generative model of this noise - learning, essentially, which labeling functions are more accurate than others - and uses this to train a discriminative classifier. At a high level, the idea is that developers can focus on writing labeling functions - which are just (Python) functions that provide a label for some subset of data points - and not think about algorithms or features!"

video https://youtube.com/watch?v=iSQHelJ1xxU
video https://youtube.com/watch?v=HmocI2b5YfA (Re)
post http://hazyresearch.github.io/snorkel/blog/weak_supervision.html
post http://hazyresearch.github.io/snorkel/blog/dp_with_tf_blog_post.html
audio https://soundcloud.com/nlp-highlights/28-data-programming-creating-large-training-sets-quickly (Ratner)
notes https://github.com/b12io/reading-group/blob/master/data-programming-snorkel.md
code https://github.com/HazyResearch/snorkel
Snorkel project summary

"Socratic Learning: Empowering the Generative Model" Varma et al.

"A challenge in training discriminative models like neural networks is obtaining enough labeled training data. Recent approaches have leveraged generative models to denoise weak supervision sources that a discriminative model can learn from. These generative models directly encode the users' background knowledge. Therefore, these models may be incompletely specified and fail to model latent classes in the data. We present Socratic learning to systematically correct such generative model misspecification by utilizing feedback from the discriminative model. We prove that under mild conditions, Socratic learning can recover features from the discriminator that informs the generative model about these latent classes. Experimentally, we show that without any hand-labeled data, the corrected generative model improves discriminative performance by up to 4.47 points and reduces error for an image classification task by 80% compared to a state-of-the-art weak supervision modeling technique."

video https://youtube.com/watch?v=0gRNochbK9c
post http://hazyresearch.github.io/snorkel/blog/socratic_learning.html
code https://github.com/HazyResearch/snorkel
Snorkel project summary

"Snorkel: Rapid Training Data Creation with Weak Supervision" Ratner, Bach, Ehrenberg, Fries, Wu, Re

"Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train stateof-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets."

video https://youtube.com/watch?v=HmocI2b5YfA (Re)
notes https://blog.acolyer.org/2018/08/22/snorkel-rapid-training-data-creation-with-weak-supervision
Snorkel project summary

interesting papers - systems

"Hidden Technical Debt in Machine Learning Systems" Sculley et al.

"Machine learning offers a fantastically powerful toolkit for building useful complexprediction systems quickly. This paper argues it is dangerous to think ofthese quick wins as coming for free. Using the software engineering frameworkof technical debt, we find it is common to incur massive ongoing maintenancecosts in real-world ML systems. We explore several ML-specific risk factors toaccount for in system design. These include boundary erosion, entanglement,hidden feedback loops, undeclared consumers, data dependencies, configurationissues, changes in the external world, and a variety of system-level anti-patterns."

notes https://blog.acolyer.org/2016/02/29/machine-learning-the-high-interest-credit-card-of-technical-debt
post http://john-foreman.com/blog/the-perilous-world-of-machine-learning-for-fun-and-profit-pipeline-jungles-and-hidden-feedback-loops

"TensorFlow: A system for large-scale machine learning" Abadi et al.

"TensorFlow is a machine learning system that operates at large scale and in heterogeneous environments. TensorFlow uses dataflow graphs to represent computation, shared state, and the operations that mutate that state. It maps the nodes of a dataflow graph across many machines in a cluster, and within a machine across multiple computational devices, including multicore CPUs, general-purpose GPUs, and custom designed ASICs known as Tensor Processing Units (TPUs). This architecture gives flexibility to the application developer: whereas in previous “parameter server” designs the management of shared state is built into the system, TensorFlow enables developers to experiment with novel optimizations and training algorithms. TensorFlow supports a variety of applications, with particularly strong support for training and inference on deep neural networks. Several Google services use TensorFlow in production, we have released it as an open-source project, and it has become widely used for machine learning research. In this paper, we describe the TensorFlow dataflow model in contrast to existing systems, and demonstrate the compelling performance that TensorFlow achieves for several real-world applications."

"A Reliable Effective Terascale Linear Learning System" Agarwal, Chapelle, Dudik, Langford

Vowpal Wabbit

"We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features, 1 billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature. We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices."

"

Online by default

Hashing, raw text is fine

Most scalable public algorithm

Reduction to simple problems

Causation instead of correlation

Learn to control based on feedback
"

https://github.com/JohnLangford/vowpal_wabbit/wiki
video http://youtube.com/watch?v=wwlKkFhEhxE (Langford)
paper "Bring The Noise: Embracing Randomness Is the Key to Scaling Up Machine Learning Algorithms" by Brian Dalessandro

"Making Contextual Decisions with Low Technical Debt" Agarwal et al.

https://github.com/brylevkirill/notes/blob/master/Reinforcement%20Learning.md#making-contextual-decisions-with-low-technical-debt-agarwal-et-al

"CatBoost: Gradient Boosting with Categorical Features Support" Dorogush, Ershov, Gulin

"In this paper we present CatBoost, a new open-sourced gradient boosting library that successfully handles categorical features and outperforms existing publicly available implementations of gradient boosting in terms of quality on a set of popular publicly available datasets. The library has a GPU implementation of learning algorithm and a CPU implementation of scoring algorithm, which are significantly faster than other gradient boosting libraries on ensembles of similar sizes."

"Two critical algorithmic advances introduced in CatBoost are the implementation of ordered boosting, a permutation-driven alternative to the classic algorithm, and an innovative algorithm for processing categorical features. Both techniques were created to fight a prediction shift caused by a special kind of target leakage present in all currently existing implementations of gradient boosting algorithms."

https://catboost.yandex
video https://youtube.com/watch?v=8o0e-r0B5xQ (Dorogush)
video https://youtube.com/watch?v=db-iLhQvcH8 (Prokhorenkova)
video https://youtube.com/watch?v=UYDwhuyWYSo (Dorogush) in russian
video https://youtube.com/watch?v=9ZrfErvm97M (Dorogush) in russian
video https://youtube.com/watch?v=Q_xa4RvnDcY (Dorogush) in russian
code https://github.com/catboost/catboost
paper "CatBoost: Unbiased Boosting with Categorical Features" by Prokhorenkova, Gusev, Vorobev, Dorogush, Gulin

"Consistent Individualized Feature Attribution for Tree Ensembles" Lundberg, Erion, Lee

"Interpreting predictions from tree ensemble methods such as gradient boosting machines and random forests is important, yet feature attribution for trees is often heuristic and not individualized for each prediction. Here we show that popular feature attribution methods are inconsistent, meaning they can lower a feature's assigned importance when the true impact of that feature actually increases. This is a fundamental problem that casts doubt on any comparison between features. To address it we turn to recent applications of game theory and develop fast exact tree solutions for SHAP (SHapley Additive exPlanation) values, which are the unique consistent and locally accurate attribution values. We then extend SHAP values to interaction effects and define SHAP interaction values. We propose a rich visualization of individualized feature attributions that improves over classic attribution summaries and partial dependence plots, and a unique "supervised" clustering (clustering based on feature attributions). We demonstrate better agreement with human intuition through a user study, exponential improvements in run time, improved clustering performance, and better identification of influential features. An implementation of our algorithm has also been merged into XGBoost and LightGBM."

Files

Machine Learning.md

Latest commit

History

Machine Learning.md

File metadata and controls

overview

applications

knowledge bases

guides

courses

books

blogs

news and discussions

conferences

theory

theory - statistical learning theory

theory - computational learning theory

theory - applications

methods

representation learning

program synthesis

meta-learning

automated machine learning

weak supervision

weak supervision - data programming

interesting papers

interesting papers - theory

"A Theory of the Learnable" Valiant

"Statistical Modeling: The Two Cultures" Breiman

"A Few Useful Things to Know about Machine Learning" Domingos

"Learning with Intelligent Teacher: Similarity Control and Knowledge Transfer" Vapnik, Izmailov

"Compression and Machine Learning: A New Perspective on Feature Space Vectors" Sculley, Brodley

interesting papers - automated machine learning

"Design of the 2015 ChaLearn AutoML Challenge" Guyon et al.

"Population Based Training of Neural Networks" Jaderberg et al.

"Data Programming: Creating Large Training Sets, Quickly" Ratner, Sa, Wu, Selsam, Re

"Socratic Learning: Empowering the Generative Model" Varma et al.

"Snorkel: Rapid Training Data Creation with Weak Supervision" Ratner, Bach, Ehrenberg, Fries, Wu, Re

interesting papers - systems

"Hidden Technical Debt in Machine Learning Systems" Sculley et al.

"TensorFlow: A system for large-scale machine learning" Abadi et al.

"A Reliable Effective Terascale Linear Learning System" Agarwal, Chapelle, Dudik, Langford

"Making Contextual Decisions with Low Technical Debt" Agarwal et al.

"CatBoost: Gradient Boosting with Categorical Features Support" Dorogush, Ershov, Gulin

"Consistent Individualized Feature Attribution for Tree Ensembles" Lundberg, Erion, Lee