En raisonnance avec la soutenance de thèse de Guillaume Papa s’est tenu le « Random Forest & Robust Algorithmics workshop » en présence de Olivier Teytaud de l'INRIA Saclay & Google Research Europe Zurich ainsi que de Joachim Buhmann de l'ETH Zurich. L’intervention d'Olivier Teytaud a porté sur les forêts aléatoires : « Exact Distributed Training: Random Forest with Bilions of Examples ». Joachim Buhmann a quant à lui abordé la question d'une algorithmique robustecomme une base pour la science.
Exact Distributed Training: Random Forest with Bilions of Examples by Olivier Teytaud
We introduce an exact distributed algorithm to train Random Forest models as well as other decision forest models without relying on approximating best split search. We introduce the proposed algorithm, and compare it, for various complexity measures (time, ram, disk, and network complexity analysis), to related approaches. We report its running performances on artificial and real-world datasets up to 17 billions examples. This figure is several orders of magnitude larger than datasets tackled in the existing literature. Finally, we show empirically that Random Forest benefits from being trained on more data, even in the case of already gigantic datasets. decision trees. Sprint is particularly suitable for the distributed setting, but we show that Sliq becomes better in the balanced case and/or when working with randomly drawn subsets of features; and we derive a rule for automatically switching between both methods. Given a dataset with 17.3B examples with 71 features, our implementation trains a tree in 22h.
Robust algorithmics: a foundation for science? by Joachim Buhmann
ALGORITHM is the idiom of modern science, as Bernard Chazelle phrazed it. I like to go a step further in this talk by claiming that algorithmics lays the foundation of modern science. The scientific method of "systematic observation, measurements, and experiments, as well as the formulation, testing, and modification of hypotheses" requires algorithms for knowledge discovery in complex experimental situations. Algorithms in data science map data spaces to hypothesis classes. Beside running time and memory consumption, such algorithms should be characterized by their sensitivity to the signal in the input and their robustness to input fluctuations. The achievable precision of an algorithm, i.e., the attainable resolution in output space, is determined by its capability to extract predictive information. I will advocate an information theoretic framework for algorithm analysis where an algorithm is characterized as a computational evolution of a posterior distribution on the output space. The method allows us to investigate complex data analysis pipelines as they occur in computational neuroscience and neurology as well as inmolecular biology. I will demonstrate this design concept for algorithm validation with a statistical analysis of diffusion tensor imaging data. A theoretical result for sparse minimum bisection yields statistical hints why random combinatorial optimization problems are hard to solve.