Marcos Vanetta (@malev)
Powered by tacos
Sponsored by Continuum Analytics
The talk
- Reproducibility
- What is a development environment?
- What can we use?
- What it would be nice to have?
Reproducibility
Reproducibility is the ability of an entire experiment or
study to be duplicated, either by the same researcher or by someone else
working independently.
Development Environment
- A computer system in which a computer program or software component is deployed and executed.
- A development environment is a collection of procedures and tools for developing, testing and debugging an application or program.
- A development environment contains everything required by a team to build and deploy software-intensive systems.
Components
- Method
- Tools
- Enablement
- Organization
- Infrastructure
- Adoption
Components
- Method
- Tools
Enablement
Organization
- Infrastructure
Adoption
Method
- Roles, work products, tasks, and processes
- Standards, guidelines, checklists, templates, and examples
- Deployment topology
Tools
- Development tools and their integrations
- Development tool configurations and installation scripts
- Deployment topology, which considers the software and hardware required
Infrastructure
A development environment considers infrastructure in terms of both hardware and software.
- Locations, nodes, and connectivity
- Software (such as operating systems, database management systems, board-level controls, and test harnesses).
How do we work with data?
- Everything is production
- Everything is NOT production
- Multi-language
- Local | Cloud | both
- Data ~Gb | Data ~Tb | ...
What do we want to reproduce?
- Coding and documentation styles
- Software dependencies (libraries, databases, etc.)
- Configuration files and environmental variables
Data (dummy data and real data)
- Keys (aws, ssh, etc)
Coding and documentation styles
Dependencies
-
Database engines
- Installation instructions
- Schema
- Configuration
- Dummy data
- Docker or Vagrant
- Makefiles or bash scripts
- SaaS
- Migrations
Automate
Dependencies: libraries
pip
|
conda
|
Lot of packages
|
Data packages mostrly
|
~ Multi platform
|
Multi platform
|
Not so fast
|
Fast
|
Included in Anaconda
|
Included in Anaconda
|
Consider tools like pipreqs or defrost.
Exporting your dependencies with pip
$ pip freeze > requirements.txt
$ cat requirements.txt
requests==2.8.1
virtualenv==13.0.1
wheel==0.26.0
Reusing an environment
$ virtualenv .my-env
$ source .my-env/bin/activate
(my-env)$ pip install -r requirements.txt
Keep it simple
Exporting your dependencies with conda
$ conda env export -n please-work -f environment.yml
$ cat environment.yml
name: my-project
dependencies:
- bokeh=0.8.0=np19py27_0
- colorama=0.3.3=py27_0
- pip:
- flask
Reusing with conda
$ conda env create
...
$ source activate my-project
discarding /Users/mvanetta/miniconda/bin from PATH
prepending /Users/mvanetta/miniconda/envs/my-project/bin to PATH
(my-project)$
Keep it simple
Using pill
$ conda install pill deps -c malev
$ pill init
$ source pill in
$ deps install
$ source pill out
$ rm -rf .pill
Working with notebooks
$ conda create -n project
$ conda install -y bokeh pandas jupyter
$ ipython notebook iris.ipynb
$ conda env attach -n iris iris.ipynb
$ anaconda notebook upload iris.ipynb
Reusing your notebook
$ anaconda notebook download malev/iris
$ conda env create iris.ipynb
$ source activate iris
$ ipython notebook iris.ipynb
Configuration files and environmental variables
- Essential part of configuration management
- yaml, ini, json files
- Generally stored in programmer's brains
Attach to the repo, document, use tools like autoenv and automate.
keys
- Security concerns
- Don't put it in the repo
- Talk to your IT department
Conclusions
- We still have some work to do
- We have a lot of manual work
- It is an expensive process