π§€DVC Version Control for Machine Learning
Workflow
ΰΉΰΈΰΈ’ΰΈΰΈΰΈΰΈ΄ΰΈΰΈ²ΰΈ£ΰΈͺΰΈ£ΰΉΰΈ²ΰΈ Model ΰΈΰΈΰΈ Machine Learning ΰΈΰΈ°ΰΈΰΈ£ΰΈ°ΰΈΰΈΰΈΰΉΰΈΰΈΰΉΰΈ§ΰΈ’ 3 ΰΈͺΰΉΰΈ§ΰΈ ΰΈΰΈ·ΰΈ Code, Data ΰΉΰΈ₯ΰΈ° Configuration ΰΈΰΈ³ΰΈ‘ΰΈ² Train ΰΉΰΈΰΈ·ΰΉΰΈΰΉΰΈ«ΰΉΰΉΰΈΰΉ Model ΰΉΰΈ₯ΰΈ°ΰΈΰΈ°ΰΈ‘ΰΈ΅ΰΈΰΈ²ΰΈ£ΰΈΰΈ³ Reproduce

ΰΈ«ΰΈ₯ΰΈ±ΰΈΰΈΰΈ²ΰΈ£ΰΈΰΈ³ΰΈΰΈ²ΰΈΰΈΰΈΰΈ DVC ΰΈΰΈ°ΰΈΰΈ₯ΰΉΰΈ²ΰΈ’ ΰΉ ΰΈΰΈ±ΰΈ Git ΰΉΰΈΰΉΰΈΰΈ°ΰΉΰΈΰΉΰΉΰΈΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΈΰΈΰΈΰΈΰΉΰΈΰΉΰΈ 2 ΰΉΰΈΰΈ ΰΈΰΈ·ΰΈ ΰΈͺΰΉΰΈ§ΰΈΰΈΰΈ΅ΰΉΰΉΰΈΰΉΰΈ Code ΰΈΰΈ°ΰΉΰΈΰΉΰΈΰΈΰΈ’ΰΈΉΰΉΰΉΰΈ Remote Code Storage ΰΈΰΈΰΈ Git Server ΰΉΰΈ₯ΰΈ°ΰΈͺΰΉΰΈ§ΰΈΰΈΰΈ΅ΰΉΰΉΰΈΰΉΰΈ Model ΰΈΰΈ°ΰΉΰΈΰΉΰΈΰΈΰΈ’ΰΈΉΰΉΰΉΰΈ Remote Data Storage ΰΉΰΈΰΉΰΈ S3, GS, Azure, SSH ΰΈΰΈ²ΰΈ‘ΰΈ£ΰΈΉΰΈΰΈΰΉΰΈ²ΰΈΰΈ₯ΰΉΰΈ²ΰΈ

Download
Get Started
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈΰΈ²ΰΈ§ΰΈΰΉΰΉΰΈ«ΰΈ₯ΰΈΰΉΰΈ₯ΰΈ°ΰΈΰΈ΄ΰΈΰΈΰΈ±ΰΉΰΈ DVC

ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈΰΈ²ΰΈ§ΰΈΰΉΰΉΰΈ«ΰΈ₯ΰΈ Code ΰΉΰΈ₯ΰΈ°ΰΈͺΰΈ£ΰΉΰΈ²ΰΈ Git Repository
git init
wget https://dvc.org/s3/examples/so/code.zip
unzip code.zip && rm -f code.zip
git add code/
git commit -m "download and initialize code"
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈͺΰΈ£ΰΉΰΈ²ΰΈ Virtual Environment
mkvirtualenv venv
workon venv
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈΰΈ΄ΰΈΰΈΰΈ±ΰΉΰΈ Package ΰΈΰΈ²ΰΈΰΉΰΈΰΈ₯ΰΉ requirements.txt
pip install -r code/requirements.txt
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈͺΰΈ£ΰΉΰΈ²ΰΈ DVC Repository
dvc init
git commit -m "initialize DVC"
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈΰΈ²ΰΈ§ΰΈΰΉΰΉΰΈ«ΰΈ₯ΰΈ Dataset ΰΉΰΈ₯ΰΈ°ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Add ΰΉΰΈ DVC ΰΈΰΉΰΈ§ΰΈ’
mkdir data
wget -P data https://dvc.org/s3/examples/so/Posts.xml.zip
dvc add data/Posts.xml.zip
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Commit ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΈ₯ΰΈ΅ΰΉΰΈ’ΰΈΰΉΰΈΰΈ₯ΰΈΰΉΰΈΰΈ’ΰΈ±ΰΈ Git Repository
git add data/Posts.xml.zip.dvc data/.gitignore
git commit -m "add dataset"
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Run ΰΉΰΈ DVC ΰΉΰΈΰΈ·ΰΉΰΈΰΈ£ΰΈ§ΰΈΰΈ£ΰΈ§ΰΈ‘ΰΈΰΈ³ΰΈͺΰΈ±ΰΉΰΈΰΉΰΈΰΉΰΈΰΉΰΈ₯ΰΈ° Stage
dvc run -d data/Posts.xml.zip ^
-o data/Posts.xml ^
-f extract.dvc ^
unzip data/Posts.xml.zip -d data
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Convert ΰΉΰΈΰΈ₯ΰΉΰΈΰΈ²ΰΈ XML ΰΉΰΈΰΉΰΈ TSV ΰΉΰΈ DVC ΰΉΰΈΰΈ·ΰΉΰΈΰΈΰΈ³ Feature Extraction ΰΉΰΈΰΉΰΈΰΉΰΈ²ΰΈ’ΰΈΰΈΆΰΉΰΈ
dvc run -d code/xml_to_tsv.py -d data/Posts.xml ^
-o data/Posts.tsv ^
-f prepare.dvc ^
python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Split Dataset ΰΉΰΈ DVC ΰΉΰΈΰΈ·ΰΉΰΈΰΉΰΈΰΉΰΈΰΈΰΉΰΈΰΈ‘ΰΈΉΰΈ₯ΰΈΰΈ΅ΰΉΰΉΰΈΰΉΰΉΰΈΰΈΰΈ²ΰΈ£ Training ΰΉΰΈ₯ΰΈ° Test ΰΉΰΈΰΈ’ΰΈΰΈ³ΰΈ«ΰΈΰΈΰΉΰΈ«ΰΉ Test Dataset ΰΈ‘ΰΈ΅ΰΈΰΈ±ΰΈΰΈ£ΰΈ²ΰΈͺΰΉΰΈ§ΰΈΰΉΰΈΰΉΰΈ 0.2 ΰΉΰΈ₯ΰΈ°ΰΈΰΈ³ΰΈ«ΰΈΰΈΰΈΰΉΰΈ² Seed ΰΉΰΈΰΈΰΈ²ΰΈ£ Random ΰΉΰΈΰΉΰΈ 20170426
dvc run -d code/split_train_test.py -d data/Posts.tsv ^
-o data/Posts-train.tsv -o data/Posts-test.tsv ^
-f split.dvc ^
python code/split_train_test.py data/Posts.tsv 0.2 20170426 ^
data/Posts-train.tsv data/Posts-test.tsv
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Extract Feature and Label ΰΉΰΈ DVC ΰΈΰΈΆΰΉΰΈΰΈΰΈ°ΰΉΰΈΰΉΰΉΰΈΰΈ₯ΰΉ Pickle
dvc run -d code/featurization.py -d data/Posts-train.tsv -d data/Posts-test.tsv ^
-o data/matrix-train.pkl -o data/matrix-test.pkl ^
-f featurize.dvc ^
python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv ^
data/matrix-train.pkl data/matrix-test.pkl
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Train Model ΰΈΰΈ±ΰΈ Training Dataset ΰΉΰΈ DVC
dvc run -d code/train_model.py -d data/matrix-train.pkl ^
-o data/model.pkl ^
-f train.dvc ^
python code/train_model.py data/matrix-train.pkl 20170426 data/model.pkl
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Evaluate Model ΰΈΰΈ±ΰΈ Test Dataset ΰΉΰΈ DVC
dvc run -d code/evaluate.py -d data/model.pkl -d data/matrix-test.pkl ^
-M auc.metric ^
-f evaluate.dvc ^
python code/evaluate.py data/model.pkl data/matrix-test.pkl auc.metric
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈΰΈ£ΰΈ§ΰΈΰΈͺΰΈΰΈ Accuracy ΰΉΰΈ DVC ΰΈΰΉΰΈ§ΰΈ’ Metric
dvc metrics show
auc.metric: AUC: 0.587951
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Commit ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΈ₯ΰΈ΅ΰΉΰΈ’ΰΈΰΉΰΈΰΈ₯ΰΈΰΉΰΈΰΈ’ΰΈ±ΰΈ Git Repository
git add *.dvc auc.metric
git commit -am "create pipeline"
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΉΰΈΰΉΰΈΰΈ₯ΰΉ code/featurization.py ( ΰΈΰΈ£ΰΈ£ΰΈΰΈ±ΰΈΰΈΰΈ΅ΰΉ 72-73 )
notepad code/featurization.py
bag_of_words = CountVectorizer(stop_words='english',
max_features=5000,
ngram_range=(1, 2))
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Reproduce ΰΈͺΰΈ³ΰΈ«ΰΈ£ΰΈ±ΰΈΰΈΰΈΈΰΈ Stage ΰΈΰΈΆΰΉΰΈΰΈΰΈ°ΰΈΰΈ³ΰΉΰΈΰΈ Auto ΰΈ«ΰΈ²ΰΈΰΈ‘ΰΈ΅ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΉΰΈΰΉΰΈΰΈ₯ΰΉ
dvc repro evaluate.dvc
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ΰΈΰΈ£ΰΈ§ΰΈΰΈͺΰΈΰΈ Accuracy ΰΉΰΈ DVC ΰΈΰΉΰΈ§ΰΈ’ Metric ΰΉΰΈΰΈΰΈΈΰΈ Branch
dvc metrics show -a
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Commit ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΈ₯ΰΈ΅ΰΉΰΈ’ΰΈΰΉΰΈΰΈ₯ΰΈΰΉΰΈΰΈ’ΰΈ±ΰΈ Git Repository
git add evaluate.dvc auc.metric
git commit -m "add evaluation step to the pipeline
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Tag ΰΉΰΈΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉΰΈ Checkpoint ΰΉΰΈΰΈ·ΰΉΰΈΰΉΰΈΰΉΰΉΰΈΰΈΰΈ²ΰΈ£ Compare
git tag -a "baseline-experiment" -m "baseline"
ΰΈΰΈ³ΰΈΰΈ²ΰΈ£ Show Pipeline ΰΉΰΈΰΈ ASCII
dvc pipeline show --ascii train.dvc
dvc pipeline show --ascii train.dvc --commands
dvc pipeline show --ascii train.dvc --outs
ΰΈΰΈ°ΰΉΰΈͺΰΈΰΈ Visualize Pipeline ΰΉΰΈΰΉΰΈΰΈ ASCII

ΰΈΰΉΰΈ²ΰΈΰΉΰΈΰΈ΄ΰΉΰΈ‘ΰΉΰΈΰΈ΄ΰΈ‘ : https://bit.ly/2FOQM5v
Last updated
Was this helpful?