🧀DVC Version Control for Machine Learning

ΰΈΰΈ²ΰΈ£ΰΉƒΰΈŠΰΉ‰ΰΈ‡ΰΈ²ΰΈ™ Version Control ฑมฑานานแΰΈ₯ΰΉ‰ΰΈ§ΰΈͺΰΈ³ΰΈ«ΰΈ£ΰΈ±ΰΈšΰΉ€ΰΈ«ΰΈ₯่า Programmer แΰΈ₯ΰΈ° Developer ΰΉ‚ΰΈ”ΰΈ’ΰΉƒΰΈ™ΰΈͺΰΈ²ΰΈ’ΰΈ‡ΰΈ²ΰΈ™ΰΈ”ΰΉ‰ΰΈ²ΰΈ™ Data Science ΰΈ—ΰΈ΅ΰΉˆΰΈ—ΰΈ³ΰΉ€ΰΈΰΈ΅ΰΉˆΰΈ’ΰΈ§ΰΈΰΈ±ΰΈš Machine Learning ก็ฑม Version Control เหฑือนกัน ΰΉ€ΰΈ£ΰΈ΅ΰΈ’ΰΈΰΈ§ΰΉˆΰΈ² DVC ΰΈ‹ΰΈΆΰΉˆΰΈ‡ΰΈˆΰΈ°ΰΈ„ΰΈ₯ΰΉ‰ΰΈ²ΰΈ’ ΰΉ† กับ Git

Workflow

ΰΈ«ΰΈ₯ักการทำงานของ DVC ΰΈˆΰΈ°ΰΈ„ΰΈ₯ΰΉ‰ΰΈ²ΰΈ’ ΰΉ† กับ Git ΰΉΰΈ•ΰΉˆΰΈˆΰΈ°ΰΉΰΈšΰΉˆΰΉˆΰΈ‡ΰΈΰΈ²ΰΈ£ΰΉ€ΰΈΰΉ‡ΰΈšΰΈ­ΰΈ­ΰΈΰΉ€ΰΈ›ΰΉ‡ΰΈ™ 2 แบบ ΰΈ„ΰΈ·ΰΈ­ ΰΈͺΰΉˆΰΈ§ΰΈ™ΰΈ—ΰΈ΅ΰΉˆΰΉ€ΰΈ›ΰΉ‡ΰΈ™ Code ΰΈˆΰΈ°ΰΉ€ΰΈΰΉ‡ΰΈšΰΈ­ΰΈ’ΰΈΉΰΉˆΰΉƒΰΈ™ Remote Code Storage ΰΈ‚ΰΈ­ΰΈ‡ Git Server แΰΈ₯ΰΈ°ΰΈͺΰΉˆΰΈ§ΰΈ™ΰΈ—ΰΈ΅ΰΉˆΰΉ€ΰΈ›ΰΉ‡ΰΈ™ Model ΰΈˆΰΈ°ΰΉ€ΰΈΰΉ‡ΰΈšΰΈ­ΰΈ’ΰΈΉΰΉˆΰΉƒΰΈ™ Remote Data Storage ΰΉ€ΰΈŠΰΉˆΰΈ™ S3, GS, Azure, SSH ΰΈ•ΰΈ²ΰΈ‘ΰΈ£ΰΈΉΰΈ›ΰΈ”ΰΉ‰ΰΈ²ΰΈ™ΰΈ₯ΰΉˆΰΈ²ΰΈ‡

Download

Get Started

  • ΰΈ—ΰΈ³ΰΈΰΈ²ΰΈ£ΰΈ”ΰΈ²ΰΈ§ΰΈ™ΰΉŒΰΉ‚ΰΈ«ΰΈ₯ดแΰΈ₯ΰΈ°ΰΈ•ΰΈ΄ΰΈ”ΰΈ•ΰΈ±ΰΉ‰ΰΈ‡ DVC

  • ΰΈ—ΰΈ³ΰΈΰΈ²ΰΈ£ΰΈ”ΰΈ²ΰΈ§ΰΈ™ΰΉŒΰΉ‚ΰΈ«ΰΈ₯ΰΈ” Code แΰΈ₯ΰΈ°ΰΈͺΰΈ£ΰΉ‰ΰΈ²ΰΈ‡ Git Repository

C:\dvc>
git init
C:\dvc>
wget https://dvc.org/s3/examples/so/code.zip
C:\dvc>
unzip code.zip && rm -f code.zip
C:\dvc>
git add code/
git commit -m "download and initialize code"
  • ทำการΰΈͺΰΈ£ΰΉ‰ΰΈ²ΰΈ‡ Virtual Environment

C:\dvc>
mkvirtualenv venv
C:\dvc>
workon venv
  • ทำการติดตั้ง Package ΰΈˆΰΈ²ΰΈΰΉ„ΰΈŸΰΈ₯์ requirements.txt

(venv) C:\nlp>
pip install -r code/requirements.txt
  • ทำการΰΈͺΰΈ£ΰΉ‰ΰΈ²ΰΈ‡ DVC Repository

(venv) C:\nlp>
dvc init
(venv) C:\nlp>
git commit -m "initialize DVC"
  • ΰΈ—ΰΈ³ΰΈΰΈ²ΰΈ£ΰΈ”ΰΈ²ΰΈ§ΰΈ™ΰΉŒΰΉ‚ΰΈ«ΰΈ₯ΰΈ” Dataset แΰΈ₯ะทำการ Add ΰΉƒΰΈ™ DVC ΰΈ”ΰΉ‰ΰΈ§ΰΈ’

mkdir data
(venv) C:\nlp>
wget -P data https://dvc.org/s3/examples/so/Posts.xml.zip
(venv) C:\nlp>
dvc add data/Posts.xml.zip
  • ทำการ Commit การเปΰΈ₯ΰΈ΅ΰΉˆΰΈ’ΰΈ™ΰΉΰΈ›ΰΈ₯ΰΈ‡ΰΉ„ΰΈ›ΰΈ’ΰΈ±ΰΈ‡ Git Repository

(venv) C:\nlp>
git add data/Posts.xml.zip.dvc data/.gitignore
(venv) C:\nlp>
git commit -m "add dataset"
  • ทำการ Run ΰΉƒΰΈ™ DVC ΰΉ€ΰΈžΰΈ·ΰΉˆΰΈ­ΰΈ£ΰΈ§ΰΈšΰΈ£ΰΈ§ΰΈ‘ΰΈ„ΰΈ³ΰΈͺΰΈ±ΰΉˆΰΈ‡ΰΉƒΰΈ™ΰΉΰΈ•ΰΉˆΰΈ₯ΰΈ° Stage

(venv) C:\nlp>
dvc run -d data/Posts.xml.zip ^
        -o data/Posts.xml ^
        -f extract.dvc ^
        unzip data/Posts.xml.zip -d data
  • ทำการ Convert ΰΉ„ΰΈŸΰΈ₯์จาก XML ΰΉ€ΰΈ›ΰΉ‡ΰΈ™ TSV ΰΉƒΰΈ™ DVC ΰΉ€ΰΈžΰΈ·ΰΉˆΰΈ­ΰΈ—ΰΈ³ Feature Extraction ΰΉ„ΰΈ”ΰΉ‰ΰΈ‡ΰΉˆΰΈ²ΰΈ’ΰΈ‚ΰΈΆΰΉ‰ΰΈ™

(venv) C:\nlp>
dvc run -d code/xml_to_tsv.py -d data/Posts.xml ^
          -o data/Posts.tsv ^
          -f prepare.dvc ^
          python code/xml_to_tsv.py data/Posts.xml data/Posts.tsv
  • ทำการ Split Dataset ΰΉƒΰΈ™ DVC ΰΉ€ΰΈžΰΈ·ΰΉˆΰΈ­ΰΉΰΈšΰΉˆΰΈ‡ΰΈ‚ΰΉ‰ΰΈ­ΰΈ‘ΰΈΉΰΈ₯ΰΈ—ΰΈ΅ΰΉˆΰΉƒΰΈŠΰΉ‰ΰΉƒΰΈ™ΰΈΰΈ²ΰΈ£ Training แΰΈ₯ΰΈ° Test โดฒกำหนดให้ Test Dataset ΰΈ‘ΰΈ΅ΰΈ­ΰΈ±ΰΈ•ΰΈ£ΰΈ²ΰΈͺΰΉˆΰΈ§ΰΈ™ΰΉ€ΰΈ›ΰΉ‡ΰΈ™ 0.2 แΰΈ₯ΰΈ°ΰΈΰΈ³ΰΈ«ΰΈ™ΰΈ”ΰΈ„ΰΉˆΰΈ² Seed ในการ Random ΰΉ€ΰΈ›ΰΉ‡ΰΈ™ 20170426

(venv) C:\nlp>
dvc run -d code/split_train_test.py -d data/Posts.tsv ^
          -o data/Posts-train.tsv -o data/Posts-test.tsv ^
          -f split.dvc ^
          python code/split_train_test.py data/Posts.tsv 0.2 20170426 ^
          data/Posts-train.tsv data/Posts-test.tsv
  • ทำการ Extract Feature and Label ΰΉƒΰΈ™ DVC ΰΈ‹ΰΈΆΰΉˆΰΈ‡ΰΈˆΰΈ°ΰΉ„ΰΈ”ΰΉ‰ΰΉ„ΰΈŸΰΈ₯์ Pickle

(venv) C:\nlp>
dvc run -d code/featurization.py -d data/Posts-train.tsv -d data/Posts-test.tsv ^
        -o data/matrix-train.pkl -o data/matrix-test.pkl ^
        -f featurize.dvc ^
        python code/featurization.py data/Posts-train.tsv data/Posts-test.tsv ^
        data/matrix-train.pkl data/matrix-test.pkl
  • ทำการ Train Model กับ Training Dataset ΰΉƒΰΈ™ DVC

(venv) C:\nlp>
dvc run -d code/train_model.py -d data/matrix-train.pkl ^
        -o data/model.pkl ^
        -f train.dvc ^
python code/train_model.py data/matrix-train.pkl 20170426 data/model.pkl
  • ทำการ Evaluate Model กับ Test Dataset ΰΉƒΰΈ™ DVC

dvc run -d code/evaluate.py -d data/model.pkl -d data/matrix-test.pkl ^
          -M auc.metric ^
          -f evaluate.dvc ^
python code/evaluate.py data/model.pkl data/matrix-test.pkl auc.metric
  • ΰΈ—ΰΈ³ΰΈΰΈ²ΰΈ£ΰΈ•ΰΈ£ΰΈ§ΰΈˆΰΈͺอบ Accuracy ΰΉƒΰΈ™ DVC ΰΈ”ΰΉ‰ΰΈ§ΰΈ’ Metric

(venv) C:\nlp>
dvc metrics show
auc.metric: AUC: 0.587951
  • ทำการ Commit การเปΰΈ₯ΰΈ΅ΰΉˆΰΈ’ΰΈ™ΰΉΰΈ›ΰΈ₯ΰΈ‡ΰΉ„ΰΈ›ΰΈ’ΰΈ±ΰΈ‡ Git Repository

(venv) C:\nlp>
git add *.dvc auc.metric
(venv) C:\nlp>
git commit -am "create pipeline"
  • ΰΈ—ΰΈ³ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉ‰ΰΉ„ΰΈ‚ΰΉ„ΰΈŸΰΈ₯์ code/featurization.py ( ΰΈšΰΈ£ΰΈ£ΰΈ—ΰΈ±ΰΈ”ΰΈ—ΰΈ΅ΰΉˆ 72-73 )

(venv) C:\nlp>
notepad code/featurization.py
bag_of_words = CountVectorizer(stop_words='english',
                               max_features=5000,
                               ngram_range=(1, 2))
  • ทำการ Reproduce ΰΈͺΰΈ³ΰΈ«ΰΈ£ΰΈ±ΰΈšΰΈ—ΰΈΈΰΈ Stage ΰΈ‹ΰΈΆΰΉˆΰΈ‡ΰΈˆΰΈ°ΰΈ—ΰΈ³ΰΉΰΈšΰΈš Auto ΰΈ«ΰΈ²ΰΈΰΈ‘ΰΈ΅ΰΈΰΈ²ΰΈ£ΰΉΰΈΰΉ‰ΰΉ„ΰΈ‚ΰΉ„ΰΈŸΰΈ₯์

(venv) C:\nlp>
dvc repro evaluate.dvc
  • ΰΈ—ΰΈ³ΰΈΰΈ²ΰΈ£ΰΈ•ΰΈ£ΰΈ§ΰΈˆΰΈͺอบ Accuracy ΰΉƒΰΈ™ DVC ΰΈ”ΰΉ‰ΰΈ§ΰΈ’ Metric ในทุก Branch

(venv) C:\nlp>
dvc metrics show -a
  • ทำการ Commit การเปΰΈ₯ΰΈ΅ΰΉˆΰΈ’ΰΈ™ΰΉΰΈ›ΰΈ₯ΰΈ‡ΰΉ„ΰΈ›ΰΈ’ΰΈ±ΰΈ‡ Git Repository

(venv) C:\nlp>
git add evaluate.dvc auc.metric
(venv) C:\nlp>
git commit -m "add evaluation step to the pipeline
  • ทำการ Tag ΰΉƒΰΈ™ΰΈΰΈ²ΰΈ£ΰΉ€ΰΈΰΉ‡ΰΈš Checkpoint ΰΉ€ΰΈžΰΈ·ΰΉˆΰΈ­ΰΉƒΰΈŠΰΉ‰ΰΉƒΰΈ™ΰΈΰΈ²ΰΈ£ Compare

(venv) C:\nlp>
git tag -a "baseline-experiment" -m "baseline"
  • ทำการ Show Pipeline แบบ ASCII

(venv) C:\nlp>
dvc pipeline show --ascii train.dvc
(venv) C:\nlp>
dvc pipeline show --ascii train.dvc --commands
(venv) C:\nlp>
dvc pipeline show --ascii train.dvc --outs
  • จะแΰΈͺΰΈ”ΰΈ‡ Visualize Pipeline ΰΉƒΰΈ™ΰΉΰΈšΰΈš ASCII

ΰΈ­ΰΉˆΰΈ²ΰΈ™ΰΉ€ΰΈžΰΈ΄ΰΉˆΰΈ‘ΰΉ€ΰΈ•ΰΈ΄ΰΈ‘ : https://bit.ly/2FOQM5v

Last updated

Was this helpful?