Setting up a data-playground with docker-compose and Faker

In the past few days I played around with Airflow to get familiar with its DAG syntax. Instead of scrambling around to find the right dataset to be imported in a MySQL instance I’ve decided to set-up a development environment from scratch that allowed me to quickly change the structure and type of data as I pleased. I’ve found this approach quite handy for three reasons:

  1. No time wasted on finding the right dataset in the interwebs.
  2. Very easy to extend and adapt to any scenario.
  3. Completely free: running locally, there’s no need to provision and manage a small database instance.

Beyond being a very easy way to set up a data playground, the following approach can also be adapted in scenarios where sensitive data is involved and using a replica of the actual db is not an option.
You can find the repo on github to follow along.

Overview

The structure of the tool is fairly straightforward: any sub folder inside local-dev/ contains a Dockerfile that will spawn a database (a MySQL instance in this case) and populate each table through dedicated Python scripts.

.
├── docker-compose.cli.yaml
├── docker-compose.yaml
├── local-dev
│   ├── mysql-seeding
│   │   ├── Dockerfile
│   │   ├── requirements.txt
│   │   ├── seed_mysql.sh
│   │   ├── seed_transactions.py
│   │   ├── seed_users.py
│   │   └── utils.py
│   └── tests
│       ├── Dockerfile
│       ├── entrypoint.sh
│       ├── requirements.txt
│       └── tests.py
└── README.md

This process relies on a very simple docker-compose file to orchestrate the setup process, with the only restriction of waiting for the db to be up before seeding the mock entries. In order to wait for the db it will be enough to specify depends_on on the seeding service and wait for the db connection to be ready in the seeding scripts.

version: '3'

services:
  mysql:
    image: mariadb:10.5 
    container_name: mysql_db
    ports:
      - "3306:3306"
    expose:
      - "3306"
    environment:
      MYSQL_ROOT_PASSWORD: "root"
      MYSQL_DATABASE: "database"

  mysql_seeding:
    container_name: mysql_seeding
    build: ./local-dev/mysql-seeding
    depends_on:
      - mysql  

The repo offers a docker-compose.cli.yaml file with an additional test container that will take care of checking that all the tables have been set up properly. We can check everything works locally by pointing docker-compose to the alternate file with docker-compose -f docker-compose.cli.yaml up --build.

Database setup & Faker

For this example, I have set up a pretty easy setup, relying on a database containing a users and a transactions table structured as shown below.

users table:

id name email state birthday notes is_adult
1 Benjamin Rodriguez ywong@yahoo.com IT 1985-03-26 22:37:54 You sit clear team family trial. 0
2 Tracy Morales douglas92@yahoo.com IT 1997-03-04 13:05:10 News television real development. 1
3 Julie Thomas MD fgarza@hotmail.com EN 1987-01-06 22:46:02 Then end note. 1
4 Vernon Wilson stephanie83@hotmail.com EN 1974-01-11 01:05:32 Her me child prepare. 1
5 Adam Duran jameschristian@yahoo.com IT 1977-04-28 15:26:31 Animal hundred become right citizen. 0

transactions table:

id order_id user_id ts amount currency_code item is_premium
1 4048 315 1993-09-06 01:53:12 8580.4900 MGA green apple 1
2 5268 40 2019-11-27 15:10:26 618.3500 XCD yellow banana 1
3 7650 249 1994-02-27 01:47:14 1489.7300 HKD samrt fox 1
4 7460 332 2010-03-12 11:57:00 1519.2600 SEK cool watermelon 1
5 3170 171 1981-05-30 21:15:37 4565.9800 AFN strange pear 1

As mentioned above, Faker has been used to populate these tables. The library has a number of built-in data providers to replicate pretty much any entry (emails, addresses, zip codes, etc.). As an example seed_transactions.py will create 2.000 transactions for the same user ids created by seed_users.py.

def insert_transactions(self):
    uids = self.get_users()
    transactions = []
    for _ in range(2000):
        transaction = {
            "order_id": faker.pyint(min_value=1, max_value=9999, step=1),
            "user_id": faker.random.choice(uids),
            "ts": faker.date_time(),
            "amount": faker.pydecimal(
                left_digits=4, right_digits=2, min_value=1
            ),
            "currency_code": faker.currency_code(),
            "item": faker.random.choice(constants['products']),
            "is_premium": faker.boolean(chance_of_getting_true=50)
        }
        transactions.append(transaction)
    self.connection.execute(self.Transactions.insert(), transactions)

Testing everything works fine

In order to test that everything works flawlessly, a second docker-compose file is included in the repo. This file has the same content of the original one except for spawning an extra container that will run a python script (based on pytest) to check that the tables have been created correctly. Pytest fixtures can be used to set up a db connection before launching the actual tests:

import pytest

@pytest.fixture
def db_conn():
    from sqlalchemy import (
        create_engine,
        engine
    )
    from sqlalchemy.ext.automap import automap_base
    from sqlalchemy.exc import InterfaceError

    config = {
    "drivername": "mysql+mysqlconnector",
    "username": "root",
    "password": "root",
    "host": "mysql",
    "port": "3306",
    "database": "database",
    }

    Base = automap_base() 

    try:
        engine = create_engine(engine.url.URL(**config))
    except InterfaceError:
        print("Something went wrong while connecting to MySQL db")

    Base.prepare(engine, reflect=True)
    yield  engine
    print("Disposing engine")
    engine.dispose()


def test_user_table_exists(db_conn):
    assert db_conn.has_table("users") == True

def test_tx_table_exists(db_conn):
    assert db_conn.has_table("transactions") == True 

Setting up Github Actions

In order to make sure that no change will break the environment, the above mentioned docker-compose file can be integrated as a CI step as a Github Action. In a nutshell, the action will take care of building the images, setting up the containers and running the tests.py file. The workflow is pretty basic and it will be triggered both both on new pull requests and every time a commit will be pushed to master.

name: run-tests

on:
  push:
    branches: master
  pull_request:
    branches:
      - master

jobs:
  build:
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Build containers
      run: docker-compose -f docker-compose.cli.yaml up -d
    - name: Start db
      run: docker start mysql_db
    - name: Seed db mock entries
      run: docker start mysql_seeding
    - name: Run tests
      run: docker start mysql_seeding_test --attach

The --attach flag will send the output of the test script to stdout, allowing to be inspected further.

edit: For some reason, using the --attach flag when running tests gets Github Actions to hang until the step will timeout (~ 6hrs).

Conclusions

A few points have been discussed in this post:

  1. How to structure a back end development environment avoiding to provision extra databases.
  2. How to use docker-compose to set up different db instances and seed them with mock data.
  3. Set up Github Actions to integrate tests in the CI pipeline to check all the tables needed have been set up correctly.