Link Search Menu Expand Document

Build a reporting dashboard

Last modified on 08-Aug-24

This article offers an example for building a data quality reporting dashboard using the Soda Cloud Reporting API. Such a dashboard enables your team to understand how healthy your data is and how the team is using Soda Cloud. The following diagram represents the system this example builds.

Overview of API to Dashboards system

Prerequisites and limitations
Set up a virtual Python environment
Set up Python ingestion
Capture data from your Soda Cloud account
Move the captured data into a SQL warehouse
Build a Dataset Health-Over-Time dashboard
Go further


Prerequisites and limitations

  • You have some knowledge of Python and are familiar with pandas and HTTP request libraries such as httpx.
  • You have installed Python 3.8, 3.9, or 3.10.
  • You have a Soda Cloud account.
  • You have installed Soda Library in your environment and connected it to your Soda Cloud account.
  • You have used Soda Library to run at least one scan against data in a dataset.
  • You are familiar with the Soda Cloud Reporting API.
  • This example does not support SSO.

Set up a virtual Python environment

  1. As per best practice, set up a Python virtual environment for each project you have so that you can keep your setups isolated and avoid library clashes. The example below uses the built-in venv module.
    python3 -m venv ~/venvs/soda-reporting-api-ingest
    
  2. Activate the virtual environment using the following command.
    source ~/venvs/soda-reporting-api-ingest
    

    Example output:

    (soda-reporting-api-ingest) -> ~/workspace/soda_reporting_ingestion git(main):
    
  3. When you have completed this tutorial, you can deactivate your virtual environment using the following command.
    deactivate
    

Set up Python ingestion

  1. To connect to the API endpoints you want to acccess, use an HTTP request library. This tutorial uses httpx; use the following command to install it.
    pip install httpx pandas sqlalchemy
    
  2. This examples moves the data it captures from your Soda Cloud account into a Snowflake data source; it requires the snowflake-sqlalchemy SQLAlchemy plugin. If you use a different type of warehouse, find a corresponding plugin, or check SQLAlchemy built-in database compatibility.
    Alternatively, you can list and save all the requirements in a requirements.txt file and install them from the command-line using pip install -r requirements.txt.
  3. Configure a few variables in a Python dictionary to contain static information such as the API URL and the endpoints to which you want to connect. Because the Soda Cloud Reporting API must identify your Soda Cloud account, create a data class to contain your Soda Cloud authentication credentials as per the following. You authenticate to the Reporting API using HTTP Basic Authentication and your Soda Cloud username and password.
     from dataclasses import dataclass
     from typing import Dic
     import httpx
     import pandas as p
     API_MAIN_URL = "https://reporting.cloud.soda.io/v1
     ENDPOINTS = {
         "dataset_health": "/quality/dataset_health",
         "datasets": "/coverage/datasets",
    
     # You can assign defaults in the classes, if you prefer
     @dataclass
     class ApiAuth:
     	soda_username: str
     	soda_password: str
    
  4. All the Soda Cloud Reporting API payloads use the following type of structure.
    {
      "resource": "string",
      "data": ...
    }
    

    In most cases, the data object is a list of dicts but for the dataset_coverage endpoint, it is a dict of dicts. Therefore, you must define a function called get_results which issues the HTTP request and returns a well-formed pandas DataFrame.

    def get_results(url: str, api_credentials: ApiAuth) -> pd.DataFrame:
     request_result = httpx.post(url, auth=(api_credentials.soda_username, api_credentials.soda_password))
    
     # check that the response is good
     if request_result.status_code == 200:
         result_json = request_result.json().get("data", {})
         return pd.DataFrame.from_records(result_json)
     else:
         raise httpx.RequestError(f"{request_result.status_code=}, {request_result.json()=}")
    

Capture data from your Soda Cloud account

  1. Use the following code to capture data from the datasets endpoint.
     api_credentials = ApiAuth(soda_username='fizz@soda.io', soda_password='fizzIsMoreThanBuzzAtSoda')
    
     datasets = get_results(
     	f"{API_MAIN_URL}{ENDPOINTS['datasets']}",
     	api_credentials
     )
    
  2. Because the dataset object is a pandas.DataFrame, view its head with the following pandas command.
    datasets.head()
    

    The output appears similar to the following example.

    | dataset_id                           | dataset_name | tags | number_of_failed_tests | is_deleted | last_scan_time                   | 
    | ------------------------------------ | ------------ | ---- | ---------------------- | ---------- | -------------------------------- |
    | 0301f146-3a0f-4437-b8cf-974936cbffda | subscription | []   | 0                      | False      | 2021-09-16T12:43:59.493882+00:00 |
    | 3927e0eb-e543-4ef5-9626-08cb220cc43a | esg          | []   | 1                      | False      | 2021-07-05T09:26:48.948338+00:00 |
    | 39628650-59f5-4801-8dfe-5b063e5d611c | products     | []   | 3                      | False      | 2021-11-04T08:13:15.173895+00:00 |
    | 3b6b8f89-c450-4eb0-b910-92a29a0757a9 | booleancheck | []   | 0                      | False      | 2021-08-25T12:42:22.133490+00:00 |
    | 450f5de3-3b79-4fe5-a781-3e7441e06a70 | customers    | []   | 3                      | False      | 2021-11-04T08:12:43.519556+00:00 |
    
  3. Ensure that the content of the tags column is compatible with most SQL databases. Because tags in Soda are a list of strings, convert the array into a string of comma-separated strings.
    datasets["tags"] = datasets["tags"].str.join(',')
    
  4. The dataset_health endpoint tracks the number of tests that passed per dataset per scan date and calculates a percentage_of_passing_tests to use as your dataset health score. Capture the data from this endpoint as the following.
     dataset_health = get_results(
     	f"{API_MAIN_URL}{ENDPOINTS['dataset_health']}",
     	api_credentials
     )
    
     dataset_health.head()
    

    Example output:

    | dataset_id                           | scan_date  | critical | info | warning | number_of_tests | percentage_passing_tests |
    | ------------------------------------ | ---------- | -------- | ---- | ------- | --------------- | ------------------------ |
    | 0301f146-3a0f-4437-b8cf-974936cbffda | 2021-09-16 | 0        | 1    | 0       | 1               | 100.000000               |
    | 3927e0eb-e543-4ef5-9626-08cb220cc43a | 2021-06-24 | 0        | 6    | 0       | 6               | 100.000000               |
    | 3927e0eb-e543-4ef5-9626-08cb220cc43a | 2021-06-25 | 1        | 5    | 0       | 6               | 83.333333                |
    | 3927e0eb-e543-4ef5-9626-08cb220cc43a | 2021-06-26 | 2        | 4    | 0       | 6               | 66.666667                |
    | 3927e0eb-e543-4ef5-9626-08cb220cc43a | 2021-06-27 | 1        | 5    | 0       | 6               | 83.333333                |
    

    The results from the second query produce a dataset_id, but no name or any of the information you get from the datasets query. When you build a dashboard, you can join the two results so that you can present the dataset_name in the reporting dashboard, rather than dataset_id.

Move the captured data into a SQL warehouse

Having captured the Soda Cloud data that your Analytics Engineers need to compose dashboards, move the data into the storage-and-compute space that your reporting tools use, such as a SQL warehouse. This example uses a Snowflake data source and leverages panda’s to_sql() method, which itself leverages the database abstraction Python library known as SQLAlchemy.

  1. Define a data class to contain the data source credentials and a function that enables you to move data into a table. If you use a non-Snowflake data source, you may need to modify the credentials class to contain the appropriate parameters.
     from sqlalchemy import create_engine
     from snowflake.sqlalchemy import URL
    
     @dataclass
     class SnowflakeCredentials:
         account: str = "<your_snowflake_account"
         user: str = "<your_username3"
         password: str = "<your_password"
         database: str = "<target_database>"
         schema: str = "<target_schema>"
         warehouse: str = "<snowflake_warehouse_to_use>"
    
     def push_to_db(
         db_credentials: SnowflakeCredentials,
         df: pd.DataFrame,
         qualified_target_table_name: str,
         if_exists: str = "replace",
     ):
         db_url_string = URL(
             account=db_credentials.account,
             user=db_credentials.user,
             password=db_credentials.password,
             database=db_credentials.database,
             schema=db_credentials.schema,
             warehouse=db_credentials.warehouse,
         )
         engine = create_engine(db_url_string)
         df.to_sql(qualified_target_table_name, con=engine, if_exists=if_exists, index=False)
    
  2. Move the two sets of data into the data source.
    push_to_db(SnowflakeCredentials(), datasets, 'datasets_report')
    push_to_db(SnowflakeCredentials(), dataset_health, 'dataset_health_report')
    

Build a Dataset Health-Over-Time dashboard

To build a dashboard, this example uses Redash, an open-source, self-hosted service. You may wish to use a different solution such as Metabase, Lightdash, Looker, or Tableau. To complete the data transformations, this example performs simple transformations directly in Redash, but you may wish to use a transformation tool such as dbt, instead.

  1. Use a join to enrich the dataset_health data with the datasets data so as to extract the dataset’s name. This example also adds a Common Table Expression (CTE) that enables you to derive the total number of tests in your account at any given time, and the median number of tests to plot some benchmarks in the visualization.
     with descriptives as (
         select
             scan_date,
             median(number_of_tests) median_tests_in_project
         from reporting_api.dataset_health
         group by 1
     )
     select
         datasets.dataset_name as "dataset_name::filter",  -- alias for re-dash dataset-level filter
         h.scan_date,
         h.percentage_passing_tests as passing_tests,
         h.number_of_tests,
         d.median_tests_in_project,
         d.stddev_tests_in_project
    
     from reporting_api.dataset_health_report h
    
     join descriptives d
         on to_date(h.scan_date) = to_date(d.scan_date)
    
     join reporting_api.datasets_report datasets
         on h.dataset_id = datasets.dataset_id
    
     where datasets.is_deleted = false
    
  2. In Redash, this example uses the alias dataset_name::filter to set up a query-filter to filter the whole dashboard. Plot the “% passing test” metric from the dataset_health over time. In Redash, this example sets up the plot as per the following image.

    % passing tests metric in Redash
  3. Make a second plot that displays the number of checks implemented on each dataset over time, as well as a project-wide benchmark, using the median calculation we derived in the SQL query above. Use two other metrics:
    • #of tests on dataset
    • median number of tests in project
      By setting the median number of tests metric as a line, viewers get insight into the check coverage of your dataset relative to other datasets in your project. You can also get similar information from the dataset_coverage endpoint.

      Plot of dataset test coverage over time
  4. Make a third plot to get an overview of the latest Dataset Health results for your project. For this plot, the SQL query captures only the last-known scan date for each dataset.
     select
         dataset_name,
         percentage_passing_tests,
         number_of_tests,
         scan_date
    
     from reporting_api.dataset_health_report r
    
     join reporting_api.datasets_report d
         on r.dataset_id = d.dataset_id
    
     where number_of_tests > 0
    
     qualify row_number() over (partition by dataset_name order by scan_date desc) = 1
    
     order by 2
    
  5. Create a table visualization to identify the health of your datasets, as per the following image.

    Dataset health last snapshot overview visualization
  6. After creating each visualization from a query, you can add them all to a dashboard that your colleagues can access. In Redash, the example dashboard appears as per the image below and includes a fourth query drawn from the dataset_coverage endpoint.

    Example of a finished overview dashboard in Redash

Go further


Was this documentation helpful?

What could we do to improve this page?

Documentation always applies to the latest version of Soda products
Last modified on 08-Aug-24