Correlated test failure sets per test and average size of correlation sets#

This notebook outputs 2 artifacts:

  1. A parquet file that provides, for a given test, all of the other tests that are highly correlated (correlation coefficient of 0.9 or above). This file omits any tests that do not have any highly correlated tests. So, if a test is not present on the list, then it has no highly correlated tests associated with it at this time and has been removed from the record. The calculation for correlation is performed on all available data exposed by the Red Hat test grid instance at the time the notebook is run.

  2. A summary metric that can be easily tracked over time that represents the average size of correlated test sets in the above parquet.

Note: This notebook follows a very similar approach to an earlier EDA notebook where we correlated failures with a different dataset. For simplicity, much of the reasoning behind the decisions made in this notebook have been omited here, but can be found in the above linked notebook for interested readers :)

related issue #139

# Import libraries
import gzip
import json
import os
import numpy as np
import pandas as pd
import datetime

from ipynb.fs.defs.metric_template import decode_run_length
from ipynb.fs.defs.metric_template import CephCommunication
from ipynb.fs.defs.metric_template import save_to_disk, read_from_disk
from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
True
## Specify variables

METRIC_NAME = "correlation"

# Specify the path for input grid data,
INPUT_DATA_PATH = "../../../../data/raw/testgrid_183.json.gz"

# Specify the path for output metric data
OUTPUT_DATA_PATH = f"../../../../data/processed/metrics/{METRIC_NAME}"

## CEPH Bucket variables
## Create a .env file on your local with the correct configs,
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
s3_path = os.getenv("S3_PROJECT_KEY", "ai4ci/testgrid/metrics")
s3_input_data_path = "raw_data"
AUTOMATION = os.getenv("IN_AUTOMATION")
## Import data
timestamp = datetime.datetime.today()

if AUTOMATION:
    filename = f"testgrid_{timestamp.day}{timestamp.month}.json"
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    s3_object = cc.s3_resource.Object(s3_bucket, f"{s3_input_data_path}/{filename}")
    file_content = s3_object.get()["Body"].read().decode("utf-8")
    testgrid_data = json.loads(file_content)

else:
    with gzip.open(INPUT_DATA_PATH, "rb") as read_file:
        testgrid_data = json.load(read_file)

Calculation#

Here we iterate through each grid in our dataset and collect the the names of all the tests that fail during the same build. We will store this in the failure_groups list.

failure_groups = []

for tab in list(testgrid_data.keys()):
    for grid in testgrid_data[tab].keys():
        current_grid = testgrid_data[tab][grid]

        tests = [
            current_grid["grid"][i]["name"] for i in range(len(current_grid["grid"]))
        ]
        # unroll the run-length encoding and set bool for flake or not (x==13)
        decoded = [
            (
                np.array(decode_run_length(current_grid["grid"][i]["statuses"])) == 12
            ).tolist()
            for i in range(len(current_grid["grid"]))
        ]

        matrix = pd.DataFrame(zip(tests, decoded), columns=["test", "values"])
        matrix = pd.DataFrame(matrix["values"].to_list(), index=matrix["test"])

        for c, items in matrix.iteritems():
            if len(items[items].index) > 1:
                failure_groups.append(items[items].index)
failure_groups = pd.Series(failure_groups)
len(failure_groups)
20132

Now we want to define a vocabulary for all of the unique tests in our dataset so that we can encode our failure sets using a binary encoding scheme.

vocab = set()
count = 0
for fg in failure_groups:
    count += len(fg)
    vocab.update(fg)

vocab = list(vocab)
print(count)
len(vocab)
194477
8935

Confirm that there are no duplicates in the vocab to ensure we have a unique set

len(pd.Series(vocab).unique()) == len(vocab)
True

Now we’ll use the below function to create our binary encoded vectors for our correlation analysis

def encode_tests(job):
    encoded = []
    for v in vocab:
        if v in job:
            encoded.extend([1])
        else:
            encoded.extend([0])
    return encoded
encoded = failure_groups.apply(encode_tests)
encoded.head()
0    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
dtype: object
df_encoded = pd.DataFrame(encoded.array, columns=vocab)
df_encoded.head()
openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce policy based on NamespaceSelector with MatchExpressions[Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s] [5] openshift-tests.[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir] [Testpattern: Dynamic PV (ntfs)][sig-windows] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] openshift-tests.[sig-cli] Kubectl client Simple pod [Top Level] [sig-cli] Kubectl client Simple pod should support exec [Suite:openshift/conformance/parallel] [Suite:k8s] openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/parallel] [12] openshift-tests.[sig-network] DNS should provide DNS for services [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [5] operator.Run multi-stage test e2e-gcp-ovn-upgrade - e2e-gcp-ovn-upgrade-ipi-deprovision-deprovision container test openshift-tests.[Conformance][Area:Networking][Feature:Router] The HAProxy router should respond with 503 to unrecognized hosts [Suite:openshift/conformance/parallel/minimal] openshift-tests.[sig-auth][Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel] [1] openshift-tests.[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Dynamic PV (ext3)] volumes [Top Level] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Dynamic PV (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s] openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce multiple ingress policies with ingress allow-all policy taking precedence [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s] [14] ... openshift-tests.[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Dynamic PV (default fs)] provisioning [Top Level] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s] operator.Run multi-stage test e2e-gcp-ovn-upgrade - e2e-gcp-ovn-upgrade-gather-network container test openshift-tests.[sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (ext3)] volumes [Top Level] [sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (ext3)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s] openshift-tests.[sig-network] Services should create endpoints for unready pods [Suite:openshift/conformance/parallel] [Suite:k8s] [13] openshift-tests.[sig-devex][Feature:ImageEcosystem][Slow] openshift images should be SCL enabled using the SCL in s2i images "registry.redhat.io/rhscl/python-36-rhel7" should be SCL enabled openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow ingress access from updated namespace [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s] [7] openshift-tests.[sig-apps] CronJob should replace jobs when ReplaceConcurrent [Suite:openshift/conformance/parallel] [Suite:k8s] openshift-tests.[k8s.io] Variable Expansion should allow substituting values in a container's command [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] openshift-tests.[Feature:ProjectAPI] TestProjectWatchWithSelectionPredicate [Top Level] [Feature:ProjectAPI] TestProjectWatchWithSelectionPredicate should succeed [Suite:openshift/conformance/parallel] openshift-tests.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (ntfs)][sig-windows] subPath should be able to unmount after the subpath directory is deleted [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s]
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 8935 columns

# percent that each test is present in the data; percent failure
perc_present = df_encoded.sum() / len(df_encoded)
perc_present.sort_values(ascending=False).head(3)
Overall                                                                                   0.777369
openshift-tests.Monitor cluster while tests execute                                       0.163272
openshift-tests.[sig-network] pods should successfully create sandboxes by getting pod    0.159944
dtype: float64
# Total failure count present in the data; failure per test
occurrence_count = df_encoded.sum()
occurrence_count.sort_values(ascending=False).head(3)
Overall                                                                                   15650
openshift-tests.Monitor cluster while tests execute                                        3287
openshift-tests.[sig-network] pods should successfully create sandboxes by getting pod     3220
dtype: int64

We also want to make sure that our correlation values are not just due to unique failed test sets present in our dataset. We want to make sure our tests impact multiple jobs. For example, if we had a unique failed test set that only occurred in a single example, and shared no other failed tests among the vocabulary, then all of the tests would appear to be 100% correlated with each other, when in fact that is merely a consequence of insufficient data. In order to prevent that, we will ignore any tests that occur only in a single job. In order to do that we will use occurrence_count to create a filter vector for any test that occurs only once. Then filter them out of our working DF.

filter_unique = list(occurrence_count[occurrence_count.values <= 1].index)
df_encoded = df_encoded.drop(filter_unique, axis=1)
df_encoded.shape
(20132, 7330)
# this takes time with full dataset - ~ 2 hours may need to use different approach
# todo try with dask
corr_matrix = df_encoded.corr()
# For each feature, find the other features that are correlated by more than 0.9
top_correlation = {}

for c in corr_matrix.columns:
    top_correlation[c] = []
    series = corr_matrix.loc[c]

    for i, s in enumerate(series):
        if s > 0.90 and series.index[i] != c:
            top_correlation[c].append((series.index[i], s))

len(top_correlation)
7330

Examine example output#

Let’s go ahead and take a look at which tests are highly correlated with the first test in our results list.

# top_correlation has a number of empty rows as not all tests have high correlations with others,
# lets grab only the sets that have at least 1 highly correlated test

pd.set_option("display.max_colwidth", 150)
corr_sets = []
for i in top_correlation.items():
    if len(i[1]) >= 1:
        corr_sets.append(i)
print(f"{len(corr_sets)} sets of correlated tests \n")
print(f"Feature of interest: {corr_sets[1][0]}")
pd.DataFrame(corr_sets[1][1], columns=["test_name", "correlation coefficient"])
3239 sets of correlated tests 

Feature of interest: openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/parallel] [12]
test_name correlation coefficient
0 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
1 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
2 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
3 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
4 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
5 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
6 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
7 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
8 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
9 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
10 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
11 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 0.975876
12 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 1.000000
if not AUTOMATION:
    test_name = "openshift-tests.[k8s.io] Security Context When creating a container with runAsUser should run the container with uid 65534 [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]"  # noqa
    num = occurrence_count.loc[test_name]
    print(f"{num} : the number of times this test failed in our data set")
5 : the number of times this test failed in our data set
lst = []
focus = corr_sets[1][1]
for j in focus:
    lst.append((j[0], occurrence_count.loc[j[0]]))

pd.DataFrame(lst, columns=["test_name", "num_occurrences"])
test_name num_occurrences
0 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
1 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
2 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
3 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
4 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
5 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
6 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
7 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
8 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
9 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
10 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20
11 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 21
12 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... 20

Save to Ceph or local#

save = pd.DataFrame(corr_sets, columns=["test_name", "correlated_tests"])
save["correlated_tests"] = save["correlated_tests"].apply(str)

if AUTOMATION:
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    cc.upload_to_ceph(
        save,
        s3_path,
        f"{METRIC_NAME}/{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    )
else:
    save_to_disk(
        save,
        OUTPUT_DATA_PATH,
        f"{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    )
## Sanity check to see if the dataset is the same
if AUTOMATION:
    sanity_check = cc.read_from_ceph(
        s3_path,
        f"{METRIC_NAME}/{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    ).head()
else:
    sanity_check = read_from_disk(
        OUTPUT_DATA_PATH,
        f"{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    ).head()

sanity_check
test_name correlated_tests
0 openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly... [('openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxO...
1 openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... [('openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformanc...
2 openshift-tests.[sig-auth][Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conform... [('openshift-tests.[sig-api-machinery][Feature:ClusterResourceQuota] Cluster resource quota should control resource limits across namespaces [Suit...
3 openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly... [('openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxO...
4 openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow ingress access from updated namespace... [('openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow ingress access from updated namesp...

Lets also capture the average size of correlated failure groups to track over time#

average_corr = save["correlated_tests"].apply(len).mean()
metric_to_save = pd.DataFrame(
    [[timestamp, average_corr]],
    columns=["timestamp", "average_number_of_correlated_failures"],
)


if AUTOMATION:
    cc.upload_to_ceph(
        metric_to_save,
        s3_path,
        f"avg_{METRIC_NAME}/avg_{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    )
else:
    save_to_disk(
        metric_to_save,
        OUTPUT_DATA_PATH,
        f"avg_{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    )
## Sanity check to see if the dataset is the same

if AUTOMATION:
    sanity_check = cc.read_from_ceph(
        s3_path,
        f"avg_{METRIC_NAME}/avg_{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    ).head()
else:
    sanity_check = read_from_disk(
        OUTPUT_DATA_PATH,
        f"avg_{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
    ).head()

sanity_check
timestamp average_number_of_correlated_failures
0 2021-04-27 21:58:21.538747 9224.075023

Conclusion#

This notebook collected all sets of highly correlated tests, i.e, sets of tests that most commonly fail together and stored that data in ceph as well as locally. A user can now pull this data and, given a test name of interest, be provided a list of all other highly correlated tests.

This notebook also computed a numerical value to summarize and quantify these correlations in aggregate: the average size of failure correlation sets. This value is also stored both locally and in ceph.