Correlated test failure sets per test and average size of correlation sets
Contents
Calculation#
Here we iterate through each grid in our dataset and collect the the names of all the tests that fail during the same build. We will store this in the failure_groups
list.
failure_groups = []
for tab in list(testgrid_data.keys()):
for grid in testgrid_data[tab].keys():
current_grid = testgrid_data[tab][grid]
tests = [
current_grid["grid"][i]["name"] for i in range(len(current_grid["grid"]))
]
# unroll the run-length encoding and set bool for flake or not (x==13)
decoded = [
(
np.array(decode_run_length(current_grid["grid"][i]["statuses"])) == 12
).tolist()
for i in range(len(current_grid["grid"]))
]
matrix = pd.DataFrame(zip(tests, decoded), columns=["test", "values"])
matrix = pd.DataFrame(matrix["values"].to_list(), index=matrix["test"])
for c, items in matrix.iteritems():
if len(items[items].index) > 1:
failure_groups.append(items[items].index)
failure_groups = pd.Series(failure_groups)
len(failure_groups)
20132
Now we want to define a vocabulary for all of the unique tests in our dataset so that we can encode our failure sets using a binary encoding scheme.
vocab = set()
count = 0
for fg in failure_groups:
count += len(fg)
vocab.update(fg)
vocab = list(vocab)
print(count)
len(vocab)
194477
8935
Confirm that there are no duplicates in the vocab to ensure we have a unique set
len(pd.Series(vocab).unique()) == len(vocab)
True
Now we’ll use the below function to create our binary encoded vectors for our correlation analysis
def encode_tests(job):
encoded = []
for v in vocab:
if v in job:
encoded.extend([1])
else:
encoded.extend([0])
return encoded
encoded = failure_groups.apply(encode_tests)
encoded.head()
0 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
3 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
4 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
dtype: object
df_encoded = pd.DataFrame(encoded.array, columns=vocab)
df_encoded.head()
openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce policy based on NamespaceSelector with MatchExpressions[Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s] [5] | openshift-tests.[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: dir] [Testpattern: Dynamic PV (ntfs)][sig-windows] subPath should be able to unmount after the subpath directory is deleted [Suite:openshift/conformance/parallel] [Suite:k8s] | openshift-tests.[sig-cli] Kubectl client Simple pod [Top Level] [sig-cli] Kubectl client Simple pod should support exec [Suite:openshift/conformance/parallel] [Suite:k8s] | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/parallel] [12] | openshift-tests.[sig-network] DNS should provide DNS for services [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] [5] | operator.Run multi-stage test e2e-gcp-ovn-upgrade - e2e-gcp-ovn-upgrade-ipi-deprovision-deprovision container test | openshift-tests.[Conformance][Area:Networking][Feature:Router] The HAProxy router should respond with 503 to unrecognized hosts [Suite:openshift/conformance/parallel/minimal] | openshift-tests.[sig-auth][Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conformance/parallel] [1] | openshift-tests.[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Dynamic PV (ext3)] volumes [Top Level] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: tmpfs] [Testpattern: Dynamic PV (ext3)] volumes should allow exec of files on the volume [Suite:openshift/conformance/parallel] [Suite:k8s] | openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should enforce multiple ingress policies with ingress allow-all policy taking precedence [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s] [14] | ... | openshift-tests.[sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Dynamic PV (default fs)] provisioning [Top Level] [sig-storage] In-tree Volumes [Driver: local][LocalVolumeType: blockfs] [Testpattern: Dynamic PV (default fs)] provisioning should provision storage with mount options [Suite:openshift/conformance/parallel] [Suite:k8s] | operator.Run multi-stage test e2e-gcp-ovn-upgrade - e2e-gcp-ovn-upgrade-gather-network container test | openshift-tests.[sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (ext3)] volumes [Top Level] [sig-storage] In-tree Volumes [Driver: hostPath] [Testpattern: Inline-volume (ext3)] volumes should store data [Suite:openshift/conformance/parallel] [Suite:k8s] | openshift-tests.[sig-network] Services should create endpoints for unready pods [Suite:openshift/conformance/parallel] [Suite:k8s] [13] | openshift-tests.[sig-devex][Feature:ImageEcosystem][Slow] openshift images should be SCL enabled using the SCL in s2i images "registry.redhat.io/rhscl/python-36-rhel7" should be SCL enabled | openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow ingress access from updated namespace [Feature:NetworkPolicy] [Skipped:Network/OpenShiftSDN/Multitenant] [Suite:openshift/conformance/parallel] [Suite:k8s] [7] | openshift-tests.[sig-apps] CronJob should replace jobs when ReplaceConcurrent [Suite:openshift/conformance/parallel] [Suite:k8s] | openshift-tests.[k8s.io] Variable Expansion should allow substituting values in a container's command [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s] | openshift-tests.[Feature:ProjectAPI] TestProjectWatchWithSelectionPredicate [Top Level] [Feature:ProjectAPI] TestProjectWatchWithSelectionPredicate should succeed [Suite:openshift/conformance/parallel] | openshift-tests.[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (ntfs)][sig-windows] subPath should be able to unmount after the subpath directory is deleted [LinuxOnly] [Suite:openshift/conformance/parallel] [Suite:k8s] | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 8935 columns
# percent that each test is present in the data; percent failure
perc_present = df_encoded.sum() / len(df_encoded)
perc_present.sort_values(ascending=False).head(3)
Overall 0.777369
openshift-tests.Monitor cluster while tests execute 0.163272
openshift-tests.[sig-network] pods should successfully create sandboxes by getting pod 0.159944
dtype: float64
# Total failure count present in the data; failure per test
occurrence_count = df_encoded.sum()
occurrence_count.sort_values(ascending=False).head(3)
Overall 15650
openshift-tests.Monitor cluster while tests execute 3287
openshift-tests.[sig-network] pods should successfully create sandboxes by getting pod 3220
dtype: int64
We also want to make sure that our correlation values are not just due to unique failed test sets present in our dataset. We want to make sure our tests impact multiple jobs. For example, if we had a unique failed test set that only occurred in a single example, and shared no other failed tests among the vocabulary, then all of the tests would appear to be 100% correlated with each other, when in fact that is merely a consequence of insufficient data. In order to prevent that, we will ignore any tests that occur only in a single job. In order to do that we will use occurrence_count to create a filter vector for any test that occurs only once. Then filter them out of our working DF.
filter_unique = list(occurrence_count[occurrence_count.values <= 1].index)
df_encoded = df_encoded.drop(filter_unique, axis=1)
df_encoded.shape
(20132, 7330)
# this takes time with full dataset - ~ 2 hours may need to use different approach
# todo try with dask
corr_matrix = df_encoded.corr()
# For each feature, find the other features that are correlated by more than 0.9
top_correlation = {}
for c in corr_matrix.columns:
top_correlation[c] = []
series = corr_matrix.loc[c]
for i, s in enumerate(series):
if s > 0.90 and series.index[i] != c:
top_correlation[c].append((series.index[i], s))
len(top_correlation)
7330
Examine example output#
Let’s go ahead and take a look at which tests are highly correlated with the first test in our results list.
# top_correlation has a number of empty rows as not all tests have high correlations with others,
# lets grab only the sets that have at least 1 highly correlated test
pd.set_option("display.max_colwidth", 150)
corr_sets = []
for i in top_correlation.items():
if len(i[1]) >= 1:
corr_sets.append(i)
print(f"{len(corr_sets)} sets of correlated tests \n")
print(f"Feature of interest: {corr_sets[1][0]}")
pd.DataFrame(corr_sets[1][1], columns=["test_name", "correlation coefficient"])
3239 sets of correlated tests
Feature of interest: openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/parallel] [12]
test_name | correlation coefficient | |
---|---|---|
0 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
1 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
2 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
3 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
4 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
5 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
6 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
7 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
8 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
9 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
10 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
11 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 0.975876 |
12 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 1.000000 |
if not AUTOMATION:
test_name = "openshift-tests.[k8s.io] Security Context When creating a container with runAsUser should run the container with uid 65534 [LinuxOnly] [NodeConformance] [Conformance] [Suite:openshift/conformance/parallel/minimal] [Suite:k8s]" # noqa
num = occurrence_count.loc[test_name]
print(f"{num} : the number of times this test failed in our data set")
5 : the number of times this test failed in our data set
lst = []
focus = corr_sets[1][1]
for j in focus:
lst.append((j[0], occurrence_count.loc[j[0]]))
pd.DataFrame(lst, columns=["test_name", "num_occurrences"])
test_name | num_occurrences | |
---|---|---|
0 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
1 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
2 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
3 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
4 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
5 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
6 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
7 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
8 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
9 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
10 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
11 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 21 |
12 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | 20 |
Save to Ceph or local#
save = pd.DataFrame(corr_sets, columns=["test_name", "correlated_tests"])
save["correlated_tests"] = save["correlated_tests"].apply(str)
if AUTOMATION:
cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
cc.upload_to_ceph(
save,
s3_path,
f"{METRIC_NAME}/{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
)
else:
save_to_disk(
save,
OUTPUT_DATA_PATH,
f"{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
)
## Sanity check to see if the dataset is the same
if AUTOMATION:
sanity_check = cc.read_from_ceph(
s3_path,
f"{METRIC_NAME}/{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
).head()
else:
sanity_check = read_from_disk(
OUTPUT_DATA_PATH,
f"{METRIC_NAME}-{timestamp.year}-{timestamp.month}-{timestamp.day}.parquet",
).head()
sanity_check
test_name | correlated_tests | |
---|---|---|
0 | openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly... | [('openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxO... |
1 | openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformance/p... | [('openshift-tests.[sig-network][Feature:Router] The HAProxy router should override the route host with a custom value [Suite:openshift/conformanc... |
2 | openshift-tests.[sig-auth][Feature:OpenShiftAuthorization] The default cluster RBAC policy should have correct RBAC rules [Suite:openshift/conform... | [('openshift-tests.[sig-api-machinery][Feature:ClusterResourceQuota] Cluster resource quota should control resource limits across namespaces [Suit... |
3 | openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxOnly... | [('openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client [Top Level] [sig-network] NetworkPolicy [LinuxO... |
4 | openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow ingress access from updated namespace... | [('openshift-tests.[sig-network] NetworkPolicy [LinuxOnly] NetworkPolicy between server and client should allow ingress access from updated namesp... |
Conclusion#
This notebook collected all sets of highly correlated tests, i.e, sets of tests that most commonly fail together and stored that data in ceph as well as locally. A user can now pull this data and, given a test name of interest, be provided a list of all other highly correlated tests.
This notebook also computed a numerical value to summarize and quantify these correlations in aggregate: the average size of failure correlation sets. This value is also stored both locally and in ceph.