Extracting the Telemetry Info for Clusters in Workload Dataset

Overview

As a data scientist, working on a workload fingerprinting project, the workload data that we have from the insight operator archive are limited to some informations like product name, cpu usage, memory usage, etc. We have done the product name mapping in this issue.

The purpose of this notebook is to be able to extract the data from the telemetry based on the cluster_id and timestamp as given in the workload data. Metric information are extracted both for image_layers_dataset and container_dataset. The different metrics are,

  • workload:cpu_usage_cores:sum

  • workload:memory_usage_bytes:sum

  • openshift:cpu_usage_cores:sum

  • openshift:memory_usage_bytes:sum

  • cluster:cpu_usage_cores:sum

  • cluster:memory_usage_bytes:sum

Importing packages

import os
import datetime as dt
from tqdm import tqdm
import pandas as pd
import io
import boto3
import warnings
import urllib3
from dotenv import load_dotenv, find_dotenv

from prometheus_api_client import PrometheusConnect, MetricSnapshotDataFrame
urllib3.disable_warnings()
pd.set_option("display.max_columns", None)
%matplotlib inline
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")

Data Collection from bucket

Firstly, we extract the image_layers and containers dataset from the PLAYPEN bucket in order to get the list of cluster_id and timestamp from the corresponding archive_path given in the respective dataset.

# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

# s3 resource to communicate with storage
s3 = boto3.resource(
    "s3",
    endpoint_url=s3_endpoint_url,
    aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key,
)

# access the parquet file as an s3 object

obj1 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/image_layers/date=2021-05-12/2021-05-12.parquet"
)
obj2 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/containers/date=2021-05-12/2021-05-12.parquet"
)
obj3 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/image_layers/dataframe_image_id.parquet"
)
obj4 = s3.Object("DH-PLAYPEN", "ccx/fingerprinting/containers/df_cont_image_id.parquet")
# download the file into the buffer
buffer1 = io.BytesIO()
obj1.download_fileobj(buffer1)
buffer2 = io.BytesIO()
obj2.download_fileobj(buffer2)

# read the buffer and create the dataframe
image_layers_df = pd.read_parquet(buffer1)
containers_df = pd.read_parquet(buffer2)

Connect to Prometheus

Connecting to Prometheus API client

We can refer to the data hub help page in order to interact with the telemetry dataset. The metrics for the credentials for the prometheus can be fetched from the data hub help page.

prom_url = os.getenv("PROM_URL")
prom_access_token = os.getenv("PROM_ACCESS_TOKEN")

PROM_URL = prom_url
PROM_ACCESS_TOKEN = prom_access_token
# instantiate prometheus connector
pc = PrometheusConnect(
    url=PROM_URL,
    headers={"Authorization": f"bearer {PROM_ACCESS_TOKEN}"},
    disable_ssl=True,
)

Getting the data for given timestamp and cluster_id for image layer datadset

We first extract the list of archive path and use the split method in order to get the information about the timestamp and cluster_id. We then use the Prometheus query in order to get the dataframe.

Metric name: workload:cpu_usage_cores:sum

# listing the archive path
image_archive_path = image_layers_df.archive_path.unique()

df_image_cpu = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )

    df_image_cpu = df_image_cpu.append(df)
df_image_cpu.head()
100%|██████████| 2802/2802 [00:43<00:00, 64.18it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 workload:cpu_usage_cores:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 0.03230751556666692
0 workload:cpu_usage_cores:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 0.030943035959259964
0 workload:cpu_usage_cores:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 0.23523091000370105
0 workload:cpu_usage_cores:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 0.029065674940740756
0 workload:cpu_usage_cores:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 0.0035705586481481486
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_image_cpu._id.nunique())
print("Number of timestamps :", df_image_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1850
Number of timestamps : 1791

For the metric,workload:cpu_usage_cores:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1850 cluster_id, i.e, ~66% cluster_id from the telemetry database.


Metric name: workload:memory_usage_bytes:sum

df_image_mem = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_mem = df_image_mem.append(df)
df_image_mem.head()
100%|██████████| 2802/2802 [01:12<00:00, 38.68it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 workload:memory_usage_bytes:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 389971968
0 workload:memory_usage_bytes:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 385421312
0 workload:memory_usage_bytes:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 1190780928
0 workload:memory_usage_bytes:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 269168640
0 workload:memory_usage_bytes:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 83660800
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_image_mem._id.nunique())
print("Number of timestamps :", df_image_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1805
Number of timestamps : 1749

For the metric,workload:memory_usage_bytes:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1805 cluster_id, i.e, ~64.4% cluster_id from the telemetry database.


Metric name: openshift:cpu_usage_cores:sum

df_image_open_cpu = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_open_cpu = df_image_open_cpu.append(df)
df_image_open_cpu.head()
100%|██████████| 2802/2802 [01:08<00:00, 40.76it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 openshift:cpu_usage_cores:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 1.9291393691359602
0 openshift:cpu_usage_cores:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 1.7056283926121354
0 openshift:cpu_usage_cores:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 2.5198167090408625
0 openshift:cpu_usage_cores:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 1.6697914679164034
0 openshift:cpu_usage_cores:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 3.497572298494714
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_image_open_cpu._id.nunique()
)
print("Number of timestamps :", df_image_open_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1845
Number of timestamps : 1786

For the metric,openshift:cpu_usage_cores:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1845 cluster_id, i.e, ~65.8% cluster_id from the telemetry database.


Metric name: openshift:memory_usage_bytes:sum

df_image_open_mem = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_open_mem = df_image_open_mem.append(df)
df_image_open_mem.head()
100%|██████████| 2802/2802 [01:06<00:00, 42.11it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 openshift:memory_usage_bytes:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 14560608256
0 openshift:memory_usage_bytes:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 14386823168
0 openshift:memory_usage_bytes:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 26840170496
0 openshift:memory_usage_bytes:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 13228613632
0 openshift:memory_usage_bytes:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 26006728704
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_image_open_mem._id.nunique()
)
print("Number of timestamps :", df_image_open_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1805
Number of timestamps : 1749

For the metric,openshift:memory_usage_bytes:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1805 cluster_id, i.e, ~64.4% cluster_id from the telemetry database.


Metric name: cluster:cpu_usage_cores:sum

df_image_cluster_cpu = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_cluster_cpu = df_image_cluster_cpu.append(df)
df_image_cluster_cpu.head()
100%|██████████| 2802/2802 [01:18<00:00, 35.50it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 cluster:cpu_usage_cores:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 1.9614468847026272
0 cluster:cpu_usage_cores:sum 002663ad-bcf4-4c7c-9530-ecb351fe4001 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620825377 4.345714285714284
0 cluster:cpu_usage_cores:sum 0030e092-3def-4e7a-96b7-17d45ca102fd openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620812523 11.39365149650794
0 cluster:cpu_usage_cores:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 1.7365714285713953
0 cluster:cpu_usage_cores:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 2.7550476190445634
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_image_cluster_cpu._id.nunique(),
)
print("Number of timestamps :", df_image_cluster_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2624
Number of timestamps : 2535

For the metric,cluster:cpu_usage_cores:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 2624 cluster_id, i.e, ~93.6% cluster_id from the telemetry database.


Metric name: cluster:memory_usage_bytes:sum

df_image_cluster_mem = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_cluster_mem = df_image_cluster_mem.append(df)
df_image_cluster_mem.head()
100%|██████████| 2802/2802 [01:04<00:00, 43.19it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 cluster:memory_usage_bytes:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 14950580224
0 cluster:memory_usage_bytes:sum 002663ad-bcf4-4c7c-9530-ecb351fe4001 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620825377 25248690176
0 cluster:memory_usage_bytes:sum 0030e092-3def-4e7a-96b7-17d45ca102fd openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620812523 29709115392
0 cluster:memory_usage_bytes:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 14772244480
0 cluster:memory_usage_bytes:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 28030951424
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_image_cluster_mem._id.nunique(),
)
print("Number of timestamps :", df_image_cluster_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2673
Number of timestamps : 2581

For the metric,cluster:memory_usage_bytes:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 2673 cluster_id, i.e, ~95.3% cluster_id from the telemetry database.

In the next section, we will do the outer merging for all the six datasets above. But first, since we have the common name for value column, we will rename them respectively.

Merging Datasets

Renaming the value column for respective datasets:

df_image_cpu.rename(
    columns={"value": "value_workload:cpu_usage_cores:sum"}, inplace=True
)
df_image_mem.rename(
    columns={"value": "value_workload:memory_usage_bytes:sum"}, inplace=True
)
df_image_open_cpu.rename(
    columns={"value": "value_openshift:cpu_usage_cores:sum"}, inplace=True
)
df_image_open_mem.rename(
    columns={"value": "value_openshift:memory_usage_bytes:sum"}, inplace=True
)
df_image_cluster_cpu.rename(
    columns={"value": "value_cluster:cpu_usage_cores:sum"}, inplace=True
)
df_image_cluster_mem.rename(
    columns={"value": "value_cluster:memory_usage_bytes:sum"}, inplace=True
)

Merging six dataset

# Arranging the dataset for merging
df_image_cpu_merge = df_image_cpu[
    ["_id", "timestamp", "value_workload:cpu_usage_cores:sum"]
]
df_image_mem_merge = df_image_mem[
    ["_id", "timestamp", "value_workload:memory_usage_bytes:sum"]
]
df_image_open_cpu_merge = df_image_open_cpu[
    ["_id", "timestamp", "value_openshift:cpu_usage_cores:sum"]
]
df_image_open_mem_merge = df_image_open_mem[
    ["_id", "timestamp", "value_openshift:memory_usage_bytes:sum"]
]
df_image_cluster_cpu_merge = df_image_cluster_cpu[
    ["_id", "timestamp", "value_cluster:cpu_usage_cores:sum"]
]
df_image_cluster_mem_merge = df_image_cluster_mem[
    ["_id", "timestamp", "value_cluster:memory_usage_bytes:sum"]
]

# Merging
merge1 = pd.merge(
    df_image_cpu_merge, df_image_mem_merge, on=["_id", "timestamp"], how="outer"
)
merge2 = pd.merge(merge1, df_image_open_cpu_merge, on=["_id", "timestamp"], how="outer")
merge3 = pd.merge(merge2, df_image_open_mem_merge, on=["_id", "timestamp"], how="outer")
merge4 = pd.merge(
    merge3, df_image_cluster_cpu_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_image_data = pd.merge(
    merge4, df_image_cluster_mem_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_image_data.head()
_id timestamp value_workload:cpu_usage_cores:sum value_workload:memory_usage_bytes:sum value_openshift:cpu_usage_cores:sum value_openshift:memory_usage_bytes:sum value_cluster:cpu_usage_cores:sum value_cluster:memory_usage_bytes:sum
0 00003d61-9db1-4757-9cd1-84df271daeb9 1620860062 0.03230751556666692 389971968 1.9291393691359602 14560608256 1.9614468847026272 14950580224
1 00351e6e-53ce-465e-9493-cf0cd2367049 1620852056 0.030943035959259964 385421312 1.7056283926121354 14386823168 1.7365714285713953 14772244480
2 003ba133-e754-4d5a-bc57-675b386d1f05 1620852942 0.23523091000370105 1190780928 2.5198167090408625 26840170496 2.7550476190445634 28030951424
3 00479ead-b7fc-49c2-ae20-3990a9b3d08c 1620855643 0.029065674940740756 269168640 1.6697914679164034 13228613632 1.6988571428571442 13497782272
4 009ef697-17c5-431e-a5f5-42d93d9c6a5a 1620855709 0.0035705586481481486 83660800 3.497572298494714 26006728704 3.501142857142862 26090389504

Now, we save the merged data in DH-PLAYPEN bucket.

# Uploading the mapping dataset in the DH-PLAYPEN bucket
parquet_image_buffer = io.BytesIO()
telemeter_image_data.to_parquet(parquet_image_buffer)
s3_obj = s3.Object(
    s3_bucket, "ccx/fingerprinting/image_layers/telemeter_image_data.parquet"
)
status = s3_obj.put(Body=parquet_image_buffer.getvalue())
print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    telemeter_image_data._id.nunique(),
)
print("Number of timestamps :", telemeter_image_data.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2673
Number of timestamps : 2581

Hence, we observe that we have atleast one metric information for 2673 cluster_ids, i.e, ~95.3% of the total, from the image layers dataset.


Getting the data for given timestamp and cluster_id for Container dataset

We first extract the list of archive path and use the split method in order to get the information about the timestamp and cluster_id. We then use the Prometheus query in order to get the dataframe.

Metric name: workload:cpu_usage_cores:sum

# listing the archive path
cont_archive_path = containers_df.archive_path.unique()

df_cont_cpu = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )

    df_cont_cpu = df_cont_cpu.append(df)
df_cont_cpu.head()
100%|██████████| 2927/2927 [00:37<00:00, 78.43it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 workload:cpu_usage_cores:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 0.03230751556666692
0 workload:cpu_usage_cores:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 0.030943035959259964
0 workload:cpu_usage_cores:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 0.23523091000370105
0 workload:cpu_usage_cores:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 0.029065674940740756
0 workload:cpu_usage_cores:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 0.0035705586481481486
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_cont_cpu._id.nunique())
print("Number of timestamps :", df_cont_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1857
Number of timestamps : 1798

For the metric,workload:cpu_usage_cores:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1857 cluster_id, i.e, ~63.4% cluster_id from the telemetry database.


Metric name: workload:memory_usage_bytes:sum

df_cont_mem = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )

    df_cont_mem = df_cont_mem.append(df)
df_cont_mem.head()
100%|██████████| 2927/2927 [00:35<00:00, 83.29it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 workload:memory_usage_bytes:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 389971968
0 workload:memory_usage_bytes:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 385421312
0 workload:memory_usage_bytes:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 1190780928
0 workload:memory_usage_bytes:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 269168640
0 workload:memory_usage_bytes:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 83660800
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_cont_mem._id.nunique())
print("Number of timestamps :", df_cont_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1811
Number of timestamps : 1755

For the metric,workload:memory_usage_bytes:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1811 cluster_id, i.e, ~61.8% cluster_id from the telemetry database.


Metric name: openshift:cpu_usage_cores:sum

df_cont_open_cpu = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_open_cpu = df_cont_open_cpu.append(df)
df_cont_open_cpu.head()
100%|██████████| 2927/2927 [00:43<00:00, 66.68it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 openshift:cpu_usage_cores:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 1.9291393691359602
0 openshift:cpu_usage_cores:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 1.7056283926121354
0 openshift:cpu_usage_cores:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 2.5198167090408625
0 openshift:cpu_usage_cores:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 1.6697914679164034
0 openshift:cpu_usage_cores:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 3.497572298494714
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_cont_open_cpu._id.nunique()
)
print("Number of timestamps :", df_cont_open_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1852
Number of timestamps : 1793

For the metric,openshift:cpu_usage_cores:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1852 cluster_id, i.e, ~63.2% cluster_id from the telemetry database.


Metric name: openshift:memory_usage_bytes:sum

df_cont_open_mem = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_open_mem = df_cont_open_mem.append(df)
df_cont_open_mem.head()
100%|██████████| 2927/2927 [00:51<00:00, 56.81it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 openshift:memory_usage_bytes:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 14560608256
0 openshift:memory_usage_bytes:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 14386823168
0 openshift:memory_usage_bytes:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 26840170496
0 openshift:memory_usage_bytes:sum 00479ead-b7fc-49c2-ae20-3990a9b3d08c openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855643 13228613632
0 openshift:memory_usage_bytes:sum 009ef697-17c5-431e-a5f5-42d93d9c6a5a openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620855709 26006728704
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_cont_open_mem._id.nunique()
)
print("Number of timestamps :", df_cont_open_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1811
Number of timestamps : 1755

For the metric,openshift:memory_usage_bytes:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1811 cluster_id, i.e, ~61.8% cluster_id from the telemetry database.


Metric name: cluster:cpu_usage_cores:sum

df_cont_cluster_cpu = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_cluster_cpu = df_cont_cluster_cpu.append(df)
df_cont_cluster_cpu.head()
100%|██████████| 2927/2927 [01:00<00:00, 48.78it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 cluster:cpu_usage_cores:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 1.9614468847026272
0 cluster:cpu_usage_cores:sum 002663ad-bcf4-4c7c-9530-ecb351fe4001 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620825377 4.345714285714284
0 cluster:cpu_usage_cores:sum 0030e092-3def-4e7a-96b7-17d45ca102fd openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620812523 11.39365149650794
0 cluster:cpu_usage_cores:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 1.7365714285713953
0 cluster:cpu_usage_cores:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 2.7550476190445634
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_cont_cluster_cpu._id.nunique(),
)
print("Number of timestamps :", df_cont_cluster_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2649
Number of timestamps : 2560

For the metric,cluster:cpu_usage_cores:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 2649 cluster_id, i.e, ~90.5% cluster_id from the telemetry database.


Metric name: cluster:memory_usage_bytes:sum

df_cont_cluster_mem = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_cluster_mem = df_cont_cluster_mem.append(df)
df_cont_cluster_mem.head()
100%|██████████| 2927/2927 [00:50<00:00, 57.88it/s]
__name__ _id prometheus receive tenant_id timestamp value
0 cluster:memory_usage_bytes:sum 00003d61-9db1-4757-9cd1-84df271daeb9 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620860062 14950580224
0 cluster:memory_usage_bytes:sum 002663ad-bcf4-4c7c-9530-ecb351fe4001 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620825377 25248690176
0 cluster:memory_usage_bytes:sum 0030e092-3def-4e7a-96b7-17d45ca102fd openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620812523 29709115392
0 cluster:memory_usage_bytes:sum 00351e6e-53ce-465e-9493-cf0cd2367049 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852056 14772244480
0 cluster:memory_usage_bytes:sum 003ba133-e754-4d5a-bc57-675b386d1f05 openshift-monitoring/k8s true FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 1620852942 28030951424
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_cont_cluster_mem._id.nunique(),
)
print("Number of timestamps :", df_cont_cluster_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2697
Number of timestamps : 2605

For the metric,cluster:memory_usage_bytes:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 2697 cluster_id, i.e, ~92.1% cluster_id from the telemetry database.

In the next section, we will merge do the outer merging for all the six datasets above. But first, since we have the common name for value column, we will rename them respectively.

Merging Datasets

Renaming the value column for respective datasets:

df_cont_cpu.rename(
    columns={"value": "value_workload:cpu_usage_cores:sum"}, inplace=True
)

df_cont_mem.rename(
    columns={"value": "value_workload:memory_usage_bytes:sum"}, inplace=True
)

df_cont_open_cpu.rename(
    columns={"value": "value_openshift:cpu_usage_cores:sum"}, inplace=True
)

df_cont_open_mem.rename(
    columns={"value": "value_openshift:memory_usage_bytes:sum"}, inplace=True
)

df_cont_cluster_cpu.rename(
    columns={"value": "value_cluster:cpu_usage_cores:sum"}, inplace=True
)

df_cont_cluster_mem.rename(
    columns={"value": "value_cluster:memory_usage_bytes:sum"}, inplace=True
)
# Arranging the dataset for merging
df_cont_cpu_merge = df_cont_cpu[
    ["_id", "timestamp", "value_workload:cpu_usage_cores:sum"]
]
df_cont_mem_merge = df_cont_mem[
    ["_id", "timestamp", "value_workload:memory_usage_bytes:sum"]
]
df_cont_open_cpu_merge = df_cont_open_cpu[
    ["_id", "timestamp", "value_openshift:cpu_usage_cores:sum"]
]
df_cont_open_mem_merge = df_cont_open_mem[
    ["_id", "timestamp", "value_openshift:memory_usage_bytes:sum"]
]
df_cont_cluster_cpu_merge = df_cont_cluster_cpu[
    ["_id", "timestamp", "value_cluster:cpu_usage_cores:sum"]
]
df_cont_cluster_mem_merge = df_cont_cluster_mem[
    ["_id", "timestamp", "value_cluster:memory_usage_bytes:sum"]
]

# Merging
merge1 = pd.merge(
    df_cont_cpu_merge, df_cont_mem_merge, on=["_id", "timestamp"], how="outer"
)
merge2 = pd.merge(merge1, df_cont_open_cpu_merge, on=["_id", "timestamp"], how="outer")
merge3 = pd.merge(merge2, df_cont_open_mem_merge, on=["_id", "timestamp"], how="outer")
merge4 = pd.merge(
    merge3, df_cont_cluster_cpu_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_cont_data = pd.merge(
    merge4, df_cont_cluster_mem_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_cont_data.head()
_id timestamp value_workload:cpu_usage_cores:sum value_workload:memory_usage_bytes:sum value_openshift:cpu_usage_cores:sum value_openshift:memory_usage_bytes:sum value_cluster:cpu_usage_cores:sum value_cluster:memory_usage_bytes:sum
0 00003d61-9db1-4757-9cd1-84df271daeb9 1620860062 0.03230751556666692 389971968 1.9291393691359602 14560608256 1.9614468847026272 14950580224
1 00351e6e-53ce-465e-9493-cf0cd2367049 1620852056 0.030943035959259964 385421312 1.7056283926121354 14386823168 1.7365714285713953 14772244480
2 003ba133-e754-4d5a-bc57-675b386d1f05 1620852942 0.23523091000370105 1190780928 2.5198167090408625 26840170496 2.7550476190445634 28030951424
3 00479ead-b7fc-49c2-ae20-3990a9b3d08c 1620855643 0.029065674940740756 269168640 1.6697914679164034 13228613632 1.6988571428571442 13497782272
4 009ef697-17c5-431e-a5f5-42d93d9c6a5a 1620855709 0.0035705586481481486 83660800 3.497572298494714 26006728704 3.501142857142862 26090389504

Now, saving the merged dataset in DH-PLAYPEN bucket.

# Uploading the mapping dataset in the DH-PLAYPEN bucket
parquet_cont_buffer = io.BytesIO()
telemeter_cont_data.to_parquet(parquet_cont_buffer)
s3_obj = s3.Object(
    s3_bucket, "ccx/fingerprinting/containers/telemeter_cont_data.parquet"
)
status = s3_obj.put(Body=parquet_cont_buffer.getvalue())
print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    telemeter_cont_data._id.nunique(),
)
print("Number of timestamps :", telemeter_cont_data.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2698
Number of timestamps : 2606

Hence, we observe that we have atleast one metric information for 2698 cluster_ids, i.e, ~92.1% of the total, from the image layers dataset.

Conclusion

These information are now integrated with the workload dataset. The next step is to perform an EDA and then apply clustering techniques to the final integrated dataset.