Extracting the Telemetry Info for Clusters in Workload Dataset¶

Overview¶

As a data scientist, working on a workload fingerprinting project, the workload data that we have from the insight operator archive are limited to some informations like product name, cpu usage, memory usage, etc. We have done the product name mapping in this issue.

The purpose of this notebook is to be able to extract the data from the telemetry based on the cluster_id and timestamp as given in the workload data. Metric information are extracted both for image_layers_dataset and container_dataset. The different metrics are,

workload:cpu_usage_cores:sum
workload:memory_usage_bytes:sum
openshift:cpu_usage_cores:sum
openshift:memory_usage_bytes:sum
cluster:cpu_usage_cores:sum
cluster:memory_usage_bytes:sum

Importing packages¶

import os
import datetime as dt
from tqdm import tqdm
import pandas as pd
import io
import boto3
import warnings
import urllib3
from dotenv import load_dotenv, find_dotenv

from prometheus_api_client import PrometheusConnect, MetricSnapshotDataFrame

urllib3.disable_warnings()
pd.set_option("display.max_columns", None)

%matplotlib inline
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")

Data Collection from bucket¶

Firstly, we extract the image_layers and containers dataset from the PLAYPEN bucket in order to get the list of cluster_id and timestamp from the corresponding archive_path given in the respective dataset.

# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

# s3 resource to communicate with storage
s3 = boto3.resource(
    "s3",
    endpoint_url=s3_endpoint_url,
    aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key,
)

# access the parquet file as an s3 object

obj1 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/image_layers/date=2021-05-12/2021-05-12.parquet"
)
obj2 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/containers/date=2021-05-12/2021-05-12.parquet"
)
obj3 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/image_layers/dataframe_image_id.parquet"
)
obj4 = s3.Object("DH-PLAYPEN", "ccx/fingerprinting/containers/df_cont_image_id.parquet")
# download the file into the buffer
buffer1 = io.BytesIO()
obj1.download_fileobj(buffer1)
buffer2 = io.BytesIO()
obj2.download_fileobj(buffer2)

# read the buffer and create the dataframe
image_layers_df = pd.read_parquet(buffer1)
containers_df = pd.read_parquet(buffer2)

Connect to Prometheus¶

Connecting to Prometheus API client

We can refer to the data hub help page in order to interact with the telemetry dataset. The metrics for the credentials for the prometheus can be fetched from the data hub help page.

prom_url = os.getenv("PROM_URL")
prom_access_token = os.getenv("PROM_ACCESS_TOKEN")

PROM_URL = prom_url
PROM_ACCESS_TOKEN = prom_access_token

# instantiate prometheus connector
pc = PrometheusConnect(
    url=PROM_URL,
    headers={"Authorization": f"bearer {PROM_ACCESS_TOKEN}"},
    disable_ssl=True,
)

Getting the data for given `timestamp` and `cluster_id` for image layer datadset¶

We first extract the list of archive path and use the split method in order to get the information about the timestamp and cluster_id. We then use the Prometheus query in order to get the dataframe.

Metric name: `workload:cpu_usage_cores:sum`¶

# listing the archive path
image_archive_path = image_layers_df.archive_path.unique()

df_image_cpu = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )

    df_image_cpu = df_image_cpu.append(df)
df_image_cpu.head()

100%|██████████| 2802/2802 [00:43<00:00, 64.18it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
workload:cpu_usage_cores:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	0.03230751556666692
workload:cpu_usage_cores:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	0.030943035959259964
workload:cpu_usage_cores:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	0.23523091000370105
workload:cpu_usage_cores:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	0.029065674940740756
workload:cpu_usage_cores:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	0.0035705586481481486

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_image_cpu._id.nunique())
print("Number of timestamps :", df_image_cpu.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1850
Number of timestamps : 1791

For the metric,workload:cpu_usage_cores:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1850 cluster_id, i.e, ~66% cluster_id from the telemetry database.

Metric name: `workload:memory_usage_bytes:sum`¶

df_image_mem = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_mem = df_image_mem.append(df)
df_image_mem.head()

100%|██████████| 2802/2802 [01:12<00:00, 38.68it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
workload:memory_usage_bytes:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	389971968
workload:memory_usage_bytes:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	385421312
workload:memory_usage_bytes:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	1190780928
workload:memory_usage_bytes:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	269168640
workload:memory_usage_bytes:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	83660800

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_image_mem._id.nunique())
print("Number of timestamps :", df_image_mem.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1805
Number of timestamps : 1749

For the metric,workload:memory_usage_bytes:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1805 cluster_id, i.e, ~64.4% cluster_id from the telemetry database.

Metric name: `openshift:cpu_usage_cores:sum`¶

df_image_open_cpu = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_open_cpu = df_image_open_cpu.append(df)
df_image_open_cpu.head()

100%|██████████| 2802/2802 [01:08<00:00, 40.76it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
openshift:cpu_usage_cores:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	1.9291393691359602
openshift:cpu_usage_cores:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	1.7056283926121354
openshift:cpu_usage_cores:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	2.5198167090408625
openshift:cpu_usage_cores:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	1.6697914679164034
openshift:cpu_usage_cores:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	3.497572298494714

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_image_open_cpu._id.nunique()
)
print("Number of timestamps :", df_image_open_cpu.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1845
Number of timestamps : 1786

For the metric,openshift:cpu_usage_cores:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1845 cluster_id, i.e, ~65.8% cluster_id from the telemetry database.

Metric name: `openshift:memory_usage_bytes:sum`¶

df_image_open_mem = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_open_mem = df_image_open_mem.append(df)
df_image_open_mem.head()

100%|██████████| 2802/2802 [01:06<00:00, 42.11it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
openshift:memory_usage_bytes:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	14560608256
openshift:memory_usage_bytes:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	14386823168
openshift:memory_usage_bytes:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	26840170496
openshift:memory_usage_bytes:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	13228613632
openshift:memory_usage_bytes:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	26006728704

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_image_open_mem._id.nunique()
)
print("Number of timestamps :", df_image_open_mem.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1805
Number of timestamps : 1749

For the metric,openshift:memory_usage_bytes:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1805 cluster_id, i.e, ~64.4% cluster_id from the telemetry database.

Metric name: `cluster:cpu_usage_cores:sum`¶

df_image_cluster_cpu = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_cluster_cpu = df_image_cluster_cpu.append(df)
df_image_cluster_cpu.head()

100%|██████████| 2802/2802 [01:18<00:00, 35.50it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
cluster:cpu_usage_cores:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	1.9614468847026272
cluster:cpu_usage_cores:sum	002663ad-bcf4-4c7c-9530-ecb351fe4001	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620825377	4.345714285714284
cluster:cpu_usage_cores:sum	0030e092-3def-4e7a-96b7-17d45ca102fd	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620812523	11.39365149650794
cluster:cpu_usage_cores:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	1.7365714285713953
cluster:cpu_usage_cores:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	2.7550476190445634

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_image_cluster_cpu._id.nunique(),
)
print("Number of timestamps :", df_image_cluster_cpu.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2624
Number of timestamps : 2535

For the metric,cluster:cpu_usage_cores:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 2624 cluster_id, i.e, ~93.6% cluster_id from the telemetry database.

Metric name: `cluster:memory_usage_bytes:sum`¶

df_image_cluster_mem = pd.DataFrame()

for path in tqdm(image_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_image_cluster_mem = df_image_cluster_mem.append(df)
df_image_cluster_mem.head()

100%|██████████| 2802/2802 [01:04<00:00, 43.19it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
cluster:memory_usage_bytes:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	14950580224
cluster:memory_usage_bytes:sum	002663ad-bcf4-4c7c-9530-ecb351fe4001	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620825377	25248690176
cluster:memory_usage_bytes:sum	0030e092-3def-4e7a-96b7-17d45ca102fd	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620812523	29709115392
cluster:memory_usage_bytes:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	14772244480
cluster:memory_usage_bytes:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	28030951424

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_image_cluster_mem._id.nunique(),
)
print("Number of timestamps :", df_image_cluster_mem.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2673
Number of timestamps : 2581

For the metric,cluster:memory_usage_bytes:sum, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 2673 cluster_id, i.e, ~95.3% cluster_id from the telemetry database.

In the next section, we will do the outer merging for all the six datasets above. But first, since we have the common name for value column, we will rename them respectively.

Merging Datasets¶

Renaming the value column for respective datasets:

df_image_cpu.rename(
    columns={"value": "value_workload:cpu_usage_cores:sum"}, inplace=True
)
df_image_mem.rename(
    columns={"value": "value_workload:memory_usage_bytes:sum"}, inplace=True
)
df_image_open_cpu.rename(
    columns={"value": "value_openshift:cpu_usage_cores:sum"}, inplace=True
)
df_image_open_mem.rename(
    columns={"value": "value_openshift:memory_usage_bytes:sum"}, inplace=True
)
df_image_cluster_cpu.rename(
    columns={"value": "value_cluster:cpu_usage_cores:sum"}, inplace=True
)
df_image_cluster_mem.rename(
    columns={"value": "value_cluster:memory_usage_bytes:sum"}, inplace=True
)

Merging six dataset

# Arranging the dataset for merging
df_image_cpu_merge = df_image_cpu[
    ["_id", "timestamp", "value_workload:cpu_usage_cores:sum"]
]
df_image_mem_merge = df_image_mem[
    ["_id", "timestamp", "value_workload:memory_usage_bytes:sum"]
]
df_image_open_cpu_merge = df_image_open_cpu[
    ["_id", "timestamp", "value_openshift:cpu_usage_cores:sum"]
]
df_image_open_mem_merge = df_image_open_mem[
    ["_id", "timestamp", "value_openshift:memory_usage_bytes:sum"]
]
df_image_cluster_cpu_merge = df_image_cluster_cpu[
    ["_id", "timestamp", "value_cluster:cpu_usage_cores:sum"]
]
df_image_cluster_mem_merge = df_image_cluster_mem[
    ["_id", "timestamp", "value_cluster:memory_usage_bytes:sum"]
]

# Merging
merge1 = pd.merge(
    df_image_cpu_merge, df_image_mem_merge, on=["_id", "timestamp"], how="outer"
)
merge2 = pd.merge(merge1, df_image_open_cpu_merge, on=["_id", "timestamp"], how="outer")
merge3 = pd.merge(merge2, df_image_open_mem_merge, on=["_id", "timestamp"], how="outer")
merge4 = pd.merge(
    merge3, df_image_cluster_cpu_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_image_data = pd.merge(
    merge4, df_image_cluster_mem_merge, on=["_id", "timestamp"], how="outer"
)

telemeter_image_data.head()

	_id	timestamp	value_workload:cpu_usage_cores:sum	value_workload:memory_usage_bytes:sum	value_openshift:cpu_usage_cores:sum	value_openshift:memory_usage_bytes:sum	value_cluster:cpu_usage_cores:sum	value_cluster:memory_usage_bytes:sum
0	00003d61-9db1-4757-9cd1-84df271daeb9	1620860062	0.03230751556666692	389971968	1.9291393691359602	14560608256	1.9614468847026272	14950580224
1	00351e6e-53ce-465e-9493-cf0cd2367049	1620852056	0.030943035959259964	385421312	1.7056283926121354	14386823168	1.7365714285713953	14772244480
2	003ba133-e754-4d5a-bc57-675b386d1f05	1620852942	0.23523091000370105	1190780928	2.5198167090408625	26840170496	2.7550476190445634	28030951424
3	00479ead-b7fc-49c2-ae20-3990a9b3d08c	1620855643	0.029065674940740756	269168640	1.6697914679164034	13228613632	1.6988571428571442	13497782272
4	009ef697-17c5-431e-a5f5-42d93d9c6a5a	1620855709	0.0035705586481481486	83660800	3.497572298494714	26006728704	3.501142857142862	26090389504

Now, we save the merged data in DH-PLAYPEN bucket.

# Uploading the mapping dataset in the DH-PLAYPEN bucket
parquet_image_buffer = io.BytesIO()
telemeter_image_data.to_parquet(parquet_image_buffer)
s3_obj = s3.Object(
    s3_bucket, "ccx/fingerprinting/image_layers/telemeter_image_data.parquet"
)
status = s3_obj.put(Body=parquet_image_buffer.getvalue())

print(
    "Total number of cluster_id in original image layer dataset :",
    image_layers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    telemeter_image_data._id.nunique(),
)
print("Number of timestamps :", telemeter_image_data.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2673
Number of timestamps : 2581

Hence, we observe that we have atleast one metric information for 2673 cluster_ids, i.e, ~95.3% of the total, from the image layers dataset.

Getting the data for given `timestamp` and `cluster_id` for Container dataset¶

We first extract the list of archive path and use the split method in order to get the information about the timestamp and cluster_id. We then use the Prometheus query in order to get the dataframe.

Metric name: `workload:cpu_usage_cores:sum`¶

# listing the archive path
cont_archive_path = containers_df.archive_path.unique()

df_cont_cpu = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )

    df_cont_cpu = df_cont_cpu.append(df)
df_cont_cpu.head()

100%|██████████| 2927/2927 [00:37<00:00, 78.43it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
workload:cpu_usage_cores:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	0.03230751556666692
workload:cpu_usage_cores:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	0.030943035959259964
workload:cpu_usage_cores:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	0.23523091000370105
workload:cpu_usage_cores:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	0.029065674940740756
workload:cpu_usage_cores:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	0.0035705586481481486

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_cont_cpu._id.nunique())
print("Number of timestamps :", df_cont_cpu.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1857
Number of timestamps : 1798

For the metric,workload:cpu_usage_cores:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1857 cluster_id, i.e, ~63.4% cluster_id from the telemetry database.

Metric name: `workload:memory_usage_bytes:sum`¶

df_cont_mem = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'workload:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )

    df_cont_mem = df_cont_mem.append(df)
df_cont_mem.head()

100%|██████████| 2927/2927 [00:35<00:00, 83.29it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
workload:memory_usage_bytes:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	389971968
workload:memory_usage_bytes:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	385421312
workload:memory_usage_bytes:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	1190780928
workload:memory_usage_bytes:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	269168640
workload:memory_usage_bytes:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	83660800

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_cont_mem._id.nunique())
print("Number of timestamps :", df_cont_mem.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1811
Number of timestamps : 1755

For the metric,workload:memory_usage_bytes:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1811 cluster_id, i.e, ~61.8% cluster_id from the telemetry database.

Metric name: `openshift:cpu_usage_cores:sum`¶

df_cont_open_cpu = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_open_cpu = df_cont_open_cpu.append(df)
df_cont_open_cpu.head()

100%|██████████| 2927/2927 [00:43<00:00, 66.68it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
openshift:cpu_usage_cores:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	1.9291393691359602
openshift:cpu_usage_cores:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	1.7056283926121354
openshift:cpu_usage_cores:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	2.5198167090408625
openshift:cpu_usage_cores:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	1.6697914679164034
openshift:cpu_usage_cores:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	3.497572298494714

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_cont_open_cpu._id.nunique()
)
print("Number of timestamps :", df_cont_open_cpu.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1852
Number of timestamps : 1793

For the metric,openshift:cpu_usage_cores:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1852 cluster_id, i.e, ~63.2% cluster_id from the telemetry database.

Metric name: `openshift:memory_usage_bytes:sum`¶

df_cont_open_mem = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'openshift:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_open_mem = df_cont_open_mem.append(df)
df_cont_open_mem.head()

100%|██████████| 2927/2927 [00:51<00:00, 56.81it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
openshift:memory_usage_bytes:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	14560608256
openshift:memory_usage_bytes:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	14386823168
openshift:memory_usage_bytes:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	26840170496
openshift:memory_usage_bytes:sum	00479ead-b7fc-49c2-ae20-3990a9b3d08c	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855643	13228613632
openshift:memory_usage_bytes:sum	009ef697-17c5-431e-a5f5-42d93d9c6a5a	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620855709	26006728704

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :", df_cont_open_mem._id.nunique()
)
print("Number of timestamps :", df_cont_open_mem.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1811
Number of timestamps : 1755

For the metric,openshift:memory_usage_bytes:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1811 cluster_id, i.e, ~61.8% cluster_id from the telemetry database.

Metric name: `cluster:cpu_usage_cores:sum`¶

df_cont_cluster_cpu = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_cluster_cpu = df_cont_cluster_cpu.append(df)
df_cont_cluster_cpu.head()

100%|██████████| 2927/2927 [01:00<00:00, 48.78it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
cluster:cpu_usage_cores:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	1.9614468847026272
cluster:cpu_usage_cores:sum	002663ad-bcf4-4c7c-9530-ecb351fe4001	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620825377	4.345714285714284
cluster:cpu_usage_cores:sum	0030e092-3def-4e7a-96b7-17d45ca102fd	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620812523	11.39365149650794
cluster:cpu_usage_cores:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	1.7365714285713953
cluster:cpu_usage_cores:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	2.7550476190445634

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_cont_cluster_cpu._id.nunique(),
)
print("Number of timestamps :", df_cont_cluster_cpu.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2649
Number of timestamps : 2560

For the metric,cluster:cpu_usage_cores:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 2649 cluster_id, i.e, ~90.5% cluster_id from the telemetry database.

Metric name: `cluster:memory_usage_bytes:sum`¶

df_cont_cluster_mem = pd.DataFrame()

for path in tqdm(cont_archive_path[:]):
    pos = path.split("/")[-1].split(".")[0]
    hours = int(pos[0:2])
    mins = int(pos[2:4])
    sec = int(pos[4:6])
    cluster_pos = path.split("/")[3]
    ts = dt.datetime(
        year=2021,
        month=5,
        day=12,
        hour=hours,
        minute=mins,
        second=sec,
        tzinfo=dt.timezone.utc,
    )
    df = MetricSnapshotDataFrame(
        pc.custom_query(
            f'cluster:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
            params={"time": ts.timestamp()},
        )
    )
    df_cont_cluster_mem = df_cont_cluster_mem.append(df)
df_cont_cluster_mem.head()

100%|██████████| 2927/2927 [00:50<00:00, 57.88it/s]

__name__	_id	prometheus	receive	tenant_id	timestamp	value
cluster:memory_usage_bytes:sum	00003d61-9db1-4757-9cd1-84df271daeb9	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620860062	14950580224
cluster:memory_usage_bytes:sum	002663ad-bcf4-4c7c-9530-ecb351fe4001	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620825377	25248690176
cluster:memory_usage_bytes:sum	0030e092-3def-4e7a-96b7-17d45ca102fd	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620812523	29709115392
cluster:memory_usage_bytes:sum	00351e6e-53ce-465e-9493-cf0cd2367049	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852056	14772244480
cluster:memory_usage_bytes:sum	003ba133-e754-4d5a-bc57-675b386d1f05	openshift-monitoring/k8s	true	FB870BF3-9F3A-44FF-9BF7-D7A047A52F43	1620852942	28030951424

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    df_cont_cluster_mem._id.nunique(),
)
print("Number of timestamps :", df_cont_cluster_mem.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2697
Number of timestamps : 2605

For the metric,cluster:memory_usage_bytes:sum, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 2697 cluster_id, i.e, ~92.1% cluster_id from the telemetry database.

In the next section, we will merge do the outer merging for all the six datasets above. But first, since we have the common name for value column, we will rename them respectively.

Merging Datasets¶

Renaming the value column for respective datasets:

df_cont_cpu.rename(
    columns={"value": "value_workload:cpu_usage_cores:sum"}, inplace=True
)

df_cont_mem.rename(
    columns={"value": "value_workload:memory_usage_bytes:sum"}, inplace=True
)

df_cont_open_cpu.rename(
    columns={"value": "value_openshift:cpu_usage_cores:sum"}, inplace=True
)

df_cont_open_mem.rename(
    columns={"value": "value_openshift:memory_usage_bytes:sum"}, inplace=True
)

df_cont_cluster_cpu.rename(
    columns={"value": "value_cluster:cpu_usage_cores:sum"}, inplace=True
)

df_cont_cluster_mem.rename(
    columns={"value": "value_cluster:memory_usage_bytes:sum"}, inplace=True
)

# Arranging the dataset for merging
df_cont_cpu_merge = df_cont_cpu[
    ["_id", "timestamp", "value_workload:cpu_usage_cores:sum"]
]
df_cont_mem_merge = df_cont_mem[
    ["_id", "timestamp", "value_workload:memory_usage_bytes:sum"]
]
df_cont_open_cpu_merge = df_cont_open_cpu[
    ["_id", "timestamp", "value_openshift:cpu_usage_cores:sum"]
]
df_cont_open_mem_merge = df_cont_open_mem[
    ["_id", "timestamp", "value_openshift:memory_usage_bytes:sum"]
]
df_cont_cluster_cpu_merge = df_cont_cluster_cpu[
    ["_id", "timestamp", "value_cluster:cpu_usage_cores:sum"]
]
df_cont_cluster_mem_merge = df_cont_cluster_mem[
    ["_id", "timestamp", "value_cluster:memory_usage_bytes:sum"]
]

# Merging
merge1 = pd.merge(
    df_cont_cpu_merge, df_cont_mem_merge, on=["_id", "timestamp"], how="outer"
)
merge2 = pd.merge(merge1, df_cont_open_cpu_merge, on=["_id", "timestamp"], how="outer")
merge3 = pd.merge(merge2, df_cont_open_mem_merge, on=["_id", "timestamp"], how="outer")
merge4 = pd.merge(
    merge3, df_cont_cluster_cpu_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_cont_data = pd.merge(
    merge4, df_cont_cluster_mem_merge, on=["_id", "timestamp"], how="outer"
)

telemeter_cont_data.head()

	_id	timestamp	value_workload:cpu_usage_cores:sum	value_workload:memory_usage_bytes:sum	value_openshift:cpu_usage_cores:sum	value_openshift:memory_usage_bytes:sum	value_cluster:cpu_usage_cores:sum	value_cluster:memory_usage_bytes:sum
0	00003d61-9db1-4757-9cd1-84df271daeb9	1620860062	0.03230751556666692	389971968	1.9291393691359602	14560608256	1.9614468847026272	14950580224
1	00351e6e-53ce-465e-9493-cf0cd2367049	1620852056	0.030943035959259964	385421312	1.7056283926121354	14386823168	1.7365714285713953	14772244480
2	003ba133-e754-4d5a-bc57-675b386d1f05	1620852942	0.23523091000370105	1190780928	2.5198167090408625	26840170496	2.7550476190445634	28030951424
3	00479ead-b7fc-49c2-ae20-3990a9b3d08c	1620855643	0.029065674940740756	269168640	1.6697914679164034	13228613632	1.6988571428571442	13497782272
4	009ef697-17c5-431e-a5f5-42d93d9c6a5a	1620855709	0.0035705586481481486	83660800	3.497572298494714	26006728704	3.501142857142862	26090389504

Now, saving the merged dataset in DH-PLAYPEN bucket.

# Uploading the mapping dataset in the DH-PLAYPEN bucket
parquet_cont_buffer = io.BytesIO()
telemeter_cont_data.to_parquet(parquet_cont_buffer)
s3_obj = s3.Object(
    s3_bucket, "ccx/fingerprinting/containers/telemeter_cont_data.parquet"
)
status = s3_obj.put(Body=parquet_cont_buffer.getvalue())

print(
    "Total number of cluster_id in original image layer dataset :",
    containers_df.cluster_id.nunique(),
)
print(
    "Number of cluster_id for the extracted dataset :",
    telemeter_cont_data._id.nunique(),
)
print("Number of timestamps :", telemeter_cont_data.timestamp.nunique())

Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2698
Number of timestamps : 2606

Hence, we observe that we have atleast one metric information for 2698 cluster_ids, i.e, ~92.1% of the total, from the image layers dataset.

Conclusion¶

These information are now integrated with the workload dataset. The next step is to perform an EDA and then apply clustering techniques to the final integrated dataset.

OpenShift Workload Fingerprinting

Extracting the Telemetry Info for Clusters in Workload Dataset¶

Overview¶

Importing packages¶

Data Collection from bucket¶

Connect to Prometheus¶

Getting the data for given timestamp and cluster_id for image layer datadset¶

Metric name: workload:cpu_usage_cores:sum¶

Metric name: workload:memory_usage_bytes:sum¶

Metric name: openshift:cpu_usage_cores:sum¶

Metric name: openshift:memory_usage_bytes:sum¶

Metric name: cluster:cpu_usage_cores:sum¶

Metric name: cluster:memory_usage_bytes:sum¶

Merging Datasets¶

Getting the data for given timestamp and cluster_id for Container dataset¶

Metric name: workload:cpu_usage_cores:sum¶

Metric name: workload:memory_usage_bytes:sum¶

Metric name: openshift:cpu_usage_cores:sum¶

Metric name: openshift:memory_usage_bytes:sum¶

Metric name: cluster:cpu_usage_cores:sum¶

Metric name: cluster:memory_usage_bytes:sum¶

Merging Datasets¶

Conclusion¶

Getting the data for given `timestamp` and `cluster_id` for image layer datadset¶

Metric name: `workload:cpu_usage_cores:sum`¶

Metric name: `workload:memory_usage_bytes:sum`¶

Metric name: `openshift:cpu_usage_cores:sum`¶

Metric name: `openshift:memory_usage_bytes:sum`¶

Metric name: `cluster:cpu_usage_cores:sum`¶

Metric name: `cluster:memory_usage_bytes:sum`¶

Getting the data for given `timestamp` and `cluster_id` for Container dataset¶

Metric name: `workload:cpu_usage_cores:sum`¶

Metric name: `workload:memory_usage_bytes:sum`¶

Metric name: `openshift:cpu_usage_cores:sum`¶

Metric name: `openshift:memory_usage_bytes:sum`¶

Metric name: `cluster:cpu_usage_cores:sum`¶

Metric name: `cluster:memory_usage_bytes:sum`¶