Extracting the Telemetry Info for Clusters in Workload Dataset¶
Overview¶
As a data scientist, working on a workload fingerprinting project, the workload data that we have from the insight operator archive are limited to some informations like product name, cpu usage, memory usage, etc. We have done the product name mapping in this issue.
The purpose of this notebook is to be able to extract the data from the telemetry based on the cluster_id
and timestamp
as given in the workload data. Metric information are extracted both for image_layers_dataset
and container_dataset
. The different metrics are,
workload:cpu_usage_cores:sum
workload:memory_usage_bytes:sum
openshift:cpu_usage_cores:sum
openshift:memory_usage_bytes:sum
cluster:cpu_usage_cores:sum
cluster:memory_usage_bytes:sum
Importing packages¶
import os
import datetime as dt
from tqdm import tqdm
import pandas as pd
import io
import boto3
import warnings
import urllib3
from dotenv import load_dotenv, find_dotenv
from prometheus_api_client import PrometheusConnect, MetricSnapshotDataFrame
urllib3.disable_warnings()
pd.set_option("display.max_columns", None)
%matplotlib inline
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
Data Collection from bucket¶
Firstly, we extract the image_layers
and containers
dataset from the PLAYPEN bucket in order to get the list of cluster_id
and timestamp
from the corresponding archive_path
given in the respective dataset.
# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
# s3 resource to communicate with storage
s3 = boto3.resource(
"s3",
endpoint_url=s3_endpoint_url,
aws_access_key_id=s3_access_key,
aws_secret_access_key=s3_secret_key,
)
# access the parquet file as an s3 object
obj1 = s3.Object(
"DH-PLAYPEN", "ccx/fingerprinting/image_layers/date=2021-05-12/2021-05-12.parquet"
)
obj2 = s3.Object(
"DH-PLAYPEN", "ccx/fingerprinting/containers/date=2021-05-12/2021-05-12.parquet"
)
obj3 = s3.Object(
"DH-PLAYPEN", "ccx/fingerprinting/image_layers/dataframe_image_id.parquet"
)
obj4 = s3.Object("DH-PLAYPEN", "ccx/fingerprinting/containers/df_cont_image_id.parquet")
# download the file into the buffer
buffer1 = io.BytesIO()
obj1.download_fileobj(buffer1)
buffer2 = io.BytesIO()
obj2.download_fileobj(buffer2)
# read the buffer and create the dataframe
image_layers_df = pd.read_parquet(buffer1)
containers_df = pd.read_parquet(buffer2)
Connect to Prometheus¶
Connecting to Prometheus API client
We can refer to the data hub help page in order to interact with the telemetry dataset. The metrics for the credentials for the prometheus can be fetched from the data hub help page.
prom_url = os.getenv("PROM_URL")
prom_access_token = os.getenv("PROM_ACCESS_TOKEN")
PROM_URL = prom_url
PROM_ACCESS_TOKEN = prom_access_token
# instantiate prometheus connector
pc = PrometheusConnect(
url=PROM_URL,
headers={"Authorization": f"bearer {PROM_ACCESS_TOKEN}"},
disable_ssl=True,
)
Getting the data for given timestamp
and cluster_id
for image layer datadset¶
We first extract the list of archive path and use the split method in order to get the information about the timestamp and cluster_id. We then use the Prometheus query in order to get the dataframe.
Metric name: workload:cpu_usage_cores:sum
¶
# listing the archive path
image_archive_path = image_layers_df.archive_path.unique()
df_image_cpu = pd.DataFrame()
for path in tqdm(image_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'workload:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_image_cpu = df_image_cpu.append(df)
df_image_cpu.head()
100%|██████████| 2802/2802 [00:43<00:00, 64.18it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | workload:cpu_usage_cores:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 0.03230751556666692 |
0 | workload:cpu_usage_cores:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 0.030943035959259964 |
0 | workload:cpu_usage_cores:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 0.23523091000370105 |
0 | workload:cpu_usage_cores:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 0.029065674940740756 |
0 | workload:cpu_usage_cores:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 0.0035705586481481486 |
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_image_cpu._id.nunique())
print("Number of timestamps :", df_image_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1850
Number of timestamps : 1791
For the metric,workload:cpu_usage_cores:sum
, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1850 cluster_id, i.e, ~66% cluster_id from the telemetry database.
Metric name: workload:memory_usage_bytes:sum
¶
df_image_mem = pd.DataFrame()
for path in tqdm(image_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'workload:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_image_mem = df_image_mem.append(df)
df_image_mem.head()
100%|██████████| 2802/2802 [01:12<00:00, 38.68it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | workload:memory_usage_bytes:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 389971968 |
0 | workload:memory_usage_bytes:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 385421312 |
0 | workload:memory_usage_bytes:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 1190780928 |
0 | workload:memory_usage_bytes:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 269168640 |
0 | workload:memory_usage_bytes:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 83660800 |
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_image_mem._id.nunique())
print("Number of timestamps :", df_image_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1805
Number of timestamps : 1749
For the metric,workload:memory_usage_bytes:sum
, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1805 cluster_id, i.e, ~64.4% cluster_id from the telemetry database.
Metric name: openshift:cpu_usage_cores:sum
¶
df_image_open_cpu = pd.DataFrame()
for path in tqdm(image_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'openshift:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_image_open_cpu = df_image_open_cpu.append(df)
df_image_open_cpu.head()
100%|██████████| 2802/2802 [01:08<00:00, 40.76it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | openshift:cpu_usage_cores:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 1.9291393691359602 |
0 | openshift:cpu_usage_cores:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 1.7056283926121354 |
0 | openshift:cpu_usage_cores:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 2.5198167090408625 |
0 | openshift:cpu_usage_cores:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 1.6697914679164034 |
0 | openshift:cpu_usage_cores:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 3.497572298494714 |
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :", df_image_open_cpu._id.nunique()
)
print("Number of timestamps :", df_image_open_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1845
Number of timestamps : 1786
For the metric,openshift:cpu_usage_cores:sum
, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1845 cluster_id, i.e, ~65.8% cluster_id from the telemetry database.
Metric name: openshift:memory_usage_bytes:sum
¶
df_image_open_mem = pd.DataFrame()
for path in tqdm(image_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'openshift:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_image_open_mem = df_image_open_mem.append(df)
df_image_open_mem.head()
100%|██████████| 2802/2802 [01:06<00:00, 42.11it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | openshift:memory_usage_bytes:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 14560608256 |
0 | openshift:memory_usage_bytes:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 14386823168 |
0 | openshift:memory_usage_bytes:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 26840170496 |
0 | openshift:memory_usage_bytes:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 13228613632 |
0 | openshift:memory_usage_bytes:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 26006728704 |
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :", df_image_open_mem._id.nunique()
)
print("Number of timestamps :", df_image_open_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 1805
Number of timestamps : 1749
For the metric,openshift:memory_usage_bytes:sum
, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 1805 cluster_id, i.e, ~64.4% cluster_id from the telemetry database.
Metric name: cluster:cpu_usage_cores:sum
¶
df_image_cluster_cpu = pd.DataFrame()
for path in tqdm(image_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'cluster:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_image_cluster_cpu = df_image_cluster_cpu.append(df)
df_image_cluster_cpu.head()
100%|██████████| 2802/2802 [01:18<00:00, 35.50it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | cluster:cpu_usage_cores:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 1.9614468847026272 |
0 | cluster:cpu_usage_cores:sum | 002663ad-bcf4-4c7c-9530-ecb351fe4001 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620825377 | 4.345714285714284 |
0 | cluster:cpu_usage_cores:sum | 0030e092-3def-4e7a-96b7-17d45ca102fd | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620812523 | 11.39365149650794 |
0 | cluster:cpu_usage_cores:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 1.7365714285713953 |
0 | cluster:cpu_usage_cores:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 2.7550476190445634 |
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :",
df_image_cluster_cpu._id.nunique(),
)
print("Number of timestamps :", df_image_cluster_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2624
Number of timestamps : 2535
For the metric,cluster:cpu_usage_cores:sum
, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 2624 cluster_id, i.e, ~93.6% cluster_id from the telemetry database.
Metric name: cluster:memory_usage_bytes:sum
¶
df_image_cluster_mem = pd.DataFrame()
for path in tqdm(image_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'cluster:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_image_cluster_mem = df_image_cluster_mem.append(df)
df_image_cluster_mem.head()
100%|██████████| 2802/2802 [01:04<00:00, 43.19it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | cluster:memory_usage_bytes:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 14950580224 |
0 | cluster:memory_usage_bytes:sum | 002663ad-bcf4-4c7c-9530-ecb351fe4001 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620825377 | 25248690176 |
0 | cluster:memory_usage_bytes:sum | 0030e092-3def-4e7a-96b7-17d45ca102fd | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620812523 | 29709115392 |
0 | cluster:memory_usage_bytes:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 14772244480 |
0 | cluster:memory_usage_bytes:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 28030951424 |
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :",
df_image_cluster_mem._id.nunique(),
)
print("Number of timestamps :", df_image_cluster_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2673
Number of timestamps : 2581
For the metric,cluster:memory_usage_bytes:sum
, we observe that, out of 2802 cluster_id from the image layers dataset of insight operator workload. We could only extract the information for 2673 cluster_id, i.e, ~95.3% cluster_id from the telemetry database.
In the next section, we will do the outer
merging for all the six datasets above. But first, since we have the common name for value
column, we will rename them respectively.
Merging Datasets¶
Renaming the value column for respective datasets:
df_image_cpu.rename(
columns={"value": "value_workload:cpu_usage_cores:sum"}, inplace=True
)
df_image_mem.rename(
columns={"value": "value_workload:memory_usage_bytes:sum"}, inplace=True
)
df_image_open_cpu.rename(
columns={"value": "value_openshift:cpu_usage_cores:sum"}, inplace=True
)
df_image_open_mem.rename(
columns={"value": "value_openshift:memory_usage_bytes:sum"}, inplace=True
)
df_image_cluster_cpu.rename(
columns={"value": "value_cluster:cpu_usage_cores:sum"}, inplace=True
)
df_image_cluster_mem.rename(
columns={"value": "value_cluster:memory_usage_bytes:sum"}, inplace=True
)
Merging six dataset
# Arranging the dataset for merging
df_image_cpu_merge = df_image_cpu[
["_id", "timestamp", "value_workload:cpu_usage_cores:sum"]
]
df_image_mem_merge = df_image_mem[
["_id", "timestamp", "value_workload:memory_usage_bytes:sum"]
]
df_image_open_cpu_merge = df_image_open_cpu[
["_id", "timestamp", "value_openshift:cpu_usage_cores:sum"]
]
df_image_open_mem_merge = df_image_open_mem[
["_id", "timestamp", "value_openshift:memory_usage_bytes:sum"]
]
df_image_cluster_cpu_merge = df_image_cluster_cpu[
["_id", "timestamp", "value_cluster:cpu_usage_cores:sum"]
]
df_image_cluster_mem_merge = df_image_cluster_mem[
["_id", "timestamp", "value_cluster:memory_usage_bytes:sum"]
]
# Merging
merge1 = pd.merge(
df_image_cpu_merge, df_image_mem_merge, on=["_id", "timestamp"], how="outer"
)
merge2 = pd.merge(merge1, df_image_open_cpu_merge, on=["_id", "timestamp"], how="outer")
merge3 = pd.merge(merge2, df_image_open_mem_merge, on=["_id", "timestamp"], how="outer")
merge4 = pd.merge(
merge3, df_image_cluster_cpu_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_image_data = pd.merge(
merge4, df_image_cluster_mem_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_image_data.head()
_id | timestamp | value_workload:cpu_usage_cores:sum | value_workload:memory_usage_bytes:sum | value_openshift:cpu_usage_cores:sum | value_openshift:memory_usage_bytes:sum | value_cluster:cpu_usage_cores:sum | value_cluster:memory_usage_bytes:sum | |
---|---|---|---|---|---|---|---|---|
0 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 1620860062 | 0.03230751556666692 | 389971968 | 1.9291393691359602 | 14560608256 | 1.9614468847026272 | 14950580224 |
1 | 00351e6e-53ce-465e-9493-cf0cd2367049 | 1620852056 | 0.030943035959259964 | 385421312 | 1.7056283926121354 | 14386823168 | 1.7365714285713953 | 14772244480 |
2 | 003ba133-e754-4d5a-bc57-675b386d1f05 | 1620852942 | 0.23523091000370105 | 1190780928 | 2.5198167090408625 | 26840170496 | 2.7550476190445634 | 28030951424 |
3 | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | 1620855643 | 0.029065674940740756 | 269168640 | 1.6697914679164034 | 13228613632 | 1.6988571428571442 | 13497782272 |
4 | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | 1620855709 | 0.0035705586481481486 | 83660800 | 3.497572298494714 | 26006728704 | 3.501142857142862 | 26090389504 |
Now, we save the merged data in DH-PLAYPEN bucket.
# Uploading the mapping dataset in the DH-PLAYPEN bucket
parquet_image_buffer = io.BytesIO()
telemeter_image_data.to_parquet(parquet_image_buffer)
s3_obj = s3.Object(
s3_bucket, "ccx/fingerprinting/image_layers/telemeter_image_data.parquet"
)
status = s3_obj.put(Body=parquet_image_buffer.getvalue())
print(
"Total number of cluster_id in original image layer dataset :",
image_layers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :",
telemeter_image_data._id.nunique(),
)
print("Number of timestamps :", telemeter_image_data.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2802
Number of cluster_id for the extracted dataset : 2673
Number of timestamps : 2581
Hence, we observe that we have atleast one metric information for 2673 cluster_ids, i.e, ~95.3% of the total, from the image layers dataset.
Getting the data for given timestamp
and cluster_id
for Container dataset¶
We first extract the list of archive path and use the split method in order to get the information about the timestamp and cluster_id. We then use the Prometheus query in order to get the dataframe.
Metric name: workload:cpu_usage_cores:sum
¶
# listing the archive path
cont_archive_path = containers_df.archive_path.unique()
df_cont_cpu = pd.DataFrame()
for path in tqdm(cont_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'workload:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_cont_cpu = df_cont_cpu.append(df)
df_cont_cpu.head()
100%|██████████| 2927/2927 [00:37<00:00, 78.43it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | workload:cpu_usage_cores:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 0.03230751556666692 |
0 | workload:cpu_usage_cores:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 0.030943035959259964 |
0 | workload:cpu_usage_cores:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 0.23523091000370105 |
0 | workload:cpu_usage_cores:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 0.029065674940740756 |
0 | workload:cpu_usage_cores:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 0.0035705586481481486 |
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_cont_cpu._id.nunique())
print("Number of timestamps :", df_cont_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1857
Number of timestamps : 1798
For the metric,workload:cpu_usage_cores:sum
, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1857 cluster_id, i.e, ~63.4% cluster_id from the telemetry database.
Metric name: workload:memory_usage_bytes:sum
¶
df_cont_mem = pd.DataFrame()
for path in tqdm(cont_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'workload:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_cont_mem = df_cont_mem.append(df)
df_cont_mem.head()
100%|██████████| 2927/2927 [00:35<00:00, 83.29it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | workload:memory_usage_bytes:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 389971968 |
0 | workload:memory_usage_bytes:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 385421312 |
0 | workload:memory_usage_bytes:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 1190780928 |
0 | workload:memory_usage_bytes:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 269168640 |
0 | workload:memory_usage_bytes:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 83660800 |
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print("Number of cluster_id for the extracted dataset :", df_cont_mem._id.nunique())
print("Number of timestamps :", df_cont_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1811
Number of timestamps : 1755
For the metric,workload:memory_usage_bytes:sum
, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1811 cluster_id, i.e, ~61.8% cluster_id from the telemetry database.
Metric name: openshift:cpu_usage_cores:sum
¶
df_cont_open_cpu = pd.DataFrame()
for path in tqdm(cont_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'openshift:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_cont_open_cpu = df_cont_open_cpu.append(df)
df_cont_open_cpu.head()
100%|██████████| 2927/2927 [00:43<00:00, 66.68it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | openshift:cpu_usage_cores:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 1.9291393691359602 |
0 | openshift:cpu_usage_cores:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 1.7056283926121354 |
0 | openshift:cpu_usage_cores:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 2.5198167090408625 |
0 | openshift:cpu_usage_cores:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 1.6697914679164034 |
0 | openshift:cpu_usage_cores:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 3.497572298494714 |
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :", df_cont_open_cpu._id.nunique()
)
print("Number of timestamps :", df_cont_open_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1852
Number of timestamps : 1793
For the metric,openshift:cpu_usage_cores:sum
, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1852 cluster_id, i.e, ~63.2% cluster_id from the telemetry database.
Metric name: openshift:memory_usage_bytes:sum
¶
df_cont_open_mem = pd.DataFrame()
for path in tqdm(cont_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'openshift:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_cont_open_mem = df_cont_open_mem.append(df)
df_cont_open_mem.head()
100%|██████████| 2927/2927 [00:51<00:00, 56.81it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | openshift:memory_usage_bytes:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 14560608256 |
0 | openshift:memory_usage_bytes:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 14386823168 |
0 | openshift:memory_usage_bytes:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 26840170496 |
0 | openshift:memory_usage_bytes:sum | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855643 | 13228613632 |
0 | openshift:memory_usage_bytes:sum | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620855709 | 26006728704 |
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :", df_cont_open_mem._id.nunique()
)
print("Number of timestamps :", df_cont_open_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 1811
Number of timestamps : 1755
For the metric,openshift:memory_usage_bytes:sum
, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 1811 cluster_id, i.e, ~61.8% cluster_id from the telemetry database.
Metric name: cluster:cpu_usage_cores:sum
¶
df_cont_cluster_cpu = pd.DataFrame()
for path in tqdm(cont_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'cluster:cpu_usage_cores:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_cont_cluster_cpu = df_cont_cluster_cpu.append(df)
df_cont_cluster_cpu.head()
100%|██████████| 2927/2927 [01:00<00:00, 48.78it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | cluster:cpu_usage_cores:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 1.9614468847026272 |
0 | cluster:cpu_usage_cores:sum | 002663ad-bcf4-4c7c-9530-ecb351fe4001 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620825377 | 4.345714285714284 |
0 | cluster:cpu_usage_cores:sum | 0030e092-3def-4e7a-96b7-17d45ca102fd | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620812523 | 11.39365149650794 |
0 | cluster:cpu_usage_cores:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 1.7365714285713953 |
0 | cluster:cpu_usage_cores:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 2.7550476190445634 |
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :",
df_cont_cluster_cpu._id.nunique(),
)
print("Number of timestamps :", df_cont_cluster_cpu.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2649
Number of timestamps : 2560
For the metric,cluster:cpu_usage_cores:sum
, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 2649 cluster_id, i.e, ~90.5% cluster_id from the telemetry database.
Metric name: cluster:memory_usage_bytes:sum
¶
df_cont_cluster_mem = pd.DataFrame()
for path in tqdm(cont_archive_path[:]):
pos = path.split("/")[-1].split(".")[0]
hours = int(pos[0:2])
mins = int(pos[2:4])
sec = int(pos[4:6])
cluster_pos = path.split("/")[3]
ts = dt.datetime(
year=2021,
month=5,
day=12,
hour=hours,
minute=mins,
second=sec,
tzinfo=dt.timezone.utc,
)
df = MetricSnapshotDataFrame(
pc.custom_query(
f'cluster:memory_usage_bytes:sum{{_id="{cluster_pos}"}}',
params={"time": ts.timestamp()},
)
)
df_cont_cluster_mem = df_cont_cluster_mem.append(df)
df_cont_cluster_mem.head()
100%|██████████| 2927/2927 [00:50<00:00, 57.88it/s]
__name__ | _id | prometheus | receive | tenant_id | timestamp | value | |
---|---|---|---|---|---|---|---|
0 | cluster:memory_usage_bytes:sum | 00003d61-9db1-4757-9cd1-84df271daeb9 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620860062 | 14950580224 |
0 | cluster:memory_usage_bytes:sum | 002663ad-bcf4-4c7c-9530-ecb351fe4001 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620825377 | 25248690176 |
0 | cluster:memory_usage_bytes:sum | 0030e092-3def-4e7a-96b7-17d45ca102fd | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620812523 | 29709115392 |
0 | cluster:memory_usage_bytes:sum | 00351e6e-53ce-465e-9493-cf0cd2367049 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852056 | 14772244480 |
0 | cluster:memory_usage_bytes:sum | 003ba133-e754-4d5a-bc57-675b386d1f05 | openshift-monitoring/k8s | true | FB870BF3-9F3A-44FF-9BF7-D7A047A52F43 | 1620852942 | 28030951424 |
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :",
df_cont_cluster_mem._id.nunique(),
)
print("Number of timestamps :", df_cont_cluster_mem.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2697
Number of timestamps : 2605
For the metric,cluster:memory_usage_bytes:sum
, we observe that, out of 2927 cluster_id from the container dataset of insight operator workload. We could only extract the information for 2697 cluster_id, i.e, ~92.1% cluster_id from the telemetry database.
In the next section, we will merge do the outer
merging for all the six datasets above. But first, since we have the common name for value
column, we will rename them respectively.
Merging Datasets¶
Renaming the value column for respective datasets:
df_cont_cpu.rename(
columns={"value": "value_workload:cpu_usage_cores:sum"}, inplace=True
)
df_cont_mem.rename(
columns={"value": "value_workload:memory_usage_bytes:sum"}, inplace=True
)
df_cont_open_cpu.rename(
columns={"value": "value_openshift:cpu_usage_cores:sum"}, inplace=True
)
df_cont_open_mem.rename(
columns={"value": "value_openshift:memory_usage_bytes:sum"}, inplace=True
)
df_cont_cluster_cpu.rename(
columns={"value": "value_cluster:cpu_usage_cores:sum"}, inplace=True
)
df_cont_cluster_mem.rename(
columns={"value": "value_cluster:memory_usage_bytes:sum"}, inplace=True
)
# Arranging the dataset for merging
df_cont_cpu_merge = df_cont_cpu[
["_id", "timestamp", "value_workload:cpu_usage_cores:sum"]
]
df_cont_mem_merge = df_cont_mem[
["_id", "timestamp", "value_workload:memory_usage_bytes:sum"]
]
df_cont_open_cpu_merge = df_cont_open_cpu[
["_id", "timestamp", "value_openshift:cpu_usage_cores:sum"]
]
df_cont_open_mem_merge = df_cont_open_mem[
["_id", "timestamp", "value_openshift:memory_usage_bytes:sum"]
]
df_cont_cluster_cpu_merge = df_cont_cluster_cpu[
["_id", "timestamp", "value_cluster:cpu_usage_cores:sum"]
]
df_cont_cluster_mem_merge = df_cont_cluster_mem[
["_id", "timestamp", "value_cluster:memory_usage_bytes:sum"]
]
# Merging
merge1 = pd.merge(
df_cont_cpu_merge, df_cont_mem_merge, on=["_id", "timestamp"], how="outer"
)
merge2 = pd.merge(merge1, df_cont_open_cpu_merge, on=["_id", "timestamp"], how="outer")
merge3 = pd.merge(merge2, df_cont_open_mem_merge, on=["_id", "timestamp"], how="outer")
merge4 = pd.merge(
merge3, df_cont_cluster_cpu_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_cont_data = pd.merge(
merge4, df_cont_cluster_mem_merge, on=["_id", "timestamp"], how="outer"
)
telemeter_cont_data.head()
_id | timestamp | value_workload:cpu_usage_cores:sum | value_workload:memory_usage_bytes:sum | value_openshift:cpu_usage_cores:sum | value_openshift:memory_usage_bytes:sum | value_cluster:cpu_usage_cores:sum | value_cluster:memory_usage_bytes:sum | |
---|---|---|---|---|---|---|---|---|
0 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 1620860062 | 0.03230751556666692 | 389971968 | 1.9291393691359602 | 14560608256 | 1.9614468847026272 | 14950580224 |
1 | 00351e6e-53ce-465e-9493-cf0cd2367049 | 1620852056 | 0.030943035959259964 | 385421312 | 1.7056283926121354 | 14386823168 | 1.7365714285713953 | 14772244480 |
2 | 003ba133-e754-4d5a-bc57-675b386d1f05 | 1620852942 | 0.23523091000370105 | 1190780928 | 2.5198167090408625 | 26840170496 | 2.7550476190445634 | 28030951424 |
3 | 00479ead-b7fc-49c2-ae20-3990a9b3d08c | 1620855643 | 0.029065674940740756 | 269168640 | 1.6697914679164034 | 13228613632 | 1.6988571428571442 | 13497782272 |
4 | 009ef697-17c5-431e-a5f5-42d93d9c6a5a | 1620855709 | 0.0035705586481481486 | 83660800 | 3.497572298494714 | 26006728704 | 3.501142857142862 | 26090389504 |
Now, saving the merged dataset in DH-PLAYPEN bucket.
# Uploading the mapping dataset in the DH-PLAYPEN bucket
parquet_cont_buffer = io.BytesIO()
telemeter_cont_data.to_parquet(parquet_cont_buffer)
s3_obj = s3.Object(
s3_bucket, "ccx/fingerprinting/containers/telemeter_cont_data.parquet"
)
status = s3_obj.put(Body=parquet_cont_buffer.getvalue())
print(
"Total number of cluster_id in original image layer dataset :",
containers_df.cluster_id.nunique(),
)
print(
"Number of cluster_id for the extracted dataset :",
telemeter_cont_data._id.nunique(),
)
print("Number of timestamps :", telemeter_cont_data.timestamp.nunique())
Total number of cluster_id in original image layer dataset : 2927
Number of cluster_id for the extracted dataset : 2698
Number of timestamps : 2606
Hence, we observe that we have atleast one metric information for 2698 cluster_ids, i.e, ~92.1% of the total, from the image layers dataset.
Conclusion¶
These information are now integrated with the workload dataset. The next step is to perform an EDA and then apply clustering techniques to the final integrated dataset.