Getting started: Introducing workload data¶
As a data scientist, in order to approach the analysis, we first need to have a knowledge about the data-set we will be working on. The objective of this notebook is to take a quick look at the data to understand what kind of information we have captured.
The goal is to collect and analyse the workload data from the clusters. Insight operator is collecting the workload data from the 4.8+ clusters. The data can be found in the Insight Operator Archive, which has two kinds of information. The image_layers and containers information.
Data Collection¶
Here we import the two kinds of dataset from the DH-PLAYPEN bucket.
import io
import boto3
import pandas as pd
import warnings
import os
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
# s3 resource to communicate with storage
s3 = boto3.resource(
"s3",
endpoint_url=s3_endpoint_url,
aws_access_key_id=s3_access_key,
aws_secret_access_key=s3_secret_key,
)
# access the parquet file as an s3 object
obj1 = s3.Object(
"DH-PLAYPEN", "ccx/fingerprinting/image_layers/date=2021-05-12/2021-05-12.parquet"
)
obj2 = s3.Object(
"DH-PLAYPEN", "ccx/fingerprinting/containers/date=2021-05-12/2021-05-12.parquet"
)
# download the file into the buffer
buffer1 = io.BytesIO()
obj1.download_fileobj(buffer1)
buffer2 = io.BytesIO()
obj2.download_fileobj(buffer2)
# read the buffer and create the dataframe
image_layers_df = pd.read_parquet(buffer1)
containers_df = pd.read_parquet(buffer2)
Image Layers Dataset¶
An overview of Image layers dataset can be seen in the dataframe below.
image_layers_df.head()
cluster_id | image_id | layer_image_id | layer_image_level | first_command | first_arg | archive_path | |
---|---|---|---|---|---|---|---|
0 | 00003d61-9db1-4757-9cd1-84df271daeb9 | sha256:337c22cabe530213b14965f9ea69a92dbeb5104... | sha256:9ebb302e1fb002fb643091710dac46f8258781d... | 0 | icTsn2s_EIax | 2v1NneeWoS_9 | archives/compressed/00/00003d61-9db1-4757-9cd1... |
1 | 00003d61-9db1-4757-9cd1-84df271daeb9 | sha256:337c22cabe530213b14965f9ea69a92dbeb5104... | sha256:a74396a32e85c2feeedf76052ed3297859810c8... | 1 | icTsn2s_EIax | 2v1NneeWoS_9 | archives/compressed/00/00003d61-9db1-4757-9cd1... |
2 | 00003d61-9db1-4757-9cd1-84df271daeb9 | sha256:337c22cabe530213b14965f9ea69a92dbeb5104... | sha256:7db62383a7470afbacfc0fab55d5a182e3c5fa2... | 2 | icTsn2s_EIax | 2v1NneeWoS_9 | archives/compressed/00/00003d61-9db1-4757-9cd1... |
3 | 00003d61-9db1-4757-9cd1-84df271daeb9 | sha256:337c22cabe530213b14965f9ea69a92dbeb5104... | sha256:f24250419f728db23957454976d6d38b679a349... | 3 | icTsn2s_EIax | 2v1NneeWoS_9 | archives/compressed/00/00003d61-9db1-4757-9cd1... |
4 | 00003d61-9db1-4757-9cd1-84df271daeb9 | sha256:337c22cabe530213b14965f9ea69a92dbeb5104... | sha256:267f7bb0f5dcf1b83f8ce89831d05f3a44a3abe... | 4 | icTsn2s_EIax | 2v1NneeWoS_9 | archives/compressed/00/00003d61-9db1-4757-9cd1... |
Inspect the Image Layers Data¶
We inspect the image layers data to see the kind of information we have access to.
image_layers_df.iloc[1]
cluster_id 00003d61-9db1-4757-9cd1-84df271daeb9
image_id sha256:337c22cabe530213b14965f9ea69a92dbeb5104...
layer_image_id sha256:a74396a32e85c2feeedf76052ed3297859810c8...
layer_image_level 1
first_command icTsn2s_EIax
first_arg 2v1NneeWoS_9
archive_path archives/compressed/00/00003d61-9db1-4757-9cd1...
Name: 1, dtype: object
Available fields:
cluster_id: id of the cluster
image_id: provide the ‘sha’ of the image that the container is running.
layer_image_id: provide the ‘sha’ of the image_layers that is linked to image id.
layer_image_level: order of the image layer.
first_command: first command in that image.
first_arg: first argument in that image. We do not have information about the kind of first command and first argument provided, but we can compare if the two image runs the same command/argument.
archive_path: path to the archive from which the images are extracted.
Containers Dataset¶
containers_df.head()
cluster_id | namespace | shape | shape_instances | image_id | first_command | first_arg | init_container | archive_path | |
---|---|---|---|---|---|---|---|---|---|
0 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 0LiT6ZNtbpYL | sha256:3ecf29979b2722bf4a82a5e7a954e8685820720... | 1 | sha256:f46f210d6023bec16e68340b484a8881ce46d5e... | None | 47DEQpj8HBSa | False | archives/compressed/00/00003d61-9db1-4757-9cd1... |
1 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 0LiT6ZNtbpYL | sha256:3ecf29979b2722bf4a82a5e7a954e8685820720... | 1 | sha256:edb9aaacf421c6dc45b20324e8699cec02f26bf... | n9CdwzVF-cwZ | RNOaw_AuQeIY | False | archives/compressed/00/00003d61-9db1-4757-9cd1... |
2 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 0LiT6ZNtbpYL | sha256:542d007d13008cc1be2dbf03601b954c4452947... | 1 | sha256:a693c315b775c693dc49c19b7f217762676bc28... | b51B0EZ1bw3c | ua-xlwwsvdYd | False | archives/compressed/00/00003d61-9db1-4757-9cd1... |
3 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 0LiT6ZNtbpYL | sha256:542d007d13008cc1be2dbf03601b954c4452947... | 1 | sha256:a693c315b775c693dc49c19b7f217762676bc28... | Cl6kTzfbYztA | None | True | archives/compressed/00/00003d61-9db1-4757-9cd1... |
4 | 00003d61-9db1-4757-9cd1-84df271daeb9 | 0LiT6ZNtbpYL | sha256:542d007d13008cc1be2dbf03601b954c4452947... | 1 | sha256:d9c64d038f16e04c52142bc9e7dfa0645ce7e34... | Cl6kTzfbYztA | None | True | archives/compressed/00/00003d61-9db1-4757-9cd1... |
Inspecting the Container Dataset¶
containers_df.iloc[1]
cluster_id 00003d61-9db1-4757-9cd1-84df271daeb9
namespace 0LiT6ZNtbpYL
shape sha256:3ecf29979b2722bf4a82a5e7a954e8685820720...
shape_instances 1
image_id sha256:edb9aaacf421c6dc45b20324e8699cec02f26bf...
first_command n9CdwzVF-cwZ
first_arg RNOaw_AuQeIY
init_container False
archive_path archives/compressed/00/00003d61-9db1-4757-9cd1...
Name: 1, dtype: object
Available fields:
cluster_id: id of the cluster
namespace: namespace in the cluster
shape: These are POD’s templete. They are set of containers in the POD. If two POD used the same set of containers, same command, they fall into same shape.
shape_instances: number of PODs of that shape.
containers (image_id/first_command/first_argument/init_container): provide the information about the containers in the shape. Their image_id, first_command, first argument and also the number of containers in that shape (init_container).
archive_path: archive path to the cluster id.
containers_df.groupby(["shape"]).agg(
{"cluster_id": pd.Series.nunique}
).reset_index().sort_values(by=["cluster_id"], ascending=False)
shape | cluster_id | |
---|---|---|
6315 | sha256:ff62cfd4da3beb77d886f8935a1b7a6aaf54bdb... | 2721 |
2969 | sha256:78fc0fdc25942f43a44b10330813a19f04ea88e... | 2551 |
2040 | sha256:524148cb8d81907984141cb8d210decf75657d7... | 1792 |
2370 | sha256:5fe22686d7266cfc828498c6674f3268fa6bb78... | 1790 |
3728 | sha256:97af185840a1f8c688608aa199bc6a8fb45f9ae... | 1361 |
... | ... | ... |
2908 | sha256:76cde139b6a84f92e5f5d273aaec928589957f4... | 1 |
2909 | sha256:76d83926eb2df6f554f519bfcc9f74904a16b75... | 1 |
2911 | sha256:76e3aa55c87e23ff1c7beef873bcf399b89ca30... | 1 |
2912 | sha256:77063f77b9a5d1513981bbe202ebceeecc5f80f... | 1 |
6328 | sha256:ffe906ed042207a1a05260ecf1c46f93218b830... | 1 |
6329 rows × 2 columns
To have some idea about the shape, we use the groupby method in order to examine if different clusters have same shape or not. Turns out, form the information above, almost all clusters have the same shape configuration.
Conclusion¶
Our next goal is to do the exploratory data analysis of the dataset to get some insight about the relationships between the features. This is then followed by the use of ML for identifying and analysing the types (clusters) of workloads that customer runs.