Getting started: Introducing workload data

As a data scientist, in order to approach the analysis, we first need to have a knowledge about the data-set we will be working on. The objective of this notebook is to take a quick look at the data to understand what kind of information we have captured.

The goal is to collect and analyse the workload data from the clusters. Insight operator is collecting the workload data from the 4.8+ clusters. The data can be found in the Insight Operator Archive, which has two kinds of information. The image_layers and containers information.

Data Collection

Here we import the two kinds of dataset from the DH-PLAYPEN bucket.

import io
import boto3
import pandas as pd
import warnings
import os

from dotenv import load_dotenv, find_dotenv

load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

# s3 resource to communicate with storage
s3 = boto3.resource(
    "s3",
    endpoint_url=s3_endpoint_url,
    aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key,
)

# access the parquet file as an s3 object

obj1 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/image_layers/date=2021-05-12/2021-05-12.parquet"
)
obj2 = s3.Object(
    "DH-PLAYPEN", "ccx/fingerprinting/containers/date=2021-05-12/2021-05-12.parquet"
)
# download the file into the buffer
buffer1 = io.BytesIO()
obj1.download_fileobj(buffer1)
buffer2 = io.BytesIO()
obj2.download_fileobj(buffer2)

# read the buffer and create the dataframe
image_layers_df = pd.read_parquet(buffer1)
containers_df = pd.read_parquet(buffer2)

Image Layers Dataset

An overview of Image layers dataset can be seen in the dataframe below.

image_layers_df.head()
cluster_id image_id layer_image_id layer_image_level first_command first_arg archive_path
0 00003d61-9db1-4757-9cd1-84df271daeb9 sha256:337c22cabe530213b14965f9ea69a92dbeb5104... sha256:9ebb302e1fb002fb643091710dac46f8258781d... 0 icTsn2s_EIax 2v1NneeWoS_9 archives/compressed/00/00003d61-9db1-4757-9cd1...
1 00003d61-9db1-4757-9cd1-84df271daeb9 sha256:337c22cabe530213b14965f9ea69a92dbeb5104... sha256:a74396a32e85c2feeedf76052ed3297859810c8... 1 icTsn2s_EIax 2v1NneeWoS_9 archives/compressed/00/00003d61-9db1-4757-9cd1...
2 00003d61-9db1-4757-9cd1-84df271daeb9 sha256:337c22cabe530213b14965f9ea69a92dbeb5104... sha256:7db62383a7470afbacfc0fab55d5a182e3c5fa2... 2 icTsn2s_EIax 2v1NneeWoS_9 archives/compressed/00/00003d61-9db1-4757-9cd1...
3 00003d61-9db1-4757-9cd1-84df271daeb9 sha256:337c22cabe530213b14965f9ea69a92dbeb5104... sha256:f24250419f728db23957454976d6d38b679a349... 3 icTsn2s_EIax 2v1NneeWoS_9 archives/compressed/00/00003d61-9db1-4757-9cd1...
4 00003d61-9db1-4757-9cd1-84df271daeb9 sha256:337c22cabe530213b14965f9ea69a92dbeb5104... sha256:267f7bb0f5dcf1b83f8ce89831d05f3a44a3abe... 4 icTsn2s_EIax 2v1NneeWoS_9 archives/compressed/00/00003d61-9db1-4757-9cd1...

Inspect the Image Layers Data

We inspect the image layers data to see the kind of information we have access to.

image_layers_df.iloc[1]
cluster_id                        00003d61-9db1-4757-9cd1-84df271daeb9
image_id             sha256:337c22cabe530213b14965f9ea69a92dbeb5104...
layer_image_id       sha256:a74396a32e85c2feeedf76052ed3297859810c8...
layer_image_level                                                    1
first_command                                             icTsn2s_EIax
first_arg                                                 2v1NneeWoS_9
archive_path         archives/compressed/00/00003d61-9db1-4757-9cd1...
Name: 1, dtype: object

Available fields:

  • cluster_id: id of the cluster

  • image_id: provide the ‘sha’ of the image that the container is running.

  • layer_image_id: provide the ‘sha’ of the image_layers that is linked to image id.

  • layer_image_level: order of the image layer.

  • first_command: first command in that image.

  • first_arg: first argument in that image. We do not have information about the kind of first command and first argument provided, but we can compare if the two image runs the same command/argument.

  • archive_path: path to the archive from which the images are extracted.

Containers Dataset

containers_df.head()
cluster_id namespace shape shape_instances image_id first_command first_arg init_container archive_path
0 00003d61-9db1-4757-9cd1-84df271daeb9 0LiT6ZNtbpYL sha256:3ecf29979b2722bf4a82a5e7a954e8685820720... 1 sha256:f46f210d6023bec16e68340b484a8881ce46d5e... None 47DEQpj8HBSa False archives/compressed/00/00003d61-9db1-4757-9cd1...
1 00003d61-9db1-4757-9cd1-84df271daeb9 0LiT6ZNtbpYL sha256:3ecf29979b2722bf4a82a5e7a954e8685820720... 1 sha256:edb9aaacf421c6dc45b20324e8699cec02f26bf... n9CdwzVF-cwZ RNOaw_AuQeIY False archives/compressed/00/00003d61-9db1-4757-9cd1...
2 00003d61-9db1-4757-9cd1-84df271daeb9 0LiT6ZNtbpYL sha256:542d007d13008cc1be2dbf03601b954c4452947... 1 sha256:a693c315b775c693dc49c19b7f217762676bc28... b51B0EZ1bw3c ua-xlwwsvdYd False archives/compressed/00/00003d61-9db1-4757-9cd1...
3 00003d61-9db1-4757-9cd1-84df271daeb9 0LiT6ZNtbpYL sha256:542d007d13008cc1be2dbf03601b954c4452947... 1 sha256:a693c315b775c693dc49c19b7f217762676bc28... Cl6kTzfbYztA None True archives/compressed/00/00003d61-9db1-4757-9cd1...
4 00003d61-9db1-4757-9cd1-84df271daeb9 0LiT6ZNtbpYL sha256:542d007d13008cc1be2dbf03601b954c4452947... 1 sha256:d9c64d038f16e04c52142bc9e7dfa0645ce7e34... Cl6kTzfbYztA None True archives/compressed/00/00003d61-9db1-4757-9cd1...

Inspecting the Container Dataset

containers_df.iloc[1]
cluster_id                      00003d61-9db1-4757-9cd1-84df271daeb9
namespace                                               0LiT6ZNtbpYL
shape              sha256:3ecf29979b2722bf4a82a5e7a954e8685820720...
shape_instances                                                    1
image_id           sha256:edb9aaacf421c6dc45b20324e8699cec02f26bf...
first_command                                           n9CdwzVF-cwZ
first_arg                                               RNOaw_AuQeIY
init_container                                                 False
archive_path       archives/compressed/00/00003d61-9db1-4757-9cd1...
Name: 1, dtype: object

Available fields:

  • cluster_id: id of the cluster

  • namespace: namespace in the cluster

  • shape: These are POD’s templete. They are set of containers in the POD. If two POD used the same set of containers, same command, they fall into same shape.

  • shape_instances: number of PODs of that shape.

  • containers (image_id/first_command/first_argument/init_container): provide the information about the containers in the shape. Their image_id, first_command, first argument and also the number of containers in that shape (init_container).

  • archive_path: archive path to the cluster id.

containers_df.groupby(["shape"]).agg(
    {"cluster_id": pd.Series.nunique}
).reset_index().sort_values(by=["cluster_id"], ascending=False)
shape cluster_id
6315 sha256:ff62cfd4da3beb77d886f8935a1b7a6aaf54bdb... 2721
2969 sha256:78fc0fdc25942f43a44b10330813a19f04ea88e... 2551
2040 sha256:524148cb8d81907984141cb8d210decf75657d7... 1792
2370 sha256:5fe22686d7266cfc828498c6674f3268fa6bb78... 1790
3728 sha256:97af185840a1f8c688608aa199bc6a8fb45f9ae... 1361
... ... ...
2908 sha256:76cde139b6a84f92e5f5d273aaec928589957f4... 1
2909 sha256:76d83926eb2df6f554f519bfcc9f74904a16b75... 1
2911 sha256:76e3aa55c87e23ff1c7beef873bcf399b89ca30... 1
2912 sha256:77063f77b9a5d1513981bbe202ebceeecc5f80f... 1
6328 sha256:ffe906ed042207a1a05260ecf1c46f93218b830... 1

6329 rows × 2 columns

To have some idea about the shape, we use the groupby method in order to examine if different clusters have same shape or not. Turns out, form the information above, almost all clusters have the same shape configuration.

Conclusion

Our next goal is to do the exploratory data analysis of the dataset to get some insight about the relationships between the features. This is then followed by the use of ML for identifying and analysing the types (clusters) of workloads that customer runs.