Mapping Image SHA to Image Name using Pyxis database¶

In order to answer most of the customer usage and business insights related questions raised in the OpenShift Workload Fingerprinting project, we need to connect two disparate datasets - the insights operator archive and the pyxis database. That is, we want to use pyxis to determine the product name, architecture, vulnerabilities, vendor, etc corresponding to the container image SHA’s in the insights dataset. In a previous issue, we figured out how to do this for a given SHA, by using curl in the terminal. In this notebook, we will try to do this programmatically, and do it for all the SHA’s available in our dataset. We will then store this merged dataset to an s3 bucket and use it for the rest of the analysis in the project going forward.

Pre-requisite¶

In order to fetch the image name (and other details) for the given ‘sha’ of the image_id, please complete the pre-requisite described below.

Follow the link in order to set-up a kerberos ticket and Red Hat IdM on your machine.
Update the /etc/krb5.conf on your machine by setting dns_canonical_hostname to false, as described in the first ‘red box’ in this guide
Obtain the kerberos ticket by running, $ kinit <your_kerberos_username>@IPA.REDHAT.COM

In this notebook, we map the given “sha” of the image_id for the image layer dataset and container dataset provided from the workload data of the insight operator.

Importing useful packages¶

import io
import boto3
import requests
import os
import json
import warnings
import pandas as pd
import multiprocessing as mp

from requests_kerberos import HTTPKerberosAuth, OPTIONAL
from dotenv import find_dotenv, load_dotenv
from tqdm import tqdm

load_dotenv(find_dotenv())
warnings.filterwarnings("ignore")
pd.set_option("display.max_columns", None)

Data Collection¶

In this section, we will fetch from our s3 bucket the containers dataset and the image layers dataset that have been curated from insights operator archives. To learn more about the general content of datasets, please check out the getting_started notebook.

# CEPH Bucket variables
s3_endpoint_url = os.getenv("S3_ENDPOINT")
s3_access_key = os.getenv("S3_ACCESS_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_secret_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

# s3 resource to communicate with storage
s3 = boto3.resource(
    "s3",
    endpoint_url=s3_endpoint_url,
    aws_access_key_id=s3_access_key,
    aws_secret_access_key=s3_secret_key,
)

dates = [
    "2021-08-05",
    "2021-08-08",
    "2021-08-10",
    "2021-08-11",
    "2021-08-15",
    "2021-08-17",
    "2021-08-19",
    "2021-08-22",
    "2021-08-23",
]

image_layers_df = pd.DataFrame([])
image1 = []
for date in dates:
    obj1 = s3.Object(
        s3_bucket,
        "prototype/workload/image_layers/date={date}/{date}.parquet".format(date=date),
    )
    buffer1 = io.BytesIO()
    obj1.download_fileobj(buffer1)
    image1 = pd.read_parquet(buffer1)
    image_layers_df = image_layers_df.append(image1)

containers_df = pd.DataFrame([])
image2 = []
for date in dates:
    obj2 = s3.Object(
        s3_bucket,
        "prototype/workload/containers/date={date}/{date}.parquet".format(date=date),
    )
    buffer2 = io.BytesIO()
    obj2.download_fileobj(buffer2)
    image2 = pd.read_parquet(buffer2)
    containers_df = containers_df.append(image2)

image_layers_df.head(3)

	cluster_id	image_id	layer_image_id	layer_image_level	first_command	first_arg	archive_path
0	9e97b920-2876-4076-8bfb-36fe123bc273	sha256:3bc831c3d6614afcd5a8e1728b8bbe6709c957d...	sha256:1cadda38f72dece653de82063e3c8e910265fe7...	0	U7Yi5SISAtKW	<NA>	archives/compressed/9e/9e97b920-2876-4076-8bfb...
1	9e97b920-2876-4076-8bfb-36fe123bc273	sha256:3bc831c3d6614afcd5a8e1728b8bbe6709c957d...	sha256:a50df8fd88fecefc26fd331f832672108deb08c...	1	U7Yi5SISAtKW	<NA>	archives/compressed/9e/9e97b920-2876-4076-8bfb...
2	9e97b920-2876-4076-8bfb-36fe123bc273	sha256:3bc831c3d6614afcd5a8e1728b8bbe6709c957d...	sha256:904d3325f999f09cad1ba9676937fc8b72ff285...	2	U7Yi5SISAtKW	<NA>	archives/compressed/9e/9e97b920-2876-4076-8bfb...

containers_df.head(3)

	cluster_id	namespace	shape	shape_instances	image_id	first_command	first_arg	init_container	archive_path
0	98df2866-2131-41c1-97f3-aba6f8761c3d	0LiT6ZNtbpYL	sha256:7ac9e625af2e30671ebec339821489da205116c...	6	sha256:6c05d74eb1fa37a77ed9215d83933265564d661...	N9KxLV2avCo2	BuLIUMMJnyP_	False	archives/compressed/98/98df2866-2131-41c1-97f3...
1	98df2866-2131-41c1-97f3-aba6f8761c3d	0LiT6ZNtbpYL	sha256:b1adc9101829bec6f71530547b1151891a99116...	6	sha256:6c05d74eb1fa37a77ed9215d83933265564d661...	N9KxLV2avCo2	EbplhSJxzSTF	False	archives/compressed/98/98df2866-2131-41c1-97f3...
2	98df2866-2131-41c1-97f3-aba6f8761c3d	0LiT6ZNtbpYL	sha256:b1adc9101829bec6f71530547b1151891a99116...	6	sha256:8b9ecf20324c62d92b4a812a9f502b1059cfed0...	Cl6kTzfbYztA	<NA>	True	archives/compressed/98/98df2866-2131-41c1-97f3...

Function to extract mapped information for `image_id`¶

def mapped_df(image_id):
    dataframe = pd.DataFrame([])
    base_url = "https://pyxis.engineering.redhat.com/v1/images?filter=image_id=="
    image_id = image_id
    team_url = base_url + str(image_id)
    r = requests.get(team_url, auth=kerberos_auth, verify=False)
    if r.status_code == 200:
        data = json.loads(r.content)
        if len(data["data"]) > 0:
            if len(data["data"][0]["parsed_data"]["labels"]) > 0:
                df = pd.DataFrame(data["data"][0]["parsed_data"]["labels"])
                table = pd.pivot_table(
                    df, values="value", aggfunc=lambda x: x, columns="name"
                )
                table["image_id"] = image_id
                table = table.set_index("image_id")
                dataframe = dataframe.append(table)
    return dataframe

Function to extract mapped information for `image_layer_id`¶

def mapped_layer_df(image_id):
    dataframe = pd.DataFrame([])
    base_url = "https://pyxis.engineering.redhat.com/v1/images?filter=top_layer_id=="
    team_url = base_url + str(image_id)
    r = requests.get(team_url, auth=kerberos_auth, verify=False)
    if r.status_code == 200:
        data = json.loads(r.content)
        if len(data["data"]) > 0:
            if len(data["data"][0]["parsed_data"]["labels"]) > 0:
                df = pd.DataFrame(data["data"][0]["parsed_data"]["labels"])
                table = pd.pivot_table(
                    df, values="value", aggfunc=lambda x: x, columns="name"
                )
                table["image_id"] = image_id
                table = table.set_index("image_id")
                dataframe = dataframe.append(table)
    return dataframe

Mapping the SHA’s in `image_id` column of Image layers Dataset¶

First, we try to form a list of unique image_id from the image layer dataset. Using that list, we will be doing the web scraping followed by the formation of the dataframe with image_id and corresponding product name, summary, vendor, version, and other attributes.

# Creating the list of image_id
arr_imageid = image_layers_df.image_id.unique()

kerberos_auth = HTTPKerberosAuth(mutual_authentication=OPTIONAL)

# Size of the list
len(arr_imageid)

# number of max processes
n_max_processes = mp.cpu_count()
print(n_max_processes)

with mp.Pool(processes=n_max_processes) as pool:
    df = list(tqdm(pool.imap(mapped_df, arr_imageid), total=len(arr_imageid)))
    dataframe_image_id = pd.concat(df)

100%|██████████| 5315/5315 [10:54<00:00,  8.13it/s]

dataframe_image_id.shape

(958, 73)

Mapped 956 (~18%) image_id’s sha out of 5315 image_ids.

"""
# Uploading the mapping dataset in the bucket
parquet_buffer = io.BytesIO()
dataframe_image_id.to_parquet(parquet_buffer)
s3_obj = s3.Object(
    s3_bucket, "prototype/workload/image_layers/dataframe_image_id.parquet"
)
status = s3_obj.put(Body=parquet_buffer.getvalue())
"""

'\n# Uploading the mapping dataset in the bucket\nparquet_buffer = io.BytesIO()\ndataframe_image_id.to_parquet(parquet_buffer)\ns3_obj = s3.Object(\n    s3_bucket, "prototype/workload/image_layers/dataframe_image_id.parquet"\n)\nstatus = s3_obj.put(Body=parquet_buffer.getvalue())\n'

The corresponding image_id mapped with the product name is saved in the bucket in the form of dataframe (dataframe_image_id.parquet).

Mapping SHA’s from `image_layer_id` column of Image Layers Dataset¶

In addition to the image_id column, the image_layer_id column also contains image SHA’s. These SHA’s correspond to the layers that make up the image in image_id. In this section, we try to form a list of unique image_layer_id from the image layer dataset. Using that list, we will be doing the web scraping followed by the formation of the dataframe with image_layer_id and corresponding product name, summary, vendor, and other attributes.

arr_layer_imageid = image_layers_df.layer_image_id.unique()

# Size of the list
len(arr_layer_imageid)

with mp.Pool(processes=n_max_processes) as pool:
    df_image_layerid = list(
        tqdm(
            pool.imap(mapped_layer_df, arr_layer_imageid),
            total=len(arr_layer_imageid),
        )
    )
    df_image_layerid = pd.concat(df_image_layerid)

100%|██████████| 17817/17817 [35:21<00:00,  8.40it/s]

df_image_layerid.shape

(1292, 105)

We were able to create a dataframe which maps 1292 (~7%) sha’s of the image layer id provided in the image layer dataset out of 17817 sha’s in the image layer dataset.

"""
# Uploading the mapping dataset in the bucket
parquet_buffer = io.BytesIO()
df_image_layer_id.to_parquet(parquet_buffer)
s3_obj = s3.Object(
    s3_bucket, "prototype/workload/image_layers/df_image_layer_id.parquet"
)
status = s3_obj.put(Body=parquet_buffer.getvalue())
"""

'\n# Uploading the mapping dataset in the bucket\nparquet_buffer = io.BytesIO()\ndf_image_layer_id.to_parquet(parquet_buffer)\ns3_obj = s3.Object(\n    s3_bucket, "prototype/workload/image_layers/df_image_layer_id.parquet"\n)\nstatus = s3_obj.put(Body=parquet_buffer.getvalue())\n'

Mapping the SHA’s in`image id` column of Containers dataset¶

In this section, we will map the SHA’s in the containers dataset to their product name, summary, vendor, and other attributes. We will first form a list of unique image_id’s from the containers dataset. Using that list, we will be doing the web scraping followed by the formation of the dataframe with image_id and corresponding attributes.

# Listing out the SHA's of image_id
arr_cont_imageid = containers_df.image_id.unique()

# Size of the list
len(arr_cont_imageid)

with mp.Pool(processes=n_max_processes) as pool:
    df_cont_image_id = list(
        tqdm(pool.imap(mapped_df, arr_cont_imageid), total=len(arr_cont_imageid))
    )
    df_cont_image_id = pd.concat(df_cont_image_id)

100%|██████████| 33488/33488 [1:07:39<00:00,  8.25it/s]

df_cont_image_id.shape

(8928, 262)

Here, we successfully did the mapping for 8928 (~26%) sha’s of the image_id for container dataset out of 33488 sha’s of image_id.

"""
# Uploading the mapping dataset in the bucket
parquet_buffer = io.BytesIO()
df_cont_image_id.to_parquet(parquet_buffer)
s3_obj = s3.Object(
    s3_bucket, "prototype/workload/containers/df_cont_image_id.parquet"
)
status = s3_obj.put(Body=parquet_buffer.getvalue())
"""

'\n# Uploading the mapping dataset in the bucket\nparquet_buffer = io.BytesIO()\ndf_cont_image_id.to_parquet(parquet_buffer)\ns3_obj = s3.Object(\n    s3_bucket, "prototype/workload/containers/df_cont_image_id.parquet"\n)\nstatus = s3_obj.put(Body=parquet_buffer.getvalue())\n'

The corresponding mapped dataframe is saved in bucket.

Conclusion¶

The notebook does takes some time to run. In the notebook, we were able to map the product name with the corresponding image_id from the image layer dataset and the container dataset. They mapped dataframe are then saved in the bucket.

As next steps, we will be extracting the telemetry information (cpu usage, memory usage) corresponding to different cluster_id in the workload dataset.

OpenShift Workload Fingerprinting

Mapping Image SHA to Image Name using Pyxis database¶

Pre-requisite¶

Importing useful packages¶

Data Collection¶

Function to extract mapped information for image_id¶

Function to extract mapped information for image_layer_id¶

Mapping the SHA’s in image_id column of Image layers Dataset¶

Mapping SHA’s from image_layer_id column of Image Layers Dataset¶

Mapping the SHA’s inimage id column of Containers dataset¶