Open Data Hub and Object Storage¶
The intent of this notebook is to provide examples of how data engineers/scientist can use Open Data Hub and object storage, specifically, Ceph object storage, much in the same way they are accustomed to interacting with Amazon Simple Storage Service (S3). This is made possible because Ceph’s object storage gateway offers excellent fidelity with the modalities of Amazon S3.
Working with Boto¶
Boto is an integrated interface to current and future infrastructural services offered by Amazon Web Services. Among the services it provides interfaces for is Amazon S3. For lightweight analysis of data using python tools like numpy or pandas, it is handy to interact with data stored in object storage using pure python. This is where Boto shines.
import sys
import os
import boto3
import pandas as pd
s3_endpoint_url = os.environ['S3_ENDPOINT_URL']
s3_access_key = os.environ['AWS_ACCESS_KEY_ID']
s3_secret_key = os.environ['AWS_SECRET_ACCESS_KEY']
s3_bucket_name = os.environ['JUPYTERHUB_USER']
print(s3_endpoint_url)
print(s3_bucket_name)
s3 = boto3.client('s3','us-east-1', endpoint_url= s3_endpoint_url,
aws_access_key_id = s3_access_key,
aws_secret_access_key = s3_secret_key)
https://s3.upshift.redhat.com
mcliffor
Interacting with S3¶
Creating a bucket, uploading an object (put), and listing the bucket.¶
In the cell below we will use our boto3 connection, s3
, to do the following: Create an S3 bucket, upload an object, and then display all of the contents of that bucket.
#s3.create_bucket(Bucket=s3_bucket_name)
#s3.put_object(Bucket=s3_bucket_name,Key='object',Body='data')
for key in s3.list_objects(Bucket=s3_bucket_name)['Contents']:
print(key['Key'])
forestmnist.1.tgz
kube-metrics/operationinfo.csv/_SUCCESS
kube-metrics/operationinfo.csv/part-00000-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00001-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00002-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00003-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00004-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00005-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
new_data
new_data.csv
object
somefolder/new_data.csv
trip_report.tsv/_SUCCESS
trip_report.tsv/part-00000-3549378a-5714-4808-8ffa-a591faa64ff4-c000.csv
Exercise #1: Manage Remote Storage¶
Let’s do something slightly more more complicated and upload a small file to our new bucket.
Below we have used pandas to generate a small csv file for you. Run the below cell, and then upload it to your S3 bucket. Then Display the contents of your bucket like we did above.
This resource may be helpful: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html
Objective¶
Upload a csv file to your s3 bucket using
s3.upload_file()
List the objects currently in your bucket using
s3.list_objects()
### Create and save a small pandas dataframe and save it locally as a .csv file
import pandas as pd
x = [1,2,3,4]
y = [4,5,6,7]
df = pd.DataFrame([x,y])
df.to_csv('new_data.csv')
# 1. Upload a csv file to your s3 bucket using s3.upload_file()
s3.upload_file(Filename='new_data.csv',Bucket=s3_bucket_name, Key='somefolder/new_data.csv')
# 2. List the objects currently in your bucket using s3.list_objects()
for key in s3.list_objects(Bucket=s3_bucket_name)['Contents']:
print(key['Key'])
forestmnist.1.tgz
kube-metrics/operationinfo.csv/_SUCCESS
kube-metrics/operationinfo.csv/part-00000-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00001-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00002-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00003-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00004-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
kube-metrics/operationinfo.csv/part-00005-1de3723d-a2d5-4f64-9726-d5e0f640fca6-c000.csv
new_data
new_data.csv
object
somefolder/new_data.csv
trip_report.tsv/_SUCCESS
trip_report.tsv/part-00000-3549378a-5714-4808-8ffa-a591faa64ff4-c000.csv
Now lets read our data from Ceph back into our notbook!
obj = s3.get_object(Bucket='mcliffor', Key = 'somefolder/new_data.csv')
df = pd.read_csv(obj['Body'])
df
Unnamed: 0 | 0 | 1 | 2 | 3 | |
---|---|---|---|---|---|
0 | 0 | 1 | 2 | 3 | 4 |
1 | 1 | 4 | 5 | 6 | 7 |
Great, now you know how to interact with and manage your data store with simple data types.