Tag: pandas

  • Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

    Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

    Summary, code, resources

    Problem

    While querying a table (based on this dataset) with SQL, you realize one of your columns uses 3-character ISO Country Codes. However, some of these 3-character codes aren’t countries but geographical regions or groups of countries, in addition to the actual country codes. How can you filter out rows so you are left with the countries only?

    Answer

    Use the Python Pandas library to scrape ISO country codes and convert the values to one single string. Then use that string as values for a subsequent SQL query (possibly something like this):

    SELECT * FROM [your_table]
    WHERE country_code IN ([values from the generated list-as-string separated by commas and encased by single / double quotes]);

    Code

    # Libraries used in this code
    from bs4 import BeautifulSoup
    import requests
    import csv 
    import pandas as pd 
    
    # I found these ISO country codes on the below URL. Pandas makes it easy to read HTML and manipulate it. Very cool!
    iso_codes = pd.read_html("https://www.iban.com/country-codes")
    
    # I create a data frame, starting at an index of 0.
    df = iso_codes[0]
    
    # But really, all I care about is the 3-digit country code. So I'll make that the df (dataframe) and strip out the index
    df = df['Alpha-3 code'].to_string(index=False)
    
    # From here, I'll save this little guy as a text file.
    with open("./countries.txt", "w") as f:
        f.write(df)
    
    # I'll set up a list. *** This was my approach, but if you find a better way, feel free to comment or adjust. ***
    my_list = []
    
    # Then I'll open that text file and read it in.
    file = open("./countries.txt", "r")
    countries = file.read()
    
    # I need to remove the "new line" identifiers, so I'm doing that here. 
    my_list = countries.split('\n')
    
    # Once I do that, I can create two new strings. I do this with f-Strings. Great article on using them here: https://realpython.com/python-f-strings/ 
    
    # I have two options here: one where the codes are contained by single quotes, the other with double quotes. Oracle Autonomous Database likes single quotes, but your DB may differ.
    
    countries_string_single_quotes = ','.join(f"'{x}'" for x in my_list)
    
    countries_string_double_quotes = ','.join(f'"{x}"' for x in my_list)
    
    # From here, I take those strings and save them in a text file. You don't have to do this; you can print and copy/paste the string. But this might be an excellent addition if you want to refer to these later without running all the code. 
    
    with open("./countries_as_list_single_quotes.txt", "a") as f:
        f.write(countries_string_single_quotes)
    
    with open("./countries_as_list_double_quotes.txt", "a") as f:
        f.write(countries_string_double_quotes)

    GitHub repo details

    You can find the code from this post in my GitHub repository. The repository consists of the following:

    • The Python code I created for solving this problem
    • A countries.txt file, which is produced midway through the code (temporary placeholder for later processing)
    • ‘Single quotes’ .txt file – the 3-character ISO Country Codes are formatted as a string. The values are enclosed by single quotes; commas throughout
    • “Double quotes” .txt file – the 3-character ISO Country Codes are formatted as a string. The values are enclosed by double quotes; commas throughout

    Resources


    The in-depth walkthrough

    Backstory

    A few days ago, I stumbled upon this question in the r/datasets subreddit:

    I spent most of the morning figuring out how I would go about this, and after some trial and error, I devised a plan. I decided to take the list of ISO Country Codes (which I found here) and use them as values for filtering in a SQL statement (later on in Oracle SQL Developer Web).

    After some research, I figured out the proper SQL syntax for a successful query.

    SELECT * FROM [your_table]
    WHERE country_code IN ([values from the generated list-as-string separated by commas and encased by single / double quotes]);

    From there, I knew I needed to work backward on those ISO Country Codes. Meaning I needed to take something that looked like this:

    And turn it into something more workable. It turns out that grabbing this was pretty straightforward. I’m using Pandas primarily for this exercise, but first, I need to import some libraries:

    # Libraries used in this code
    from bs4 import BeautifulSoup
    import requests
    import csv 
    import pandas as pd 

    Next, I’ll use Pandas’ read_html function (this feels like cheating, but it’s incredible) to read in the table.

    # I found these ISO country codes on the below URL. Pandas makes it easy to read HTML and manipulate it. Very cool!
    iso_codes = pd.read_html("https://www.iban.com/country-codes")
    
    # I create a data frame, starting at an index of 0.
    df = iso_codes[0]

    This is wild, but this is what the printout looks like:

    pandas to_html printout of ISO codes table, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The Pandas read_html() the function is powerful.

    If you squint, you can see an “Alpha-2 code” and an “Alpha-3 code” column in the image. From here, I need to isolate the 3-code column. So I reshaped the data frame by making it a single column; dropping the index (this is optional, you could keep the index if you needed it; perhaps you wanted to create a separate table in your database).

    # But really, all I care about is the 3-digit country code. So I'll make that the df (dataframe) and strip out the index
    df = df['Alpha-3 code'].to_string(index=False)

    I’ll save this data frame as a .txt file.

    # From here, I'll save this little guy as a text file.
    with open("./countries.txt", "w") as f:
        f.write(df)

    This is only temporary (FYI: this is the only way I could figure out how to do this). It’ll look like this:

    country codes as a temporary text file, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The temporary .txt file of 3-character ISO Country Codes.

    Next, I take that temporary text file and read it in. I’m going to add it to a list, so I’ll first create the empty list (aptly named “my_list“). I also need to remove the newline characters from the list; otherwise, if I don’t, then when I create my string of values (that comes in the final step), the string will look like this:

    countries string with newline characters, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The “countries” string with “\n” characters.

    I remove the newline characters with this piece of code:

    # I need to remove the "new line" identifiers, so I'm doing that here. 
    my_list = countries.split('\n')

    The almost string of values will look like this:

    viewing ouput of my_list, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    New line characters have now been removed.

    I use F-Strings to create the following two strings; countries_strings_single_quotes and countries_strings_double_quotes, respectively. Need to learn about F-Strings (or, more formally, Literal String Interpolation)? No problemo! Check out these three resources:

    The code for the F-Strings is below. I loop through my_list and separate the x (the things I’m iterating over) with commas (that’s the join).

    # Once I do that, I can create two new strings. I do this with f-Strings. Great article on using them here: https://realpython.com/python-f-strings/ 
    
    # I have two options here: one where the codes are contained by single quotes, the other with double
    # quotes. Oracle Autonomous Database likes single quotes, but your DB may differ.
    
    countries_string_single_quotes = ','.join(f"'{x}'" for x in my_list)
    
    countries_string_double_quotes = ','.join(f'"{x}"' for x in my_list)

    And now that I have these two objects (are they called objects??). I’ll save them each as a text file. One file has the 3-character codes surrounded by single quotes, the other with double quotes. The code:

    # From here, I take those strings and save them in a text file. You don't have to do this; you can print
    # and copy/paste the string. But this might be a nice addition if you want to refer to these later 
    # without running all the code. 
    
    with open("./countries_as_list_single_quotes.txt", "a") as f:
        f.write(countries_string_single_quotes)
    
    with open("./countries_as_list_double_quotes.txt", "a") as f:
        f.write(countries_string_double_quotes)

    The text files look like this now:

    side by side comparison of newly created text files, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The country codes are now presented in one long string. Pretty cool, eh?

    SQL time

    We have arrived! Let me show you what I can do now!

    I took the CSV data from the World Bank and loaded it into my Autonomous Database. Our returning intern Layla put together a video of how to do this; you can check it out here:

    Once my table was created, I did a SELECT [columns] FROM. Here you can see my “beginning state”.

    There are 266 entries; some are countries, and others are not. And if you recall, the original question asked how somebody could filter out the non-countries. Onto that next!

    This is the best part. I can take the string I made and use that in a SQL query such as this:

    SELECT * from ADMIN.REDDIT_TABLE
    WHERE COUNTRY_CODE IN('AFG','ALA','ALB','DZA','ASM','AND','AGO','AIA','ATA',
    'ATG','ARG','ARM','ABW','AUS','AUT','AZE','BHS','BHR','BGD','BRB','BLR','BEL',
    'BLZ','BEN','BMU','BTN','BOL','BES','BIH','BWA','BVT','BRA','IOT','BRN','BGR',
    'BFA','BDI','CPV','KHM','CMR','CAN','CYM','CAF','TCD','CHL','CHN','CXR','CCK',
    'COL','COM','COD','COG','COK','CRI','CIV','HRV','CUB','CUW','CYP','CZE','DNK',
    'DJI','DMA','DOM','ECU','EGY','SLV','GNQ','ERI','EST','SWZ','ETH','FLK','FRO',
    'FJI','FIN','FRA','GUF','PYF','ATF','GAB','GMB','GEO','DEU','GHA','GIB','GRC',
    'GRL','GRD','GLP','GUM','GTM','GGY','GIN','GNB','GUY','HTI','HMD','VAT','HND',
    'HKG','HUN','ISL','IND','IDN','IRN','IRQ','IRL','IMN','ISR','ITA','JAM','JPN',
    'JEY','JOR','KAZ','KEN','KIR','PRK','KOR','KWT','KGZ','LAO','LVA','LBN','LSO',
    'LBR','LBY','LIE','LTU','LUX','MAC','MKD','MDG','MWI','MYS','MDV','MLI','MLT',
    'MHL','MTQ','MRT','MUS','MYT','MEX','FSM','MDA','MCO','MNG','MNE','MSR','MAR',
    'MOZ','MMR','NAM','NRU','NPL','NLD','NCL','NZL','NIC','NER','NGA','NIU','NFK',
    'MNP','NOR','OMN','PAK','PLW','PSE','PAN','PNG','PRY','PER','PHL','PCN','POL',
    'PRT','PRI','QAT','REU','ROU','RUS','RWA','BLM','SHN','KNA','LCA','MAF','SPM',
    'VCT','WSM','SMR','STP','SAU','SEN','SRB','SYC','SLE','SGP','SXM','SVK','SVN',
    'SLB','SOM','ZAF','SGS','SSD','ESP','LKA','SDN','SUR','SJM','SWE','CHE','SYR',
    'TWN','TJK','TZA','THA','TLS','TGO','TKL','TON','TTO','TUN','TUR','TKM','TCA',
    'TUV','UGA','UKR','ARE','GBR','UMI','USA','URY','UZB','VUT','VEN','VNM','VGB',
    'VIR','WLF','ESH','YEM','ZMB','ZWE')
    ORDER BY COUNTRY_CODE ASC;

    Once I execute that SQL statement, I’m left with the countries from that list. I opened up the results in another window so you can see a sample.

    The end

    So yeah, that’s it! I don’t know if this was the best way to go about this, but it was fun. I’m curious (if you’ve made it this far), what do you think? How would you go about it? Let me know.

    And two more things: remember to share this and…

  • More fun with Medium story stats, JSON, Python, Pandas, and Oracle SQL Developer Web

    More fun with Medium story stats, JSON, Python, Pandas, and Oracle SQL Developer Web

    That’s right; I’m back again for yet another installment of this ongoing series dedicated to working with Medium.com story stats. I first introduced this topic in a previous post. Maybe you saw it. If not, you can find it here.

    Recap

    My end goal was to gather all story stats from my Medium account and place them into my Autonomous Database. I wanted to practice my SQL and see if I could derive insights from the data. Unfortunately, gathering said data is complicated.

    Pulling the data down was a breeze once I figured out where to look for these story statistics. I had to decipher what I was looking at in the Medium REST API (I suppose that was somewhat tricky). My search was mostly an exercise in patience (there was a lot of trial and error).

    I uploaded a quick video in the previous post. But I’ll embed it here so you can see the process for how I found the specific JSON payload.

    Obtaining the raw JSON

    Once I found that URL, I saved this JSON as a .json file. The images below show remnants of a JavaScript function captured with the rest of the JSON. I’m no JavaScript expert, so I can’t tell what this function does. But before I load this into my Autonomous Database (I’m using an OCI Free Tier account, you can check it out here if you are curious), it needs to go.


    README

    I am pointing out a few things that may seem convoluted and unnecessary here. Please take the time to read this section so you can better understand my madness. 
    
    FIRST: Yes, you can manually remove the [presumably] JavaScript saved along with the primary JSON payload (see above paragraphs). I'm showing how to do this in Python as a practical exercise. But I'm also leaving open the opportunity for future automation (as it pertains to cleaning data). 
    
    SECOND: When it comes to the Pandas data frame steps, of course, you could do all this in Excel, Numbers, or Sheets! Again, the idea here is to show you how I can clean and process this in Python. Sometimes doing things like this in Excel, Numbers, and Sheets is impossible (thinking about enterprise security here). 
    
    THIRD: Admittedly, the date-time conversion is hilarious and convoluted. Of course, I could do this in a spreadsheet application. That's not the point. I was showing the function practically and setting myself up for potential future automation. 
    
    FOURTH: I'll be the first to admit that the JSON > TXT > JSON > CSV file conversion is comical. So if you have any suggestions, leave a comment here or on my GitHub repository (I'll link below), and I'll attribute you!

    The code

    Explaining the code in context, with embedded comments, will be most illuminating.

    I’ve named everything in the code as literally as possible. In production, this feels like it might be impractical; however, there is no question about what the hell the code is doing! Being more literal is ideal for debugging and code maintenance.

    Here is the entire code block (so CTRL+C/CTRL+V to your heart’s content 😘). I’ll still break this down into discrete sections and review them.

    import csv
    import json
    import pandas as pd 
    import datetime
    from pathlib import Path
    
    # You'll first need to sign in to your account, then you can access this URL without issues: 
    # https://medium.com/@chrishoina/stats/total/1548525600000/1668776608433
    # NOTES:
    # Replace the "@chrishoina" with your username 
    # The two numbers you see are Unix Epochs; you can modify those as # needed; in my case, I
    # wanted to see the following:
    # * 1548525600000 - At the time of this post, this seems to be 
    # whenever your first post was published or when 
    # you first created a Medium account. In this case, for me, this 
    # was Sat, Jan/26/2019, 6:00:00PM - GMT
    # * 1665670606216 - You shouldn't need to change this since it will # just default to the current date.
    
    # For the conversion, I an Epoch Converter tool I found online: https://www.epochconverter.com/
    
    # Step 1 - Convert this to a,(.txt) file
    p = Path("/Users/choina/Documents/socialstats/1668776608433.json")
    p.rename(p.with_suffix('.txt'))
    
    # Step 2 - "read" in that text file, and remove those pesky
    # characters/artifacts from position 0 through position 15. 
    # I'm only retaining the JSON payload from position 16 onward.
    
    with open("/Users/choina/Documents/socialstats/1668776608433.txt", "r") as f:
    stats_in_text_file_format = f.read()
    
    # This [16:] essentially means grabbing everything in this range. Since
    # there is nothing after the colon; it will just default to the end (which is
    # what I want in this case).
    
    cleansed_stats_from_txt_file = stats_in_text_file_format[16:]
    
    print(cleansed_stats_from_txt_file)
    
    # This took me a day to figure out, but this text file needs to be encoded
    # properly, so I can save it as a JSON file (which is about to happen). I
    # always need to remember this, but I know that the json.dumps = dump
    # string, which json.dump = dump object. There is a difference, I'm not
    # the expert, but the docs were helpful. 
    
    json.dumps(cleansed_stats_from_txt_file)
    
    # Step 3 - Here, I create a new file, then indicate we will "w"rite to it. I take the 
    # progress from Step 2 and apply it here. 
    
    with open('medium_stats_ready_for_pandas.json', 'w') as f:
    f.write(cleansed_stats_from_txt_file)
    
    # Step 4 - Onto Pandas! We've already imported the pandas library as "pd."
    # We first create a data frame and name the columns. I kept the names
    # very similar to avoid confusion. I feared that timestampMs might be a
    # reserved word in Oracle DB or too close, so I renamed it. 
    
    df = pd.DataFrame(columns=['USERID', 'FLAGGEDSPAM', 'STATSDATE', 'UPVOTES', 'READS', 'VIEWS', 'CLAPS', 'SUBSCRIBERS']) 
    
    with open("/Users/choina/Documents/socialstats/medium_stats_ready_for_pandas.json", "r") as f: 
    data = json.load(f)
    data = data['payload']['value']
    
    print(data)
    
    for i in range(0, len(data)):
    df.loc[i] = [data[i]['userId'], data[i]['flaggedSpam'], data[i]['timestampMs'], data[i]['upvotes'], data[i]['reads'], data[i]['views'], data[i]['claps'], data[i]['updateNotificationSubscribers']]
    
    df['STATSDATE'] = pd.to_datetime(df['STATSDATE'], unit="ms")
    
    print(df.columns)
    
    # Step 5 - use the Pandas' df.to_csv function and save the data frame as
    # a CSV file
    
    with open("medium_stats_ready_for_database_update.csv", "w") as f:
    df.to_csv(f, index=False, header=True)

    I used several Python libraries I use for this script:

    Step 1

    p =  Path("/Users/choina/Documents/socialstats/1668776608433.json")
    p.rename(p.with_suffix('.txt')

    Pathlib allows you to assign the file’s path to “p”. From there, I changed the .json file extension to a .txt extension.

    Note: Again, I'm sure there is a better way to do this, so if you're reading, leave a comment here or on my GitHub repository so I can attribute it to you 🙃. 

    The before and after of what this step looks like this:

    With that out of the way, I needed to remove that JavaScript “prefix” in the file. I do this in Step 2 (I got so fancy that I probably reached diminishing returns). My approach works, and I can repurpose this for other applications too!

    Step 2:

    # Step 2 - "read" in that text file, and remove those pesky
    # characters/artifacts from position 0 through position 15. Or in other
    # words, you'll retain everything from position 16 onward because that's
    # where the actual JSON payload is.
    
    with open("/Users/choina/Documents/socialstats/1668776608433.txt", "r") as f:
    stats_in_text_file_format = f.read()
    
    # This [16:] essentially means grabbing everything in this range. Since
    # there is nothing after the colon; it will just default to the end (which is
    # what I want in this case).
    cleansed_stats_from_txt_file = stats_in_text_file_format[16:]
    
    print(cleansed_stats_from_txt_file)
    
    # This took me a day to figure out, but this text file needs to be
    # appropriately encoded to save as a JSON file (which is about to
    # happen). I always forget the difference between "dump" and "dumps";
    # json.dumps = dump string, whereas json.dump = dump object. There is
    # a difference, I'm not the expert, but the docs were helpful (you should
    # read them). 
    
    json.dumps(cleansed_stats_from_txt_file)
    
    json with characters that need to be removed in vs code context chris hoina senior product manager oracle rest apis database tools autonomous database
    I needed to remove these remnants from the Medium JSON response

    While this initially came through as a JSON payload, those first 0-15 characters had to go.

    FULL DISCLAIMER: I couldn't figure out how to get rid of this while it was still a JSON file hence why I converted this to a text file (this was the only way I could figure it out).

    I captured position 16 to infinity (or the end of the file, whichever occurs first), then I re-encoded the file as JSON (I interpreted this as “something the target machine can read and understand as JSON“).

    OPEN SEASON: CompSci folks, please roast me in the comments if I'm wrong. 

    Step 3

    # Step 3 - I create a new file, then I'll "w"rite to it. I took the result from Step 2 and applied it here. 
    
    with open('medium_stats_ready_for_pandas.json', 'w') as f: 
    f.write(cleansed_stats_from_txt_file)

    I’m still at the data-wrangling portion of this journey, but I’m getting close to the end. I’ll create a new JSON file, take the parts of the (freshly encoded) text file I need, and then save them as that new JSON file.

    Step 4

    # Step 4 - Onto Pandas! We've already imported the pandas library as "pd"
    
    # I first create a data frame and name the columns. I kept the names 
    # similar to avoid confusion. I feared that timestampMs might be a
    # reserved word in Oracle DB or too close, so I renamed it. 
    
    df = pd.DataFrame(columns=['USERID', 'FLAGGEDSPAM', 'STATSDATE', 'UPVOTES', 'READS', 'VIEWS', 'CLAPS', 'SUBSCRIBERS']) 
    
    with open("/Users/choina/Documents/socialstats/medium_stats_ready_for_pandas.json", "r") as f: 
    data = json.load(f)
    data = data['payload']['value']
    
    print(data)
    
    for i in range(0, len(data)):
    df.loc[i] = [data[i]['userId'], data[i]['flaggedSpam'], data[i]['timestampMs'], data[i]['upvotes'], 
    data[i]['reads'], data[i]['views'], data[i]['claps'], data[i]['updateNotificationSubscribers']]
    
    df['STATSDATE'] = pd.to_datetime(df['STATSDATE'], unit="ms")
    
    print(df.columns)

    I won’t teach Pandas (and honestly, you do NOT want me to be the one to teach you Pandas), but I’ll do my best to explain my process. I first created the structure of my data frame (“df” in this case). And then, I named all the column headers (these can be anything, but I kept them very close to the ones found in the original JSON payload).

    I then opened the newly-saved JSON file and extracted what I needed.

    NOTE: I got stuck here for about a day and a half, so let me explain this part. 

    The data['payload']['value'] refers to the key and value in this particular {key: value} pair. This approach allowed me to grab all the values of “value“. This image explains what I started with (on the left) and what I ended up with (on the right).

    retrieving all the values in the payload-value key-value pair, chris hoina, ords, oracle rest apis, database tools, sql developer web
    The before and after JSON payload

    You’ll notice a {"success": true} key: value pair. With this method, I removed that pair and shed others at the end of the JSON payload.

    three additional key value pairs i remove from the json, chris hoina, senior product manager, ords, oracle rest apis, database tools, sql developer web, oracle autonomous database
    Removing a great deal of trash

    I can’t take credit for organically coming up with this next part; Kidson on YouTube is my savior. I’d watch this video to understand what is happening in this piece of code entirely:

    for i in range(0, len(data)):
        df.loc[i] = [data[i]['userId'], data[i]['flaggedSpam'], data[i]['timestampMs'], data[i]['upvotes'],                                   
        data[i]['reads'], data[i]['views'], data[i]['claps'], data[i]['updateNotificationSubscribers']]

    In short, you take the values from the columns in the JSON file (above) and then put them into the column locations named in this piece of code:

    df = pd.DataFrame(columns=['USERID', 'FLAGGEDSPAM', 'STATSDATE', 'UPVOTES', 'READS', 'VIEWS', 'CLAPS', 'SUBSCRIBERS'])  

    For instance, the "userId" values in the JSON file will all go into the 'USERID' column in the Pandas data frame. And the same thing will happen for the other values and associated (Pandas data frame) columns.

    Finally, I changed the date (which, if you recall, is still in this Epoch format) with the Datetime library to a more friendly, readable date. Using this code:

    df['STATSDATE'] = pd.to_datetime(df['STATSDATE'], unit="ms")

    Step 5

    with open("medium_stats_ready_for_database_update.csv", "w") as f:
        df.to_csv(f, index=False, header=True)

    I’m at the home stretch now. I take everything I’ve done in Pandas and save it as a CSV file. I wanted to keep the headers but ditch any indexing. The clean CSV file will look like this:

    completed csv ready for oracle autonomous database chris hoina senior product manager oracle rest apis database tools autonomous database
    Cleaned, tidy CSV ready for Data Load via SQL Developer Web

    Step 6

    Lastly, I logged into SQL Developer Web and clicked the new Data Load button (introduced in Oracle REST Data Services version 22.3) to upload the CSV file into a new table. The Autonomous Database automatically infers column names and data types. I slightly modified the "statsdate" column (honestly, I could have left it alone, but it was easy enough to change).

    Before and After

    And that’s it! Once uploaded, I can compare what I did previously to what I have achieved most recently. And both ways are correct. For instance, depending on your requirements, you can retain the JSON payload as a CLOB (as seen in the first image) or a more traditional table format (as seen in the second image).

    Wrap up

    If you’ve made it this far, congrats! You should now have two ways to store Medium stats data in a table (that lives in the Oracle Autonomous Database) either as:

    • a CLOB
    • an OG table

    And if you’d like to review the code, you can find it here.

    And until next time 🫠, happy querying.

  • Clean up a .CSV file with Regular Expressions, Pandas, and Python

    Clean up a .CSV file with Regular Expressions, Pandas, and Python

    Let us begin

    I log into Database Actions as my newly created “Python Developer” and navigate directly to the “Data Load” page (found under the Data Tools section of the Launchpad). I choose to “Load Data” from a “Local File.” I click “next,” click the pencil icon (see the arrow in the image), and navigate to the “File” tab. I scroll to the “RESTAURANTOPENDATE” column and see this:

    CSV file in data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    In a previous post (if you’re new, you can read the end-to-end process), I discussed how this time (“04:00:00+00” or “05:00:00+00”) wasn’t necessary for me. At that time, I used the “Find and Replace” function in Excel (I’m sure you can do the same with Numbers, Sheets, or Calc) to replace all occurrences of time with “” (i.e., nothing).

    But in the spirit of doing things in five days, when they could have taken but five minutes, I opted to see if I could achieve a similar result in python.

    Goal

    Create a python script that will allow you to remove part of a value in a .CSV file.

    WARNING: I thought this would be a simple task. I should have known better. My approach may not be ideal for your situation, but hopefully you’ll learn something. Or at the very least maybe you can bookmark this post, along with the resources (at the end of the post) I'm including for later use.

    Regular Expressions

    I am confident there is a way to achieve this goal with the .CSV library in python. There is probably a way to do this with python out of the box. I couldn’t figure it out.

    I’m also reasonably confident that my approach is on the verge of ridiculous. Nevertheless, Regular Expressions, and the Pandas library, in python are what worked for me.

    What are Regular Expressions?

    Good question. I still don’t know, but here is what I found on Wikipedia:

    “A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually, such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.”

    That is so painful to read, but the way I understand it is that we use Regular Expressions in pattern-matching. Essentially you create a pattern and then tell your application (or script) to search for it. From there, you can include more code to perform more actions. In my case, those actions would be to find a pattern and replace that pattern with nothing.

    So what is the pattern?

    One of the benefits of having zero formal training in application development (in this case, computer science and formal language theory) is that occasionally, you might take an approach that, while unintuitive, works well enough.

    And after many, many hours of trial and error, parsing through Stack Overflow, reviewing hours of YouTube, reading pages of blogs, and occasional use of the “I’m Feeling Lucky” button on Google, it occurred to me that my pattern was:

    Pattern for Regular Expressions, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    After reviewing more documentation and other various resources (I have an entire section at the end), I more clearly identified a pattern:

    A more formal pattern for Regular Expressions, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    Coming up with a clearly-defined pattern helped inform me as to how I could create the following Regular Expression:

    '\s\d+:\d+:\d+[^a-c6]\d+$'

    I then did some more stuff, and that was it! Follow me for more…

    I’m kidding.

    Deep Dive

    I can’t teach you everything, because I’m probably only ahead of you by a week at this point. But I can explain, in the hopes that it will make sense. Let me continue with the above Regular Expression (again, you’ll want to spend some time in the resources section I’ve included to understand better how this all fits together).

    But the above Regular Expression can be defined like this:

    Regular Expression definition for this script, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    I then took the above and plugged it into this nine (12 with sections, slightly more with comments) line script:

    from pickle import NONE, TRUE
    #This doesn't seem work without pickle, still not sure why
    import pandas as pd
    #You need this to do anything related to dataframes (i.e. "df"); 
    # if you're a data scientist and you use python, you may already 
    # be familiar with this
    import re 
    #You need this for Regular Expressions to work
    
    d = '/Users/choina/Documents/untitled folder/Restaurants_in_Wake_County.csv'
    #I'm assigning the file to "d"
    df = pd.read_csv(d)
    #df means dataframe. Here I'm telling panda to read the .CSV file 
    # that I just created ("d"), and consider that the "dataframe".
    
    print(df.dtypes)
    #This is optional, but when I do this I can see what the various 
    # datatypes are (according to pandas). The RESTAURANTOPENDATE column 
    # is an 'object'.
    for RESTAURANTOPENDATE in df:
        df['RESTAURANTOPENDATE'] = df['RESTAURANTOPENDATE'].str.replace('\s\d+:\d+:\d+[^a-c6]\d+$', '', regex=TRUE)
    #This is a "For Loop" that says set the RESTAURANTOPENDATE column 
    # contents equal to the following: FIRST, consider the values in the 
    # column as string and replace the contents using this process: 
    # find a pattern that matches this Regular Expression, replace
    # it with NOTHING, then take that column,along with the other 
    # columns and...
    df.to_csv('newrest.csv')
    #Save it to a new .CSV file called "newrest.csv"

    Code as an image for reference:

    Python script to clean CSV file for data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    In short, (I’ve left detailed comments for reference), these nine lines of code search for a pattern and replace it with nothing when found. We then take these results (including the rest of the .CSV file) and save them to a new file called “newrest.csv.”

    Please, reference this script. Tweak it as needed. Or drop me a note if you need an outside perspective. Good luck!

    One more thing

    Pickle, Pandas, and RE are all required for this script to work.

    Note: While I understand Pickle is a library that deals with the serialization of objects in python. I've no idea what that means, and reading about it makes my brain melt. Additionally, I’m not sure if this is expected behavior, but when I ran an earlier version of this script, this "Pickles" library just appeared at the very top. If this has happened to you, please leave a comment, because I'm scared there may be an actual ghost in my machine.

    Moving on

    The rest is pretty straightforward.

    I then went back into Database Actions (it had been so long that I’d expected a new version had been released already), loaded my data, and inspected the column in question:

    Updated CSV file in data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    Looks good to me!

    Magic

    Something I noticed was that my Autonomous Database automatically recognizes the “RESTAURANTOPENDATE” column as a data type of “Date.” I didn’t have to do anything special here, and I thought this was pretty cool. Take a look:

    Database Actions automatically recognizes datatypes Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    Addendum to the addendum

    Once imported, you can also use Regular Expressions in your database (with SQL). But if you’d like to clean your data before it reaches your ADB, the approach I’m writing about here is also a great option!

    Resources

    As promised, here are all the resources I’ve curated that might assist you should you choose to take the road less traveled…

    • Important/relevant documentation:
    • Stack Oveflow – this is a series of “bookmarked” questions about Python and Regular Expressions, that helped me. They might be of use. They aren’t intended to teach you, but rather aide you in gaining a better perspective.
    • YouTube – a playlist I created that might be worth reviewing (at a minimum watch Case Digital’s “How to Remove Characters From a Pandas Dataframe In Python” video).
    • Regular Expression Editors:
      • Pythex (make sure you click the “Regular expression cheatsheet” button to expand/see a valuable quick reference – also, apparently a more complete version can be found here)
      • Regular Expressions 101 – this includes an enormous library of previously created Regular Expressions. There is a “debugger” too!

    An even simpler solution

    I discovered an even simpler solution to this problem. As I was reviewing the Pandas documentation I noticed there was functionality built directly into Pandas. This can be done with the “datatime” function.

    And you can take it one step further with what is referred to as a “.dt accessor.” (you’ll see this in the updated code I’m including). This accessor allows you to manipulate a Series (aka a column) within a Pandas dataframe (df).

    There are quite a few time-date properties you can manipulate in Pandas, all of which can be found here. The “Time Zone Operations” section is where the “.dt accessor”, is briefly mentioned. For more context, I’d recommend reviewing this section on the .dt accessor and how they interplay with a Pandas Series.

    Don’t stop there though, the main python documentation discusses the basics of date and time operations. Without further ado, on to the new code:

    import pandas as pd 
    import numpy as np 
    
    #You don't have to do it like this. You could have a single "d" and then just comment/uncomment each line as needed. 
    
    d1 = '/Users/choina/Downloads/Food_Inspections.csv'
    d2 = '/Users/choina/Downloads/Food_Inspection_Violations copy.csv'
    d3 = '/Users/choina/Downloads/Restaurants_in_Wake_County copy.csv'
    
    #The same thing here, you could just have a single "d" and then just comment out the other two, when not needed.
    
    df_1 = pd.read_csv(d1)
    df_2 = pd.read_csv(d2)
    df_3 = pd.read_csv(d3)
    
    #Same goes for this too. This is either very slick of me, or incredibly lazy. 
    
    df_1['DATE_'] = pd.to_datetime(df_1['DATE_']).dt.date
    df_2['INSPECTDATE'] = pd.to_datetime(df_2['INSPECTDATE']).dt.date
    df_3['RESTAURANTOPENDATE'] = pd.to_datetime(df_3['RESTAURANTOPENDATE']).dt.date
    
    #Same, same here as well. You could do one at a time. But it works, I double-checked my work. 
    
    df_1.to_csv('newninspect.csv')
    df_2.to_csv('newviolate.csv')
    df_3.to_csv('newrest.csv')
    New python script to change date time in csv files, Chris Hoina, Senior Product Manager, ORDS, Database Tools, Oracle Autonomous Database

    About the update

    This code is certified fresh – meaning, it works. I triple-checked. The only thing I’ve noticed is that lines 12-14 must-have “.dt.date” at the end. From what I am reading (and am inferring from the documentation), it looks as though you need to first “treat” the Series with the “to_datatime” function. After that, the entire dataframe is in limbo (not quite written out to a new .CSV), waiting. Before the next step, we can strip the time portion out using the .dt accessor (i.e. the “.dt.date” portion located at the end of lines 12-14).

    From there it is simply a matter of writing out these updated dataframes to the three new expectant .CSV files.

    Find Me

    Please come back for more. I’m going to be working on views in my Autonomous Database, and then later, I’ll REST-enable them with the help of ORDS!