Categories
Autonomous Database Python

Clean up a .CSV file with Regular Expressions, Pandas, and Python

Let us begin

I log into Database Actions as my newly created “Python Developer” and navigate directly to the “Data Load” page (found under the Data Tools section of the Launchpad). I choose to “Load Data” from a “Local File.” I click “next,” click the pencil icon (see the arrow in the image), and navigate to the “File” tab. I scroll to the “RESTAURANTOPENDATE” column and see this:

CSV file in data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

In a previous post (if you’re new, you can read the end-to-end process), I discussed how this time (“04:00:00+00” or “05:00:00+00”) wasn’t necessary for me. At that time, I used the “Find and Replace” function in Excel (I’m sure you can do the same with Numbers, Sheets, or Calc) to replace all occurrences of time with “” (i.e., nothing).

But in the spirit of doing things in five days, when they could have taken but five minutes, I opted to see if I could achieve a similar result in python.

Goal

Create a python script that will allow you to remove part of a value in a .CSV file.

WARNING: I thought this would be a simple task. I should have known better. My approach may not be ideal for your situation, but hopefully you’ll learn something. Or at the very least maybe you can bookmark this post, along with the resources (at the end of the post) I'm including for later use.

Regular Expressions

I am confident there is a way to achieve this goal with the .CSV library in python. There is probably a way to do this with python out of the box. I couldn’t figure it out.

I’m also reasonably confident that my approach is on the verge of ridiculous. Nevertheless, Regular Expressions, and the Pandas library, in python are what worked for me.

What are Regular Expressions?

Good question. I still don’t know, but here is what I found on Wikipedia:

“A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually, such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.”

That is so painful to read, but the way I understand it is that we use Regular Expressions in pattern-matching. Essentially you create a pattern and then tell your application (or script) to search for it. From there, you can include more code to perform more actions. In my case, those actions would be to find a pattern and replace that pattern with nothing.

So what is the pattern?

One of the benefits of having zero formal training in application development (in this case, computer science and formal language theory) is that occasionally, you might take an approach that, while unintuitive, works well enough.

And after many, many hours of trial and error, parsing through Stack Overflow, reviewing hours of YouTube, reading pages of blogs, and occasional use of the “I’m Feeling Lucky” button on Google, it occurred to me that my pattern was:

Pattern for Regular Expressions, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

After reviewing more documentation and other various resources (I have an entire section at the end), I more clearly identified a pattern:

A more formal pattern for Regular Expressions, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

Coming up with a clearly-defined pattern helped inform me as to how I could create the following Regular Expression:

'\s\d+:\d+:\d+[^a-c6]\d+$'

I then did some more stuff, and that was it! Follow me for more…

I’m kidding.

Deep Dive

I can’t teach you everything, because I’m probably only ahead of you by a week at this point. But I can explain, in the hopes that it will make sense. Let me continue with the above Regular Expression (again, you’ll want to spend some time in the resources section I’ve included to understand better how this all fits together).

But the above Regular Expression can be defined like this:

Regular Expression definition for this script, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

I then took the above and plugged it into this nine (12 with sections, slightly more with comments) line script:

from pickle import NONE, TRUE
#This doesn't seem work without pickle, still not sure why
import pandas as pd
#You need this to do anything related to dataframes (i.e. "df"); 
# if you're a data scientist and you use python, you may already 
# be familiar with this
import re 
#You need this for Regular Expressions to work

d = '/Users/choina/Documents/untitled folder/Restaurants_in_Wake_County.csv'
#I'm assigning the file to "d"
df = pd.read_csv(d)
#df means dataframe. Here I'm telling panda to read the .CSV file 
# that I just created ("d"), and consider that the "dataframe".

print(df.dtypes)
#This is optional, but when I do this I can see what the various 
# datatypes are (according to pandas). The RESTAURANTOPENDATE column 
# is an 'object'.
for RESTAURANTOPENDATE in df:
    df['RESTAURANTOPENDATE'] = df['RESTAURANTOPENDATE'].str.replace('\s\d+:\d+:\d+[^a-c6]\d+$', '', regex=TRUE)
#This is a "For Loop" that says set the RESTAURANTOPENDATE column 
# contents equal to the following: FIRST, consider the values in the 
# column as string and replace the contents using this process: 
# find a pattern that matches this Regular Expression, replace
# it with NOTHING, then take that column,along with the other 
# columns and...
df.to_csv('newrest.csv')
#Save it to a new .CSV file called "newrest.csv"

Code as an image for reference:

Python script to clean CSV file for data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

In short, (I’ve left detailed comments for reference), these nine lines of code search for a pattern and replace it with nothing when found. We then take these results (including the rest of the .CSV file) and save them to a new file called “newrest.csv.”

Please, reference this script. Tweak it as needed. Or drop me a note if you need an outside perspective. Good luck!

One more thing

Pickle, Pandas, and RE are all required for this script to work.

Note: While I understand Pickle is a library that deals with the serialization of objects in python. I've no idea what that means, and reading about it makes my brain melt. Additionally, I’m not sure if this is expected behavior, but when I ran an earlier version of this script, this "Pickles" library just appeared at the very top. If this has happened to you, please leave a comment, because I'm scared there may be an actual ghost in my machine.

Moving on

The rest is pretty straightforward.

I then went back into Database Actions (it had been so long that I’d expected a new version had been released already), loaded my data, and inspected the column in question:

Updated CSV file in data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

Looks good to me!

Magic

Something I noticed was that my Autonomous Database automatically recognizes the “RESTAURANTOPENDATE” column as a data type of “Date.” I didn’t have to do anything special here, and I thought this was pretty cool. Take a look:

Database Actions automatically recognizes datatypes Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

Addendum to the addendum

Once imported, you can also use Regular Expressions in your database (with SQL). But if you’d like to clean your data before it reaches your ADB, the approach I’m writing about here is also a great option!

Resources

As promised, here are all the resources I’ve curated that might assist you should you choose to take the road less traveled…

  • Important/relevant documentation:
  • Stack Oveflow – this is a series of “bookmarked” questions about Python and Regular Expressions, that helped me. They might be of use. They aren’t intended to teach you, but rather aide you in gaining a better perspective.
  • YouTube – a playlist I created that might be worth reviewing (at a minimum watch Case Digital’s “How to Remove Characters From a Pandas Dataframe In Python” video).
  • Regular Expression Editors:
    • Pythex (make sure you click the “Regular expression cheatsheet” button to expand/see a valuable quick reference – also, apparently a more complete version can be found here)
    • Regular Expressions 101 – this includes an enormous library of previously created Regular Expressions. There is a “debugger” too!

An even simpler solution

I discovered an even simpler solution to this problem. As I was reviewing the Pandas documentation I noticed there was functionality built directly into Pandas. This can be done with the “datatime” function.

And you can take it one step further with what is referred to as a “.dt accessor.” (you’ll see this in the updated code I’m including). This accessor allows you to manipulate a Series (aka a column) within a Pandas dataframe (df).

There are quite a few time-date properties you can manipulate in Pandas, all of which can be found here. The “Time Zone Operations” section is where the “.dt accessor”, is briefly mentioned. For more context, I’d recommend reviewing this section on the .dt accessor and how they interplay with a Pandas Series.

Don’t stop there though, the main python documentation discusses the basics of date and time operations. Without further ado, on to the new code:

import pandas as pd 
import numpy as np 

#You don't have to do it like this. You could have a single "d" and then just comment/uncomment each line as needed. 

d1 = '/Users/choina/Downloads/Food_Inspections.csv'
d2 = '/Users/choina/Downloads/Food_Inspection_Violations copy.csv'
d3 = '/Users/choina/Downloads/Restaurants_in_Wake_County copy.csv'

#The same thing here, you could just have a single "d" and then just comment out the other two, when not needed.

df_1 = pd.read_csv(d1)
df_2 = pd.read_csv(d2)
df_3 = pd.read_csv(d3)

#Same goes for this too. This is either very slick of me, or incredibly lazy. 

df_1['DATE_'] = pd.to_datetime(df_1['DATE_']).dt.date
df_2['INSPECTDATE'] = pd.to_datetime(df_2['INSPECTDATE']).dt.date
df_3['RESTAURANTOPENDATE'] = pd.to_datetime(df_3['RESTAURANTOPENDATE']).dt.date

#Same, same here as well. You could do one at a time. But it works, I double-checked my work. 

df_1.to_csv('newninspect.csv')
df_2.to_csv('newviolate.csv')
df_3.to_csv('newrest.csv')
New python script to change date time in csv files, Chris Hoina, Senior Product Manager, ORDS, Database Tools, Oracle Autonomous Database

About the update

This code is certified fresh – meaning, it works. I triple-checked. The only thing I’ve noticed is that lines 12-14 must-have “.dt.date” at the end. From what I am reading (and am inferring from the documentation), it looks as though you need to first “treat” the Series with the “to_datatime” function. After that, the entire dataframe is in limbo (not quite written out to a new .CSV), waiting. Before the next step, we can strip the time portion out using the .dt accessor (i.e. the “.dt.date” portion located at the end of lines 12-14).

From there it is simply a matter of writing out these updated dataframes to the three new expectant .CSV files.

Find Me

Please come back for more. I’m going to be working on views in my Autonomous Database, and then later, I’ll REST-enable them with the help of ORDS!

Categories
Autonomous Database

Updating VARCHAR2 fields in Oracle Database Actions

Update

I’ve run into more problems with this python/flask/ords/adb app idea of mine. Turns out I’m too [developer] dumb to do some of the things that I’d originally envisioned. Now that I’ve become more familiar with python, flask, the Oracle Autonomous Database, and our Database Tools (like ORDS and Database Actions), I’m realizing that I need to rethink my approach.

First off, I’ve updated the (not trying to be pretentious here) User Journey for this project.

SIDE NOTE: The User Journey is something that I really should pay more attention to (especially considering it was one of my concentrations in my MBA program, and it was a huge initiative in my previous role at Big Blue) and you should too!

Ch-ch-ch-ch-changes

I’m leaning towards (after a recent discussion with that jeff smith) more of a drop-down style of search. I’d like to present a user with all the available cities (having taken these from my “Restaurants” table) in Wake County.

I think I can populate a list of cities to choose from with the help of REST APIs (via ORDS). But I need to do some clean-up first.

Side note

In about a minute, you’ll probably come to the realization that my approach is less than elegant. Remember, I’m not a developer. I’m kind of stumbling through this, and I’ll need to make some sacrifices along the way (we call those “trade-offs” in business-speak). Unfortunately, there are simply some techniques/skills that I do not know yet. For a visual representation of these deficits, I draw your attention to:

Me the develper versus an actual developer, Chris Hoina, Senior Product Manager, ORDS, Database Tools, Oracle Autonomous Database, Database Actions

Back to your regularly scheduled program

On to clean-up. The cities in this table are a bit all over the place. Let’s take a look.

From Database Actions, I can navigate to a SQL Worksheet. From there I can inspect my tables (Right-click > Edit), or jump straight into the SQL. Since I know my “CITIES” column is the problem, I’ll run this:

SELECT DISTINCT CITY
FROM ADMIN.RESTAURANTS
ORDER BY 1

Some screenshots of what this looks like in Database Actions:

NOTE: The "DISTINCT" parameter drops any duplicates, "ORDER BY" returns results in ascending order.

I suspect that the program used to input health inspections allows for some level of free text.

Tangent / Story

I’ve used an application similar to this in a previous life, but it was a Computer Aided Dispatch (CAD) application. You could enter in free-text, or select from users’ previous entries. There was a lot of flexibility, but because of that, there were many variations on things like street names, cities, locations, parking lots, person names, etc. For the user, awesome, for the developer or database administrator, less so.

At this point, it is pretty clear the amount of normalization I’ll have to do to all these city names. Now I just need to figure out:

  • case to use (Upper, Sentence, Lower, SaRCaSTic, etc.)
  • city name variations to use

I randomly chose a city with the following SQL command:

UPDATE Restaurants SET CITY = 'Holly Springs'
WHERE UPPER(CITY) = 'HOLLY SPRING'

In this example, I searched for all instances of “Holly Spring” and replaced them with the correct “Holly Springs”. I can then repeat the process for all other cities. Again, the abridged walk-through of this is here.

I actually know this data well, since I grew up in the area. So deciding on the correct city names shouldn’t be too challenging. But familiarizing yourself with your own data might not be such a bad idea. For instance, in this table, one of the cities displays as “Research Triangle Park”. I don’t think it’s common to refer to it like this; we usually just call it “RTP”. I think little details like that really help to elevate your application.

Next steps

I’ll keep this brief. But what I think I’m going to have to do next is create a REST API based on this Restaurant table. I should have a better idea in a few days. But the idea is, that once a user selects a city, it will pass that city as a parameter to a GET request (again, courtesy of ORDS) and then return additional data. A user would then be able to further filter on that returned data, eventually ending on their target restaurant.

Okay, stay tuned…as I feel like I’m on the precipice of something exciting!

Catch me if you can


Abridged walk-through

Select Worksheets in Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

From Database Actions, navigate to the “SQL” feature. You’ll see this screen appear.

NOTE: I had a previous SQL Worksheet saved, but you can enter SQL statements directly into the default worksheet. 
Before city field has been updated in Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

*If you expand, you can see the statement that I used.

SELECT DISTINCT CITY
FROM ADMIN.RESTAURANTS
ORDER BY 1

This will produce a list of all the cities in their current form. Notice the four different variations of “Fuquay-Varina”.

Updating fields in City Column Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

I’ll comment out lines 1-3 and enter the following SQL statement to update Fuquay-Varina:

UPDATE Admin.Restaurants SET CITY = 'Fuquay-Varina'
WHERE UPPER(CITY) = 'FUQUAY VARINA'

I use the “DISTINCT” parameter to drop any duplicates in the results. “ORDER BY” returns results in ascending order.

You’ll see that 77 rows have been updated.

After Fuquay Varina has been updated in Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

I’ll un-comment lines 1-3, and comment lines 5-6.

Based on the results, we’ve reduced the variations of “Fuquay-Varina”. I’ll continue this work until I have a smaller set of more consistent cities.

Categories
Autonomous Database

Table Prep: Data loads and time zones

Next episode

My Autonomous Database in Oracle Cloud Infrastructure

With my Autonomous Database up and running,I needed to find some data to upload. After some quiet reflection, I decided upon three datasets that should be plenty fun.

Restaurants in Wake County, North Carolina. Wake County is where I, Chris Hoina, live.

A bit of context; I currently live in Wake County, North Carolina. And it turns out that we’ve been very progressive with providing open, accessible data to our citizens. I really wanted to do something with food, so I was surprised when I found the following three datasets (among the nearly 300 available):

  1. Restaurants in Wake County
  2. Restaurant Food Inspections/Grades
  3. Restaurant Food Inspection Violation History

Always read the docs

Upon reading the documentation, I discovered that these datasets could be all connected through their HSISID* Permit ID fields. Which should come in handy later as it could make for some very interesting (if not amusing) revelations.

*I’m not sure of the acronym, but the documentation defines the HSISID as a, “State code identifying the restaurant (also the primary key to identify the restaurant).”

To the data loading

I’m sticking with the Oracle Database Actions* web-based interface for now.

*Oracle Database Actions was previously known as Oracle SQL Developer Web.
In the Database Actions Launchpad, there are several actions for users. This includes development, data-loading, monitoring, and administration.
In the Database Actions Launchpad, there are several actions for users. This includes development, data-loading, monitoring, and administration.

And from what I learned in the Autonomous Database workshop, there are two ways to load data into your database (there may be other methods; mea culpa if so):

  1. SQL worksheet (found in the Development section, of your launchpad)
  2. Data Load Tool (found in the Data Tools section)

And I opted for the SQL worksheet method, since I need to practice and learn this method regardless.

Obstacles

After choosing my method of data loading, here is where I encountered some obstacles. With complete lack of regard for best practices, I proceeded to load all three datasets into my Autonomous Database only to discover two glaring issues.

First

Unfortunately, in my restaurant “violations” table, I encountered two rows that failed to load.

ORA-12899 error in Oracle Autonomous Database on Oracle Cloud Infrastructure.
ORA-12899 error in Oracle Autonomous Database on Oracle Cloud Infrastructure.

These values were loaded as a VARCHAR2 datatype; which by default, allows for a maximum byte size of 4000 bytes. Since these rows exceeded 4000 bytes (in that specific column) they failed. Fortunately, it was easy to increase the bytes to 10,000.

Apparently, I can increase to 32,767 bytes, but that is overboard. I also learned that the VARCHAR2 data type can have their max string sizes set to ‘STANDARD’ or ‘EXTENDED’ (i.e 4000 bytes vs 32,767 bytes). I’m assuming the Autonomous Database is set to EXTENDED by default, since I was able to increase this column, and re-upload with zero failed rows. You can read up on VARCHAR2 data types here.

Second

The second obstacle I ran into, took me about a day (on and off) to figure out. The dates in all three of these tables were constructed like this:

Timestamps in Oracle Autonomous Database.
Adjusting for the TIMESTAMPS in the Oracle Autonomous Database.

And at first glance, I thought I was looking at a time zone offset. Dates were either 04:00:00+00 hrs or 05:00:00+00 hrs; possibly due to daylight savings time. Which seemed like a reasonable assumption since the early-November — mid-March dates were an hour later than the mid-March — early-November dates (i.e. UTC-4hrs in Winter, UTC-5hrs all else).

My first thought was to modify the effected columns with something like this parameter:

TIMESTAMP [(fractional_seconds_precision)] WITH TIME ZONE

or with…

TIMESTAMP '1997-01-31 09:26:56.66 +02:00'

But, I’d never seen this date/time stamp before (the ‘esriFieldTypeDate’), so I had to investigate before I did anything (also, I’ve no idea how to update a column with a parameter like this, so it would have added additional time).

Turns out this data originates from ArcGIS Online. This field date type maps directly to Oracle’s ‘TIMESTAMP’ data type. Seen here in the documentation:

ArcGIS datatype mapping to Oracle Database.
In the ArcGIS documentation, it looks like the ‘esriFieldTypeDate’ maps to Oracle’s ‘TIMESTAMP’ datatype.

I later tinkered with an API Explorer on the Wake County datasets site and reviewed a sample JSON response for this data set. The response also confirmed the field type:

Wake County has an API Explorer on their datasets site, which you can sample query to review the JSON that is returned.
Wake County has an API Explorer on their datasets site, which you can sample query to review the JSON that is returned.

It’s funny, because after all this hunting, it looks like I won’t be able to include time zones anyways. It’s somewhat inconsequential as I’m only interested in Wake County, but for some that could be unnerving.

I wish I could say that I came up with a cool Python script to automate the cleaning of these columns (I would have had to do this for all three of my .CSV files), but I did not. It would have taken me an afternoon or longer to work through this since I haven’t toyed with .CSV Reader (in the Python standard library), Pandas, or NumPy in quite some time. In the end, (and it pains me to say this; embarrassingly really), I used the ‘Find + Replace’ feature in Numbers to remove the time/time zone portion of the dates.

I’m pathetic, I know. But the idea of creating a dedicated Python script does seem like a fun weekend project. Who knows, maybe I’ll post to GitHub. Regardless, my Autonomous Database is clean and primed for ORDS.

In closing

In closing

My advice is as such:

Always inspect your data before you do anything major, always make sure you understand the inputs and the resultant outputs of the data loading process, and always refer to the documentation FIRST so you can understand how data is being recorded, coded, displayed, etc..

Chris Hoina, 2022

Oh, and it wouldn’t hurt to think about how you might use this data in the future. I’m guessing that could help you avoid a lot of rework (i.e. technical debt).

Take me as an example. Since I’m focused on ORDS and REST enabling my database, I may want to consider things like:

  • What am I going to expose (via REST APIs)? or
  • What fields might the presentation layer of my application include?

In my case, a lot of this data will lay dormant, as I might choose to focus on only a few key fields. Of course, anything I create will be simple and basic. But that might not be the case for you.

Koala RESTing, Photo by Jordan Whitt on Unsplash

I’m off to REST…enable this database now, but in the meantime if you want to learn more about ORDS, check out:

Find me

Find me

And as always, catch me on: