Tag: Data prep

  • From one API to another: Using ORDS Auto-REST to update a table

    From one API to another: Using ORDS Auto-REST to update a table

    It turns out the National Highway Traffic Safety Administration (NHTSA; pronounced ‘nit-sə) has a Vehicle Identification Number (VIN) decoder (it’s freely available to the public). But even cooler are the set of APIs they have. The one I found most interesting, and useful (for my use case) was the /vehicles/DecodeVinValues/[your VIN]?format=[format]l&modelyear=[year] endpoint. Here is an example, using my vehicle:

    And as a cURL command (using my 1999 Chevy Silverado 🤠), with the subsequent server response:

    curl --request GET \
      --url 'https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVinValues/1GCEK14V0XZ138493?format=json&modelyear=1999' \
    
    {
    	"Count": 1,
    	"Message": "Results returned successfully. NOTE: Any missing decoded values should be interpreted as NHTSA does not have data on the specific variable. Missing value should NOT be interpreted as an indication that a feature or technology is unavailable for a vehicle.",
    	"SearchCriteria": "VIN(s): 1GCEK14V0XZ138493",
    	"Results": [
    		{
    			"ABS": "",
    			"ActiveSafetySysNote": "",
    			"AdaptiveCruiseControl": "",
    			"AdaptiveDrivingBeam": "",
    			"AdaptiveHeadlights": "",
    			"AdditionalErrorText": "",
    			"AirBagLocCurtain": "",
    			"AirBagLocFront": "",
    			"AirBagLocKnee": "",
    			"AirBagLocSeatCushion": "",
    			"AirBagLocSide": "",
    			"AutoReverseSystem": "",
    			"AutomaticPedestrianAlertingSound": "",
    			"AxleConfiguration": "",
    			"Axles": "",
    			"BasePrice": "",
    			"BatteryA": "",
    			"BatteryA_to": "",
    			"BatteryCells": "",
    			"BatteryInfo": "",
    			"BatteryKWh": "",
    			"BatteryKWh_to": "",
    			"BatteryModules": "",
    			"BatteryPacks": "",
    			"BatteryType": "",
    			"BatteryV": "",
    			"BatteryV_to": "",
    			"BedLengthIN": "",
    			"BedType": "",
    			"BlindSpotIntervention": "",
    			"BlindSpotMon": "",
    			"BodyCabType": "",
    			"BodyClass": "Pickup",
    			"BrakeSystemDesc": "",
    			"BrakeSystemType": "Hydraulic",
    			"BusFloorConfigType": "Not Applicable",
    			"BusLength": "",
    			"BusType": "Not Applicable",
    			"CAN_AACN": "",
    			"CIB": "",
    			"CashForClunkers": "",
    			"ChargerLevel": "",
    			"ChargerPowerKW": "",
    			"CombinedBrakingSystem": "",
    			"CoolingType": "",
    			"CurbWeightLB": "",
    			"CustomMotorcycleType": "Not Applicable",
    			"DaytimeRunningLight": "",
    			"DestinationMarket": "",
    			"DisplacementCC": "4800.0",
    			"DisplacementCI": "292.91397165471",
    			"DisplacementL": "4.8",
    			"Doors": "2",
    			"DriveType": "4WD/4-Wheel Drive/4x4",
    			"DriverAssist": "",
    			"DynamicBrakeSupport": "",
    			"EDR": "",
    			"ESC": "",
    			"EVDriveUnit": "",
    			"ElectrificationLevel": "",
    			"EngineConfiguration": "V-Shaped",
    			"EngineCycles": "",
    			"EngineCylinders": "8",
    			"EngineHP": "",
    			"EngineHP_to": "",
    			"EngineKW": "",
    			"EngineManufacturer": "",
    			"EngineModel": "",
    			"EntertainmentSystem": "",
    			"ErrorCode": "0",
    			"ErrorText": "0 - VIN decoded clean. Check Digit (9th position) is correct",
    			"ForwardCollisionWarning": "",
    			"FuelInjectionType": "Multipoint Fuel Injection (MPFI)",
    			"FuelTankMaterial": "",
    			"FuelTankType": "",
    			"FuelTypePrimary": "",
    			"FuelTypeSecondary": "",
    			"GCWR": "",
    			"GCWR_to": "",
    			"GVWR": "Class 2E: 6,001 - 7,000 lb (2,722 - 3,175 kg)",
    			"GVWR_to": "",
    			"KeylessIgnition": "",
    			"LaneCenteringAssistance": "",
    			"LaneDepartureWarning": "",
    			"LaneKeepSystem": "",
    			"LowerBeamHeadlampLightSource": "",
    			"Make": "CHEVROLET",
    			"MakeID": "467",
    			"Manufacturer": "GENERAL MOTORS LLC",
    			"ManufacturerId": "984",
    			"Model": "Silverado",
    			"ModelID": "1850",
    			"ModelYear": "1999",
    			"MotorcycleChassisType": "Not Applicable",
    			"MotorcycleSuspensionType": "Not Applicable",
    			"NCSABodyType": "",
    			"NCSAMake": "",
    			"NCSAMapExcApprovedBy": "",
    			"NCSAMapExcApprovedOn": "",
    			"NCSAMappingException": "",
    			"NCSAModel": "",
    			"NCSANote": "",
    			"NonLandUse": "",
    			"Note": "",
    			"OtherBusInfo": "",
    			"OtherEngineInfo": "Iron",
    			"OtherMotorcycleInfo": "",
    			"OtherRestraintSystemInfo": "",
    			"OtherTrailerInfo": "",
    			"ParkAssist": "",
    			"PedestrianAutomaticEmergencyBraking": "",
    			"PlantCity": "ROANOKE",
    			"PlantCompanyName": "Fort Wayne Assembly",
    			"PlantCountry": "UNITED STATES (USA)",
    			"PlantState": "INDIANA",
    			"PossibleValues": "",
    			"Pretensioner": "",
    			"RearAutomaticEmergencyBraking": "",
    			"RearCrossTrafficAlert": "",
    			"RearVisibilitySystem": "",
    			"SAEAutomationLevel": "",
    			"SAEAutomationLevel_to": "",
    			"SeatBeltsAll": "",
    			"SeatRows": "",
    			"Seats": "",
    			"SemiautomaticHeadlampBeamSwitching": "",
    			"Series": "1500 (1/2 Ton)",
    			"Series2": "Conventional Cab",
    			"SteeringLocation": "",
    			"SuggestedVIN": "",
    			"TPMS": "",
    			"TopSpeedMPH": "",
    			"TrackWidth": "",
    			"TractionControl": "",
    			"TrailerBodyType": "Not Applicable",
    			"TrailerLength": "",
    			"TrailerType": "Not Applicable",
    			"TransmissionSpeeds": "",
    			"TransmissionStyle": "",
    			"Trim": "",
    			"Trim2": "",
    			"Turbo": "",
    			"VIN": "1GCEK14V0XZ138493",
    			"ValveTrainDesign": "",
    			"VehicleDescriptor": "1GCEK14V*XZ",
    			"VehicleType": "TRUCK",
    			"WheelBaseLong": "",
    			"WheelBaseShort": "",
    			"WheelBaseType": "",
    			"WheelSizeFront": "",
    			"WheelSizeRear": "",
    			"WheelieMitigation": "",
    			"Wheels": "",
    			"Windows": ""
    		}
    	]
    }

    On its own, this is pretty cool. But I wanted to do something more with this information. I wanted to store it, so I could use it later, for a Cloud World-related project.

    A GET and POST Python script

    So, with the help of one of the many popular LLMs I settled on a Python script that helps me take what you see above and get it “database ready.” Here is what that Python code looks like:

    import argparse
    import requests
    import json
    import re
    
    def sanitize_key(key: str) -> str:
        # Convert keys to lowercase, replace non-alphanumeric characters with underscores.
        key = key.strip()
        key = re.sub(r'[^a-zA-Z0-9]+', '_', key)
        return key.lower()
    
    def decode_vin(vin: str, model_year: int):
        url = f"https://vpic.nhtsa.dot.gov/api/vehicles/DecodeVinValues/{vin}?format=json&modelyear={model_year}"
        response = requests.get(url)
        response.raise_for_status()
    
        data = response.json()
    
        if "Results" not in data or not data["Results"]:
            raise ValueError("Invalid API response: No Results found.")
    
        raw_result = data["Results"][0]
    
        # Sanitize keys and skip empty values
        result = {
            sanitize_key(k): v
            for k, v in raw_result.items()
            if v not in (None, "", "Not Applicable")
        }
    
        return result
    
    def post_to_endpoint(payload: dict, endpoint: str):
        headers = {'Content-Type': 'application/json'}
        response = requests.post(endpoint, json=payload, headers=headers)
        response.raise_for_status()
        return response.json()
    
    def main():
        parser = argparse.ArgumentParser(description="A tool to decode VINs and send them to an Oracle REST endpoint")
        parser.add_argument("vin", type=str, help="Vehicle Identification Number")
        parser.add_argument("model_year", type=int, help="Model year of the vehicle")
        parser.add_argument("--pretty", action="store_true", help="Pretty-print output JSON")
        args = parser.parse_args()
    
        # Decode and sanitize
        result = decode_vin(args.vin, args.model_year)
    
        # Print to console
        if args.pretty:
            print(json.dumps(result, indent=2))
        else:
            print(result)
    
        # POST to local Oracle ORDS endpoint
        endpoint_url = "http://localhost:8080/ords/ordsdemo/vin_table/"
        try:
            post_response = post_to_endpoint(result, endpoint_url)
            print("\nPOST response from server:")
            print(json.dumps(post_response, indent=2))
        except requests.HTTPError as e:
            print(f"Error posting to endpoint: {e}")
    
    if __name__ == "__main__":
        main()

    👈🏼 Click here to learn what all different parts of this script do.

    I’ll explain briefly according to the code blocks:

    • Lines 6-10 take the results, or rather the “keys” (which will later become the columns in a database table), and standardize them. The keys end up being all lower case, no whitespaces.
    • Lines 12-31 are the bulk of the script. That part issues a GET request to the VIN decoder endpoint. If you can picture in your head, what I’m doing effectively is creating a table of keys and values. Those will then be sent over to my ORDS endpoint (a table I separately created and Auto-REST enabled with ORDS).
    • Lines 33-37 and 56-62 kind of work in tandem. That first block, lines 33-57, basically expects all the “stuff” you see in line 58. Line 58 sends the results (after we’ve processed it), using the ORDS endpoint, to the function in lines 33-57.
    • Lines 39-53 are my entry and exit points for this script. So on the command line I’d execute the following command (I’m using the same VIN and model year): python3 vindecode.py 1GCEK14V0XZ138493 1999

    Target 23ai Database

    So this whole exercise doesn’t work unless I have a target, in my database. And for this project I’m working on, a simple table that has been Auto-REST enabled is fine. I just need it to store these values. And I can use the VIN as a Primary Key (VINs are unique, and no two are alike).

    And if I’m thinking ahead, I’ll probably never update any single row. I just need to be able to insert a new row with a POST request, and then later issue a GET request by VIN. So an Auto-REST enabled table, with the VIN as a Primary Key will work just fine. As things scale, this might change, but with minimal technical debt incurred, this approach works for now.

    As an aside, when I’m working through projects like these, where I’m building out a new object, I usually have a general idea of what I want to achieve, but I will go through tons of iterations1. This time I ended up creating and dropping like four table versions before I ended on this one (four is on the low-end actually). The final DDL for this table:

    CREATE TABLE ORDSDEMO.VIN_TABLE 
        ( 
         VIN                         VARCHAR2 (17) , 
         MAKE                        VARCHAR2 (50) , 
         MAKEID                      NUMBER (5) , 
         MANUFACTURER                VARCHAR2 (100) , 
         MANUFACTURERID              NUMBER (10) , 
         MODEL                       VARCHAR2 (100) , 
         MODELID                     NUMBER (10) , 
         MODELYEAR                   NUMBER (4) , 
         SERIES                      VARCHAR2 (100) , 
         SERIES2                     VARCHAR2 (100) , 
         TRIM                        VARCHAR2 (100) , 
         TRIM2                       VARCHAR2 (100) , 
         VEHICLEDESCRIPTOR           VARCHAR2 (20) , 
         VEHICLETYPE                 VARCHAR2 (50) , 
         BODYCLASS                   VARCHAR2 (50) , 
         BODYCABTYPE                 VARCHAR2 (50) , 
         DRIVETYPE                   VARCHAR2 (50) , 
         BRAKESYSTEMTYPE             VARCHAR2 (50) , 
         BRAKESYSTEMDESC             VARCHAR2 (100) , 
         FUELINJECTIONTYPE           VARCHAR2 (100) , 
         FUELTYPEPRIMARY             VARCHAR2 (50) , 
         FUELTYPESECONDARY           VARCHAR2 (50) , 
         ENGINECONFIGURATION         VARCHAR2 (50) , 
         ENGINECYLINDERS             NUMBER (2) , 
         ENGINEHP                    NUMBER , 
         ENGINEKW                    NUMBER , 
         ENGINEMANUFACTURER          VARCHAR2 (100) , 
         ENGINEMODEL                 VARCHAR2 (100) , 
         DISPLACEMENTCC              NUMBER (6,1) , 
         DISPLACEMENTCI              NUMBER (8,5) , 
         DISPLACEMENTL               NUMBER (4,1) , 
         DOORS                       NUMBER (2) , 
         GVWR                        VARCHAR2 (100) , 
         PLANTCITY                   VARCHAR2 (50) , 
         PLANTSTATE                  VARCHAR2 (50) , 
         PLANTCOUNTRY                VARCHAR2 (50) , 
         PLANTCOMPANYNAME            VARCHAR2 (100) , 
         OTHERENGINEINFO             VARCHAR2 (255) , 
         ERRORTEXT                   VARCHAR2 (255) , 
         ERRORCODE                   NUMBER (2) , 
         AIRBAGSIDE                  VARCHAR2 (50) , 
         AIRBAGFRONT                 VARCHAR2 (50) , 
         AIRBAGCURTAIN               VARCHAR2 (50) , 
         TPMS                        VARCHAR2 (100) , 
         SEATBELTSALL                VARCHAR2 (50) , 
         ENTERTAINMENTSYSTEM         VARCHAR2 (100) , 
         CUSTOMMOTORCYCLETYPE        VARCHAR2 (100) , 
         TRAILERTYPE                 VARCHAR2 (100) , 
         TRAILERBODYTYPE             VARCHAR2 (100) , 
         TRANSMISSIONSTYLE           VARCHAR2 (100) , 
         TRANSMISSIONSPEEDS          VARCHAR2 (50) , 
         WINDOWS                     VARCHAR2 (50) , 
         WHEELS                      VARCHAR2 (50) , 
         WHEELBASESHORT              VARCHAR2 (50) , 
         WHEELBASELONG               VARCHAR2 (50) , 
         WHEELBASETYPE               VARCHAR2 (100) , 
         NOTE                        VARCHAR2 (255) , 
         ABS                         VARCHAR2 (10) , 
         ADAPTIVECRUISECONTROL       VARCHAR2 (50) , 
         ADAPTIVEHEADLIGHTS          VARCHAR2 (50) , 
         FORWARDCOLLISIONWARNING     VARCHAR2 (50) , 
         PEDESTRIANAUTOMATICEBRAKING VARCHAR2 (50) , 
         REARCROSSTRAFFICALERT       VARCHAR2 (50) , 
         REARAUTOMATICEBRAKING       VARCHAR2 (50) , 
         PARKASSIST                  VARCHAR2 (50) , 
         TRACTIONCONTROL             VARCHAR2 (50) , 
         ESC                         VARCHAR2 (50) , 
         DYNAMICBRAKESUPPORT         VARCHAR2 (50) , 
         LANEDEPARTUREWARNING        VARCHAR2 (50) , 
         LANEKEEPSYSTEM              VARCHAR2 (50) , 
         ACTIVESAFETYSYSNOTE         VARCHAR2 (255) , 
         ADDITIONALERRORTEXT         VARCHAR2 (255) 
        ) 
        TABLESPACE USERS 
        LOGGING 
    ;
    
    ALTER TABLE ORDSDEMO.VIN_TABLE 
        ADD PRIMARY KEY ( VIN ) 
        USING INDEX LOGGING ;

    Some background

    Oh, please know that I used an LLM for this DDL too. I could have reduced the number of columns since this particular vehicle has a lot of null or “n/a” fields. But other, future vehicles might actually have values for some of those columns. They can stay for now. You might not like the stylistic choice of condensing all my values, but I think this is inconsequential. You’d probably transform them again, in your application layer, when you pull them down from the database. Also, I’m familiar with these values. I could have abbreviated them even further, but for readability, I think this is good (for future me, and others too).

    ORDS Auto-REST

    The last thing I needed to do was Auto-REST enable my table. Which, if you’ve used the SQL Developer Web UI, you know how easy this is. Two clicks and you’re done. From there I have access to several HTTP methods; I only care about GET [all], GET{by id}, and POST.

    Oh, and if you look at the URI that is provided, it’s the same URI that you saw in line 56 of my Python script. So that sort of brings us full circle.

    If you recall, I used the VIN as a Primary Key. This image shows how that VIN can now be used in the URI (instead of an arbitrary Row ID).

    Review/Checkpoint

    So, at this point, I have two target endpoints:

    1. One where I retrieve a vehicle’s details by its VIN and model year (I don’t own this one – this is the one provided by NHTSA).
    2. Another (that I own) where I am updating and retrieving vehicle records.

    I’ve created a VIN_TABLE table, with the VIN as the Primary Key, and I’ve Auto-REST enabled it with ORDS. I’ve then taken the endpoint for issuing a POST request and added it to the Python script (saved as vindecode.py). Now I can issue the following command to add a record to my table:

    python3 vindecode.py 1GCEK14V0XZ138493 1999

    And just so you know I’m being honest, I’ll drop everything from my target table, and run through this workflow again.

    Back in VS Code, I’ll issue the command, and receive the response from my database (via ORDS).

    And finally, I’ll review the results (I already did this in VS Code, but I’ll do it in the browser and SQL Worksheet too):

    Review

    And that’s it! One of countless examples of how you might work through a pretty common pattern such as this (using LLMs too). The nice thing here is I don’t have to worry about any ORMs (never have, never will). You have to understand, my reality is such that ORMs do not exist. All I’ve ever known is ORDS; a bit of middleware that sends/receives HTTP/HTTPS requests on your behalf, without ever having to map a thing.

    And we’ve made it much easier to take ORDS for a test drive. We’ve been working behind the scenes to improve our Docker/Podman documentation; we just released the latest refresh to our docs here. Hopefully the docs are much more approachable for the container newbie.

    OH! And if you ever have any issues, open a ticket here. Our team now “owns” the ORDS content on that GitHub repository; so, we’ll be a lot more responsive too!

    That is all for now. If you haven’t downloaded the latest ORDS, check it out here. Bye 👋🏼.

    P.S. Yes, I realize this post’s image is of a GMT-400, and not a GMT-800. But it was a cool pic, and I really liked it. Credit to The Autopian.

    1. I’m including a link to the prompting I used for creating most of what you see here. I think in this new age of AI/LLMs it is important to understand prompting. Eventually, whether we like it or not, we’ll all need to become “prompt engineers.” And as you’ll see in that thread, you actually have to know what you are doing to be able to ask the right questions and to challenge an LLM’s assumptions.

      A lot of times, these things are just flat out wrong, and overly verbose too. As you’ll see, at one point the LLM added a mapping function in one of the iterations, which in my mind was completely unnecessary. Additionally, I wouldn’t have constructed this script as a pseudo-command line interface. BUT…I learned a new library (argparse), and I think I’ll explore that more in the future. ↩︎
  • Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

    Using Python Pandas to turn ISO Country Codes into a string to use as values for a SQL Query

    Summary, code, resources

    Problem

    While querying a table (based on this dataset) with SQL, you realize one of your columns uses 3-character ISO Country Codes. However, some of these 3-character codes aren’t countries but geographical regions or groups of countries, in addition to the actual country codes. How can you filter out rows so you are left with the countries only?

    Answer

    Use the Python Pandas library to scrape ISO country codes and convert the values to one single string. Then use that string as values for a subsequent SQL query (possibly something like this):

    SELECT * FROM [your_table]
    WHERE country_code IN ([values from the generated list-as-string separated by commas and encased by single / double quotes]);

    Code

    # Libraries used in this code
    from bs4 import BeautifulSoup
    import requests
    import csv 
    import pandas as pd 
    
    # I found these ISO country codes on the below URL. Pandas makes it easy to read HTML and manipulate it. Very cool!
    iso_codes = pd.read_html("https://www.iban.com/country-codes")
    
    # I create a data frame, starting at an index of 0.
    df = iso_codes[0]
    
    # But really, all I care about is the 3-digit country code. So I'll make that the df (dataframe) and strip out the index
    df = df['Alpha-3 code'].to_string(index=False)
    
    # From here, I'll save this little guy as a text file.
    with open("./countries.txt", "w") as f:
        f.write(df)
    
    # I'll set up a list. *** This was my approach, but if you find a better way, feel free to comment or adjust. ***
    my_list = []
    
    # Then I'll open that text file and read it in.
    file = open("./countries.txt", "r")
    countries = file.read()
    
    # I need to remove the "new line" identifiers, so I'm doing that here. 
    my_list = countries.split('\n')
    
    # Once I do that, I can create two new strings. I do this with f-Strings. Great article on using them here: https://realpython.com/python-f-strings/ 
    
    # I have two options here: one where the codes are contained by single quotes, the other with double quotes. Oracle Autonomous Database likes single quotes, but your DB may differ.
    
    countries_string_single_quotes = ','.join(f"'{x}'" for x in my_list)
    
    countries_string_double_quotes = ','.join(f'"{x}"' for x in my_list)
    
    # From here, I take those strings and save them in a text file. You don't have to do this; you can print and copy/paste the string. But this might be an excellent addition if you want to refer to these later without running all the code. 
    
    with open("./countries_as_list_single_quotes.txt", "a") as f:
        f.write(countries_string_single_quotes)
    
    with open("./countries_as_list_double_quotes.txt", "a") as f:
        f.write(countries_string_double_quotes)

    GitHub repo details

    You can find the code from this post in my GitHub repository. The repository consists of the following:

    • The Python code I created for solving this problem
    • A countries.txt file, which is produced midway through the code (temporary placeholder for later processing)
    • ‘Single quotes’ .txt file – the 3-character ISO Country Codes are formatted as a string. The values are enclosed by single quotes; commas throughout
    • “Double quotes” .txt file – the 3-character ISO Country Codes are formatted as a string. The values are enclosed by double quotes; commas throughout

    Resources


    The in-depth walkthrough

    Backstory

    A few days ago, I stumbled upon this question in the r/datasets subreddit:

    I spent most of the morning figuring out how I would go about this, and after some trial and error, I devised a plan. I decided to take the list of ISO Country Codes (which I found here) and use them as values for filtering in a SQL statement (later on in Oracle SQL Developer Web).

    After some research, I figured out the proper SQL syntax for a successful query.

    SELECT * FROM [your_table]
    WHERE country_code IN ([values from the generated list-as-string separated by commas and encased by single / double quotes]);

    From there, I knew I needed to work backward on those ISO Country Codes. Meaning I needed to take something that looked like this:

    And turn it into something more workable. It turns out that grabbing this was pretty straightforward. I’m using Pandas primarily for this exercise, but first, I need to import some libraries:

    # Libraries used in this code
    from bs4 import BeautifulSoup
    import requests
    import csv 
    import pandas as pd 

    Next, I’ll use Pandas’ read_html function (this feels like cheating, but it’s incredible) to read in the table.

    # I found these ISO country codes on the below URL. Pandas makes it easy to read HTML and manipulate it. Very cool!
    iso_codes = pd.read_html("https://www.iban.com/country-codes")
    
    # I create a data frame, starting at an index of 0.
    df = iso_codes[0]

    This is wild, but this is what the printout looks like:

    pandas to_html printout of ISO codes table, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The Pandas read_html() the function is powerful.

    If you squint, you can see an “Alpha-2 code” and an “Alpha-3 code” column in the image. From here, I need to isolate the 3-code column. So I reshaped the data frame by making it a single column; dropping the index (this is optional, you could keep the index if you needed it; perhaps you wanted to create a separate table in your database).

    # But really, all I care about is the 3-digit country code. So I'll make that the df (dataframe) and strip out the index
    df = df['Alpha-3 code'].to_string(index=False)

    I’ll save this data frame as a .txt file.

    # From here, I'll save this little guy as a text file.
    with open("./countries.txt", "w") as f:
        f.write(df)

    This is only temporary (FYI: this is the only way I could figure out how to do this). It’ll look like this:

    country codes as a temporary text file, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The temporary .txt file of 3-character ISO Country Codes.

    Next, I take that temporary text file and read it in. I’m going to add it to a list, so I’ll first create the empty list (aptly named “my_list“). I also need to remove the newline characters from the list; otherwise, if I don’t, then when I create my string of values (that comes in the final step), the string will look like this:

    countries string with newline characters, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The “countries” string with “\n” characters.

    I remove the newline characters with this piece of code:

    # I need to remove the "new line" identifiers, so I'm doing that here. 
    my_list = countries.split('\n')

    The almost string of values will look like this:

    viewing ouput of my_list, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    New line characters have now been removed.

    I use F-Strings to create the following two strings; countries_strings_single_quotes and countries_strings_double_quotes, respectively. Need to learn about F-Strings (or, more formally, Literal String Interpolation)? No problemo! Check out these three resources:

    The code for the F-Strings is below. I loop through my_list and separate the x (the things I’m iterating over) with commas (that’s the join).

    # Once I do that, I can create two new strings. I do this with f-Strings. Great article on using them here: https://realpython.com/python-f-strings/ 
    
    # I have two options here: one where the codes are contained by single quotes, the other with double
    # quotes. Oracle Autonomous Database likes single quotes, but your DB may differ.
    
    countries_string_single_quotes = ','.join(f"'{x}'" for x in my_list)
    
    countries_string_double_quotes = ','.join(f'"{x}"' for x in my_list)

    And now that I have these two objects (are they called objects??). I’ll save them each as a text file. One file has the 3-character codes surrounded by single quotes, the other with double quotes. The code:

    # From here, I take those strings and save them in a text file. You don't have to do this; you can print
    # and copy/paste the string. But this might be a nice addition if you want to refer to these later 
    # without running all the code. 
    
    with open("./countries_as_list_single_quotes.txt", "a") as f:
        f.write(countries_string_single_quotes)
    
    with open("./countries_as_list_double_quotes.txt", "a") as f:
        f.write(countries_string_double_quotes)

    The text files look like this now:

    side by side comparison of newly created text files, chris hoina, senior product manager, oracle rest apis, oracle autonomous database, sql developer web, ORDS
    The country codes are now presented in one long string. Pretty cool, eh?

    SQL time

    We have arrived! Let me show you what I can do now!

    I took the CSV data from the World Bank and loaded it into my Autonomous Database. Our returning intern Layla put together a video of how to do this; you can check it out here:

    Once my table was created, I did a SELECT [columns] FROM. Here you can see my “beginning state”.

    There are 266 entries; some are countries, and others are not. And if you recall, the original question asked how somebody could filter out the non-countries. Onto that next!

    This is the best part. I can take the string I made and use that in a SQL query such as this:

    SELECT * from ADMIN.REDDIT_TABLE
    WHERE COUNTRY_CODE IN('AFG','ALA','ALB','DZA','ASM','AND','AGO','AIA','ATA',
    'ATG','ARG','ARM','ABW','AUS','AUT','AZE','BHS','BHR','BGD','BRB','BLR','BEL',
    'BLZ','BEN','BMU','BTN','BOL','BES','BIH','BWA','BVT','BRA','IOT','BRN','BGR',
    'BFA','BDI','CPV','KHM','CMR','CAN','CYM','CAF','TCD','CHL','CHN','CXR','CCK',
    'COL','COM','COD','COG','COK','CRI','CIV','HRV','CUB','CUW','CYP','CZE','DNK',
    'DJI','DMA','DOM','ECU','EGY','SLV','GNQ','ERI','EST','SWZ','ETH','FLK','FRO',
    'FJI','FIN','FRA','GUF','PYF','ATF','GAB','GMB','GEO','DEU','GHA','GIB','GRC',
    'GRL','GRD','GLP','GUM','GTM','GGY','GIN','GNB','GUY','HTI','HMD','VAT','HND',
    'HKG','HUN','ISL','IND','IDN','IRN','IRQ','IRL','IMN','ISR','ITA','JAM','JPN',
    'JEY','JOR','KAZ','KEN','KIR','PRK','KOR','KWT','KGZ','LAO','LVA','LBN','LSO',
    'LBR','LBY','LIE','LTU','LUX','MAC','MKD','MDG','MWI','MYS','MDV','MLI','MLT',
    'MHL','MTQ','MRT','MUS','MYT','MEX','FSM','MDA','MCO','MNG','MNE','MSR','MAR',
    'MOZ','MMR','NAM','NRU','NPL','NLD','NCL','NZL','NIC','NER','NGA','NIU','NFK',
    'MNP','NOR','OMN','PAK','PLW','PSE','PAN','PNG','PRY','PER','PHL','PCN','POL',
    'PRT','PRI','QAT','REU','ROU','RUS','RWA','BLM','SHN','KNA','LCA','MAF','SPM',
    'VCT','WSM','SMR','STP','SAU','SEN','SRB','SYC','SLE','SGP','SXM','SVK','SVN',
    'SLB','SOM','ZAF','SGS','SSD','ESP','LKA','SDN','SUR','SJM','SWE','CHE','SYR',
    'TWN','TJK','TZA','THA','TLS','TGO','TKL','TON','TTO','TUN','TUR','TKM','TCA',
    'TUV','UGA','UKR','ARE','GBR','UMI','USA','URY','UZB','VUT','VEN','VNM','VGB',
    'VIR','WLF','ESH','YEM','ZMB','ZWE')
    ORDER BY COUNTRY_CODE ASC;

    Once I execute that SQL statement, I’m left with the countries from that list. I opened up the results in another window so you can see a sample.

    The end

    So yeah, that’s it! I don’t know if this was the best way to go about this, but it was fun. I’m curious (if you’ve made it this far), what do you think? How would you go about it? Let me know.

    And two more things: remember to share this and…

  • More fun with Medium story stats, JSON, Python, Pandas, and Oracle SQL Developer Web

    More fun with Medium story stats, JSON, Python, Pandas, and Oracle SQL Developer Web

    That’s right; I’m back again for yet another installment of this ongoing series dedicated to working with Medium.com story stats. I first introduced this topic in a previous post. Maybe you saw it. If not, you can find it here.

    Recap

    My end goal was to gather all story stats from my Medium account and place them into my Autonomous Database. I wanted to practice my SQL and see if I could derive insights from the data. Unfortunately, gathering said data is complicated.

    Pulling the data down was a breeze once I figured out where to look for these story statistics. I had to decipher what I was looking at in the Medium REST API (I suppose that was somewhat tricky). My search was mostly an exercise in patience (there was a lot of trial and error).

    I uploaded a quick video in the previous post. But I’ll embed it here so you can see the process for how I found the specific JSON payload.

    Obtaining the raw JSON

    Once I found that URL, I saved this JSON as a .json file. The images below show remnants of a JavaScript function captured with the rest of the JSON. I’m no JavaScript expert, so I can’t tell what this function does. But before I load this into my Autonomous Database (I’m using an OCI Free Tier account, you can check it out here if you are curious), it needs to go.


    README

    I am pointing out a few things that may seem convoluted and unnecessary here. Please take the time to read this section so you can better understand my madness. 
    
    FIRST: Yes, you can manually remove the [presumably] JavaScript saved along with the primary JSON payload (see above paragraphs). I'm showing how to do this in Python as a practical exercise. But I'm also leaving open the opportunity for future automation (as it pertains to cleaning data). 
    
    SECOND: When it comes to the Pandas data frame steps, of course, you could do all this in Excel, Numbers, or Sheets! Again, the idea here is to show you how I can clean and process this in Python. Sometimes doing things like this in Excel, Numbers, and Sheets is impossible (thinking about enterprise security here). 
    
    THIRD: Admittedly, the date-time conversion is hilarious and convoluted. Of course, I could do this in a spreadsheet application. That's not the point. I was showing the function practically and setting myself up for potential future automation. 
    
    FOURTH: I'll be the first to admit that the JSON > TXT > JSON > CSV file conversion is comical. So if you have any suggestions, leave a comment here or on my GitHub repository (I'll link below), and I'll attribute you!

    The code

    Explaining the code in context, with embedded comments, will be most illuminating.

    I’ve named everything in the code as literally as possible. In production, this feels like it might be impractical; however, there is no question about what the hell the code is doing! Being more literal is ideal for debugging and code maintenance.

    Here is the entire code block (so CTRL+C/CTRL+V to your heart’s content 😘). I’ll still break this down into discrete sections and review them.

    import csv
    import json
    import pandas as pd 
    import datetime
    from pathlib import Path
    
    # You'll first need to sign in to your account, then you can access this URL without issues: 
    # https://medium.com/@chrishoina/stats/total/1548525600000/1668776608433
    # NOTES:
    # Replace the "@chrishoina" with your username 
    # The two numbers you see are Unix Epochs; you can modify those as # needed; in my case, I
    # wanted to see the following:
    # * 1548525600000 - At the time of this post, this seems to be 
    # whenever your first post was published or when 
    # you first created a Medium account. In this case, for me, this 
    # was Sat, Jan/26/2019, 6:00:00PM - GMT
    # * 1665670606216 - You shouldn't need to change this since it will # just default to the current date.
    
    # For the conversion, I an Epoch Converter tool I found online: https://www.epochconverter.com/
    
    # Step 1 - Convert this to a,(.txt) file
    p = Path("/Users/choina/Documents/socialstats/1668776608433.json")
    p.rename(p.with_suffix('.txt'))
    
    # Step 2 - "read" in that text file, and remove those pesky
    # characters/artifacts from position 0 through position 15. 
    # I'm only retaining the JSON payload from position 16 onward.
    
    with open("/Users/choina/Documents/socialstats/1668776608433.txt", "r") as f:
    stats_in_text_file_format = f.read()
    
    # This [16:] essentially means grabbing everything in this range. Since
    # there is nothing after the colon; it will just default to the end (which is
    # what I want in this case).
    
    cleansed_stats_from_txt_file = stats_in_text_file_format[16:]
    
    print(cleansed_stats_from_txt_file)
    
    # This took me a day to figure out, but this text file needs to be encoded
    # properly, so I can save it as a JSON file (which is about to happen). I
    # always need to remember this, but I know that the json.dumps = dump
    # string, which json.dump = dump object. There is a difference, I'm not
    # the expert, but the docs were helpful. 
    
    json.dumps(cleansed_stats_from_txt_file)
    
    # Step 3 - Here, I create a new file, then indicate we will "w"rite to it. I take the 
    # progress from Step 2 and apply it here. 
    
    with open('medium_stats_ready_for_pandas.json', 'w') as f:
    f.write(cleansed_stats_from_txt_file)
    
    # Step 4 - Onto Pandas! We've already imported the pandas library as "pd."
    # We first create a data frame and name the columns. I kept the names
    # very similar to avoid confusion. I feared that timestampMs might be a
    # reserved word in Oracle DB or too close, so I renamed it. 
    
    df = pd.DataFrame(columns=['USERID', 'FLAGGEDSPAM', 'STATSDATE', 'UPVOTES', 'READS', 'VIEWS', 'CLAPS', 'SUBSCRIBERS']) 
    
    with open("/Users/choina/Documents/socialstats/medium_stats_ready_for_pandas.json", "r") as f: 
    data = json.load(f)
    data = data['payload']['value']
    
    print(data)
    
    for i in range(0, len(data)):
    df.loc[i] = [data[i]['userId'], data[i]['flaggedSpam'], data[i]['timestampMs'], data[i]['upvotes'], data[i]['reads'], data[i]['views'], data[i]['claps'], data[i]['updateNotificationSubscribers']]
    
    df['STATSDATE'] = pd.to_datetime(df['STATSDATE'], unit="ms")
    
    print(df.columns)
    
    # Step 5 - use the Pandas' df.to_csv function and save the data frame as
    # a CSV file
    
    with open("medium_stats_ready_for_database_update.csv", "w") as f:
    df.to_csv(f, index=False, header=True)

    I used several Python libraries I use for this script:

    Step 1

    p =  Path("/Users/choina/Documents/socialstats/1668776608433.json")
    p.rename(p.with_suffix('.txt')

    Pathlib allows you to assign the file’s path to “p”. From there, I changed the .json file extension to a .txt extension.

    Note: Again, I'm sure there is a better way to do this, so if you're reading, leave a comment here or on my GitHub repository so I can attribute it to you 🙃. 

    The before and after of what this step looks like this:

    With that out of the way, I needed to remove that JavaScript “prefix” in the file. I do this in Step 2 (I got so fancy that I probably reached diminishing returns). My approach works, and I can repurpose this for other applications too!

    Step 2:

    # Step 2 - "read" in that text file, and remove those pesky
    # characters/artifacts from position 0 through position 15. Or in other
    # words, you'll retain everything from position 16 onward because that's
    # where the actual JSON payload is.
    
    with open("/Users/choina/Documents/socialstats/1668776608433.txt", "r") as f:
    stats_in_text_file_format = f.read()
    
    # This [16:] essentially means grabbing everything in this range. Since
    # there is nothing after the colon; it will just default to the end (which is
    # what I want in this case).
    cleansed_stats_from_txt_file = stats_in_text_file_format[16:]
    
    print(cleansed_stats_from_txt_file)
    
    # This took me a day to figure out, but this text file needs to be
    # appropriately encoded to save as a JSON file (which is about to
    # happen). I always forget the difference between "dump" and "dumps";
    # json.dumps = dump string, whereas json.dump = dump object. There is
    # a difference, I'm not the expert, but the docs were helpful (you should
    # read them). 
    
    json.dumps(cleansed_stats_from_txt_file)
    
    json with characters that need to be removed in vs code context chris hoina senior product manager oracle rest apis database tools autonomous database
    I needed to remove these remnants from the Medium JSON response

    While this initially came through as a JSON payload, those first 0-15 characters had to go.

    FULL DISCLAIMER: I couldn't figure out how to get rid of this while it was still a JSON file hence why I converted this to a text file (this was the only way I could figure it out).

    I captured position 16 to infinity (or the end of the file, whichever occurs first), then I re-encoded the file as JSON (I interpreted this as “something the target machine can read and understand as JSON“).

    OPEN SEASON: CompSci folks, please roast me in the comments if I'm wrong. 

    Step 3

    # Step 3 - I create a new file, then I'll "w"rite to it. I took the result from Step 2 and applied it here. 
    
    with open('medium_stats_ready_for_pandas.json', 'w') as f: 
    f.write(cleansed_stats_from_txt_file)

    I’m still at the data-wrangling portion of this journey, but I’m getting close to the end. I’ll create a new JSON file, take the parts of the (freshly encoded) text file I need, and then save them as that new JSON file.

    Step 4

    # Step 4 - Onto Pandas! We've already imported the pandas library as "pd"
    
    # I first create a data frame and name the columns. I kept the names 
    # similar to avoid confusion. I feared that timestampMs might be a
    # reserved word in Oracle DB or too close, so I renamed it. 
    
    df = pd.DataFrame(columns=['USERID', 'FLAGGEDSPAM', 'STATSDATE', 'UPVOTES', 'READS', 'VIEWS', 'CLAPS', 'SUBSCRIBERS']) 
    
    with open("/Users/choina/Documents/socialstats/medium_stats_ready_for_pandas.json", "r") as f: 
    data = json.load(f)
    data = data['payload']['value']
    
    print(data)
    
    for i in range(0, len(data)):
    df.loc[i] = [data[i]['userId'], data[i]['flaggedSpam'], data[i]['timestampMs'], data[i]['upvotes'], 
    data[i]['reads'], data[i]['views'], data[i]['claps'], data[i]['updateNotificationSubscribers']]
    
    df['STATSDATE'] = pd.to_datetime(df['STATSDATE'], unit="ms")
    
    print(df.columns)

    I won’t teach Pandas (and honestly, you do NOT want me to be the one to teach you Pandas), but I’ll do my best to explain my process. I first created the structure of my data frame (“df” in this case). And then, I named all the column headers (these can be anything, but I kept them very close to the ones found in the original JSON payload).

    I then opened the newly-saved JSON file and extracted what I needed.

    NOTE: I got stuck here for about a day and a half, so let me explain this part. 

    The data['payload']['value'] refers to the key and value in this particular {key: value} pair. This approach allowed me to grab all the values of “value“. This image explains what I started with (on the left) and what I ended up with (on the right).

    retrieving all the values in the payload-value key-value pair, chris hoina, ords, oracle rest apis, database tools, sql developer web
    The before and after JSON payload

    You’ll notice a {"success": true} key: value pair. With this method, I removed that pair and shed others at the end of the JSON payload.

    three additional key value pairs i remove from the json, chris hoina, senior product manager, ords, oracle rest apis, database tools, sql developer web, oracle autonomous database
    Removing a great deal of trash

    I can’t take credit for organically coming up with this next part; Kidson on YouTube is my savior. I’d watch this video to understand what is happening in this piece of code entirely:

    for i in range(0, len(data)):
        df.loc[i] = [data[i]['userId'], data[i]['flaggedSpam'], data[i]['timestampMs'], data[i]['upvotes'],                                   
        data[i]['reads'], data[i]['views'], data[i]['claps'], data[i]['updateNotificationSubscribers']]

    In short, you take the values from the columns in the JSON file (above) and then put them into the column locations named in this piece of code:

    df = pd.DataFrame(columns=['USERID', 'FLAGGEDSPAM', 'STATSDATE', 'UPVOTES', 'READS', 'VIEWS', 'CLAPS', 'SUBSCRIBERS'])  

    For instance, the "userId" values in the JSON file will all go into the 'USERID' column in the Pandas data frame. And the same thing will happen for the other values and associated (Pandas data frame) columns.

    Finally, I changed the date (which, if you recall, is still in this Epoch format) with the Datetime library to a more friendly, readable date. Using this code:

    df['STATSDATE'] = pd.to_datetime(df['STATSDATE'], unit="ms")

    Step 5

    with open("medium_stats_ready_for_database_update.csv", "w") as f:
        df.to_csv(f, index=False, header=True)

    I’m at the home stretch now. I take everything I’ve done in Pandas and save it as a CSV file. I wanted to keep the headers but ditch any indexing. The clean CSV file will look like this:

    completed csv ready for oracle autonomous database chris hoina senior product manager oracle rest apis database tools autonomous database
    Cleaned, tidy CSV ready for Data Load via SQL Developer Web

    Step 6

    Lastly, I logged into SQL Developer Web and clicked the new Data Load button (introduced in Oracle REST Data Services version 22.3) to upload the CSV file into a new table. The Autonomous Database automatically infers column names and data types. I slightly modified the "statsdate" column (honestly, I could have left it alone, but it was easy enough to change).

    Before and After

    And that’s it! Once uploaded, I can compare what I did previously to what I have achieved most recently. And both ways are correct. For instance, depending on your requirements, you can retain the JSON payload as a CLOB (as seen in the first image) or a more traditional table format (as seen in the second image).

    Wrap up

    If you’ve made it this far, congrats! You should now have two ways to store Medium stats data in a table (that lives in the Oracle Autonomous Database) either as:

    • a CLOB
    • an OG table

    And if you’d like to review the code, you can find it here.

    And until next time 🫠, happy querying.

  • Clean up a .CSV file with Regular Expressions, Pandas, and Python

    Clean up a .CSV file with Regular Expressions, Pandas, and Python

    Let us begin

    I log into Database Actions as my newly created “Python Developer” and navigate directly to the “Data Load” page (found under the Data Tools section of the Launchpad). I choose to “Load Data” from a “Local File.” I click “next,” click the pencil icon (see the arrow in the image), and navigate to the “File” tab. I scroll to the “RESTAURANTOPENDATE” column and see this:

    CSV file in data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    In a previous post (if you’re new, you can read the end-to-end process), I discussed how this time (“04:00:00+00” or “05:00:00+00”) wasn’t necessary for me. At that time, I used the “Find and Replace” function in Excel (I’m sure you can do the same with Numbers, Sheets, or Calc) to replace all occurrences of time with “” (i.e., nothing).

    But in the spirit of doing things in five days, when they could have taken but five minutes, I opted to see if I could achieve a similar result in python.

    Goal

    Create a python script that will allow you to remove part of a value in a .CSV file.

    WARNING: I thought this would be a simple task. I should have known better. My approach may not be ideal for your situation, but hopefully you’ll learn something. Or at the very least maybe you can bookmark this post, along with the resources (at the end of the post) I'm including for later use.

    Regular Expressions

    I am confident there is a way to achieve this goal with the .CSV library in python. There is probably a way to do this with python out of the box. I couldn’t figure it out.

    I’m also reasonably confident that my approach is on the verge of ridiculous. Nevertheless, Regular Expressions, and the Pandas library, in python are what worked for me.

    What are Regular Expressions?

    Good question. I still don’t know, but here is what I found on Wikipedia:

    “A regular expression (shortened as regex or regexp; also referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually, such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings, or for input validation. It is a technique developed in theoretical computer science and formal language theory.”

    That is so painful to read, but the way I understand it is that we use Regular Expressions in pattern-matching. Essentially you create a pattern and then tell your application (or script) to search for it. From there, you can include more code to perform more actions. In my case, those actions would be to find a pattern and replace that pattern with nothing.

    So what is the pattern?

    One of the benefits of having zero formal training in application development (in this case, computer science and formal language theory) is that occasionally, you might take an approach that, while unintuitive, works well enough.

    And after many, many hours of trial and error, parsing through Stack Overflow, reviewing hours of YouTube, reading pages of blogs, and occasional use of the “I’m Feeling Lucky” button on Google, it occurred to me that my pattern was:

    Pattern for Regular Expressions, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    After reviewing more documentation and other various resources (I have an entire section at the end), I more clearly identified a pattern:

    A more formal pattern for Regular Expressions, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    Coming up with a clearly-defined pattern helped inform me as to how I could create the following Regular Expression:

    '\s\d+:\d+:\d+[^a-c6]\d+$'

    I then did some more stuff, and that was it! Follow me for more…

    I’m kidding.

    Deep Dive

    I can’t teach you everything, because I’m probably only ahead of you by a week at this point. But I can explain, in the hopes that it will make sense. Let me continue with the above Regular Expression (again, you’ll want to spend some time in the resources section I’ve included to understand better how this all fits together).

    But the above Regular Expression can be defined like this:

    Regular Expression definition for this script, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    I then took the above and plugged it into this nine (12 with sections, slightly more with comments) line script:

    from pickle import NONE, TRUE
    #This doesn't seem work without pickle, still not sure why
    import pandas as pd
    #You need this to do anything related to dataframes (i.e. "df"); 
    # if you're a data scientist and you use python, you may already 
    # be familiar with this
    import re 
    #You need this for Regular Expressions to work
    
    d = '/Users/choina/Documents/untitled folder/Restaurants_in_Wake_County.csv'
    #I'm assigning the file to "d"
    df = pd.read_csv(d)
    #df means dataframe. Here I'm telling panda to read the .CSV file 
    # that I just created ("d"), and consider that the "dataframe".
    
    print(df.dtypes)
    #This is optional, but when I do this I can see what the various 
    # datatypes are (according to pandas). The RESTAURANTOPENDATE column 
    # is an 'object'.
    for RESTAURANTOPENDATE in df:
        df['RESTAURANTOPENDATE'] = df['RESTAURANTOPENDATE'].str.replace('\s\d+:\d+:\d+[^a-c6]\d+$', '', regex=TRUE)
    #This is a "For Loop" that says set the RESTAURANTOPENDATE column 
    # contents equal to the following: FIRST, consider the values in the 
    # column as string and replace the contents using this process: 
    # find a pattern that matches this Regular Expression, replace
    # it with NOTHING, then take that column,along with the other 
    # columns and...
    df.to_csv('newrest.csv')
    #Save it to a new .CSV file called "newrest.csv"

    Code as an image for reference:

    Python script to clean CSV file for data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    In short, (I’ve left detailed comments for reference), these nine lines of code search for a pattern and replace it with nothing when found. We then take these results (including the rest of the .CSV file) and save them to a new file called “newrest.csv.”

    Please, reference this script. Tweak it as needed. Or drop me a note if you need an outside perspective. Good luck!

    One more thing

    Pickle, Pandas, and RE are all required for this script to work.

    Note: While I understand Pickle is a library that deals with the serialization of objects in python. I've no idea what that means, and reading about it makes my brain melt. Additionally, I’m not sure if this is expected behavior, but when I ran an earlier version of this script, this "Pickles" library just appeared at the very top. If this has happened to you, please leave a comment, because I'm scared there may be an actual ghost in my machine.

    Moving on

    The rest is pretty straightforward.

    I then went back into Database Actions (it had been so long that I’d expected a new version had been released already), loaded my data, and inspected the column in question:

    Updated CSV file in data load, Database Actions, Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    Looks good to me!

    Magic

    Something I noticed was that my Autonomous Database automatically recognizes the “RESTAURANTOPENDATE” column as a data type of “Date.” I didn’t have to do anything special here, and I thought this was pretty cool. Take a look:

    Database Actions automatically recognizes datatypes Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Actions

    Addendum to the addendum

    Once imported, you can also use Regular Expressions in your database (with SQL). But if you’d like to clean your data before it reaches your ADB, the approach I’m writing about here is also a great option!

    Resources

    As promised, here are all the resources I’ve curated that might assist you should you choose to take the road less traveled…

    • Important/relevant documentation:
    • Stack Oveflow – this is a series of “bookmarked” questions about Python and Regular Expressions, that helped me. They might be of use. They aren’t intended to teach you, but rather aide you in gaining a better perspective.
    • YouTube – a playlist I created that might be worth reviewing (at a minimum watch Case Digital’s “How to Remove Characters From a Pandas Dataframe In Python” video).
    • Regular Expression Editors:
      • Pythex (make sure you click the “Regular expression cheatsheet” button to expand/see a valuable quick reference – also, apparently a more complete version can be found here)
      • Regular Expressions 101 – this includes an enormous library of previously created Regular Expressions. There is a “debugger” too!

    An even simpler solution

    I discovered an even simpler solution to this problem. As I was reviewing the Pandas documentation I noticed there was functionality built directly into Pandas. This can be done with the “datatime” function.

    And you can take it one step further with what is referred to as a “.dt accessor.” (you’ll see this in the updated code I’m including). This accessor allows you to manipulate a Series (aka a column) within a Pandas dataframe (df).

    There are quite a few time-date properties you can manipulate in Pandas, all of which can be found here. The “Time Zone Operations” section is where the “.dt accessor”, is briefly mentioned. For more context, I’d recommend reviewing this section on the .dt accessor and how they interplay with a Pandas Series.

    Don’t stop there though, the main python documentation discusses the basics of date and time operations. Without further ado, on to the new code:

    import pandas as pd 
    import numpy as np 
    
    #You don't have to do it like this. You could have a single "d" and then just comment/uncomment each line as needed. 
    
    d1 = '/Users/choina/Downloads/Food_Inspections.csv'
    d2 = '/Users/choina/Downloads/Food_Inspection_Violations copy.csv'
    d3 = '/Users/choina/Downloads/Restaurants_in_Wake_County copy.csv'
    
    #The same thing here, you could just have a single "d" and then just comment out the other two, when not needed.
    
    df_1 = pd.read_csv(d1)
    df_2 = pd.read_csv(d2)
    df_3 = pd.read_csv(d3)
    
    #Same goes for this too. This is either very slick of me, or incredibly lazy. 
    
    df_1['DATE_'] = pd.to_datetime(df_1['DATE_']).dt.date
    df_2['INSPECTDATE'] = pd.to_datetime(df_2['INSPECTDATE']).dt.date
    df_3['RESTAURANTOPENDATE'] = pd.to_datetime(df_3['RESTAURANTOPENDATE']).dt.date
    
    #Same, same here as well. You could do one at a time. But it works, I double-checked my work. 
    
    df_1.to_csv('newninspect.csv')
    df_2.to_csv('newviolate.csv')
    df_3.to_csv('newrest.csv')
    New python script to change date time in csv files, Chris Hoina, Senior Product Manager, ORDS, Database Tools, Oracle Autonomous Database

    About the update

    This code is certified fresh – meaning, it works. I triple-checked. The only thing I’ve noticed is that lines 12-14 must-have “.dt.date” at the end. From what I am reading (and am inferring from the documentation), it looks as though you need to first “treat” the Series with the “to_datatime” function. After that, the entire dataframe is in limbo (not quite written out to a new .CSV), waiting. Before the next step, we can strip the time portion out using the .dt accessor (i.e. the “.dt.date” portion located at the end of lines 12-14).

    From there it is simply a matter of writing out these updated dataframes to the three new expectant .CSV files.

    Find Me

    Please come back for more. I’m going to be working on views in my Autonomous Database, and then later, I’ll REST-enable them with the help of ORDS!

  • Updating VARCHAR2 fields in Oracle Database Actions

    Updating VARCHAR2 fields in Oracle Database Actions

    Update

    I’ve run into more problems with this python/flask/ords/adb app idea of mine. Turns out I’m too [developer] dumb to do some of the things that I’d originally envisioned. Now that I’ve become more familiar with python, flask, the Oracle Autonomous Database, and our Database Tools (like ORDS and Database Actions), I’m realizing that I need to rethink my approach.

    First off, I’ve updated the (not trying to be pretentious here) User Journey for this project.

    SIDE NOTE: The User Journey is something that I really should pay more attention to (especially considering it was one of my concentrations in my MBA program, and it was a huge initiative in my previous role at Big Blue) and you should too!

    Ch-ch-ch-ch-changes

    I’m leaning towards (after a recent discussion with that jeff smith) more of a drop-down style of search. I’d like to present a user with all the available cities (having taken these from my “Restaurants” table) in Wake County.

    I think I can populate a list of cities to choose from with the help of REST APIs (via ORDS). But I need to do some clean-up first.

    Side note

    In about a minute, you’ll probably come to the realization that my approach is less than elegant. Remember, I’m not a developer. I’m kind of stumbling through this, and I’ll need to make some sacrifices along the way (we call those “trade-offs” in business-speak). Unfortunately, there are simply some techniques/skills that I do not know yet. For a visual representation of these deficits, I draw your attention to:

    Me the develper versus an actual developer, Chris Hoina, Senior Product Manager, ORDS, Database Tools, Oracle Autonomous Database, Database Actions

    Back to your regularly scheduled program

    On to clean-up. The cities in this table are a bit all over the place. Let’s take a look.

    From Database Actions, I can navigate to a SQL Worksheet. From there I can inspect my tables (Right-click > Edit), or jump straight into the SQL. Since I know my “CITIES” column is the problem, I’ll run this:

    SELECT DISTINCT CITY
    FROM ADMIN.RESTAURANTS
    ORDER BY 1

    Some screenshots of what this looks like in Database Actions:

    NOTE: The "DISTINCT" parameter drops any duplicates, "ORDER BY" returns results in ascending order.

    I suspect that the program used to input health inspections allows for some level of free text.

    Tangent / Story

    I’ve used an application similar to this in a previous life, but it was a Computer Aided Dispatch (CAD) application. You could enter in free-text, or select from users’ previous entries. There was a lot of flexibility, but because of that, there were many variations on things like street names, cities, locations, parking lots, person names, etc. For the user, awesome, for the developer or database administrator, less so.

    At this point, it is pretty clear the amount of normalization I’ll have to do to all these city names. Now I just need to figure out:

    • case to use (Upper, Sentence, Lower, SaRCaSTic, etc.)
    • city name variations to use

    I randomly chose a city with the following SQL command:

    UPDATE Restaurants SET CITY = 'Holly Springs'
    WHERE UPPER(CITY) = 'HOLLY SPRING'

    In this example, I searched for all instances of “Holly Spring” and replaced them with the correct “Holly Springs”. I can then repeat the process for all other cities. Again, the abridged walk-through of this is here.

    I actually know this data well, since I grew up in the area. So deciding on the correct city names shouldn’t be too challenging. But familiarizing yourself with your own data might not be such a bad idea. For instance, in this table, one of the cities displays as “Research Triangle Park”. I don’t think it’s common to refer to it like this; we usually just call it “RTP”. I think little details like that really help to elevate your application.

    Next steps

    I’ll keep this brief. But what I think I’m going to have to do next is create a REST API based on this Restaurant table. I should have a better idea in a few days. But the idea is, that once a user selects a city, it will pass that city as a parameter to a GET request (again, courtesy of ORDS) and then return additional data. A user would then be able to further filter on that returned data, eventually ending on their target restaurant.

    Okay, stay tuned…as I feel like I’m on the precipice of something exciting!

    Catch me if you can


    Abridged walk-through

    Select Worksheets in Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

    From Database Actions, navigate to the “SQL” feature. You’ll see this screen appear.

    NOTE: I had a previous SQL Worksheet saved, but you can enter SQL statements directly into the default worksheet. 
    Before city field has been updated in Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

    *If you expand, you can see the statement that I used.

    SELECT DISTINCT CITY
    FROM ADMIN.RESTAURANTS
    ORDER BY 1

    This will produce a list of all the cities in their current form. Notice the four different variations of “Fuquay-Varina”.

    Updating fields in City Column Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

    I’ll comment out lines 1-3 and enter the following SQL statement to update Fuquay-Varina:

    UPDATE Admin.Restaurants SET CITY = 'Fuquay-Varina'
    WHERE UPPER(CITY) = 'FUQUAY VARINA'

    I use the “DISTINCT” parameter to drop any duplicates in the results. “ORDER BY” returns results in ascending order.

    You’ll see that 77 rows have been updated.

    After Fuquay Varina has been updated in Database Actions Oracle Autonomous Database, Chris Hoina, Senior Product Manager, ORDS, Database Tools

    I’ll un-comment lines 1-3, and comment lines 5-6.

    Based on the results, we’ve reduced the variations of “Fuquay-Varina”. I’ll continue this work until I have a smaller set of more consistent cities.

  • Table Prep: Data loads and time zones

    Table Prep: Data loads and time zones

    Next episode

    My Autonomous Database in Oracle Cloud Infrastructure

    With my Autonomous Database up and running,I needed to find some data to upload. After some quiet reflection, I decided upon three datasets that should be plenty fun.

    A bit of context; I currently live in Wake County, North Carolina. And it turns out that we’ve been very progressive with providing open, accessible data to our citizens. I really wanted to do something with food, so I was surprised when I found the following three datasets (among the nearly 300 available):

    1. Restaurants in Wake County
    2. Restaurant Food Inspections/Grades
    3. Restaurant Food Inspection Violation History
    Restaurants in Wake County, North Carolina. Wake County is where I, Chris Hoina, live.

    Always read the docs

    Upon reading the documentation, I discovered that these datasets could be all connected through their HSISID* Permit ID fields. Which should come in handy later as it could make for some very interesting (if not amusing) revelations.

    *I’m not sure of the acronym, but the documentation defines the HSISID as a, “State code identifying the restaurant (also the primary key to identify the restaurant).”

    To the data loading

    I’m sticking with the Oracle Database Actions* web-based interface for now.

    *Oracle Database Actions was previously known as Oracle SQL Developer Web.
    In the Database Actions Launchpad, there are several actions for users. This includes development, data-loading, monitoring, and administration.
    The Database Actions Launchpad offers several actions for users, including development, data loading, monitoring, and administration.

    And from what I learned in the Autonomous Database workshop, there are two ways to load data into your database (there may be other methods; mea culpa if so):

    1. SQL worksheet (found in the Development section, of your launchpad)
    2. Data Load Tool (found in the Data Tools section)

    And I opted for the SQL worksheet method since I need to practice and learn this method regardless.

    Obstacles

    After choosing my method of data loading, I encountered some obstacles. With no regard for best practices, I proceeded to load all three datasets into my Autonomous Database only to discover two glaring issues.

    First

    Unfortunately, in my restaurant “violations” table, I encountered two rows that failed to load.

    ORA-12899 error in Oracle Autonomous Database on Oracle Cloud Infrastructure.
    ORA-12899 error in Oracle Autonomous Database on Oracle Cloud Infrastructure.

    These values were loaded as a VARCHAR2 datatype, which, by default, allows for a maximum byte size of 4000 bytes. Since these rows exceeded 4000 bytes (in that specific column) they failed. Fortunately, it was easy to increase the bytes to 10,000.

    While you can increase it to 32,767 bytes, that is overboard. VARCHAR2 data types can have their max string sizes set to ‘STANDARD’ or ‘EXTENDED’ (i.e 4000 bytes vs 32,767 bytes). I’m assuming the Autonomous Database is set to EXTENDED by default, since I was able to increase this column, and re-upload with zero failed rows. You can read up on VARCHAR2 data types here.

    Second

    The second obstacle I ran into, took me about a day (on and off) to figure out. The dates in all three of these tables were constructed like this:

    Timestamps in Oracle Autonomous Database.
    Adjusting for the TIMESTAMPS in the Oracle Autonomous Database.

    At first glance, I thought I was looking at a time zone offset. Dates were either 04:00:00+00 hrs or 05:00:00+00 hrs; possibly due to daylight savings time. This seemed like a reasonable assumption since the early-November — mid-March dates were an hour later than the mid-March — early-November dates (i.e., UTC-4hrs in Winter, UTC-5hrs all else).

    My first thought was to modify the affected columns with something like this parameter:

    TIMESTAMP [(fractional_seconds_precision)] WITH TIME ZONE

    or with…

    TIMESTAMP '1997-01-31 09:26:56.66 +02:00'

    But, I’d never seen this date/time stamp before (the ‘esriFieldTypeDate’), so I had to investigate before I did anything (also, I’ve no idea how to update a column with a parameter like this, so it would have added additional time).

    Turns out this data originates from ArcGIS Online. This field data type maps directly to Oracle’s ‘TIMESTAMP’ data type. Seen here in the documentation:

    ArcGIS datatype mapping to Oracle Database.
    In the ArcGIS documentation, the ‘esriFieldTypeDate’ appears to map to Oracle’s ‘TIMESTAMP’ datatype.

    I later tinkered with an API Explorer on the Wake County datasets site and reviewed a sample JSON response for this data set. The response also confirmed the field type:

    Wake County has an API Explorer on their datasets site, which you can sample query to review the JSON that is returned.
    Wake County has an API Explorer on its datasets site, where you can sample a query to review the JSON that is returned.

    It’s funny because, after all this hunting, it looks like I won’t be able to include time zones anyways. It’s somewhat inconsequential as I’m only interested in Wake County, but for some that could be unnerving.

    I wish I could say that I came up with a cool Python script to automate the cleaning of these columns (I would have had to do this for all three of my .CSV files), but I did not. It would have taken me an afternoon or longer to work through this since I haven’t toyed with .CSV Reader (in the Python standard library), Pandas, or NumPy in quite some time. In the end, (and it pains me to say this, embarrassingly, really), I used the ‘Find + Replace’ feature in Numbers to remove the time/time zone portion of the dates.

    I’m pathetic, I know. But the idea of creating a dedicated Python script does seem like a fun weekend project. Who knows, maybe I’ll post to GitHub. Regardless, my Autonomous Database is clean and primed for ORDS.

    In closing

    In closing

    My advice is as such:

    Always inspect your data before you do anything major, always make sure you understand the inputs and the resultant outputs of the data loading process, and always refer to the documentation FIRST so you can understand how data is being recorded, coded, displayed, etc..

    Chris Hoina, 2022

    Oh, and it wouldn’t hurt to think about how you might use this data in the future. I’m guessing that could help you avoid a lot of rework (i.e. technical debt).

    Take me as an example. Since I’m focused on ORDS and REST enabling my database, I may want to consider things like:

    • What am I going to expose (via REST APIs)? or
    • What fields might the presentation layer of my application include?

    In my case, a lot of this data will lay dormant, as I might choose to focus on only a few key fields. Of course, anything I create will be simple and basic. But that might not be the case for you.

    Koala RESTing, Photo by Jordan Whitt on Unsplash

    I’m off to REST…enable this database now, but in the meantime, if you want to learn more about ORDS, check out:

    Find me

    And as always, catch me on: