Cleaning data is your friend

In a perfect world, a geospatial project would be provided with data that requires no additional cleaning. You could start mapping or executing an analysis from the get-go with the initial data, saving a lot of time, energy, and brain power.

The above scenario is rarely the case, as most GIS practitioners know. It is critical for the success of a project to ensure that there are no strange characters and spelling errors in the data, the correct projection is being used, and there are no redundancies.

I experienced the importance of cleaned data first-hand while working on my capstone project and documented the cleaning process in a python script (as referenced by my previous blog post). My project focused on identifying dam locations throughout Nova Scotia that could potentially impact the migration of fish species that move from saltwater to freshwater systems. When embarking on this project, I had anticipated a week tops for preparing the data to be analyzed and visualized which was extended to about 3 weeks of my 5 week timeline.

As referenced above, the data varied in both older and up to date coordinate systems which were projected all into the same coordinate system of NAD83 CSRS UTM Zone 20N, when necessary. The spatial extent for this project initially covered the province of Nova Scotia which is why this coordinate system was used instead of one for a smaller region. I also received a few different data sources which needed to be combined into a finalized layer. This part of the process required applying an append and ensuring that each data source had the same field names which can be referenced in the code below. Finally, as mentioned previously, redundancies were important to identify so that each dam location would have the most up to date data assigned to it. This quality assurance step was completed by using Select By Location in order to see if maybe previously pinpointed dams had been updated to a new location.

# Define the addFields function before append:
#   - f_add: feature with fields to be added
#   - f_get: feature that will get the fields
def addFields(f_add, f_get):

    # - Get the field list
    fields = arcpy.ListFields(f_add)

    # - Set the field information for fields being added (name, data type, and length)  
    fieldName = [f.name for f in fields]
    fieldType = [f.type for f in fields]
    fieldLength = [f.length for f in fields]

    # - For the index of length of the fields list:
    for i in range(len(fields)):

        # - Set the name, data, type, length
        f_name = fieldName[i]
        f_type = fieldType[i]
        f_length = fieldLength[i]

        # - Provide the information
        arcpy.AddMessage(str(f_name) + " has " + str(f_type) \
                         + " data type and length of " + str(f_length) + ".")

        # - If the field names are not Watercourse, OBJECTID, and Shape:
        if f_name != "Watercourse" and f_name != "OBJECTID" and f_name != "Shape":

            # - If the data type is a String:
            if f_type == "String":

                # - Set it to TEXT
                f_type = "TEXT"
        
                # - Add the fields from f_add to f_get
                arcpy.AddField_management(f_get, f_name, f_type, field_length=f_length)
                
            # - Otherwise if the data type is SmallInteger:    
            elif f_type == "SmallInteger":

                # - Set it to SHORT
                f_type = "SHORT"

                # - Add the fields from f_add to f_get
                arcpy.AddField_management(f_get, f_name, f_type, field_length=f_length)

             # - Otherwise if the data type is Integer:
            elif f_type == "Integer":

                # - Set it to LONG
                f_type = "LONG"

                # - Add the fields from f_add to f_get
                arcpy.AddField_management(f_get, f_name, f_type, field_length=f_length)

             # - Otherwise if the data type is Double:
            elif f_type == "Double":

                # - Change it to upper case
                f_type = f_type.upper()

                # - Add the fields from f_add to f_get
                arcpy.AddField_management(f_get, f_name, f_type, field_length=f_length)

    # - Notify the user
    arcpy.AddMessage("Fields have been added.")

To add interest to finalizing the dams data, connecting dams to a watercourse network was critical prior to conducting an analysis of how much habitat was available for fish species beyond dams. This step was completed by using the Near tool with the Location parameter included, and then using the Make XY Event Layer to create point features using the location coordinates from the previous results.

The python code for these steps can be viewed below.

# Define the connectDams function to connect dams to watercourses:
#   - f: dams feature
#   - w: watercourses
def connectDams(f, w):
    
    # - Setup Near tool fields
    nea_fields = "NEAR_FID NEAR_FID;NEAR_DIST NEAR_DIST;NEAR_X NEAR_X;NEAR_Y NEAR_Y"

    # - Apply the Near tool and include the Location argument 
    arcpy.analysis.Near(f, w, None, "LOCATION", "NO_ANGLE", "PLANAR", nea_fields)

    # - Notify user that near locations were found
    arcpy.AddMessage("Near locations were found for dams to watercourses.")

    # - Set the output layer 
    xy_event = "XYEvent"

    # - Make a new layer using X and Y locations from the Near locations
    arcpy.management.MakeXYEventLayer(f, "NEAR_X", "NEAR_Y", xy_event, "PROJCS['NAD_1983_CSRS_UTM_Zone_20N',GEOGCS['GCS_North_American_1983_CSRS',DATUM['D_North_American_1983_CSRS',SPHEROID['GRS_1980',6378137.0,298.257222101]],PRIMEM['Greenwich',0.0],UNIT['Degree',0.0174532925199433]],PROJECTION['Transverse_Mercator'],PARAMETER['False_Easting',500000.0],PARAMETER['False_Northing',0.0],PARAMETER['Central_Meridian',-63.0],PARAMETER['Scale_Factor',0.9996],PARAMETER['Latitude_Of_Origin',0.0],UNIT['Meter',1.0]];-5120900 -9998100 10000;-100000 10000;-100000 10000;0.001;0.001;0.001;IsHighPrecision", None)
    
    # - Set the output feature
    dams_water = f + "XYEvent"

    # - Copy the layer into a feature
    arcpy.CopyFeatures_management(xy_event, dams_water)

    # - Notify the user that dams are connected to watercourses
    arcpy.AddMessage("XY Event output: " + dams_water)

    # - Return the output feature
    return dams_water

Although, the previously discussed steps are only examples of the kinds actions that may need to be taken to clean data before completing any analyses, they are important considerations for most dataset processing. Without cleaning data, one can run into messy errors leading to inefficiencies and time wasted. It may be a painful process to go through, but it will be your friend in the end.

To learn more about my project, you can check out my StoryMap here.