Data Cleaning

Cleaning Occurrences

Introduction

One of the first steps in creating species distribution models, let alone multi-species analyses, is acquiring and preparing specimen occurrence records. There are multiple method for acquiring these raw specimen records such as aggregator downloads or API calls but once you have the raw data, you need to assemble your entire dataset, which involves converting records to a common format, grouping, and cleaning. The lmpy library provides tools for performing these aggregation and cleaning steps to greatly simplify the process for the user.

Occurrence Data Wrangler Configuration

You can either use the Data Wrangler Factory or instantiate occurrence data wrangler classes directly. We will use the factory for this example with the configuration below.

[
    # Decimal precision
    dict(
        wrangler_type='DecimalPrecisionFilter',
        decimal_places=4
    ),
    # Bounding box
    dict(
        wrangler_type='BoundingBoxFilter',
        min_x=0.0,
        min_y=-90.0,
        max_x=180.0,
        max_y=0.0
    ),
    # Unique localities
    dict(wrangler_type='UniqueLocalitiesFilter')
]

Example - Console Script

For this example, we will use the raw occurrence data found in the sample data directory in lmpy at occurrence/Crocodylus porosus.csv and the example wrangler configuration should be written to ./occurrence_wrangler_config.json. The cleaned data will be written to ./clean_data.csv.

$ wrangle_occurrences "./lmpy/sample_data/occurrence/Crocodylus porosus.csv" \
    ./clean_data.csv \
    ./occurrence_wrangler_config.json

Example - Python

For this example, we will use the raw occurrence data found in the sample data directory in lmpy at occurrence/Crocodylus porosus.csv and the example wrangler configuration. The cleaned data will be written to ./clean_data.csv.

from lmpy.data_wrangling.factory import WranglerFactory
from lmpy.point import PointCsvReader, PointCsvWriter

raw_occurrences_filename = './lmpy/sample_data/occurrence/Crocodylus porosus.csv'
clean_occurrences_filename = './clean_data.csv'

wrangler_config = [
    # Decimal precision
    dict(
        wrangler_type='DecimalPrecisionFilter',
        decimal_places=4
    ),
    # Bounding box
    dict(
        wrangler_type='BoundingBoxFilter',
        min_x=0.0,
        min_y=-90.0,
        max_x=180.0,
        max_y=0.0
    ),
    # Unique localities
    dict(wrangler_type='UniqueLocalitiesFilter')
]

factory = WranglerFactory()
wranglers = factory.get_wranglers(wrangler_config)
with PointCsvReader(
    raw_occurrences_filename,
    'species_name',
    'x',
    'y'
) as reader:
    with PointCsvWriter(
        clean_occurrences_filename, ['species_name', 'x', 'y']
    ) as writer:
        for points in reader:
            for wrangler in wranglers:
                points = wrangler.wrangle_points(points)
            if len(points) > 0:
                writer.write_points(points)