============= Data Cleaning ============= Cleaning Occurrences ==================== Introduction ------------ One of the first steps in creating :term:`species distribution models`, let alone multi-species analyses, is acquiring and preparing specimen occurrence records. There are multiple method for acquiring these raw specimen records such as aggregator downloads or API calls but once you have the raw data, you need to assemble your entire dataset, which involves converting records to a common format, grouping, and cleaning. The lmpy library provides tools for performing these aggregation and cleaning steps to greatly simplify the process for the user. Occurrence Data Wrangler Configuration -------------------------------------- You can either use the Data `Wrangler Factory <../autoapi/lmpy/data_wrangling/factory/index.html#lmpy.data_wrangling.factory.WranglerFactory>`_ or instantiate occurrence data wrangler classes directly. We will use the factory for this example with the configuration below. .. code-block:: json [ # Decimal precision dict( wrangler_type='DecimalPrecisionFilter', decimal_places=4 ), # Bounding box dict( wrangler_type='BoundingBoxFilter', min_x=0.0, min_y=-90.0, max_x=180.0, max_y=0.0 ), # Unique localities dict(wrangler_type='UniqueLocalitiesFilter') ] Example - Console Script ------------------------ For this example, we will use the raw occurrence data found in the sample data directory in lmpy at `occurrence/Crocodylus porosus.csv` and the example wrangler configuration should be written to `./occurrence_wrangler_config.json`. The cleaned data will be written to `./clean_data.csv`. .. code-block:: bash $ wrangle_occurrences "./lmpy/sample_data/occurrence/Crocodylus porosus.csv" \ ./clean_data.csv \ ./occurrence_wrangler_config.json Example - Python ------------------------ For this example, we will use the raw occurrence data found in the sample data directory in lmpy at `occurrence/Crocodylus porosus.csv` and the example wrangler configuration. The cleaned data will be written to `./clean_data.csv`. .. code-block:: python from lmpy.data_wrangling.factory import WranglerFactory from lmpy.point import PointCsvReader, PointCsvWriter raw_occurrences_filename = './lmpy/sample_data/occurrence/Crocodylus porosus.csv' clean_occurrences_filename = './clean_data.csv' wrangler_config = [ # Decimal precision dict( wrangler_type='DecimalPrecisionFilter', decimal_places=4 ), # Bounding box dict( wrangler_type='BoundingBoxFilter', min_x=0.0, min_y=-90.0, max_x=180.0, max_y=0.0 ), # Unique localities dict(wrangler_type='UniqueLocalitiesFilter') ] factory = WranglerFactory() wranglers = factory.get_wranglers(wrangler_config) with PointCsvReader( raw_occurrences_filename, 'species_name', 'x', 'y' ) as reader: with PointCsvWriter( clean_occurrences_filename, ['species_name', 'x', 'y'] ) as writer: for points in reader: for wrangler in wranglers: points = wrangler.wrangle_points(points) if len(points) > 0: writer.write_points(points)