Data Flow

Thinking about data flow is something I try to do as infrequently as possible, but that’s unavoidable in this day and age. The largest issue I have with it is that data flow and the “better practices” of the day are not only dependent on the ever-shifting nature of technology and tools at your disposal, but also on the hat you are wearing for the day, the task you are trying to accomplish, and what sort of resource constraints your environment provides, which makes it exceptionally hard to take a firm stance on the topic. That being said, some of the firmest ground I’ve found thus far is to think about data flow as “where that data lives”.

Where Data can live

The answer to this question is really a semantic distinction because all ‘real data’ is stored on a hard drive (I’m excluding archived data which may live in an archival format or data which is not digital e.g. hard-copy records), but I occasionally use terms for data access patterns that can be confusing and so I’ll elaborate here:

Local data: Local data, sometimes called on-disk or disk data, is data that that lives alongside the compute you are using. You can open up a file explorer or otherwise navigate to it from a mounted drive and is typically the fastest and most efficient manner in which you interact with a dataset.

Cloud native: The alternative to that is cloud native data, or data which lives or is accessed from an S3 bucket. Although that data really lives on someone elses hard drive, you interact and access it using urls and an active internet connection, and the “better practice” is to pull in small subsets of the data into your local memory to interact with at a time.

Location agnostic: A term I use more to describe the nature of the reader you deploy to place that data on your local machine. While all data could in theory be “location agnostic” with syntax candy, this moreso describes data stored in formats like parquet and flatgeobuf which, as is implemented in the sf package in R, can take either a path to a local file on your disk or an s3 url. The function itself takes care of the loading of that data for manipulation and you as the end user don’t have to alter syntax or thought patterns in order to take advantage of the benefits the different locations of the data provide.

Attributes and tags

Beyond the location and nature (vector, raster, tabular) of the data, there are several attributes, tags, standards, and principals that quickly inform you of the sorts of tools, techniques, and challenges that you will face as you start working with that data.

Big

[[BD]] Big Data

FAIR

Defining the FAIR acronym [[20240313142448]] FAIR data

Cloud native data

Defining the ARCO acronym [[20240313183434]] A note on “ARCO” and “Cloud Native” data

“Analysis-ready Cloud Optimized” (ARCO) and “Cloud Native” are terms used to connote nature of the sturcture, shape, and

Formats

Cloud Optimized GeoTIFFs (COG) Zarr Kerchunk Cloud-Optimized HDF5 and NetCDF Cloud-Optimized Point Clouds (COPC) GeoParquet FlatGeobufStac PMTiles