Compute

While I am a patient man and am happy to wait, many objective of data science care about speed. If that’s your jam, check out https://duckdblabs.github.io/db-benchmark/, but as quoted from the duckdblabs page: “We also include the syntax being timed alongside the timing. This way you can immediately see whether you are doing these tasks or not, and if the timing differences matter to you or not.” (“Database-Like Ops Benchmark” n.d.)

The complications of scaling and compute

Many user and use cases desire high resolution data, but that use can become a significant cost burden the finer in resolution you move. Furthermore, this cost increase does not correspond with the changes in resolution (i.g. a process at 10 meters will not take twice as long as a 20 meter one). What’s often missing in this conversation is some geographic context and some bounds around the scales of the process you’re modeling.

```{r}
fort_collins_lat <- 40.5853
fort_collins_lon <- -105.0844

# Create a bounding box as an sf object
# bbox <- sf::st_bbox(sf::st_buffer(
#   sf::st_sf(geometry = sf::st_sfc(sf::st_point(c(fort_collins_lon, fort_collins_lat))),crs = sf::st_crs("EPSG:4326")),
#   units::as_units(500,'m')))
# projected_bbox <- sf::st_transform(sf::st_as_sfc(bbox), crs = sf::st_crs("EPSG:5070"))
# cell_sizes <- c(500, 100, 10, 5, 3, 2, 1)

bbox <- sf::st_bbox(sf::st_buffer(
  sf::st_sf(geometry = sf::st_sfc(sf::st_point(c(fort_collins_lon, fort_collins_lat))),crs = sf::st_crs("EPSG:4326")),
  units::as_units(50,'m')))
projected_bbox <- sf::st_transform(sf::st_as_sfc(bbox), crs = sf::st_crs("EPSG:5070"))
cell_sizes <- c(10, 5, 3, 2, 1)

# Create a list to store the grids
grids <- list()
for (size in cell_sizes) {
  grid <- sf::st_make_grid(projected_bbox, cellsize = c(size, size))
  grids[[paste0("grid_", size, "m")]] <- grid
}
grids_wgs84 <- lapply(grids, sf::st_transform, crs = sf::st_crs("EPSG:4326"))

ggplot2::ggplot() +
  ggplot2::geom_sf(data = sf::st_transform(sf::st_as_sfc(bbox), crs = 4326), fill = "lightblue", alpha = 0.3) +
  # ggplot2::geom_sf(data = grids_wgs84$grid_500m, fill = NA, color = "red", linewidth = 1) +
  # ggplot2::geom_sf(data = grids_wgs84$grid_100m, fill = NA, color = "green", linewidth = 0.8) +
  ggplot2::geom_sf(data = grids_wgs84$grid_10m, fill = NA, color = "blue", linewidth = 0.6) +
  ggplot2::geom_sf(data = grids_wgs84$grid_5m, fill = NA, color = "orange", linewidth = 0.5) +
  ggplot2::geom_sf(data = grids_wgs84$grid_3m, fill = NA, color = "purple", linewidth = 0.4) +
  ggplot2::geom_sf(data = grids_wgs84$grid_2m, fill = NA, color = "brown", linewidth = 0.3) +
  ggplot2::geom_sf(data = grids_wgs84$grid_1m, fill = NA, color = "black", linewidth = 0.2) +
  ggplot2::labs(title = "Overlapping Grids Centered on Fort Collins",
                subtitle = "Cell Sizes: 500m (red), 100m (green), 10m (blue), 5m (orange), 3m (purple), 2m (brown), 1m (black)",
                x = "Longitude", y = "Latitude") +
  ggplot2::theme_minimal()

# Calculate the number of cells for each grid
num_cells <- sapply(grids, length)
computation_table <- data.frame(
  Cell_Size_m = as.numeric(gsub("grid_(\\d+)m", "\\1", names(num_cells))),
  Number_of_Cells = num_cells
)
computation_table <- computation_table[order(computation_table$Cell_Size_m, decreasing = TRUE), ]
computation_table$Relative_Increase <- computation_table$Number_of_Cells / computation_table$Number_of_Cells[1]
print(computation_table)
```

Local

The most accessible way to run calculations is on the computer right in front of your face. However, how you set up and run your computer (environment) might vary wildly from someone else, and so issues like reproducibility and replicability creep in as you advance. One of the most direct ways to overcome some of those hurdles is to create a “virtual environment” within which you develop and test your process. While there are all sorts of different virtual environments that you can use that provide all the bells and whistle you might want, those can also be very “heavy” both in terms of the amount of disk space the environment takes up and how ineffectively the environment uses your hardware. When performance and ease of scaleability matter, Docker is a great way to keep a task reproducible which makes porting and scaling your workflow almost frictionless. Per the Docker landing page: “Docker helps developers build, share, run, and verify applications anywhere — without tedious environment configuration or management.” See how I install and use Docker for tips and cheatsheets.

Cloud

ARCO resource

https://radiantearth.github.io/stac-browser/#/

References

“Database-Like Ops Benchmark.” n.d. https://duckdblabs.github.io/db-benchmark/. Accessed October 13, 2024.