Indexing data I don't own

For a week I was trying to put NASA satellite data into Azure blob storage. I needed CHIRPS rainfall and MODIS imagery for a disease forecasting platform, and the obvious plan was to pull one from NASA and another from UCSB, stage them in blob, and build a STAC catalog over my own bytes. It felt like the honest way to do things. You run a data platform, you host the data.

It was going badly. NASA’s CMR returns signed URLs that expire, so any ingest pipeline has to refresh tokens partway through a transfer, and you have to handle partial failures cleanly because a multi-gigabyte download can hit the expiry at the worst possible moment. MODIS comes as HDF4, an archive format from the mid-90s that much of modern geospatial tooling no longer parses cleanly — you end up going through GDAL with specific drivers and hoping the subdataset names match what your code expects. CHIRPS is easier, but it’s large, and I was going to be paying Azure egress and storage for data that already exists in a better-curated form elsewhere. I had written the bones of a scheduled ingest in a Terraform-managed Function App, and every time I fixed one thing another broke, and at some point I started to feel like I was reimplementing a service that ought to be someone else’s job.

Then I opened the Microsoft Planetary Computer catalog one evening, for an unrelated reason, and saw MODIS was already there. So were Landsat, Sentinel-2, GPW, and WorldClim — most of the imagery collections I had on my roadmap, and a few I hadn’t thought of yet. Each one was staged in Azure blob, COG-tiled where the source format allowed it, with STAC metadata and short-lived SAS tokens you can request through the MPC API. CHIRPS wasn’t there, which I’ll come back to, but the MODIS side of the problem — which was the painful side — had already been solved by a team with more resourcing than me.

I sat with that for a while. MPC is effectively the ingest pipeline I was building, except it has been built by a team at Microsoft, it’s maintained continuously, and it ships with provenance, licensing, and signed access out of the box. I had been treating the download-and-stage step as mandatory. That assumption doesn’t survive first contact with a catalog that already has the data in Azure, tiled for efficient reads, with tokens on request.

The pivot was not trivial, though, and this is the part that took me longer to think through than I expected. If I’m using MPC’s blobs, what am I actually running?

STAC-as-an-index became the useful framing. STAC items are just JSON, and the href inside each asset can point at any URL — a blob in my storage account, a blob in MPC’s storage account, a signed URL that expires in an hour. The catalog layer and the storage layer are separate concerns in STAC by design, and I had collapsed them in my head because my first mental model of “building a data platform” implied owning the data.

It turns out you don’t have to own the data to own the catalog. The catalog is the thing that expresses, for a given domain, which assets exist, how they relate, and how to query them. For disease forecasting, that means an opinionated subset — specific CHIRPS dekads, MODIS LST over particular regions, derivative products I produce myself, and metadata fields tuned to how the forecasting models consume input. That layer has to live somewhere I control, because it’s specific to my work. The bytes underneath, in a lot of cases, can live wherever a stable signed URL will let me point at them.

So the architecture shifted. pgSTAC runs in my Azure environment and holds the catalog. Items for MODIS and the other MPC-hosted collections point at MPC blobs, through a small service on my side that requests a signed URL at query time and returns a short-lived one to the client. Items for CHIRPS point at my own blob, populated by an Azure Batch job that pulls the UCSB releases on a schedule, writes them into storage, and registers the corresponding STAC items. Items for products I generate myself, forecasting model outputs, region-specific composites, a few in-house datasets from our partners, live in my blob too. The query API, the agent layer, the map frontend, none of them care where a given asset physically lives.

That removed a large amount of scope from the roadmap. The durable ingest pipeline for MODIS and the other imagery collections went away. What’s left on Azure Batch is the CHIRPS ingest and the derivative products we make ourselves, the data that only exists because we made it, and which is also the interesting work.

The uncomfortable design question underneath this is the dependency. If MPC changes its access model, or retires a collection, my catalog has broken items. In practice STAC items are cheap to re-point, if I had to move a collection’s hrefs to a mirror, it’s a database update rather than a re-ingest, and I could script the cutover in a day. MPC itself is stable enough, backed well enough, that the operational risk is smaller than the risk of me running my own ingest of the same data at a fraction of the resourcing. But it is a real dependency, and I’ve been writing down the specific things I’d need to do if it went away, so the answer doesn’t have to be improvised if it does.

What’s stayed with me from this is the question of what’s worth owning when the infrastructure below you is maintained better than you could manage. Owning more feels like a safer default, until you try to keep it running at the quality somebody else is already delivering. The parts worth owning are the domain-specific ones, the catalog of things you care about, the derivative products, the query patterns your users actually use. Those parts are almost always a smaller slice than the platform you first imagine building.

TL;DR (LinkedIn)

I spent a week trying to pull NASA MODIS and UCSB CHIRPS into Azure blob so I could build a STAC catalog over my own bytes. It was going badly; NASA CMR’s signed URLs expire mid-transfer, MODIS ships as HDF4, and I was going to pay storage and egress for imagery that already exists in a better-curated form elsewhere.

Then I opened Microsoft Planetary Computer one evening and found MODIS, Landsat, Sentinel-2 — all staged in Azure blob, COG-tiled, with STAC metadata and signed SAS tokens on request. CHIRPS wasn’t there, but the painful half of the problem had been solved by somebody with more resourcing than me.

The reframing that mattered: STAC items are just JSON, and the href on an asset can point at any URL, including somebody else’s blob. The catalog and the storage are separate concerns in STAC by design. I had collapsed them in my head because my first model of “a data platform” implied owning the data.

So pgSTAC now runs in my environment. MODIS items point at MPC blobs through a small signed-URL service on my side. CHIRPS items point at my own blob, populated by an Azure Batch job that pulls the UCSB releases on a schedule. Derivative products sit alongside them.

The question underneath this is what’s worth owning when the infrastructure below you is maintained better than you could. For me the answer ended up smaller than I’d assumed, the catalog, one custom ingest, and the derivative products. The rest can live wherever a stable signed URL points.

Marouf's Knowledge Vault

Recent writing