csemx — Design Rationale (companion to the spec)

Companion to csemx-specification.md. The spec states the rules tersely; this records the why behind each non-obvious decision, so a reader who questions a rule there can find its justification here. Sections follow the spec where a rationale is useful; purely tabular field definitions are omitted.

§1 Scope and configurations

One data model for every configuration. Geometry is always explicit 3D vertices (wires, loops) or oriented point elements at arbitrary (easting, northing, elev). Surface CSEM, marine towed-dipole, borehole induction, and crosswell EM then differ only in where the vertices sit and whether an element is a wire or a coil: no per-configuration schema, no special-case objects.

Moving sources/receivers are dense station sequences. A towed source (and possibly receiver) moves continuously, so each position is its own tx_station_id/rx_station_id. A dense sequence of single-use stations is the intended idiom: a tow line is not an object the schema must model specially.

Calibrated response only. The normative payload is the geometry needed to model the response, the calibrated normalized response functions, and the producer’s use recommendation, and nothing else. Instrument internals (calibration curves, gains, serial numbers) are applied before delivery; acquisition bookkeeping (navigation uncertainty, weather, crew logs) and consumer-side modeling/inversion settings are not properties of the EM response. Environmental/medium properties (seawater conductivity, CTD profiles, borehole fluids) are modeling inputs and ship separately.

§2 Bundle structure: ZIP + CSV/Parquet, not HDF5/SQLite

ZIP of fixed-name tables. A bundle is a ZIP of one directory with fixed filenames, so a reader locates every table without configuration. The .zip already supplies the single-self-describing-file property, leaving a human-readable YAML manifest rather than an opaque container header.

Geometry split from elements. Element metadata (tx/rx) is separated from geometry (*_vertices), joined by keys, because vertex count varies per element (point = 1, wire ≥ 2, loop ≥ 3) and does not fit a fixed-width element row. The split also lets a large vertex table go to Parquet on its own.

Parquet per table, chosen over HDF5/SQLite on contractor-straightforwardness. Airborne/marine volume bloats not just data but tx/rx and their vertex tables, so binarization must be available per table, not just for data.

Parquet is the typed table — its schema is the column set the spec already defines, so whoever writes the CSV can write Parquet with a single library call in common analysis tools. Minimal structural ambiguity, little room for per-vendor divergence.
HDF5 is a flexible container, not a schema: using it would require pinning an internal group/dataset layout each contractor must build correctly — reintroducing the per-file divergence csemx exists to remove — with a fiddlier API and portability quirks; tables are not its native idiom.
SQLite is a database container rather than a plain table exchange format; using it would add database-specific implementation choices without matching the fixed-file export path most producers already have.

Per-table choice keeps small geometry tables readable as CSV while a large data table goes binary. Parquet stores real float NaN and required f64 values, avoiding CSV NaN-casing issues and preserving the precision producers write. String IDs must stay string-typed: a writer that infers 001 → 1 silently corrupts join keys.

§3.1 Coordinate system

A single EPSG code pins projection, datum, axis order, and units unambiguously and is understood by standard geospatial tooling. A geographic CRS (degrees, e.g. 4326) breaks the geometry conventions: §3.4 takes lengths and orientations as Euclidean differences of vertex coordinates, but degree spacing is non-uniform and latitude-dependent (1° longitude ≈ 111 km·cos φ), so those differences are not metric; projected meter-axis coordinates are required. Feet are rejected because they would silently corrupt lengths and areas. Absolute lat/lon is recovered by reprojecting easting/northing through epsg_horizontal and is never stored, so there is no second copy to drift out of sync.

No local grid. A producer using a private origin/rotation already holds the georeferencing needed to reproject into a registered projected CRS before delivery. Requiring a registered code keeps every bundle readable by standard tooling and free of bespoke frame parameters. Original local-grid coordinates may ride along as ignored ext_* provenance or be described in notes.md, but are never the authoritative geometry.

§3.2 Elevation and altitude

Elevation datum is declared, never defaulted. A silent default would let two producers mean different surfaces by the same number. 4979 (WGS84 ellipsoidal) is the recommended choice because GPS/RTK delivers ellipsoidal height natively: zero geoid conversion, no model dependence; 3855 (EGM2008 orthometric) is offered for “height above sea level” when that is what was recorded.

Altitude is the optional, interface-relative companion. Marine CSEM (and airborne EM) response is acutely sensitive to instrument height above the interface (seafloor or ground), which the altimeter measures directly. Absolute elevation captures it poorly: bathymetry/DEM grids disagree in absolute z by meters (datum, tides, resolution), so computing altitude = elev − surface(x,y) lands that disagreement on the most sensitive parameter. Carrying the measured altitude lets a consumer place each vertex at the right height above their own model’s interface without baking in the producer’s bathymetry/DEM. elev remains the required absolute height for mapping, exchange, and downhole geometry; altitude is an optional interface-relative placement aid, not a second vertical datum. The reference surface itself is not shipped (topo/bathy is a modeling input, out of scope); altitude.reference only names the interface (seafloor/ground).

§3.3 Azimuth and dip

Direction is polarity. A point element’s azimuth/dip give a directed axis n̂, and that direction carries the source/sensor sign. There is no separate polarity field. Folding sign into the axis removes a redundant flag that could contradict the geometry: the producer picks azimuth/dip to match the element’s positive source or sensor axis. For a coil, that is the right-hand normal of its winding current; reversed leads or sensor polarity are corrected before delivery, not represented by a sign column.

True north, dip positive down. Azimuths are true-north because magnetic declination varies in time and space and is a producer correction, not survey data (record it in notes.md if provenance needs it). Dip is positive downward, following the standard geophysical convention that inclination into the earth is positive.

Forbidden on wire/loop. Vertex order already fixes a wire’s or loop’s orientation, so azimuth/dip there would be a second encoding of the same fact, and two encodings invite contradiction.

§3.4 Vertex ordering and positive reference direction

RX wire / bent wire. At nonzero CSEM frequency E is generally non-conservative (∇×E = −∂B/∂t), so a wire receiver’s datum is the line integral along its stated path, not a path-independent endpoint potential. In the DC/static limit this reduces to the familiar V(+) − V(−) convention. csemx orders the RX wire so the first vertex is the voltmeter + terminal and the last the −, so the datum is ∫E·dl taken along the vertex path (first → last) with no sign flip.

Loop circulation and vector area. The loop’s signed circulation is derived from vertex order and is never constrained to have an “up” normal, which keeps the convention lossless for real non-planar loops on topography. For dipole approximations csemx uses the oriented area vector a = ½ Σ_{i=0}^{N-1} r_i × r_{i+1} (with r_N = r_0), where r_i is the 3D position of vertex i relative to any fixed origin; the closed-loop sum is translation-invariant. For a planar loop this reduces to A·n̂ with n̂ from the right-hand rule; for a non-planar loop a is still exact, but m = I·a is then the dipole approximation and the finite-loop geometry remains the closed vertex path.

TX vs RX electrode order (one positive sense). Both TX and RX take the vertex order, first → last, as the positive reference direction, so a reciprocal Tx/Rx pair on identical geometry needs no sign flip. The electrode labels that realize this are opposite: a TX dipole moment points − to + (first → last is − → +, the impressed source-current direction, not the earth-conduction-current direction, which runs + → − through the earth), whereas an RX wire reads V(+) − V(−), so its + terminal is the first vertex.

§3.5 Sign convention and phase reference

Both conventions are declared, not mandated. exp(+iwt) (engineering) and exp(-iwt) (physics) are both in active use; mandating one would force every producer or consumer on the other side to convert. csemx declares the convention so a multi-bundle consumer flips the imaginary sign once, at ingest. No default: a wrong-but-present default is more dangerous than a required field because it is silently plausible.

Phase referenced to the transmitter current. The response is delivered in phase relative to the transmitter-current spectral component at frequency, so it is comparable across instruments regardless of the internal reference used during acquisition (lock-in phase, GPS time, transmitter voltage, reference clock). The producer converts to this reference before delivery. No propagation-time correction is represented. That is a modeling step, not a property of the measurement.

§3.6 Units and the measured datum

Wire/loop receiver responses are kept at V/A (not divided by dipole length or loop vector area) because the length, shape, polarity, and winding sense are already in rx_vertices.csv; a modeling code computes the datum as the appropriate line integral of E: open-path ∫E·dl along the wire’s vertex path for a wire, closed-loop ∮E·dl for a loop. A point magnetic receiver reports a calibrated projected B-field, so T/A is already the natural datum.

There is no length or current unit by design: coordinates are meters (§3.1), current is normalized away (§3.7), and the one area column carries its unit in its name.

Units are fixed by the spec, not carried in the bundle. The spec admits exactly one canonical unit per datum (wire/loop → V/A, point magnetic → T/A), so a manifest units: block would only restate fixed constants: it would carry no per-bundle information and could be misread as a unit selector. The unit of every quantity is instead pinned by format.version plus §3.6, so a bundle stays self-describing on units (through its version) without a redundant, drift-prone declaration. The same logic does not apply to coordinate system or sign convention, which genuinely vary between bundles and so remain declared.

§3.7 Normalization

Current normalization uses the complex current phasor/Fourier coefficient at each reported frequency, not peak, RMS, or nominal drive, so multi-frequency and encoded sources stay comparable.

Turns fold into the normalization precisely because they are dimensionless: V/A stays V/A. Area cannot, because dividing by m² changes the units. That asymmetry is why a point transmitter must declare point_moment_area_m2 (no vertices to supply an area, and the area cannot hide in the normalization), while finite loops keep their area in their vertices. A point receiver needs no moment column because its datum is already calibrated B-field in T/A. The same dimensionless-turns logic applies on the receive side: a loop receiver’s EMF scales with its turns, so its response is divided by the receiver turn count to a single-turn loop (area kept in rx_vertices).

§3.8 Missing values

NaN is the IEEE 754 token parsed natively by every scientific computing stack. Output casing varies (nan / NaN / NAN), hence the case-insensitive read. A fixed token, rather than a configurable sentinel, means a reader never has to discover what “missing” looks like in a given bundle.

§3.9 Component naming

The x/y/z letter is opaque because the same letter means different things across the industry: a contractor may call the inline dipole Ex; the MT/EDI convention takes Ex as north; the easting axis points east: three directions behind one letter. csemx records the actual azimuth/vertices, so the data is unambiguous even though the label is not.

For point magnetic components, Bx/By/Bz naming (never a separate H* set) keeps labels tied to the delivered magnetic-flux-density datum (T/A) and avoids an A/m alternative.

§3.10 Geometry and field type

wire geometry denotes electric coupling; loop and point geometry denote magnetic coupling. Electric elements are open wire dipoles (grounded or capacitive); closed loops and compact coils are magnetic. This avoids electric-loop and magnetic-wire hybrids.

No stored *_type column. Field type is a total function of geometry (wire→electric, loop/point→magnetic), and csemx represents electric data as open-path wire voltages/line integrals over finite baselines, so there is no electric point. A stored source_type/sensor_type could therefore only restate geometry or contradict it (wire+magnetic), so field type is derived from geometry_type, not stored (the “can’t be filled in wrong” principle). Geometry, not field type, also fixes the unit: a loop and a point are both magnetic but report V/A and T/A.

A large transmitter loop is written as its vertices; a small coil (borehole / crosswell tool, negligible at survey offsets) is a point with point_moment_area_m2 rather than a polygon of millimeter-scale vertices stacked on six-figure coordinates. Anchoring on real finite geometry, with one effective-area escape hatch, lets one model cover surface, borehole, and crosswell layouts without per-channel length/moment bookkeeping.

§3.11 Field content: total or secondary

Secondary is the stored datum; ppm is derived context. Secondary is defined by the primary: the free-space (no-earth) transmitter response at the receiver, so secondary = total − primary. The primary must therefore be defined, but it does not need to be a payload column: csemx needs only one declaration (field.content) because the primary is computed from the geometry the bundle already carries. ppm is a display/comparison ratio derived outside the csemx datum, not a stored unit.

Store the absolute secondary in V/A/T/A, never literal ppm. ppm is common for inductive loop-loop systems (airborne and ground FDEM), but storing it would break the fixed-unit pillar (§3.6): a dimensionless ×10⁶ ratio is the first datum whose unit is not set by geometry_type, the error columns would become ppm, and a reader assuming T/A would mis-scale by ~12 orders. Keeping the datum in the canonical field unit and recording which transform was applied (total vs secondary) avoids a second unit system or a per-bundle units selector.

Why secondary is preferred for airborne data. Total field at the bird is dominated by the primary; the ground signal is a small ppm-level fraction of it. Delivering total and asking a consumer to recover the secondary by subtracting their own computed primary lands every geometry/calibration mismatch on a signal orders of magnitude smaller. Legacy ppm processing is primary-referenced precisely to reduce sensitivity to the dominant primary amplitude. Secondary storage preserves the scattered signal; it is the numerically sound default for these systems, not merely a convenience.

Default total, so the baseline stays minimal. Absence of the field block means total; total-field bundles need no secondary-specific metadata. Secondary delivery adds only the field.content declaration.

No primary column: the primary is a derived reference. Contractor systems may report ppm internally or in legacy deliverables, but csemx stores the absolute secondary response, not ppm and not the primary itself. For the free-space convention used here, the primary is a fixed system response recoverable from the geometry the bundle already carries. Both sides therefore compute the identical primary from existing columns; storing it would duplicate a derivable quantity and break the format’s “carry geometry, derive the rest” rule in exactly one place (§3.7). A stored real primary would also be inadequate for all receiver datums: the primary is real for a point B-field datum but quadrature for a loop EMF datum (§3.5).

Not airborne-specific. The same primary-reference machinery covers any fixed-offset loop-loop FDEM (ground and airborne), so the profile is scoped by the measurement (a removed primary), not by platform.

§4 Manifest, survey identity, and re-ships

Survey identity and re-ships. Re-ship lineage is the tuple (contractor, contractor_reference, survey.name), ordered by survey.revision. survey.name alone can collide across contractors, so the contractor and their job/contract reference qualify it.

Per-datum reconciliation over a survey.part field. A survey delivered in pieces must distinguish “complementary part” from “supersedes,” but both look like multiple bundles with one survey.name. csemx uses a single per-datum rule: highest survey.revision wins for repeated datum keys, disjoint data is unioned. Resolution keys on a higher revision, so two copies of one datum key at the same revision have no tiebreaker and are non-conformant; a corrected datum must increment the revision. That rule needs no new metadata and is derived from content plus revision. survey.part would only restate intent the content already carries, and could be set inconsistently. Lineage comes from survey identity plus survey.revision; hashes, if used, are only for byte-level deduplication.

Quoted, UTC acquisition dates. acquired_start/acquired_end are quoted strings in one of two exact forms because an unquoted YAML scalar like 2026-05-01 parses as a date object, not a string, and would validate inconsistently across YAML libraries. UTC with a literal Z (no offsets, no local time) removes timezone ambiguity from a field used only as a day or instant bound.

On ingest, a consumer combining bundles verifies or reprojects coordinate systems and matches sign conventions (flipping the imaginary part as needed); units need no reconciliation (fixed by the spec), and IDs are namespaced across bundles to avoid collision (e.g. <contractor>:<contractor_reference>:<original_id>).

§9 The data table

One datum per row. The response at a frequency is one physical quantity regardless of the transmitter drive pattern used to estimate it. tx_fundamental records useful drive provenance when a nominal repetition/fundamental frequency exists, but does not define datum identity and need not be harmonically related to frequency. Keep the preferred estimate for a datum in one bundle, and ship competing estimates (alternate stacks, trial processing) as separate bundles.

All-or-nothing complex datum. real/imag are both finite or both NaN: a complex response with only one part present is not a usable measurement, and the all-or-nothing rule lets a consumer test presence on either part. Errors follow the datum (finite for a present datum, NaN for a missing one) so a skipped measurement carries no false precision.

use flag. Some contracts mandate a complete matrix (every frequency × station × transmitter), so a contractor cannot omit poor data, and inflating err to signal badness corrupts the statistical uncertainty an inversion relies on; that inflation is irreversible, since a genuinely noisy datum then can’t be told from a producer-deprecated one. use separates the producer’s quality judgment from the error: deliver the true value and error, mark use = 0. It is binary and advisory by design: the cutoff for “bad” is subjective, so the flag is explicitly the producer’s include/exclude recommendation, not a calibrated grade, and the consumer may override. Kept strictly binary (no levels, no reason codes) to avoid reimporting the “what does ‘fair’ mean / where’s the cutoff” ambiguity; a reason goes in notes.md. Represented as 0/1, not true/false, to avoid the boolean-text casing trap (cf. exp, NaN).

Real/imag, not amplitude/phase. Real/imag is the canonical complex response and avoids phase wrapping, phase-unit choices, log-amplitude conventions, and a two-representations-per-bundle switch. Consumers that prefer amplitude/phase can derive it from the complex value and apply their own error model and floors.

§11 Versioning

Additive-only minor versions let a MAJOR.X reader accept any MAJOR.Y bundle (Y ≤ X) and ignore unknown optional additions, so the format can grow (new optional columns, manifest keys, files) without breaking existing readers. Anything that would break a reader bumps MAJOR. Because the unit of every quantity is pinned by format.version (§3.6), the version string is also what keeps a bundle self-describing on units without a manifest units block.

Out of scope for v1.0

Deliberately deferred:

Time-domain (TDEM/TEM). v1.0 is frequency-domain only and does not define gates, waveforms, turn-off ramps, or time-zero conventions.
Natural-source EM (MT/AMT). csemx is controlled-source only; natural-source deliverables remain separate.
Static/DC data. frequency = 0 is not a placeholder for DC resistivity or static-limit data; v1.0 data rows require frequency > 0.
Environmental / medium properties. Seawater conductivity, CTD profiles, borehole fluids, bathymetry, and earth models are modeling inputs, not csemx response data.
Graded quality codes. v1.0 keeps only the binary advisory use flag.
Time-lapse linkage. Relationships between repeat surveys belong in external project metadata.