src.toolbox.utils.alignment#

Functions#

interpolate_DEPTH(, save_path, show)

Interpolate missing DEPTH values per PROFILE_NUMBER, normalize depth sign,

aggregate_vars(→ xarray.Dataset)

Compute medians per (PROFILE_NUMBER, DEPTH_bin) and return a NEW dataset

filter_xarray_by_profile_ids(→ xarray.Dataset)

Filter an xarray.Dataset to keep only rows / entries with profile IDs in valid_ids.

find_profile_pair_metadata(→ pandas.DataFrame)

Identify profile pairs between a target and ancillary glider within time/distance thresholds.

merge_pairs_from_filtered_aggregates(→ xarray.Dataset)

Build a dataset with one row per (target_profile, ancillary_profile, depth_bin).

major_axis_r2_xr(→ float)

Compute R² (coefficient of determination) for Major Axis (Type II) regression using xarray.

compute_r2_for_merged_profiles_xr(→ xarray.Dataset)

Compute R² for each profile-pair in an xarray.Dataset, and append the results directly to the dataset.

plot_r2_heatmaps_per_pair(r2_datasets, variables[, ...])

Create ONE heatmap per ancillary pairing showing counts of unique ancillary profiles

collect_xy_from_r2_ds(r2_ds, var, target_name, ...[, ...])

From an R² dataset (one ancillary), collect all (x=ancillary, y=target) binned points

fit_linear_map(x, y)

Fit y = slope * x + intercept using sklearn LinearRegression.

plot_pair_scatter_grid(r2_datasets, variables, target_name)

Grid of scatter plots: rows = ancillary, cols = variables.

apply_linear_map_to_da(da, slope, intercept[, out_name])

Apply y = slope * da + intercept to an xarray.DataArray.

Module Contents#

src.toolbox.utils.alignment.interpolate_DEPTH(ds: xarray.Dataset, bin_size: int = 5, depth_positive: str = 'down', plot: bool = False, plot_kind: str = 'hist', max_profiles_heatmap: int = 50, figsize=(9, 4), save_path: str | None = None, show: bool = True) xarray.Dataset[source]#

Interpolate missing DEPTH values per PROFILE_NUMBER, normalize depth sign, build regular depth bins, and (optionally) visualize the bin coverage.

Parameters:
  • ds (xr.Dataset) – Must contain ‘DEPTH’, ‘TIME’, ‘PROFILE_NUMBER’ on the ‘N_MEASUREMENTS’ axis.

  • bin_size (int) – Depth bin size in the same units as DEPTH (e.g., meters).

  • depth_positive ({"down","up"}) – Normalize depth sign before binning. “down” => positive increasing with depth.

  • plot (bool) – If True, draw a quick visualization of the resulting depth bins.

  • plot_kind ({"hist","heatmap"}) –

    • “hist”: overall histogram of measurement counts per bin (barh).

    • ”heatmap”: per-profile vs bin coverage (first max_profiles_heatmap profiles).

  • max_profiles_heatmap (int) – Max number of profiles to include in the heatmap (to avoid huge plots).

  • figsize (tuple) – Matplotlib figure size for the plot.

  • save_path (str or None) – If provided, save the plot to this path.

  • show (bool) – If True, plt.show() the figure; otherwise close it.

Returns:

With:
  • DEPTH_interpolated (float64, N_MEASUREMENTS)

  • DEPTH_bin (float32, N_MEASUREMENTS)

  • DEPTH_range (str, N_MEASUREMENTS) e.g. “50–55”

Return type:

xr.Dataset

src.toolbox.utils.alignment.aggregate_vars(ds: xarray.Dataset, vars_to_aggregate, profile_dim='PROFILE_NUMBER', bin_dim='DEPTH_bin') xarray.Dataset[source]#

Compute medians per (PROFILE_NUMBER, DEPTH_bin) and return a NEW dataset with variables named ‘median_{var}’, each shaped (PROFILE_NUMBER, DEPTH_bin).

Notes: - This does NOT attach medians back to the raw dataset (which would broadcast

onto N_MEASUREMENTS). Use the returned dataset for downstream steps.

  • Requires that both PROFILE_NUMBER and DEPTH_bin are coordinates aligned to the measurement dimension (typically ‘N_MEASUREMENTS’).

src.toolbox.utils.alignment.filter_xarray_by_profile_ids(ds: xarray.Dataset, profile_id_var: str, valid_ids: numpy.ndarray | list) xarray.Dataset[source]#

Filter an xarray.Dataset to keep only rows / entries with profile IDs in valid_ids. Works for both raw (with N_MEASUREMENTS) and aggregated (PROFILE_NUMBER as a dim) datasets.

src.toolbox.utils.alignment.find_profile_pair_metadata(df_target: pandas.DataFrame, df_ancillary: pandas.DataFrame, target_name: str, ancillary_name: str, time_thresh_hr: float = 2.0, dist_thresh_km: float = 5.0) pandas.DataFrame[source]#

Identify profile pairs between a target and ancillary glider within time/distance thresholds.

Parameters:
  • df_target (pd.DataFrame) – Summary dataframe for the target glider (from summarising_profiles()).

  • df_ancillary (pd.DataFrame) – Summary dataframe for the ancillary glider.

  • target_name (str) – Name of the target glider (used in output column names).

  • ancillary_name (str) – Name of the ancillary glider (used in output column names).

Returns:

Matched profile pairs with columns: - [target_name]_PROFILE_NUMBER - [ancillary_name]_PROFILE_NUMBER - time_diff_hr - dist_km

Return type:

pd.DataFrame

src.toolbox.utils.alignment.merge_pairs_from_filtered_aggregates(paired_df, agg_target: xarray.Dataset, agg_anc: xarray.Dataset, target_name: str, ancillary_name: str, variables, bin_dim: str = 'DEPTH_bin', pair_dim: str = 'PAIR_INDEX') xarray.Dataset[source]#

Build a dataset with one row per (target_profile, ancillary_profile, depth_bin). Each row has: target/ancillary profile IDs, time/distance diffs, and median_{var} for both sides.

src.toolbox.utils.alignment.major_axis_r2_xr(x: xarray.DataArray, y: xarray.DataArray) float[source]#

Compute R² (coefficient of determination) for Major Axis (Type II) regression using xarray.

Parameters:
  • x (xr.DataArray) – First variable (e.g., target glider).

  • y (xr.DataArray) – Second variable (e.g., ancillary glider).

Returns:

R² value, or NaN if fewer than 2 valid observations.

Return type:

float

src.toolbox.utils.alignment.compute_r2_for_merged_profiles_xr(ds: xarray.Dataset, variables: list[str], target_name: str, ancillary_name: str) xarray.Dataset[source]#

Compute R² for each profile-pair in an xarray.Dataset, and append the results directly to the dataset.

Parameters:
  • ds (xr.Dataset) – Dataset with one row per (PAIR_INDEX, DEPTH_bin), and aligned variables for target and ancillary gliders.

  • variables (list of str) – List of variable base names (e.g., [“salinity”, “temperature”]).

  • target_name (str) – Name of the target glider (used in suffix suffix: _TARGET_{name}).

  • ancillary_name (str) – Name of the ancillary glider (used in suffix: _{name}).

Returns:

Same dataset with new variables: r2_{var}_{ancillary_name}, one per profile pair. These will be aligned with the “PAIR_INDEX” dimension only.

Return type:

xr.Dataset

src.toolbox.utils.alignment.plot_r2_heatmaps_per_pair(r2_datasets, variables, target_name=None, r2_thresholds=[0.99, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7], time_thresh_hr=None, dist_thresh_km=None, figsize=(9, 6), save_plots=False, output_path=None, show_plots=True)[source]#

Create ONE heatmap per ancillary pairing showing counts of unique ancillary profiles that meet R² thresholds for each variable. Optionally filter by time/dist thresholds.

src.toolbox.utils.alignment.collect_xy_from_r2_ds(r2_ds, var, target_name, ancillary_name, r2_min=None, time_max=None, dist_max=None)[source]#

From an R² dataset (one ancillary), collect all (x=ancillary, y=target) binned points for a given variable across selected profile pairs, flattening PAIR_INDEX×DEPTH_bin.

Returns: x, y (1D np.ndarrays)

src.toolbox.utils.alignment.fit_linear_map(x, y)[source]#

Fit y = slope * x + intercept using sklearn LinearRegression. Returns dict with slope, intercept, R² (model score), and n.

src.toolbox.utils.alignment.plot_pair_scatter_grid(r2_datasets, variables, target_name, variable_r2_criteria=None, max_time_hr=None, max_dist_km=None, ancillaries_order=None, figsize=None, point_alpha=0.6, point_size=8, equal_axes=True)[source]#

Grid of scatter plots: rows = ancillary, cols = variables. Each panel: ancillary median_{var} vs target median_{var}, all depth bins from pairs that pass R²/time/distance filters. Plots 1:1 and fitted line.

src.toolbox.utils.alignment.apply_linear_map_to_da(da, slope, intercept, out_name=None)[source]#

Apply y = slope * da + intercept to an xarray.DataArray. Returns a new DataArray (optionally renamed).