src.toolbox.utils.alignment#

Functions#

`interpolate_DEPTH`(, save_path, show)	Interpolate missing DEPTH values per PROFILE_NUMBER, normalize depth sign,
`aggregate_vars`(→ xarray.Dataset)	Compute medians per (PROFILE_NUMBER, DEPTH_bin) and return a NEW dataset
`filter_xarray_by_profile_ids`(→ xarray.Dataset)	Filter an xarray.Dataset to keep only rows / entries with profile IDs in valid_ids.
`find_profile_pair_metadata`(→ pandas.DataFrame)	Identify profile pairs between a target and ancillary glider within time/distance thresholds.
`merge_pairs_from_filtered_aggregates`(→ xarray.Dataset)	Build a dataset with one row per (target_profile, ancillary_profile, depth_bin).
`major_axis_r2_xr`(→ float)	Compute R² (coefficient of determination) for Major Axis (Type II) regression using xarray.
`compute_r2_for_merged_profiles_xr`(→ xarray.Dataset)	Compute R² for each profile-pair in an xarray.Dataset, and append the results directly to the dataset.
`plot_r2_heatmaps_per_pair`(r2_datasets, variables[, ...])	Create ONE heatmap per ancillary pairing showing counts of unique ancillary profiles
`collect_xy_from_r2_ds`(r2_ds, var, target_name, ...[, ...])	From an R² dataset (one ancillary), collect all (x=ancillary, y=target) binned points
`fit_linear_map`(x, y)	Fit y = slope * x + intercept using sklearn LinearRegression.
`plot_pair_scatter_grid`(r2_datasets, variables, target_name)	Grid of scatter plots: rows = ancillary, cols = variables.
`apply_linear_map_to_da`(da, slope, intercept[, out_name])	Apply y = slope * da + intercept to an xarray.DataArray.

Module Contents#

src.toolbox.utils.alignment.interpolate_DEPTH(ds: xarray.Dataset, bin_size: int = 5, depth_positive: str = 'down', plot: bool = False, plot_kind: str = 'hist', max_profiles_heatmap: int = 50, figsize=(9, 4), save_path: str | None = None, show: bool = True) → xarray.Dataset[source]#

Interpolate missing DEPTH values per PROFILE_NUMBER, normalize depth sign, build regular depth bins, and (optionally) visualize the bin coverage.

Parameters:

ds (xr.Dataset) – Must contain ‘DEPTH’, ‘TIME’, ‘PROFILE_NUMBER’ on the ‘N_MEASUREMENTS’ axis.
bin_size (int) – Depth bin size in the same units as DEPTH (e.g., meters).
depth_positive ({"down","up"}) – Normalize depth sign before binning. “down” => positive increasing with depth.
plot (bool) – If True, draw a quick visualization of the resulting depth bins.
plot_kind ({"hist","heatmap"}) –
- “hist”: overall histogram of measurement counts per bin (barh).
- ”heatmap”: per-profile vs bin coverage (first max_profiles_heatmap profiles).
max_profiles_heatmap (int) – Max number of profiles to include in the heatmap (to avoid huge plots).
figsize (tuple) – Matplotlib figure size for the plot.
save_path (str or None) – If provided, save the plot to this path.
show (bool) – If True, plt.show() the figure; otherwise close it.

Returns:

With:

DEPTH_interpolated (float64, N_MEASUREMENTS)
DEPTH_bin (float32, N_MEASUREMENTS)
DEPTH_range (str, N_MEASUREMENTS) e.g. “50–55”

Return type:

xr.Dataset

src.toolbox.utils.alignment.aggregate_vars(ds: xarray.Dataset, vars_to_aggregate, profile_dim='PROFILE_NUMBER', bin_dim='DEPTH_bin') → xarray.Dataset[source]#

Compute medians per (PROFILE_NUMBER, DEPTH_bin) and return a NEW dataset with variables named ‘median_{var}’, each shaped (PROFILE_NUMBER, DEPTH_bin).

Notes: - This does NOT attach medians back to the raw dataset (which would broadcast

onto N_MEASUREMENTS). Use the returned dataset for downstream steps.

Requires that both PROFILE_NUMBER and DEPTH_bin are coordinates aligned to the measurement dimension (typically ‘N_MEASUREMENTS’).

src.toolbox.utils.alignment.filter_xarray_by_profile_ids(ds: xarray.Dataset, profile_id_var: str, valid_ids: numpy.ndarray | list) → xarray.Dataset[source]#: Filter an xarray.Dataset to keep only rows / entries with profile IDs in valid_ids. Works for both raw (with N_MEASUREMENTS) and aggregated (PROFILE_NUMBER as a dim) datasets.

src.toolbox.utils.alignment.find_profile_pair_metadata(df_target: pandas.DataFrame, df_ancillary: pandas.DataFrame, target_name: str, ancillary_name: str, time_thresh_hr: float = 2.0, dist_thresh_km: float = 5.0) → pandas.DataFrame[source]#

Identify profile pairs between a target and ancillary glider within time/distance thresholds.

Parameters:

df_target (pd.DataFrame) – Summary dataframe for the target glider (from summarising_profiles()).
df_ancillary (pd.DataFrame) – Summary dataframe for the ancillary glider.
target_name (str) – Name of the target glider (used in output column names).
ancillary_name (str) – Name of the ancillary glider (used in output column names).

Returns:

Matched profile pairs with columns: - [target_name]_PROFILE_NUMBER - [ancillary_name]_PROFILE_NUMBER - time_diff_hr - dist_km

Return type:

pd.DataFrame

src.toolbox.utils.alignment.merge_pairs_from_filtered_aggregates(paired_df, agg_target: xarray.Dataset, agg_anc: xarray.Dataset, target_name: str, ancillary_name: str, variables, bin_dim: str = 'DEPTH_bin', pair_dim: str = 'PAIR_INDEX') → xarray.Dataset[source]#: Build a dataset with one row per (target_profile, ancillary_profile, depth_bin). Each row has: target/ancillary profile IDs, time/distance diffs, and median_{var} for both sides.

src.toolbox.utils.alignment.major_axis_r2_xr(x: xarray.DataArray, y: xarray.DataArray) → float[source]#

Compute R² (coefficient of determination) for Major Axis (Type II) regression using xarray.

Parameters:

x (xr.DataArray) – First variable (e.g., target glider).
y (xr.DataArray) – Second variable (e.g., ancillary glider).

Returns:

R² value, or NaN if fewer than 2 valid observations.

Return type:

float

src.toolbox.utils.alignment.compute_r2_for_merged_profiles_xr(ds: xarray.Dataset, variables: list[str], target_name: str, ancillary_name: str) → xarray.Dataset[source]#

Compute R² for each profile-pair in an xarray.Dataset, and append the results directly to the dataset.

Parameters:

ds (xr.Dataset) – Dataset with one row per (PAIR_INDEX, DEPTH_bin), and aligned variables for target and ancillary gliders.
variables (list of str) – List of variable base names (e.g., [“salinity”, “temperature”]).
target_name (str) – Name of the target glider (used in suffix suffix: _TARGET_{name}).
ancillary_name (str) – Name of the ancillary glider (used in suffix: _{name}).

Returns:

Same dataset with new variables: r2_{var}_{ancillary_name}, one per profile pair. These will be aligned with the “PAIR_INDEX” dimension only.

Return type:

xr.Dataset

src.toolbox.utils.alignment.plot_r2_heatmaps_per_pair(r2_datasets, variables, target_name=None, r2_thresholds=[0.99, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7], time_thresh_hr=None, dist_thresh_km=None, figsize=(9, 6), save_plots=False, output_path=None, show_plots=True)[source]#: Create ONE heatmap per ancillary pairing showing counts of unique ancillary profiles that meet R² thresholds for each variable. Optionally filter by time/dist thresholds.

src.toolbox.utils.alignment.collect_xy_from_r2_ds(r2_ds, var, target_name, ancillary_name, r2_min=None, time_max=None, dist_max=None)[source]#

From an R² dataset (one ancillary), collect all (x=ancillary, y=target) binned points for a given variable across selected profile pairs, flattening PAIR_INDEX×DEPTH_bin.

Returns: x, y (1D np.ndarrays)

src.toolbox.utils.alignment.fit_linear_map(x, y)[source]#: Fit y = slope * x + intercept using sklearn LinearRegression. Returns dict with slope, intercept, R² (model score), and n.

src.toolbox.utils.alignment.plot_pair_scatter_grid(r2_datasets, variables, target_name, variable_r2_criteria=None, max_time_hr=None, max_dist_km=None, ancillaries_order=None, figsize=None, point_alpha=0.6, point_size=8, equal_axes=True)[source]#: Grid of scatter plots: rows = ancillary, cols = variables. Each panel: ancillary median_{var} vs target median_{var}, all depth bins from pairs that pass R²/time/distance filters. Plots 1:1 and fitted line.

src.toolbox.utils.alignment.apply_linear_map_to_da(da, slope, intercept, out_name=None)[source]#: Apply y = slope * da + intercept to an xarray.DataArray. Returns a new DataArray (optionally renamed).