Skip to content

refactor(assets): centralize asset upload and compare truth translation#616

Merged
LinoGiger merged 1 commit into
mainfrom
refactor(assets)/centralize-asset-upload-and-compare-truth
Jun 11, 2026
Merged

refactor(assets): centralize asset upload and compare truth translation#616
LinoGiger merged 1 commit into
mainfrom
refactor(assets)/centralize-asset-upload-and-compare-truth

Conversation

@RapidPoseidon

Copy link
Copy Markdown
Contributor

What

Asset upload + IAssetInput construction was duplicated across six call sites (datapoint uploader, validation rapid uploader, audience example handler ×3, benchmark participant), each re-implementing the same "if media: upload then wrap as ExistingAssetInput/MultiAssetInput, else wrap as TextAssetInput" branching. This PR centralizes it and fixes a real bug in the audience flow. No public API changes.

1. Single entry point for "asset → IAssetInput"

AssetUploader.build_asset_input(asset, data_type) and build_asset_input_with_names(...) (which also returns the original-path → uploaded-name map needed for compare-truth translation). All six call sites now use them; media_context call sites keep using upload_and_map_asset.

2. Shared, strict compare-truth translation

ValidationRapidUploader._translate_compare_truth moved to datapoints/_truth_translator.py as a module-level function, handling both CompareTruth and MultiCompareTruth. Behavior change: if a referenced asset is not in the name map on a media rapid, it now raises a clear ValueError instead of silently shipping the raw path/URL as winnerId to the backend.

3. Bug fix: audience compare-truth examples

AudienceExampleHandler._add_rapid_example never translated compare truths at all — media compare rapids added as audience examples sent the caller's raw local path/URL as winnerId instead of the uploaded asset name. It now runs the shared translation before the IValidationTruthModelIExampleTruth dict conversion.

4. Perf: validation sets pre-upload assets

ValidationSetManager._submit now collects every media asset (including media_context) across all rapids and runs AssetUploadOrchestrator.upload_all_assets before the per-rapid loop. Since AssetUploader caches by path/URL, the per-rapid uploads become cache hits — validation sets get the same batched-URL + parallel-file upload path the dataset flow already has. Pre-upload failures don't abort the loop; the per-rapid upload still raises so the existing failed_rapids reporting keeps working.

5. Dead code removal

RapidataOrderManager and RapidataJobManager constructed an AssetUploader they never used — field and imports removed.

Validation

  • uv run pyright src/rapidata/rapidata_client → 0 errors, 0 warnings
  • uv run python -c "from rapidata import RapidataClient" → OK

🔗 Session: session-916a26db

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: lino <lino@rapidata.ai>
@claude

claude Bot commented Jun 10, 2026

Copy link
Copy Markdown

Code Review

Overview

This PR centralizes a repeated "upload media asset / wrap text" pattern across six call sites into two methods on AssetUploader, extracts translate_compare_truth into a shared module, fixes a real audience compare-truth bug, adds pre-upload batching for validation sets, and removes unused dead code. The scope is well-chosen and the changes are internally consistent.


Positives

  • Bug fix is real and important. AudienceExampleHandler._add_rapid_example was silently shipping raw local paths as winnerId for media compare rapids. That would produce corrupted validation data with no error — exactly the kind of silent data bug that's hard to catch in review.
  • Validation order improved in add_compare_example. Checks (truth not in datapoint, len(datapoint) != 2) are now performed before the upload, avoiding wasted network calls on bad input.
  • Duplicate-entry handling in build_asset_input_with_names (the re-index comment) is correct and well-documented.
  • Literal["media", "text"] type annotation on the new methods is consistent with the project's typing style.
  • Dead code removal in RapidataOrderManager / RapidataJobManager is clean with no risk.

Issues / Suggestions

1. Behavior change: translate_compare_truth now raises instead of silently passing through

The old _translate_compare_truth used asset_to_uploaded.get(asset_id, asset_id) — silently using the raw path as the winner ID if the asset wasn't found. The new version raises ValueError.

This is strictly better behavior (fail fast beats silent corruption), but it is a breaking behavior change that isn't flagged in the PR description. If any existing caller constructs a CompareTruth with an asset path that isn't in the rapid's own asset list, they'll now get an unhandled exception instead of a silently wrong result. Worth a note in the PR description, and consider whether it needs a mention in the changelog.

2. Pre-upload failures are silently discarded

In validation_set_manager.py:

if assets_to_upload:
    AssetUploadOrchestrator(self._openapi_service).upload_all_assets(
        assets_to_upload
    )

upload_all_assets returns list[FailedUpload[str]] but the return value is discarded. The PR description explains that per-rapid uploads still raise so failed_rapids reporting keeps working — which is correct — but the pre-upload failure count is lost entirely. If 50 of 100 assets fail in the batch step, you'll see 100 entries in failed_rapids rather than a cleaner "50 assets failed to pre-upload" message. Acceptable as-is, but worth a # noqa comment or a brief explanation comment so the discard looks intentional to future readers.

3. AssetUploadOrchestrator is constructed on each _submit call

AssetUploadOrchestrator(self._openapi_service).upload_all_assets(assets_to_upload)

The AssetUploader inside AssetUploadOrchestrator has class-level caches, so this is functionally correct — the new instance shares the same cache state as the one inside ValidationRapidUploader. But it constructs a BatchAssetUploader on every call too. Low impact; could be a stored field on ValidationSetManager to make the intent clearer.

4. media_context pre-upload assumes list type

if rapid.media_context:
    assets_to_upload.update(rapid.media_context)

If rapid.media_context could ever be a str (not list[str]), update would iterate characters. From reading the codebase this appears to always be a list[str], but worth a quick type-annotation check on Rapid.media_context to confirm there's no legacy str path.


Verdict

Approve with minor notes. The core refactoring is clean, the bug fix is correct, and the pre-upload batching for validation sets is a genuine win. The main thing worth flagging before merge is the ValueError-on-missing-asset behavior change — it's the right call, but callers who previously relied on the silent pass-through may now see unhandled exceptions, so it deserves a mention in the changelog or PR description.

@LinoGiger LinoGiger merged commit 5b7e81f into main Jun 11, 2026
3 checks passed
@LinoGiger LinoGiger deleted the refactor(assets)/centralize-asset-upload-and-compare-truth branch June 11, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants