System Scope>Backend service for booking physical cricket practice nets with payment, authentication, and booking lifecycle management.>Primary objective: guarantee data correctness under concurrency, retries, and external callback uncertainty.>Designed and deployed as a continuously running service, not a demo-only CRUD application.
Failure Modes Addressed>Duplicate booking attempts from double taps, retries, refreshes, and callback replays.>Slot contention where two users attempt the same net and timeslot simultaneously.>Partial-failure paths such as payment success with API timeout and server restarts during active requests.>Dependency outage scenarios where Redis is unavailable but correctness must remain intact.
Correctness Model>Mandatory idempotency keys convert request retries into intent replay instead of duplicate writes.>Layered write protection combines distributed lock scopes, pessimistic DB reads, and PostgreSQL overlap exclusion constraints.>Transactions are intentionally narrow and only wrap state mutation boundaries to avoid long-held locks.>Booking lifecycle is modeled as state transitions (PENDING, CONFIRMED, EXPIRED) to survive asynchronous flows.
Payment and Security Boundary>Public payment callbacks are verified using HMAC signature checks, CIDR allowlist validation, timestamp freshness, and replay-dedupe keys.>Payment creation and confirmation use layered idempotency with unique-key collision recovery to return existing state safely.>Refresh session replay risk is reduced through hashed token storage, rotation on use, and session family revocation behavior.>Role-aware rate limiting applies IP-based keys for public routes and user-based keys for authenticated APIs.
Resilience and Recovery>Redis is treated as a performance accelerator, with degraded-mode fallbacks and fail-closed handling for critical keyspaces.>Scoped cache version invalidation and after-commit invalidation hooks reduce stale availability windows.>Notification delivery is hardened with retry queue processing, dead-letter handling, and scheduled recovery jobs.>Audit and integrity services provide before/after traceability plus restore and validation workflows for destructive admin operations.>Recurring booking generation runs under lock-protected scheduler paths with duplicate/conflict checks.
Platform and Runtime Engineering>Spring Security pipeline includes JWT filter, RBAC boundaries, refresh-session replay defense, and secure cookie/session controls.>Anti-abuse controls combine global sharded limiting, DoS threshold limiting, endpoint policies, and Retry-After style throttling semantics.>Resilience stack combines Redisson-backed coordination, fail-open cache handling where safe, and fail-closed behavior for critical keyspaces.>Operational observability includes Actuator health, Prometheus metrics export, and scheduler-driven maintenance/recovery workflows.>Backend remains modular with clear domain boundaries (booking, payment, notifications, user, net, timeslot, security, integrity).
Deployment and Runtime>Self-hosted deployment uses Docker Compose with Nginx reverse proxy and environment-specific runtime profiles.>CI/CD flow runs through self-hosted GitHub runner automation (ARM64) and scripted rollout steps.>Production endpoints include Swagger UI and actuator health checks to support operational visibility.
Hardening Roadmap>Replace destructive production schema behavior (`ddl-auto=create-drop`) with strict migration-only policy.>Move payment callback URL wiring fully to environment-managed secure configuration with startup validation.>Expand end-to-end tests for callback security edge cases and Redis outage behavior on fail-closed paths.>Formalize secret rotation lifecycle and SLO-driven alert thresholds for p95 latency and queue drain times.
Result>Duplicate requests return the same booking result.>Concurrent slot races cannot create overlapping confirmed bookings.>Payment callbacks remain replay-safe and verification-gated.>Redis outages reduce performance but not booking correctness.>The backend remains predictable under retry-heavy mobile traffic and partial dependency failures.