Backend Reliability Systems

Backend engineering deep dives focused on design quality and correctness under failure.

Cricket Net Booking System

>A production-focused booking platform for correct reservations under concurrent traffic.
>Built to prevent double bookings, duplicate payments, and incorrect states.

>Spring Boot service with PostgreSQL as source of truth and Redis for performance acceleration.
>Layered backend: HTTP controllers -> security/rate-limit layer -> domain services -> persistence.
>External boundaries include payment gateway callbacks, SMTP, and SMS notification provider integrations.
>Hosted in Docker containers behind Nginx reverse proxy.

>Transaction boundaries protect booking create/update operations from partial writes.
>Idempotency keys ensure repeated create/payment callbacks return the same result.
>Distributed locking and deterministic DB lock ordering protect concurrent slot writes.
>PostgreSQL overlap exclusion constraints act as final conflict safety net under race pressure.

>Public payment callbacks are verified through HMAC signature checks, CIDR allowlist validation, and timestamp freshness checks.
>Replay defense uses Redis dedupe keys to safely ignore repeated callback payloads.
>Rate limiting is layered: global sharded limiter, DoS threshold limiter, and endpoint-level policies with Retry-After signaling.
>Refresh-session design uses rotation and replay-aware revocation to reduce token hijack risk.

>If Redis fails, the system enters degraded mode where performance drops but correctness remains DB-protected.
>Fail-open cache handling preserves availability on cache errors while critical keyspaces remain fail-closed.
>If payment confirmation is retried, system returns existing booking instead of duplicating.
>If any validation step fails, transaction rollback keeps state consistent.

>Read path is cache-first with Redis primary and fallback behavior during backend instability.
>Write path performs after-commit invalidation so cache is never updated before durable state change.
>Scoped version-token invalidation (broad/net/date/net+date) avoids stale cross-node availability views.

>Swagger UI: https://api.rumalg.me/swagger-ui/index.html
>Actuator health endpoint: https://api.rumalg.me/actuator/health
>Current health signals include DB UP, Redis UP, rate-limit status UP, and circuit breaker state visibility.

>Correctness first: additional locking and validation add write-path latency under contention.
>Layered defenses increase operational complexity but provide deterministic behavior under retries and concurrency.
>Current hardening priorities include stricter production migration policy and callback configuration validation.

>Async Python runtime with PyroFork/Pyrogram and layered handler -> service -> repository flow.
>MongoDB serves indexed metadata queries while Telegram remains the underlying media storage.
>Redis is used for sessions, cache acceleration, and rate-limiting primitives.
>Hosted in Docker behind Nginx with production runtime monitoring.

>MongoDB indexing strategy provides fast and deterministic retrieval for large channel datasets.
>User-wise rate limiting to isolate abusive or heavy request patterns.
>Atomic quota reservation with compensating release prevents oversubscription during bulk send operations.
>Global merge/sort before pagination preserves correctness in multi-database search responses.
>Supports multiple MongoDB databases using separate DB URIs for client-specific routing/failover.

>Per-database circuit breaker states (CLOSED/OPEN/HALF_OPEN) isolate failing pools without collapsing all writes.
>FloodWait and transient Telegram RPC failures are handled with adaptive retry scheduling and bounded concurrency.
>Bounded queue + overflow queue + dynamic batch sizing protect indexing pipeline during traffic spikes.
>Broadcast state and maintenance counters are persisted so restarts recover safely.

>Operational endpoints: /health, /metrics, /performance.
>Admin runtime controls include cache/database/performance visibility and broadcast lifecycle actions.
>Maintenance jobs and structured task cleanup reduce long-running runtime drift and orphan background work.

>System chooses metadata indexing over external file storage, reducing storage/legal overhead but requiring robust reference reconstruction.
>Cross-database correctness and failover resilience add implementation complexity compared to single-DB bots.
>Built as a client project with configurable deployment behavior and operational controls.