Advanced File Filter Bot

A Telegram indexing system that turns large media channels into searchable storage using incremental synchronization, metadata indexing, and flood-wait-safe request scheduling without storing files externally.

PythonPyroFork/PyrogramMongoDBRedisTelegram Bot APIaiohttpDocker ComposeRuffmypypytest2025

Architecture and Engineering Diagrams

What I did

  • >Client-facing Telegram media indexing and retrieval middleware built with Python + Pyrogram/PyroFork.
  • >Converts large Telegram channels into searchable metadata without downloading or re-hosting media externally.
  • >Supports user-wise rate limiting, atomic quota controls, multi-database pooling/failover, and indexed MongoDB retrieval paths.
  • >Structured as handlers, services, and repositories with centralized lifecycle control for long-running runtime stability.
  • >Operational endpoints include /health, /metrics, and /performance via aiohttp with admin tooling for cache/database/runtime management.
  • >Live metrics endpoint: https://filefilterbot.rumalg.me/metrics.

Case Study

System Scope

  • >Telegram-based indexing and retrieval system that makes large channel media searchable for end users.
  • >Acts as middleware on top of Telegram where external search APIs are not available.
  • >Designed as an operational backend system, not only a command bot implementation.

Platform Constraints

  • >Telegram channels can contain hundreds of thousands of posts with inconsistent native media search.
  • >Rate limits and FloodWait behavior make naive crawling and send patterns unstable.
  • >Messages may be deleted and file references can expire, so the index must self-heal.
  • >Bot restarts are expected; progress and state must be restart-safe.

Core Architecture

  • >Incremental indexing tracks last processed message ID, shifting ingestion from O(n) rescans to O(delta) updates.
  • >MongoDB stores indexed metadata and Telegram remains the source file storage system.
  • >Repository layer uses indexed fields and global merge/sort pagination for correctness across multiple databases.
  • >Redis handles sessions, caching, and rate limiting with versioned cache invalidation for low-cost global resets.

Correctness and Concurrency Decisions

  • >Send-all uses atomic quota reservation plus compensating release to prevent concurrent oversubscription.
  • >Duplicate indexing and duplicate user-creation paths are handled idempotently to avoid restart-time conflicts.
  • >Daily usage counter reset state is persisted to survive restarts and avoid incorrect counter drift.
  • >Poster-to-file mapping uses message ordering to reconstruct logical media groupings not exposed directly by Telegram.

Scalability and Throughput

  • >Batch duplicate checks and bulk-save indexing paths remove N+1 overhead in channel ingestion.
  • >Bounded queue + overflow queue + dynamic batch sizing absorb spikes without collapsing ingestion workers.
  • >Per-domain semaphores isolate Telegram API and database workloads to reduce contention.
  • >Adaptive retry scheduling for FloodWait and transient RPC errors protects long-run throughput.

Fault Tolerance and Recovery

  • >Multi-database manager uses per-database circuit breaker states (CLOSED, OPEN, HALF_OPEN) with recovery probing.
  • >Smart write selection can route across multiple MongoDB URIs to isolate partial outages.
  • >Broadcast runtime state is persisted and recovered on startup to avoid orphan active sessions.
  • >Structured handler/task cleanup prevents background task leaks during shutdown and restart cycles.
  • >Secure updater workflow includes backup, validation checks, and rollback support for safer runtime upgrades.

Technology and Operational Controls

  • >Core runtime: Python async bot architecture with PyroFork/Pyrogram integration and aiohttp operational endpoints.
  • >Data layer: MongoDB with index-driven search and optional multi-database routing/failover.
  • >Coordination layer: Redis-backed rate limiting, session state, and versioned cache invalidation.
  • >Delivery safety: Telegram API wrapper with FloodWait-aware retries and semaphore-based concurrency control.
  • >Operations surface: admin commands for cache, database stats, performance checks, and broadcast lifecycle control.
  • >Tooling quality gate: Ruff, mypy, pytest-oriented project setup with Docker Compose deployment workflow.

Production Readiness Signals

  • >Incremental indexing supports large-channel operation without full-history rescans.
  • >Queue backpressure controls and dynamic batch sizing keep ingestion stable during traffic spikes.
  • >Atomic quota reservation preserves correctness for bulk-send flows under concurrency.
  • >Restart-safe state recovery ensures broadcasts, counters, and indexing can resume predictably.
  • >Designed and operated as a long-running production-style client system, not a demo-only bot.

Outcome

  • >Transforms Telegram channels into searchable archives without externally copying media files.
  • >Maintains predictable behavior across retries, restarts, FloodWait, and partial database failures.
  • >Delivers fast user search and retrieval while preserving platform-safe operational behavior.

Live System

This project runs continuously and exposes runtime metrics:

https://filefilterbot.rumalg.me/metrics