Work · JFIntech

Fintech

JFIntech

Led fintech SaaS engineering team.

Role

Tech Lead

Timeline

2023

Stack

5 technologies

40%

production incidents reduced

Hours → Minutes

root cause analysis time

Month 1

full observability in place

Stack:PythonFastAPIKubernetesPostgreSQLFintech

The Brief

JFIntech builds SaaS tooling for financial services operations — the category of software that sits between the core banking system and customer-facing products. Reconciliation, reporting, regulatory filing preparation, exception management.

The engineering challenge was not product invention but product stability and scale. The system was growing faster than its architecture could accommodate. Production incidents were frequent, root cause analysis was slow, and the team was spending more time firefighting than building.

The mandate: stabilize the platform, instrument it properly, and restore the team's ability to ship features confidently.

What Was Delivered

  • Full observability stack in place by end of month one: structured JSON logging, distributed tracing via OpenTelemetry, and a metrics suite designed for fintech workloads
  • Async refactor of the reconciliation flow using asyncio.gather, eliminating serialized database calls that were responsible for 60% of timeout-related incidents
  • Connection pool right-sized for actual production concurrency levels
  • Kubernetes infrastructure refactored: reconciliation workers separated from API layer, scaling independently based on queue depth
  • Pod disruption budgets and proper readiness/liveness probes added throughout
  • PostgreSQL query optimization via EXPLAIN ANALYZE on every slow query identified by the observability layer
  • Production incidents down 40% in the following quarter
  • Mean time to resolution for remaining incidents dropped from hours to minutes

The Approach

Production incidents in fintech systems are almost entirely caused by two things: synchronous operations that should have been asynchronous, and insufficient observability that makes debugging slow and root cause identification unreliable.

Fintech systems handle regulated data. This shapes the architecture: auditability, data integrity, and failure isolation matter more than raw throughput. The migration from a synchronous, monolithic processing model to an async, event-driven model had to be done carefully — financial data cannot be silently dropped or duplicated during a migration.

"In financial systems, a silent failure is worse than a loud one. Design for noise."

The Build

Month 1 — Observability. Before changing any production code, the system needed enough instrumentation to understand what was actually happening. Structured logging in JSON format with consistent field names across services. Distributed tracing with OpenTelemetry. Metrics tailored to the fintech workload: transaction processing latency, error rates by transaction type, queue depths, database connection pool saturation.

The dashboards that resulted immediately revealed the actual bottlenecks — not the assumed ones. Three synchronous database operations, called in sequence in the reconciliation flow, were responsible for 60% of the timeout-related incidents. They were not slow individually; they were slow because they were serialized when they could have been parallelized, and because connection pool exhaustion under load caused cascading timeouts.

Async refactor. The serialized operations were parallelized using asyncio.gather, and the connection pool was sized correctly for actual concurrency levels observed in production.

async def reconcile_transaction(txn_id: str) -> ReconciliationResult:
    # Previously sequential — 3x the latency under load
    positions, trades, settlements = await asyncio.gather(
        fetch_positions(txn_id),
        fetch_trades(txn_id),
        fetch_settlements(txn_id),
    )
    return compute_reconciliation(positions, trades, settlements)

Kubernetes infrastructure. Reconciliation workers separated from the API layer, scaling independently based on queue depth. Pod disruption budgets and proper readiness/liveness probes added — the kind of operational hardening that prevents the cluster from taking down its own workloads during deployments.

PostgreSQL optimization. Financial queries tend to be complex: multiple joins, aggregations, date-range filters across large tables. EXPLAIN ANALYZE on every slow query identified by the observability layer revealed several cases of sequential scans on large tables where an index would have been straightforward to add. Index design on partitioned tables requires care.

The Outcome

Production incidents dropped 40% in the quarter following the observability and async refactor work. Mean time to resolution for incidents that did occur dropped further — because the observability layer made root cause analysis a matter of minutes rather than hours of log archaeology.

The observability investment also revealed that the assumed bottleneck (a third-party reconciliation API) was not the actual bottleneck. The actual bottlenecks were internal and fixable within a week. This is a pattern that repeats: engineering teams think they know where the problems are. The teams that actually know have the data.

The team's deployment confidence improved measurably. Feature delivery rate in the following quarter was higher than in any previous quarter.

Lessons

Observability before optimization. Every engineering team thinks they know where the bottlenecks are. The teams that actually know have the data. Build the instrumentation before you start fixing anything.

Async is not always the answer. The reconciliation operations were the right candidates for parallelization because they were genuinely independent. Other parts of the codebase had sequential operations that were sequential for good reasons — transaction ordering guarantees, for example. Understanding which sequential operations can safely be parallelized requires domain knowledge, not just profiling data.

In financial systems, test your failure modes explicitly. The 40% incident reduction came partly from the fixes and partly from adding explicit tests for the failure scenarios that had been causing incidents — tests that prove the system degrades gracefully under load.

Next Engagement

HorizonBench

View Case Study →