← THE INDEX  ·  WRITEUP

Silent Data Corruption and Persistent Backdoor via Unrestricted redis.set_repl in Aiven Valkey

Any authenticated Valkey user can suppress replication of arbitrary write commands via Lua, silently diverging master and replica state, and can register persistent trojan functions that continue corrupting data after the attacker disconnects.

Summary

Any authenticated user on Aiven's managed Valkey service can call redis.set_repl(redis.REPL_NONE) within a Lua EVAL script, suppressing replication of subsequent write commands. Writes execute on the master but are never propagated to replicas or the AOF persistence log. This silently diverges master and replica state with no error, no log entry, and no monitoring alert. Aiven explicitly disables REPLICAOF, SLAVEOF, CLUSTER, CONFIG, DEBUG, and similar replication and administrative commands, demonstrating clear intent to prevent replication manipulation. redis.set_repl() within Lua was not included in those restrictions.

Additionally, FUNCTION LOAD is unrestricted, allowing an attacker to register persistent server-side functions that embed redis.set_repl(REPL_NONE) internally. These trojan functions survive restarts via RDB/AOF and continue silently manipulating replication state on every invocation by any user, including the legitimate application.

Impact

On Aiven Business and Premium plans (2-3 nodes), master and replica data diverge silently. Applications reading from the master see deleted or corrupted keys; failover to a replica restores stale data, creating impossible-to-diagnose intermittent behavior. Specific attack scenarios proven during testing:

  • Silent key deletion. A key deleted with REPL_NONE is removed from the master only. Replicas retain it. After failover, the deleted key reappears.
  • Silent FLUSHDB. An entire database is wiped on the master while replicas retain all data.
  • Persistent trojan functions. A function library registered via FUNCTION LOAD with embedded set_repl(REPL_NONE) calls persists after the attacker disconnects. Every call to the function by the legitimate application silently increments a shadow counter on the master that is never replicated. The function is indistinguishable from a normal application helper in the function list.
  • 100% sustained downtime (when combined with repeated crash primitives): attacker re-crashes on recovery so the server is never available. The two issues are independent but compose.

Redis's own documentation explicitly warns: "This is an advanced feature. Misuse can cause damage by violating the contract that binds the Redis master, its replicas, and AOF contents to hold the same logical content."

Root cause

Aiven's command restriction policy disables replication topology commands (REPLICAOF, SLAVEOF, CLUSTER) and administrative commands (CONFIG, DEBUG, BGSAVE, ACL). These restrictions are applied at the command level. redis.set_repl() is a Lua scripting API function rather than a top-level Valkey command, so it does not appear in the command ACL and was not included in the restriction policy. The result is a policy gap: replication topology is protected, but replication content integrity is not.

FUNCTION LOAD persists function code in the server state across restarts via RDB and AOF. Functions are stored as first-class server state, so a trojan function registered by one user continues operating after that user's connection ends and after server restarts.

Proof of concept

The scripts below demonstrate silent key deletion and a persistent trojan function. All credentials and host identifiers have been replaced with placeholders.

Disclosure and fix

Reported to Aiven through their bug bounty program. Aiven triaged this as P2 (High). Recommended fixes:

  1. Restrict redis.set_repl() in the Lua environment by either returning an error when it is called from EVAL or FCALL, adding it to the ACL command list so it requires an explicit grant, or limiting it to REPL_ALL only and blocking REPL_NONE, REPL_AOF, and REPL_REPLICA.
  2. Consider restricting FUNCTION LOAD for default users to prevent registration of persistent functions without an explicit grant.
  3. Add monitoring for replication content divergence, not just replication lag: compare key counts or checksums between master and replicas to detect split-brain state caused by REPL_NONE writes.