Elrond Incident and Recovery Report
Late at night on Sunday, the 5th of June 2022, a lot of suspicious activity started being reported on the Elrond mainnet, including large transfers of EGLD and large swaps on the Maiar Exchange. The team quickly assembled to investigate this activity, in what would become a long and pivotal night. Now that the dust has settled, the vulnerabilities fixed, most of the missing funds recovered, and all systems live, it is worth looking back and tracing what has actually been happening in the past 5 days.
What actually happened?
It all begins with the deployment of a new VM function to the mainnet. The endpoint in question is called “execute on destination context by caller”, "executeOnDestContextByCaller” as the contracts see it. It is very similar to the more common “execute on destination context”, which we use for synchronous same-shard interactions between contracts, but with one caveat: the action happens as if it were directly sent by the original caller of the transaction.
Let’s break it down step by step: say User A calls Contract B, which in turn calls Contract C using “executeOnDestContextByCaller”. Contract C receives a transaction from B that looks as if it was sent directly by User A. Why would we want this functionality? Let’s say for instance that User A is trying to mint an NFT for herself using an NFT minter contract (Contract B), but wants to be the creator of the NFT. Normally Contract B is the creator, since it is calling the NFT mint function itself, but by using “executeOnDestContextByCaller”, the NFT looks like it was minted by User A directly.
All good until now. Come last week, a small group of developers from the community is trying to build a lottery contract that can withstand all manner of exploits. The trouble is that users of this lottery can in principle cancel any transaction where they don’t win by using a smart contract, effectively winning 100% of the time (this is a broader flaw in that particular lottery contract design, but this discussion is out of scope of this article). The contract is patched to not allow calls from other contracts, but somebody finds a workaround to that, by using “executeOnDestContextByCaller”. Now the lottery contract can be tricked into believing a regular wallet sent a transaction that really came from another contract.
A flurry of activity ensues on the official development channel, as the community plays with this function, until someone notices something worrying. Contract B can also transfer funds in the name of the caller (User A) by using this function. A user could be tricked into sending any transaction to a malicious contract, with no funds, only to see their wallet drained by said contract.
To make sure we don’t see a wave of scams using this exploit, the Elrond team quickly issues a patch, barring all transfers of value via “executeOnDestContextByCaller”. Upgrading the network is a complex and sensitive process, so to do it properly it normally takes at least a week for a patch to reach mainnet. The reason is that it takes 2-4 days to validate the patch, plus several days to propose it and give validators ample time to adopt and upgrade. The patch is supposed to come into effect on Monday, the 8th, at epoch change. At this point the remaining time is pretty short for a regular scammer to build and deploy a malicious dApp, and the users could always be warned to stay away, so the danger seems minimal.
But see, if it was just that we wouldn’t be here discussing this.
Come the weekend, the community continues to play around with the functionality and accidentally stumbles across something more threatening. Sharding is arguably the biggest deal about Elrond, and to make sharding really a thing, you need contracts to properly communicate across shards. Synchronous EVM-style calls are not enough for this, so contracts on Elrond most of the time send so-called asynchronous calls to other contracts and wait for a callback to confirm whether or not things on the other side went well. Asynchronous calls work similarly in the same shard, in this case the callback comes immediately after the asynchronous call ends. The community discovered that you could call “executeOnDestContextByCaller” in a callback, in which case a malicious contract could perform actions not in the name of its caller, but in the name of any contract it deemed to call.
That is right, by using “executeOnDestContextByCaller” in an asynchronous callback, an attacker is able to drain EGLD from any contract it wants, and there is nothing the victim contract can do about it.
There are some caveats: the attack only works if the malicious contract and the target are in the same shard, so all staked EGLD is out of reach, as it resides on the metachain. ESDT tokens are also safe, since to transfer those the malicious contract needs to call a builtin function, like “ESDTTransfer”, which the VM forbids in this case. But that is all the good news.
Bafflingly, these discoveries are discussed quite nonchalantly on the public Elrond Developers channel during the weekend, without raising any security or safety concerns. Since a patch is known to come to the network in 2 days, there is little attention to responsible disclosure and immediate action.
The Attack
The as-of-yet unknown attacker chooses Sunday night as the time of the attack, less than 36 hours before the window of opportunity closes.
He (or she/they?) targets the probably largest EGLD honeypot: the egld-esdt-swap contract. This is the contract that creates and swaps Wrapped EGLD (WEGLD) to and from EGLD. It is a very simple and secure contract: all users who need WEGLD need to deposit the equal amount of EGLD in this contract, so the amount of EGLD stored here is equal to all the WEGLD out there.
There is one of these swap contracts deployed in each of the 3 regular shards, the attacker targets all of them. To get to the funds, he creates wallets in all shards and deploys a copy of the malicious contract in each.
One of these contracts can be found here:
https://explorer.elrond.com/accounts/erd1qqqqqqqqqqqqqpgq8urd3mza45z6th63h52e00zpgvrc5zxll9cshfqmhg
We don’t have the original sources, but in principle it must have looked very similar to this:
At ~ 1 a.m. EEST the attack commences. He sends the transactions whose callbacks drain EGLD from the swap contracts, as follows:
- On Shard 0: 400k EGLD
- On Shard 1: 800k EGLD
- On Shard 2: 450k EGLD
Now, things are a little more complicated than that. If you are not interested in the technical description of the attack, feel free to skip over the next few paragraphs.
The attacker, in fact, deploys one additional contract besides the malicious one in each shard, presumably to hide the transactions or the malicious contract. This brings the total up to 6 contracts deployed. He also uses 2 different wallets in each shard, one for initiating the call, and one for getting the EGLD out.
The small initial funds for the attack all originate on the Binance hot wallet erd16x7le8dpkjsafgwjx0e5kw94evsqw039rwp42m2j9eesd88x8zzs75tzry.
Unfortunately the intra-shard transfer that drains the funds is currently not visible in the explorer, but a new update will soon make this available. This seems to have led to some confusion in the community as to how the attacker got the funds.
Now the attacker is trying to somehow get this trove out of the Elrond chain.
His plan is to use the Maiar Exchange to swap it to something that can be bridged away to Ethereum. The first step to doing this is wrapping the stolen EGLD, which, ironically, needs to take place in the very contracts he has just hacked. This time he uses the contracts’ functions properly, as designed, sending the stolen EGLD back to the contract and receiving fresh WEGLD instead. But because he used stolen EGLD to do this, the WEGLD he receives is no longer properly backed 1:1 with EGLD in the pool, so we can think of it as “hacked WEGLD” that will now be propagating through the network.
He then converts a large part of it to USDC in the Maiar Exchange, but the liquidity pool, while large, is no match for such a gargantuan sum. He thus single-handedly plunges the EGLD price in the liquidity pool down to around $4, selling his hacked WEGLD at a huge loss. The arbitrage bots running on the LP at this hour make some serious profit, but don’t have the volume to get the price back up before the exchange is halted by the team.
He doesn’t centralize the stolen funds in a single wallet, but instead will continue to use the second wallets of each shard (see in the table) to interact with the Maiar Exchange in parallel. There is a $200k transaction limit on the bridge and for some reason he doesn’t just split the amount when using the bridge, but also when interacting with the exchange. This results in very many interactions with the liquidity pools, too many to enumerate here, but curious readers can have a look at them in the explorer:
Shard 0: https://explorer.elrond.com/accounts/erd16syfkds2faezhqa7pn5n8fyjkst70l5qefpmc0r960467snlgycq4ww0rt
Shard 1:
https://explorer.elrond.com/accounts/erd1cura2qq8skel5fsxrpxyysjkaw6durengtkencrezkw78y6y2zhscf854j
Shard 2:
https://explorer.elrond.com/accounts/erd1yrf9qeuqkcjeh5c4xn628mags7cse4r9ra2p2ggmlgfqq3l3v6pqxfu950
Seeing that he crashed the EGLD price in the EGLD-USDC pool and that he is selling at a huge loss, he also converts some of the amount to UTK, some of which he also bridges over to Ethereum.
A technical note: the ETHUSDC and ETHUTK tokens that you see in the transactions are intermediate tokens used by the bridge. USDC and UTK, respectively, need to be converted to them before they can be bridged.
The total amount that he manages to bridge to Ethereum before the bridge is temporarily deactivated is around 4 million USDC and 1 million UTK. Another 1.4 million USDC is frozen in the bridge by the team, before the attacker gets a chance to move it out.
All this activity on the exchange and on the bridge soon starts ringing off alarms, and the team quickly assembles to investigate the incident.
The Attack is Countered
The first priority is to limit the damage as much as possible. The exchange, the Wrapped EGLD swap and the bridge are immediately paused. Note that the activity on the bridge can only be suspended in exceptional, life and death circumstances, and with the agreement of the trusted bridge relayers.
The public API is kept on but the send function is turned off, so no transactions can come through it. The attacker is, however, not relying on the public gateway, so is undeterred by this.
Finally, many ESDTs have a freeze function if so configured at minting or creation. Note that all stable coins, for instance, have this function enabled across all chains, for obvious security reasons. Using it, the team is able to temporarily freeze all USDC and UTK transfers. EGLD on the other hand cannot be frozen, so the attacker is free to move it around.
In the meantime, the team is also in contact with all the major exchanges, passing around a blacklist, to prevent the attacker from extracting the EGLD there. It is also having discussions with law enforcement agencies to start several investigations.
Next is the main task of patching the vulnerability as soon as possible. It is clear that waiting for the regular patch at 6 p.m. EEST is unacceptable given the severity of the situation. An emergency patch is ready at 7:07 a.m. EEST and in a few minutes messages start flying around on the validator channels.
Since the team currently only controls a minority of the validator nodes, the broad participation of the validator community, via rough consensus and off-chain governance, is necessary and important even for such an emergency patch update. This means communicating the patch, proposal, and adoption. In principle, more than 66% of the nodes are necessary to safely enact a change, so the majority of the staking agencies need to be onboard with this.
It is decided to patch things in-place, in the current epoch. To avoid any complications during the upgrade, the community must act with utmost swiftness. The team has contingency plans for any possible event during the upgrade, but it is easier if it doesn’t come to that. Once at least ~67% of the nodes have upgraded, any possible danger is averted.
Miraculously, at 7 a.m. EEST much of the validator community is awake and poised for action. The team performs the final tests for the build, and 20 minutes later the upgrade begins. It takes about half an hour for most of the network to be running the new version. It all goes smoothly and without incident or downtime for the network, much to everyone’s relief. The validators’ response is beyond reproach. Contracts are once again safe.
Immediate Aftermath
It is early Monday morning, the immediate crisis is averted, but the funds are still missing and the team is not sure if the morning patch is a complete fix, or there might be more hidden vulnerabilities.
Part of the team uses the day to test out further scenarios and think of further potential exploits. While no further vulnerabilities seem to hide in the code, the team still feels uneasy about the continued existence of the “executeOnDestContextByCaller” function. To quell any further suspicions, it is ultimately decided to have another, second patch in the same day, in which “executeOnDestContextByCaller” is removed completely. The benefits of having this function around no longer outweigh the risks. This patch comes on Monday evening, and this time the upgrade is done according to protocol, and relatively worry-free.
Another part of the team is investigating solutions to recover the funds. In the meantime, most of the funds have been recovered from the attackers, but since this is still an open investigation, due to legal reasons, we will follow up with a more detailed overview as soon as the investigation is closed.
The Dust Settles
By Tuesday morning most of the funds are recovered and the team begins to restore the state from before the attack.
First of all, the EGLD-USDC liquidity pool is still unbalanced. In order to fully recover what was lost, it is the Elrond team that needs to buy back the EGLD that was sold by the attacker and bring it back to the fair market price.
There is no whitelisting mechanism yet in the DEX, so the team needs to implement it first. This is the only way for the team to perform the swaps before reopening the contracts to the public. No shortcuts are acceptable: the new version of the LP contract is tested thoroughly for several hours before deployment.
The same goes for the EGLD-UTK pair, the team needs to sell the stolen UTK the same way as it was bought by the attacker to get back the exploited WEGLD.
Second, the Wrapped EGLD swap contract needs to be rebalanced. Under normal circumstances there is always a 1:1 ratio between EGLD in the contract and WEGLD, but because of the attack, this balance is broken. The excess WEGLD needs to be burned.
This is trickier than expected. The attacker converted much of the stolen EGLD into WEGLD, then dumped most of it in the exchange. Some of that WEGLD has been lost to fees and to the arbitrage bots that were operating at the time, so there is still excess WEGLD spread throughout the network.
The team manages to recover some funds from the hacker. More specifically the funds that were not yet bridged to Ethereum. The recovered funds are sent to the following address:
https://explorer.elrond.com/accounts/erd1pml9k2tsqsnvtmmalglt2su0sn3cguvr8e8jq0gy69zw2ldcej2qapml9a
Thus the recovery process can commence. The team starts by undoing the damage done on the liquidity pools and the WEGLD swap contracts.
By buying back WEGLD from the pools at the price the attacker sold it, some of the loss can be recovered. The team performs this in several smaller steps, until the EGLD price matches the current EGLD price on Binance ($68.5 at that time). Note that this price is a little lower than it was during the attack.
This does not, unfortunately, return to the team all the “exploited WEGLD” that has spread throughout the network. The team can only buy 728k WEGLD with the USDC and UTK returned by the attacker.
Now comes the time for the swap contracts. The attacker has stolen 1650k EGLD from the contracts, but only converted 917k EGLD to WEGLD. Some of this double-minted WEGLD has then been sold by the arbitrage bots after making profits. This is not so important, though. What is important is that there are 1650k more WEGLD than EGLD in contracts.
There are 2 ways of eliminating this difference: burning WEGLD or adding missing EGLD to the contracts.
The team does both:
a. The 728k WEGLD bought from the exchange are burned in the following transaction:
b. The remaining 922k EGLD are fed into the contracts directly, without minting any more WEGLD. This is done via a “rebalance” endpoint written specifically for this occasion. 200k of the EGLD used in this operation come from the Elrond Foundation reserve. The rebalancing transactions are as follows:
Finally, after many more rounds of verification and testing, the Maiar Exchange is un-paused and services resume normal operation on Wednesday, the 8th of June.
Common Misconceptions
In the aftermath of the attack, the community was abuzz with theories and suppositions of what had happened. There were a few who understood the events correctly, but many more who succumbed to confusion and misinformation.
To dispel the most common misconceptions:
- “It was a DEX attack”. It wasn’t, the DEX worked as expected the entire time. Two of the liquidity pools got out of balance during the massive sell off, but the DEX was not the target of this attack.
- “It was a liquidity issue/something similar to UST/LUNA”. No, this exploit has nothing in common with the recent collapse of the Terra ecosystem. At no point was this an under-collateralization problem, it was pure theft.
- “It was a bridge attack”. No, the bridge worked as expected. Some of the funds were bridged to Ethereum, during which time the bridge worked as expected. It was then deactivated on purpose, to prevent more funds from leaving the ecosystem.
- “It was a smart contract attack”. The attacker did not exploit any smart contract code vulnerability, the problem was in the VM. There was no bug in the egld-esdt-swap contract itself.
- “The attacker managed to mint EGLD out of thin air”. No, all the EGLD was stolen from the balance of the egld-esdt-swap smart contract directly. It was already there.
- “The attacker managed to mint WEGLD out of thin air”. This is partially true. While the attacker did not exploit WEGLD directly, by wrapping stolen EGLD, the result was there was more WEGLD in circulation than it should have been, 900k WEGLD to be precise. This is equivalent to illegally minting unbacked WEGLD. The team managed to recover this illegal WEGLD and burned it, so right now each 1 WEGLD token is once again backed by 1 EGLD.
- “The Elrond blockchain was turned off”. At no time was the blockchain turned off or inactive. Blocks and even some transactions kept being added during the entire time. Only the send transaction endpoint of the public API was turned off.
- “This was an «inside job»”. One of the most egregious pieces of misinformation is the theory that somebody from the team had done it. It is absurd to believe that someone working on a project for years would put said project in such peril, as well as risking their own personal freedom for some quick profit.
- “Black hat hacking is worth it”. Even if successfully pulling off a hack, it is very difficult to get the funds out. Even the most skilled hackers leave traces everywhere, all CEXes have KYCs in place and it is nigh-impossible to evade the law. If you find a software vulnerability, the smart thing to do is report it to the team. This will land you a nice bounty while keeping you on the right side of the law, free to explore the world outside of prison.
Lessons Learned, Steps Forward
Software is hard. Blockchain is hard. Writing a VM from scratch is also very hard, but worth it if you’re really after speed, security and scalability, and a meaningful impact on the world.
Things occasionally go wrong. This is obviously not something we need to accept, it is something we need to overcome. Decentralization is what we are all striving for, but it does not get along well with bugs and vulnerabilities.
We would not be able to keep the network as decentralized as it is without such a responsive and committed validator community. They really have shone in the moment of maximum challenge.
Creators of contracts and ESDT tokens can choose to grant themselves the power to temporarily stop said contracts and ESDT tokens in moments of crisis. In such unfortunate events, the importance and necessity of auxiliary precautions can be clearly understood. As the technology matures, we will see this power being gradually and securely relinquished via different forms of governance.
As for the code itself, there are 3 ways to ensure it works correctly: testing, time and mathematics. Testing is clearly not sufficient for complex and critical infrastructure. Time eventually reveals and irons out all hidden vulnerabilities, but it is a painful process and one full of pitfalls. The best approach is to think deeply about it, model it mathematically, simulate it and audit it, and this is something in which we need to invest more in the future. We already have partial models of the VM and a lot of tools to test the correctness of our code, we just need to get more out of them and to develop them further. Events like this one are an intense evaluation of our engineering standards and methodologies.
Alas, this week starts a new chapter in the story of Elrond. We have decided to rethink not only our security architecture, but our entire philosophy of building software.
This week already, we have been outlying several fronts on which the fight for rooting out vulnerabilities will be taken. There are already several teams on the lookout for problems in the protocol, VM, services and the frontend. We have also been looking at which tools can bring the most comprehensive results the fastest, and have already started developing some of them.
The internal process of raising potential security concerns, surfacing problems, and processing them with utmost speed and thoroughness have also been reiterated and simplified. The external process of responsible security disclosure, via sending an email to security@elrond.com, and receiving a hefty bounty for any valid and meaningful bug, will be constantly brought to everyone’s attention in the community, to help reinforce this constructive behavior.
To say this has been a tough few days, is an understatement. In the end, nothing of significance can be built without hardship. Since hardship will invariably come, we will do our best to prepare ourselves, fortify our resolve, and strengthen our response in the face of it.
After all this, the Maiar DEX is fully live. The APIs are fully live. Exchanges are fully live.
All funds are safe, all users are safe.
What stubborn strength and joy these words carry.
With these new lessons learned, we shall open a new chapter for Elrond.
Come build with us. Let’s fortify this network together.