Starting from weeks ago, there were two strange bug reports about Simple Gmail Notes. Bug reports of this extension were actually not rare, there were users complaining about various items from time to time.
However, those reports were a bit unusual, the users claimed their Gmail accounts were temporarily locked by Google because of excessive network requests, and they believed this extension was the culprit.
I was shocked at first, as it was a very serious charge. I carefully inspected my code, checked all possible places with network requests. Nothing suspicious was found. And according to my calculation, there was no way for the extension to perform endless network requests.
So after some further investigations, I concluded that they might be just false alarms.
However, I got a couple more reports of locked accounts in the next week. Then I knew there must be something wrong. The problem was, however, I had absolutely no way to reproduce this bug, and I could not find a trace of the bug by code inspection.
On the other hand, this bug was so bad that it would heavily impact the daily work of users. Locking of account was possibly the worst thing that could happen for an extension, and it’s even much worse than the crash of entire browser. At least you could uninstall the extension and got everything back to normal in the latter case. I can’t imagine how I would react if my Gmail account locked me out. I would probably hate the developer until I die, if his work denied the access of my Gmail.
Therefore, in the worst case, I planned to pull the extension out of market if I failed solve this problem. But before that, I decided to use the second-worst approach on earth: prompt alert and self disable. I set up a gate-keeper function that that would be called before triggering any network requests. Then I kept a counter for the number of calls inside each corresponding function. If the extension detected a large amount of network requests (20 requests within 1 minute), regardless of where they were originated from and what they were used for, an alert message would be prompted. And then the extension would be self-diabled for 60 seconds, i.e. within the next minute, the whole extension would not work.
It’s a very intrusive way to alert the user (in fact it’s probably the most intrusive way), but it’s still better than locking out the Gmail accounts of users, or simply pulling off the extension from the Google market.
Before I release this new version (0.4.18), I once suspected there were some special versions of Gmail in the other side of the world, that presented the DOM structure in a very different way. So some coding logic in my work did not apply to them and caused all the problems. I also doubted that if anyone will ever bother to send the bug reports to me, so I might still have no clue at the end of the day.
It turns out that I was totally wrong.
Firstly, I got at least 5 manual feedbacks from users right in the next day (Thank you!!!). The first user even typed the whole message to the bug report page, even though I only asked for the last line.
Secondly, all users provided the same set of information, which solved half of the puzzle right away. It’s definitely unrelated with unknown DOM structures that I suspected before, it’s solely because of the extra data pushes.
The extension tried to pull the data every 2 seconds, though that was not supposed to happen according to my calculations. I had set up the comprehensive logic in the push calculation, and such endless push requests were not supposed to happen, ever. However, apparently it was happening right there.
While I was wondering what happened, another amazing thing happened. I saw the alert message myself. Though I never got my own Gmail account locked out before, but I did see that message right there. That was really shocking. However, after I refreshed the browser the same message never showed up again.
It’s very subtle, but it did happen. So now the question is, what could possibly happen to confuse the system and miscalculate the DOM items? And why it only happened once?
I read back all the previous bug reports again, and I found least two of them mentioned that the alert message showed up after they left the laptop asleep. What could have happened?
Finally, (I believe) I found the answer – the update itself triggered the problem. This is what happened essentially:
- I uploaded the new version (0.4.18) at July 26
- The Chrome in user’s laptop performed auto-update.
- The auto-update process killed the background thread of the extension, so the Gmail side front-end script failed to communicate with the backend.
- All previous assumptions failed, the front-end script continuously pulled data and requested background scripts to process them. But nothing was actually done at the background side, and no reply was ever sent back to the front-end scripts.
- Normally the problem would go away after the user refreshed the browser, but some users had already gone away from the computer, with the Gmail page left open (which is not uncommon).
- The data pull lasted for whole night, with 1 request every 2 seconds.
- Google side detected enormous data pull and locked down the account.
In other words, the new version of 0.4.18 not only alerted out the bug, but also triggered the bug. That also explained why the bug is so difficult to detect,
- Every time after the extension is updated, I would always refresh the browser to make the new code effective. The refresh itself hides the bug.
- It would appear once and only once after each update.
- I previously assumed the background script would always have the network connection, after all who would use the Gmail if the network is lost. Apparently this assumption was not always true.
(Later after second thought, it’s also possible that the backend script was killed by some other ways, e.g. by the operating system, but the story afterwards should be the same.)
After the truth is found, it’s relatively easy to reproduce the problem. I just need to manually update the extension (with same code and different version number), and then have a look of network requests in the Chrome console, the problem immediately comes to the surface.
The solution is simple too. I just add another mechanism to detect the survival of background script, and raise a warning in case the background script is found to be dead. Also, the extension would stop pulling data if the previous attempt of data pulling makes no apparent difference in the web page. That way would inhibit endless data pull, no matter what’s the reason behind.
Finally, the issue is not reported again after the version 0.4.19.
It’s probably one of most fatal and yet subtle bugs I have ever seen.