Logging Twitter for Years with Tweepy, POSIX/UNIX Signals, Logrotate, Systemd, and S3
I’ve been running a Twitter listener for about 2 years and it’s done a
pretty good job of staying up and running during that time. In this
post I want to explain a bit about how it works. Using utilities
outside of Python like logrotate
, systemd
and aws s3
make the
Python code look very simple but the interaction of all these is a bit
complex. Also, even though the python script is pretty short, there
are a few tricks.
First, let’s look at the python script. the code is in the listen-and-log.py file. The first 19 lines are imports and environment variables:
Tweepy is the library that actually connects to Twitter, the json library parses the json respons, the os library is used for retrieving environment variables, the signal library is how logrotate signals the script when it’s time to rotate the logfile, sys is used for printing to stderr, and urllib3 is used to catch a type of error (it looks like I commented that exception out.
Then I get environment variables. Environment variables are a good
way to pass contextual configuration information to the script.
Search 12-factor app, principle 3 for more information. The first
environment variable TWITTER_USER
is the user whose friendswe will
follow. In this case, I’m getting data from an account that I set up
with username abestockmon
(i.e., it’s a dummy account to monitor
stock news (“fintwit”). The following environment variables are
tokens provided by Twitter. To get this information from Twitter, you
need to get a developer account and create a project. Since I set
this up two years ago I don’t remember exactly what each of these do
but you can see their usage later in the script.
Finally, the last line opens the log file. This is where the json from Twitter will be saved. It’s slightly different from the normal file opening in python. First, it’s opening a gzip file, so everything that’s written will be saved as gzip. Second, it doesn’t use the normal
with open(file) as f:
...
idiom. This is because we want the file handle to be available to the next function:
This function is called when the HUP (hangup) signal is received.
Signals are kind of cool because unlike ordinary Python function calls
they can get called from outside the Python script and they are called
in the middle of the script execution, interrupting the script. So
the script is running, writting json from Twitter and it receives the
HUP signal. Wherever the script is, it will run this function. It
takes the filehandle f
, saves it to the f_old
variable, then opens
a new file, and closes the old file. Anything that was being written
will go to the old file, but anything that runs after this function is
going to the new file. It looks like these would be the same file,
but actually logrotate has changed the file name outside of the python
script (the old file is renamed to twitter.log.gz.1
).
The next chunk defines a stream listener class that simply writes the received json from Twitter to the logfile:
The following function takes the stream and a list of twitter users and filters the stream bast on this followlist:
The main function below sets up the signal handler, the Twitter
credentials, gets the users that the abestockmon
user follows,
creates the stream, and then runs filter
on the stream:
If this running script did not receive any signal from outside, it would keep writing to its logfile, which would grow and grow. We can manually send a HUP signal like this:
kill -HUP $(pgrep -f listen-and-log.py)
This is what I did when I was testing the script, but logrotate does it automatically and the logrotation is specified declaratively in the logrotate.conf file:
This config file tells the logrotate UNIX/Linux command where the
logfile is, minimally how often to rotate it, what the maximum size
should be before it is rotated, and how many old log files are saved.
The postrotate
block specifies what happens after the log is
rotated, i.e., renaming the twitter.log.gz file to twitter.log.gz.1.
The main thing is that logrotate
sends the Python script to send the
HUP signal and then it copies the old log file to aws S3. This
prevents logfiles from accumulating on the server and offloads them to
AWS S3.
Finally, every now and then there can be transient issues that will cause the script to fail and die. It could happen in the middle of the night and I wouldn’t know about it so I wanted to make sure that it restarted automatically. To do this, I created a Systemd service unit file:
A service unit file will make the script run as a daemon in the background. It reads an enviroment variable file and it will restart the process after two seconds if it fails. Waiting two seconds is good because if you try to restart too quickly you could overload the system. The final line makes the script start automatically when the AWS Linux VM starts.
This looks simple but it took some work to get it working right. I learned this pattern from my old boss Kim Scheibel, then at Adly, now at Google (thanks Kim!). Back then we used Upstart instead of Systemd (Systemd is still relatively new although now it is the default).
This approach does lose data when the script restarts. In a more critical system, we would have multiple listeners running in different availability zones so that if a problem happened to one, then there would be another that would most likely still be running. This would require a deduplication phase and so it gets a lot more complex than what I’ve shown here.