Hello all
I asked you people if you are interested in learning Python by writing a Reddit bot some time back on r/learnprogramming. I received immense number of responses, so I'm posting it on r/python. It covers the process, practices and tools involved in writing a Reddit bot in Python.
Please post your feedback and questions in comments, and I'll be happy to answer them.
Requirements:
- You are somewhat familiar with programming in general, and have an idea about Python or other languages (like what are variables etc)
- You are on a Unix-like system with access to command line tools
- Familiarity with git and Github (Not really required...it is better if you are familiar)
- A Reddit account
Even if you don't meet these requirements...no need to worry!
Introduction
Writing a bot for Reddit is easier than you think, because Reddit provides a structured method to access its data via the Reddit API. There are amazing tools like PRAW which can help simplify the process. PRAW is a Python wrapper for Reddit, and it takes the pain out of the process of writing a bot.
And as I said earlier, Python is an great language, which is easy to grasp for beginners. What even more amazing is it's community, which has made incredible open-source tools and libraries for almost anything. We will be using some of those, and you will realise how useful they are!
Important information - Please read
-
While this post will cover a variety of programming topics, I highly encourage you to explore more. Go ahead and read the documentation of libraries used and think about how can you improve something in your project. I will possibly add more topics to this post.
-
Please DO NOT CREATE SPAM by letting your bot run on all of Reddit before you have thoroughly tested it. r/test exists for you to do all sorts of testing with your bot...please use it. Refer to botiquette and keeping in mind the type of bot you create, comply with the guidelines.
The code and resources
I jus wrote a bot for Reddit, which posts a textual explanation of the popular webcomic xkcd whenever it encounters an xkcd link. It is named explainxkcdbot. As of now, I have set my bot to run on r/test only to let you people test it all you want.
You can find the complete source code on my Github page - https://github.com/aydwi/explainxkcdbot
Here is the code for the bot - https://github.com/aydwi/explainxkcdbot/blob/master/explainxkcdbot.py
How to run the bot
The process is given here - https://github.com/aydwi/explainxkcdbot/blob/master/README.md
If you have any problems in running the bot, please let me know in the comments.
Let us begin
We will move forward step-by-step in the code, and see what is going on.
Section 1 - Importing libraries
https://gist.github.com/aydwi/e5e4f294b66adf1cb025a70a0392f847
This is the beginning of the program. Any line which begins with a # sign is a comment in Python, so it is not interpreted as code. Then we import the essential libraries required for out program. These libraries contain the methods (or functions) which we will use throughout our program. The syntax is pretty straightforward. Next we see 3 variables with some values assigned to them. More on them later.
Section 2 - Authenticating
https://gist.github.com/aydwi/b941d04d8128e415d6630e961cc97988
This is where important things begin. Before proceeding, you should have a basic idea of what are classes and methods in Python. One more important thing is an object - it is an instance of class, which holds the data variables declared in class and the member functions work on these objects. Please make a comment if you do not understand, I will try to explain.
def authenticate(): defines a function which will try to authenticate the bot to Reddit. This is where PRAW comes into play as it provides us methods to do this. Remember we created a praw.ini file earlier. It is a configuration file that stores authentication credentials, so that you just call it whenever needed.
Next, we pass authentication parameters to praw.Reddit(). Here explainbot refers to the credentials in the configuration file, and user_agent is a descriptive string about your script/bot. This returns a Reddit instance, which we assign to the variable reddit, and use this variable whenever we wish to authenticate.
Next up, we make a print statement to see who is getting authenticated i.e. the name. The user class in PRAW provides methods for the currently authenticated user. We call the method me() on user and then call it on the variable holding the Reddit instance for authentication, which in our case is reddit. So the statement becomes reddit.user.me(). We just use some formatting in python to print it.
Finally we return the variable reddit. Now i can use this authenticate() function in my program by calling it when I need it, and it will give me the variable reddit, which stores the Reddit instance.
Important note 1: Why did I create a function? It is because creating functions makes us easy to manage out code. We write separate functions for separate features of the program, and call them wherever needed. This breaks the code into manageable chunks, which makes it easy to read, modify and debug.
Section 3 - Scraping
https://gist.github.com/aydwi/699123424e7b38da19d62e69339e859c
Next up is the function def fetchdata(url). Notice how this function takes a parameter url. We pass this parameter whenever we call this function. The purpose of this function is to scrape/gather the data from the web, in order to post an explanation.
Now scraping is not a very reliable process, but sometimes we just have to do it. We go to a website, go through it's HTML source, and extract the text, links etc. that we want. Here we use Requests library to make a web request, and Beautiful Soup to extract the data.
r = requests.get(url) gives a Response object called r. We can get all the information we need from this object. Then we call the method content on r, and pass it to soup = BeautifulSoup(r.content, 'html.parser'), which gives us a soup object with which we can retrieve elements of a web page. Read more here.
Now what happens depends on what are you scraping. Every website has different type of content, and you should go through the HTML to see how can you retrieve what you want. In my case http://explainxkcd.com has it's explanation inside the first few <p> tags, so I look for the first <p> tag by tag = soup.find('p'). Also, immediately after the explanation ends, <h2> tag follows, so I know where to stop. Now take a look here-
while True:
if isinstance(tag, bs4.element.Tag):
if tag.name == 'h2':
break
if tag.name == 'h3':
tag = tag.nextSibling
else:
data = data + '\n' + tag.text
tag = tag.nextSibling
else:
tag = tag.nextSibling
I continue to look through the tags, and store textual data in a string named data and move to the next tag. I break the process if I encounter <h2> because I know that is where explanation ends. The code depends on the structure of website you are scraping. Finally, I return data, so that now when I call the function like fetchdata('http://explainxkcd.com/1024') for example, it returns me the explanation of comic 1024 as a string named data.
Important note 2: Scraping is unreliable. If the website changes it's structure, you have to change your code. In this case, it can still include some tags which we don't want. The line if (tag.name == 'h3') in the above function is for handling one such unexpected situation I encountered on http://explainxkcd.com. This is why we like to have APIs.
Section 4 - Using regular expressions, parsing URLs, handling exceptions and preventing duplicate submissions
https://gist.github.com/aydwi/700db3e46770d13fe89356d0fa4f570d
This function which takes a parameter which is a Reddit instance (remember that Reddit instance from Section 1). Again we use PRAW to get one or more subreddits, and go through their comments. See here and here.
You can pass multiple subreddits like for comment in reddit.subreddit('test+learnprogramming+pics').
Regular Expressions: Now I want to extract xkcd links from the comments, so I need to look for a pattern for the URL. xkcd links are of the form https://www.xkcd.com/[some number]. So we make use of regular expressions from the re library to match the pattern. More about regular expressions.
match = re.findall("https://www.xkcd.com/[0-9]+", comment.body)
if match:
print("Link found in comment with comment ID: " + comment.id)
xkcd_url = match[0]
Here I look for all strings that are like https://www.xkcd.com/[some number] using the method findall from user comments, and if it matches, it returns a list with the matching URL, and we store our URL in the variable xkcd_url.
Important note 3: Try to find out regular expressions (regex) to improve this program. What happens if a user posts 'https://xkcd.com/1024' or http version as 'http://www.xkcd.com/1024'. These are still valid xkcd URLs. There can be some other cases where the URL might be placed between a bunch of symbols in a comment. Try to modify the regex statement so that it can detect these cases as well.
URL parsing: Now I have an xkcd URL. I have 2 immediate objectives now-
Objective 1. To extract the comic number at the end of the URL. We use urllib.parse for this purpose.
url_obj = urlparse(xkcd_url)
xkcd_id = int((url_obj.path.strip("/")))
Objective 2. Form a complete explainxkcd.com URL, which I will pass as a parameter to the function fetchdata(url) (See Section 2)
myurl = 'http://www.explainxkcd.com/wiki/index.php/' + str(xkcd_id)
Both objectives are thus complete.
Preventing duplicate submissions and Exception handling: The bot needs to make sure it replies to every comment only once. There are many methods you can try which include using a separate key-value database like Memcached, but we will use a simpler approach.
Just create a text file (remember commented.txt), store the comment ID of a comment in it when you visit a relevant comment, and verify if a comment ID already exists in that file before making new comments. Remember to put the location of the text file in the path variable (from Section 1), so that you can pass it to the method open().
Now, what happens when a user makes a comment like 'https://www.xkcd.com/689243348723'. This xkcd does not exist, but the program will extract '689243348723', then form a link like http://www.explainxkcd.com/wiki/index.php/689243348723, and pass it to the function fetchdata(url). Now the function will return an exception and your program will stop/crash. You need to take care of these situations.
This is achieved using try, except and else blocks. See here Here is the code for these 2 steps-
file_obj_r = open(path,'r')
try:
explanation = fetchdata(myurl)
except:
print('Exception!!! Possibly incorrect xkcd URL...\n')
# Typical cause for this will be a URL for an xkcd that does not exist (Example: https://www.xkcd.com/772524318/)
else:
if comment.id not in file_obj_r.read().splitlines():
print('Link is unique...posting explanation\n')
comment.reply(header + explanation + footer)
file_obj_r.close()
file_obj_w = open(path,'a+')
file_obj_w.write(comment.id + '\n')
file_obj_w.close()
else:
print('Already visited link...No reply needed\n')
Notice how I make a reply to a relevant comment by using the variables from Section 1 to print header and footer description alongside the explanation - comment.reply(header + explanation + footer)
Next we make some sleep statements to stop the bot from querying Reddit too fast. If the Reddit API returns an error due to too many requests, adjust val in the instances of time.sleep(val) in your program.
Section 5: The main function
https://gist.github.com/aydwi/99323ebd710428f4590077a844236f83
We are almost done here. We wrap our functions into a main function by calling authenticate() (remember Section 1) and passing it to the function which runs the bot, namely run_explainbot(reddit). Since we are calling it inside while loop with the expression 'True', it will run indefinitely.
My bot in action
Test post to show the bot in action: https://redd.it/6tey71 (Since I'm not running it continuously as of now, it won't reply to every comment containing an xkcd link there)
My Terminal emulator while running the bot: http://imgur.com/4TzEyor
Finally, if you want to take this project further, you are welcome to contribute code to my Github repository. You can fork it (please note that there are still some checks to be added in order to make the bot more Reddit friendly) or open a pull request. I'll try to add more details about file handling and scraping aspects of the program once I get some time.
PLEASE: For all people looking to make a bot, please read bottiquette! https://www.reddit.com/wiki/bottiquette