I wrote a Reddit bot in Python a few weeks back, and asked people if they were interested in learning the process, tools and practices. I'm posting it on r/Python, and hope you find it helpful.

2017-10-26 11:16:44r/Python u/kindw (1168)

Hello all

I asked you people if you are interested in learning Python by writing a Reddit bot some time back on r/learnprogramming. I received immense number of responses, so I'm posting it on r/python. It covers the process, practices and tools involved in writing a Reddit bot in Python.

Please post your feedback and questions in comments, and I'll be happy to answer them.

Requirements:

You are somewhat familiar with programming in general, and have an idea about Python or other languages (like what are variables etc)
You are on a Unix-like system with access to command line tools
Familiarity with git and Github (Not really required...it is better if you are familiar)
A Reddit account

Even if you don't meet these requirements...no need to worry!

Introduction

Writing a bot for Reddit is easier than you think, because Reddit provides a structured method to access its data via the Reddit API. There are amazing tools like PRAW which can help simplify the process. PRAW is a Python wrapper for Reddit, and it takes the pain out of the process of writing a bot.

And as I said earlier, Python is an great language, which is easy to grasp for beginners. What even more amazing is it's community, which has made incredible open-source tools and libraries for almost anything. We will be using some of those, and you will realise how useful they are!

Important information - Please read

While this post will cover a variety of programming topics, I highly encourage you to explore more. Go ahead and read the documentation of libraries used and think about how can you improve something in your project. I will possibly add more topics to this post.
Please DO NOT CREATE SPAM by letting your bot run on all of Reddit before you have thoroughly tested it. r/test exists for you to do all sorts of testing with your bot...please use it. Refer to botiquette and keeping in mind the type of bot you create, comply with the guidelines.

The code and resources

I jus wrote a bot for Reddit, which posts a textual explanation of the popular webcomic xkcd whenever it encounters an xkcd link. It is named explainxkcdbot. As of now, I have set my bot to run on r/test only to let you people test it all you want.

You can find the complete source code on my Github page - https://github.com/aydwi/explainxkcdbot

Here is the code for the bot - https://github.com/aydwi/explainxkcdbot/blob/master/explainxkcdbot.py

How to run the bot

The process is given here - https://github.com/aydwi/explainxkcdbot/blob/master/README.md

If you have any problems in running the bot, please let me know in the comments.

Let us begin

We will move forward step-by-step in the code, and see what is going on.

Section 1 - Importing libraries

https://gist.github.com/aydwi/e5e4f294b66adf1cb025a70a0392f847

This is the beginning of the program. Any line which begins with a # sign is a comment in Python, so it is not interpreted as code. Then we import the essential libraries required for out program. These libraries contain the methods (or functions) which we will use throughout our program. The syntax is pretty straightforward. Next we see 3 variables with some values assigned to them. More on them later.

Section 2 - Authenticating

https://gist.github.com/aydwi/b941d04d8128e415d6630e961cc97988

This is where important things begin. Before proceeding, you should have a basic idea of what are classes and methods in Python. One more important thing is an object - it is an instance of class, which holds the data variables declared in class and the member functions work on these objects. Please make a comment if you do not understand, I will try to explain.

def authenticate(): defines a function which will try to authenticate the bot to Reddit. This is where PRAW comes into play as it provides us methods to do this. Remember we created a praw.ini file earlier. It is a configuration file that stores authentication credentials, so that you just call it whenever needed.

Next, we pass authentication parameters to praw.Reddit(). Here explainbot refers to the credentials in the configuration file, and user_agent is a descriptive string about your script/bot. This returns a Reddit instance, which we assign to the variable reddit, and use this variable whenever we wish to authenticate.

Next up, we make a print statement to see who is getting authenticated i.e. the name. The user class in PRAW provides methods for the currently authenticated user. We call the method me() on user and then call it on the variable holding the Reddit instance for authentication, which in our case is reddit. So the statement becomes reddit.user.me(). We just use some formatting in python to print it.

Finally we return the variable reddit. Now i can use this authenticate() function in my program by calling it when I need it, and it will give me the variable reddit, which stores the Reddit instance.

Important note 1: Why did I create a function? It is because creating functions makes us easy to manage out code. We write separate functions for separate features of the program, and call them wherever needed. This breaks the code into manageable chunks, which makes it easy to read, modify and debug.

Section 3 - Scraping

https://gist.github.com/aydwi/699123424e7b38da19d62e69339e859c

Next up is the function def fetchdata(url). Notice how this function takes a parameter url. We pass this parameter whenever we call this function. The purpose of this function is to scrape/gather the data from the web, in order to post an explanation.

Now scraping is not a very reliable process, but sometimes we just have to do it. We go to a website, go through it's HTML source, and extract the text, links etc. that we want. Here we use Requests library to make a web request, and Beautiful Soup to extract the data.

r = requests.get(url) gives a Response object called r. We can get all the information we need from this object. Then we call the method content on r, and pass it to soup = BeautifulSoup(r.content, 'html.parser'), which gives us a soup object with which we can retrieve elements of a web page. Read more here.

Now what happens depends on what are you scraping. Every website has different type of content, and you should go through the HTML to see how can you retrieve what you want. In my case http://explainxkcd.com has it's explanation inside the first few <p> tags, so I look for the first <p> tag by tag = soup.find('p'). Also, immediately after the explanation ends, <h2> tag follows, so I know where to stop. Now take a look here-

while True:
    if isinstance(tag, bs4.element.Tag):
        if tag.name == 'h2':
            break
        if tag.name == 'h3':
            tag = tag.nextSibling
        else:
            data = data + '\n' + tag.text
            tag = tag.nextSibling
    else:
        tag = tag.nextSibling

I continue to look through the tags, and store textual data in a string named data and move to the next tag. I break the process if I encounter <h2> because I know that is where explanation ends. The code depends on the structure of website you are scraping. Finally, I return data, so that now when I call the function like fetchdata('http://explainxkcd.com/1024') for example, it returns me the explanation of comic 1024 as a string named data.

Important note 2: Scraping is unreliable. If the website changes it's structure, you have to change your code. In this case, it can still include some tags which we don't want. The line if (tag.name == 'h3') in the above function is for handling one such unexpected situation I encountered on http://explainxkcd.com. This is why we like to have APIs.

Section 4 - Using regular expressions, parsing URLs, handling exceptions and preventing duplicate submissions

https://gist.github.com/aydwi/700db3e46770d13fe89356d0fa4f570d

This function which takes a parameter which is a Reddit instance (remember that Reddit instance from Section 1). Again we use PRAW to get one or more subreddits, and go through their comments. See here and here.

You can pass multiple subreddits like for comment in reddit.subreddit('test+learnprogramming+pics').

Regular Expressions: Now I want to extract xkcd links from the comments, so I need to look for a pattern for the URL. xkcd links are of the form https://www.xkcd.com/[some number]. So we make use of regular expressions from the re library to match the pattern. More about regular expressions.

match = re.findall("https://www.xkcd.com/[0-9]+", comment.body)
    if match:
        print("Link found in comment with comment ID: " + comment.id)
        xkcd_url = match[0]

Here I look for all strings that are like https://www.xkcd.com/[some number] using the method findall from user comments, and if it matches, it returns a list with the matching URL, and we store our URL in the variable xkcd_url.

Important note 3: Try to find out regular expressions (regex) to improve this program. What happens if a user posts 'https://xkcd.com/1024' or http version as 'http://www.xkcd.com/1024'. These are still valid xkcd URLs. There can be some other cases where the URL might be placed between a bunch of symbols in a comment. Try to modify the regex statement so that it can detect these cases as well.

URL parsing: Now I have an xkcd URL. I have 2 immediate objectives now-

Objective 1. To extract the comic number at the end of the URL. We use urllib.parse for this purpose.

url_obj = urlparse(xkcd_url)
        xkcd_id = int((url_obj.path.strip("/")))

Objective 2. Form a complete explainxkcd.com URL, which I will pass as a parameter to the function fetchdata(url) (See Section 2)

myurl = 'http://www.explainxkcd.com/wiki/index.php/' + str(xkcd_id)

Both objectives are thus complete.

Preventing duplicate submissions and Exception handling: The bot needs to make sure it replies to every comment only once. There are many methods you can try which include using a separate key-value database like Memcached, but we will use a simpler approach.

Just create a text file (remember commented.txt), store the comment ID of a comment in it when you visit a relevant comment, and verify if a comment ID already exists in that file before making new comments. Remember to put the location of the text file in the path variable (from Section 1), so that you can pass it to the method open().

Now, what happens when a user makes a comment like 'https://www.xkcd.com/689243348723'. This xkcd does not exist, but the program will extract '689243348723', then form a link like http://www.explainxkcd.com/wiki/index.php/689243348723, and pass it to the function fetchdata(url). Now the function will return an exception and your program will stop/crash. You need to take care of these situations.

This is achieved using try, except and else blocks. See here Here is the code for these 2 steps-

file_obj_r = open(path,'r')

        try:
            explanation = fetchdata(myurl)
        except:
            print('Exception!!! Possibly incorrect xkcd URL...\n')
            # Typical cause for this will be a URL for an xkcd that does not exist (Example: https://www.xkcd.com/772524318/)
        else:
            if comment.id not in file_obj_r.read().splitlines():
                print('Link is unique...posting explanation\n')
                comment.reply(header + explanation + footer)

                file_obj_r.close()

                file_obj_w = open(path,'a+')
                file_obj_w.write(comment.id + '\n')
                file_obj_w.close()
            else:
                print('Already visited link...No reply needed\n')

Notice how I make a reply to a relevant comment by using the variables from Section 1 to print header and footer description alongside the explanation - comment.reply(header + explanation + footer)

Next we make some sleep statements to stop the bot from querying Reddit too fast. If the Reddit API returns an error due to too many requests, adjust val in the instances of time.sleep(val) in your program.

Section 5: The main function

https://gist.github.com/aydwi/99323ebd710428f4590077a844236f83

We are almost done here. We wrap our functions into a main function by calling authenticate() (remember Section 1) and passing it to the function which runs the bot, namely run_explainbot(reddit). Since we are calling it inside while loop with the expression 'True', it will run indefinitely.

My bot in action

Test post to show the bot in action: https://redd.it/6tey71 (Since I'm not running it continuously as of now, it won't reply to every comment containing an xkcd link there)

My Terminal emulator while running the bot: http://imgur.com/4TzEyor

Finally, if you want to take this project further, you are welcome to contribute code to my Github repository. You can fork it (please note that there are still some checks to be added in order to make the bot more Reddit friendly) or open a pull request. I'll try to add more details about file handling and scraping aspects of the program once I get some time.

2017-10-26 11:52:52

(27)

u/Earhacker

Was this removed from one of the general programming subs? I followed it there and thought it was great.

2017-10-26 11:58:51

(15)

u/kindw

Unfortunately it was.

(8)

Why?

(33)

probably because bots are a menace, and making them that much easier to make lets them choke the subreddits more and more.

2017-10-26 19:42:57

(6)

That was the reason given by the mods.

2017-10-26 14:45:31

(19)

u/RaptorF22

This is sort of a general python question not really specifically about your bot... but how does one know about all the available libraries there are and what functions they may have? Is there some sort of "big list" that you can quick reference? How do you know to look in the requests library for a certain function vs the beautiful soup one?

2017-10-26 15:26:18

u/astroFizzics

I doubt you'll get a very satisfying answer to your question. I'd wager that a lot of "knowing where to look" just comes from experience. You do enough similar projects and you get a decent idea of which tools are available and which tools are better at which tasks.

2017-10-26 15:59:08

(11)

u/marco_santos

But also googling should not be underestimated. If you face a problem you might just look for a specific library googling by functionality, check the documentation and decide on using it.

E.g. i have to integrate my python code with the reddit api, i dont want to handle all the http requests by myself, someone must have done it. I can look for a wrapper. So, i can just google "reddit api library python" and there are a few options.

2017-10-26 16:37:32

(5)

u/drowninFish

Maybe not the answer you'd like but there are way too many libraries out there to quickly become familiar with them all. As you go about running into hurdles you'll consult the internet and be pointed to some library or another. Over time you'll develop a familiarity with the popular ones and eventually you'll have a feel for which library can provide the functionality you need. A year ago i didn't know what the requests library or beautiful soup was, but now i sit here reading your question and know you'll likely use both together, using requests to retrieve web data and beautiful soup to navigate it.

2017-10-26 20:05:25

(4)

u/IDe-

Take this script for example.

Google "python html parsing library", find beautifulsoup.
Google "python http library", find requests.
Google "python reddit api library", find praw.

It's usually a good idea to look for relevant libraries if you can describe what it is you want to do. Any common task that you're doing that you know other people are also doing likely has one or more libraries, especially when you're dealing with a mature general purpose language like Python.

2017-10-27 00:12:44

(1)

u/bkzhotsauc3

Ive been coding in Python for several months and I had this thought as well.

For me its been a hassle googling stuff that I want to do with Python and eventually I see that multiple ppl mention success with what I wanna do with certain libraries so I go with that.

Basically intense googling and observing patterns.

2017-10-27 00:21:23

u/DELETED

I've used this reference a few times. It categorizes the different libraries and links to the documentation.

2017-10-27 05:17:56

u/Geroy121

Here is the full Py Pi index, most packages would be listed here that are public.

2017-10-27 06:10:11

u/PC__LOAD__LETTER

There's no "official" list, but this is along the lines of what you're asking fore: https://pypi.python.org/pypi

You can install a program called pip that will then let you install any of those packages/libraries to your local environment.

As for standard libraries included in Python itself, https://docs.python.org/3/library/index.html

2017-10-26 19:46:12

(3)

u/Falconinati

Why do you require the comments.txt file to be made before running the bot? Why not just create it if it doesn't already exist?

2017-10-26 19:55:44

Yes it can be done. I shouldn't have missed that.

Edit: I'll update it

2017-10-26 20:16:57

All you would need to do is use open('comments.txt', 'a+') and not worry about it.

The a+ option opens the file in for both appending and reading, and if the file doesn't exist, it will create it.

2017-10-26 17:24:04

(2)

u/Brainix

I'm at work, so I can't do a thorough code review... But just from the text of your post, you don't need the parenthesis around if conditions. if (tag.name == 'h2'): could be more succinctly written as if tag.name == 'h2':. Thanks for sharing your work, OP!

Edit: I'd also avoid bare excepts. If you catch all exceptions and just print out an error message print('Exception!!! Possibly incorrect xkcd URL...\n'), then as a user of your code, I stand less of a chance at figuring out what broke and why. But if I see the full traceback, then I might have a better idea as to what's going on. If you want to simply print an error message, then you can use a bare except, but I'd recommend re-raising the exception afterwards.

2017-10-26 19:39:05

Thanks for pointing that out. Removed the unnecessary parentheses (gotta write it the Python way). I'll look into improving the exception handling as soon as I get some time.

2017-10-26 12:16:22

u/Hi_Im_A_Redditor

Can I run this on windows?? Sorry programming newb here and I saw you said in requirements you must have a UNIX like system

2017-10-26 13:13:53

(7)

It will. Install python, install git-bash for windows, install pip, and then use pip install in git-bash to get the dependencies.

I do all my dev on windows. Sometimes it takes a little more work, but it does work.

2017-10-26 13:39:28

Yes you can. The way the libraries are installed will be a bit different though. You can refer their respective documentation for Windows support.

2017-10-26 17:25:24

u/jpflathead

We've been through this before, but your tutorial and your bot violate bottiquette.

YOU HAVE PRODUCED A BLUEPRINT FOR BAD BOTS.

You do no one, not individuals learning how to become good and ethical developers, not the reddit community any favor by showing people how to produce shoddy, faulty goods without any discussion or instruction on how to follow reddiquette.

Far better for the community would not be to recycle your old tutorial but to improve it and make a bot that adheres to reddit's guidelines.

2017-10-26 20:47:37

u/garriusbearius

Preface: I'm not engaging as a troll, I'm genuinely curious.

Why is this a bad bot?

2017-10-26 21:02:28

So ironically, or predictably a bad bot responded to you, u/perrycohen. https://i.imgur.com/NkIpQZq.png

Perrycohen is a bot.
Perrycohen could easily have been written with u/kindw's crappy bot script.

Perrycohen is a bad bot that violates reddit's bottiquette

https://www.reddit.com/r/Bottiquette/wiki/bottiquette

doesn't make just top level requests
adds nothing to the conversation and distracts from the flow since u/perrycohen came into say YOU were wrong in accusing me of being a bot, when you did nothing of the sort
writes replies to comments without being summoned.

https://www.reddit.com/wiki/bottiquette

3. Doesn't provide an easy opt-out
7. Doesn't allow the user or moderator an easy way to communicate with the creator
10. Doesn't make clear the post was mad by a bot

Bots that do things like that are bad bots.

Most of these things are trivial to provide.

Links to communicate to the creator? Trivial.

Some are more difficult to provide: Only respond to top level posts? That makes my idea of a fun bot, a nazi grammar bot harder. Waah. Reddit doesn't like my bot that replies to all sorts of random comments. I am being oppressed.

Some require some amount of storage, ie, the various opt out bots need to maintain the storage and other scripting required to support that.

The bot tutorials are posted all the time and they are interesting, but reckless because they leave out the information required to make the bot and its creator behave ethically.

It is like posting a recipe for fourth of july fireworks in r/learnchemistry without talking about the dangers.

2017-10-26 23:00:02

Fair enough. Thanks for the detailed response

2017-10-26 21:02:41

(-1)

Are you sure about that? Because I am 99.9999% sure that garriusbearius is not a bot.

^(I am a Neural Network being trained to detect spammers | Does something look wrong?) ^Visit ^/r/SpamBotDetection ^| ^GitHub

2017-10-26 21:07:16

fuck off bot.

2017-10-26 20:47:50

Are you sure about that? Because I am 99.9993% sure that jpflathead is not a bot.

2017-10-26 21:02:46

2017-10-27 02:45:18

u/Bibliophile777

Amazing contribution OP. I've been interested on developing a Twitter bot with Node.Js, definitely I will this one a go, even though I focus more on JavaScript I want to learn Python. Thanks again for sharing this!

2017-10-26 12:29:12

[deleted]

2017-10-26 13:36:43

Yes. It was unnecessary. I've changed it now. Thanks

Edit: Changed the regex

2017-10-26 16:36:24

u/roerd

findall will skip anything that isn't part of the regex anyway.

EDIT:

Also, "[a-z]*[A-Z]*[0-9]*" will only specifically match a sequence of small letters first, then capital letters, and digits last. To match all these characters in arbitrary order, use "[a-zA-Z0-9]*", or just "\w*" which is almost the same except it also includes underscores.

2017-10-26 19:46:51

Yeah it was unnecessary. I've removed it

2017-10-26 23:04:05

u/PeridexisErrant

Note that "\w*" will also match Unicode "word characters" - it's only the same if your you're processing ASCII text.

Of course, a properly encoded URL can only use ASCII characters, but it's worth remembering as a more general tip.