## [t3_78uvdw](https://www.reddit.com/r/Python/comments/78uvdw/i_wrote_a_reddit_bot_in_python_a_few_weeks_back/) Hello all I asked you people if you are interested in learning Python by writing a Reddit bot some time back on r/learnprogramming. I received immense number of responses, so I'm posting it on r/python. It covers the process, practices and tools involved in writing a Reddit bot in Python. Please post your feedback and questions in comments, and I'll be happy to answer them. #**Requirements**: 1. You are somewhat familiar with programming in general, and have an idea about Python or other languages (like what are variables etc) 2. You are on a Unix-like system with access to command line tools 3. Familiarity with [git](https://git-scm.com/) and [Github](https://github.com/) (Not really required...it is better if you are familiar) 4. A Reddit account Even if you don't meet these requirements...no need to worry! #**Introduction** Writing a bot for Reddit is easier than you think, because Reddit provides a structured method to access its data via the [Reddit API](https://www.reddit.com/dev/api). There are amazing tools like [PRAW](https://praw.readthedocs.io/en/latest/) which can help simplify the process. PRAW is a Python wrapper for Reddit, and it takes the pain out of the process of writing a bot. And as I said earlier, Python is an great language, which is easy to grasp for beginners. What even more amazing is it's community, which has made incredible open-source tools and libraries for almost anything. We will be using some of those, and you will realise how useful they are! #**Important information - Please read** 1. While this post will cover a variety of programming topics, I highly encourage you to explore more. Go ahead and read the documentation of libraries used and think about how can you improve something in your project. I will possibly add more topics to this post. 2. Please **DO NOT CREATE SPAM** by letting your bot run on all of Reddit before you have thoroughly tested it. **r/test** exists for you to do all sorts of testing with your bot...please use it. Refer to [botiquette](https://www.reddit.com/wiki/bottiquette) and keeping in mind the type of bot you create, comply with the guidelines. #**The code and resources** I jus wrote a bot for Reddit, which posts a textual explanation of the popular webcomic [xkcd](https://www.xkcd.com) whenever it encounters an xkcd link. It is named **explainxkcdbot**. As of now, I have set my bot to run on r/test only to let you people test it all you want. You can find the complete source code on my Github page - **https://github.com/aydwi/explainxkcdbot** Here is the code for the bot - **https://github.com/aydwi/explainxkcdbot/blob/master/explainxkcdbot.py** #**How to run the bot** The process is given here - https://github.com/aydwi/explainxkcdbot/blob/master/README.md If you have any problems in running the bot, please let me know in the comments. #**Let us begin** We will move forward step-by-step in the code, and see what is going on. ##Section 1 - Importing libraries https://gist.github.com/aydwi/e5e4f294b66adf1cb025a70a0392f847 This is the beginning of the program. Any line which begins with a `#` sign is a comment in Python, so it is not interpreted as code. Then we import the essential libraries required for out program. These libraries contain the methods (or functions) which we will use throughout our program. The syntax is pretty straightforward. Next we see 3 variables with some values assigned to them. More on them later. ##Section 2 - Authenticating https://gist.github.com/aydwi/b941d04d8128e415d6630e961cc97988 This is where important things begin. Before proceeding, you should have a basic idea of what are **[classes](https://docs.python.org/3/tutorial/classes.html)** and **[methods](https://stackoverflow.com/questions/3786881/what-is-a-method-in-python)** in Python. One more important thing is an **object** - it is an instance of class, which holds the data variables declared in class and the **member functions work on these objects**. Please make a comment if you do not understand, I will try to explain. `def authenticate():` defines a function which will try to authenticate the bot to Reddit. This is where PRAW comes into play as it provides us methods to do this. Remember we created a *praw.ini* file earlier. It is a configuration file that stores authentication credentials, so that you just call it whenever needed. Next, we pass authentication parameters to `praw.Reddit()`. Here `explainbot` refers to the credentials in the configuration file, and `user_agent` is a descriptive string about your script/bot. This returns a [Reddit instance](https://praw.readthedocs.io/en/latest/code_overview/reddit_instance.html), which we assign to the variable `reddit`, and use this variable whenever we wish to authenticate. Next up, we make a print statement to see who is getting authenticated i.e. the name. The `user` class in PRAW provides methods for the currently authenticated user. We call the method `me()` on `user` and then call it on the variable holding the Reddit instance for authentication, which in our case is `reddit`. So the statement becomes `reddit.user.me()`. We just use some [formatting in python](https://pyformat.info/) to print it. Finally we `return` the variable `reddit`. Now i can use this `authenticate()` function in my program by calling it when I need it, and it will give me the variable `reddit`, which stores the Reddit instance. ------------------------------ **Important note 1:** Why did I create a function? It is because creating functions makes us easy to manage out code. We write separate functions for separate features of the program, and call them wherever needed. This breaks the code into manageable chunks, which makes it easy to read, modify and debug. ------------------------------ ##Section 3 - Scraping https://gist.github.com/aydwi/699123424e7b38da19d62e69339e859c Next up is the function `def fetchdata(url)`. Notice how this function takes a parameter `url`. We pass this parameter whenever we call this function. The purpose of this function is to scrape/gather the data from the web, in order to post an explanation. **Now scraping is not a very reliable process**, but sometimes we just have to do it. We go to a website, go through it's HTML source, and extract the text, links etc. that we want. Here we use **[Requests](http://docs.python-requests.org/en/master/user/quickstart/#make-a-request)** library to make a web request, and **[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)** to extract the data. `r = requests.get(url)` gives a Response object called `r`. We can get all the information we need from this object. Then we call the method `content` on `r`, and pass it to `soup = BeautifulSoup(r.content, 'html.parser')`, which gives us a `soup` object with which we can retrieve elements of a web page. Read more [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). Now what happens depends on what are you scraping. Every website has different type of content, and you should go through the HTML to see how can you retrieve what you want. In my case http://explainxkcd.com has it's explanation inside the first few `

` tags, so I look for the first `

` tag by `tag = soup.find('p')`. Also, immediately after the explanation ends, `

` tag follows, so I know where to stop. Now take a look here- while True: if isinstance(tag, bs4.element.Tag): if tag.name == 'h2': break if tag.name == 'h3': tag = tag.nextSibling else: data = data + '\n' + tag.text tag = tag.nextSibling else: tag = tag.nextSibling I continue to look through the tags, and store textual data in a `string` named `data` and move to the next tag. I `break` the process if I encounter `

` because I know that is where explanation ends. The code depends on the structure of website you are scraping. Finally, I return `data`, so that now when I call the function like `fetchdata('http://explainxkcd.com/1024')` for example, it returns me the explanation of comic 1024 as a `string` named `data`. ---------------- **Important note 2:** Scraping is unreliable. If the website changes it's structure, you have to change your code. In this case, it can still include some tags which we don't want. The line `if (tag.name == 'h3')` in the above function is for handling one such unexpected situation I encountered on http://explainxkcd.com. This is why we like to have APIs. -------------------------- ##Section 4 - Using regular expressions, parsing URLs, handling exceptions and preventing duplicate submissions https://gist.github.com/aydwi/700db3e46770d13fe89356d0fa4f570d This function which takes a parameter which is a Reddit instance (remember that Reddit instance from Section 1). Again we use PRAW to get one or more subreddits, and go through their comments. See [here](https://praw.readthedocs.io/en/latest/code_overview/reddit/subreddits.html) and [here](http://praw.readthedocs.io/en/latest/code_overview/models/comment.html). You can pass multiple subreddits like `for comment in reddit.subreddit('test+learnprogramming+pics')`. **Regular Expressions**: Now I want to extract xkcd links from the comments, so I need to look for a pattern for the URL. xkcd links are of the form `https://www.xkcd.com/[some number]`. So we make use of regular expressions from the `re` library to match the pattern. [More about regular expressions](https://regexone.com/references/python). match = re.findall("https://www.xkcd.com/[0-9]+", comment.body) if match: print("Link found in comment with comment ID: " + comment.id) xkcd_url = match[0] Here I look for all strings that are like `https://www.xkcd.com/[some number]` using the method `findall` from [user comments](http://praw.readthedocs.io/en/latest/code_overview/models/comment.html), and if it matches, it returns a list with the matching URL, and we store our URL in the variable `xkcd_url`. ---------------- **Important note 3:** Try to find out regular expressions (regex) to improve this program. What happens if a user posts 'https://xkcd.com/1024' or http version as 'http://www.xkcd.com/1024'. These are still valid xkcd URLs. There can be some other cases where the URL might be placed between a bunch of symbols in a comment. Try to modify the regex statement so that it can detect these cases as well. -------------------------- **URL parsing**: Now I have an xkcd URL. I have 2 immediate objectives now- Objective 1. To extract the comic number at the end of the URL. We use [urllib.parse](https://docs.python.org/3/library/urllib.parse.html) for this purpose. url_obj = urlparse(xkcd_url) xkcd_id = int((url_obj.path.strip("/"))) Objective 2. Form a complete **explainxkcd.com** URL, which I will pass as a parameter to the function `fetchdata(url)` (See Section 2) myurl = 'http://www.explainxkcd.com/wiki/index.php/' + str(xkcd_id) Both objectives are thus complete. **Preventing duplicate submissions and Exception handling**: The bot needs to make sure it replies to every comment only once. There are many methods you can try which include using a separate key-value database like [Memcached](https://memcached.org/), but we will use a simpler approach. Just create a text file (remember commented.txt), store the comment ID of a comment in it when you visit a relevant comment, and verify if a comment ID already exists in that file before making new comments. Remember to put the location of the text file in the `path` variable (from Section 1), so that you can pass it to the method `open()`. Now, what happens when a user makes a comment like 'https://www.xkcd.com/689243348723'. This xkcd does not exist, but the program will extract '689243348723', then form a link like http://www.explainxkcd.com/wiki/index.php/689243348723, and pass it to the function `fetchdata(url)`. Now **the function will return an exception and your program will stop/crash**. You need to take care of these situations. This is achieved using `try`, `except` and `else` blocks. [See here](https://docs.python.org/3/tutorial/errors.html) Here is the code for these 2 steps- file_obj_r = open(path,'r') try: explanation = fetchdata(myurl) except: print('Exception!!! Possibly incorrect xkcd URL...\n') # Typical cause for this will be a URL for an xkcd that does not exist (Example: https://www.xkcd.com/772524318/) else: if comment.id not in file_obj_r.read().splitlines(): print('Link is unique...posting explanation\n') comment.reply(header + explanation + footer) file_obj_r.close() file_obj_w = open(path,'a+') file_obj_w.write(comment.id + '\n') file_obj_w.close() else: print('Already visited link...No reply needed\n') Notice how I make a reply to a relevant comment by using the variables from Section 1 to print header and footer description alongside the explanation - `comment.reply(header + explanation + footer)` Next we make some `sleep` statements to stop the bot from querying Reddit too fast. If the Reddit API returns an error due to too many requests, adjust `val` in the instances of `time.sleep(val)` in your program. ##Section 5: The main function https://gist.github.com/aydwi/99323ebd710428f4590077a844236f83 We are almost done here. We wrap our functions into a `main` function by calling `authenticate()` (remember Section 1) and passing it to the function which runs the bot, namely `run_explainbot(reddit)`. Since we are calling it inside `while` loop with the expression 'True', it will run indefinitely. #My bot in action Test post to show the bot in action: https://redd.it/6tey71 (Since I'm not running it continuously as of now, it won't reply to every comment containing an xkcd link there) My Terminal emulator while running the bot: http://imgur.com/4TzEyor **Finally, if you want to take this project further, you are welcome to contribute code to [my Github repository](https://github.com/aydwi/explainxkcdbot). You can fork it (please note that there are still some checks to be added in order to make the bot more Reddit friendly) or open a pull request.** I'll try to add more details about file handling and scraping aspects of the program once I get some time. --- submitted to [r/Python](https://www.reddit.com/r/Python) by [u/kindw](https://www.reddit.com/user/kindw)