Web Scraping from API Calls Using Python

0
354
Web Scraping from API Calls Using Python

Hello friends how are you doing so today I am gonna be talking about Web Scraping from API Calls Using Python. So I will be talking about what Web Scraping is and how you can do it. I will be giving an example here too of a simple Web Scraping script that I wrote so you can also get an idea about how to work around that.

What is Web Scraping?

So let’s talk about what Web Scraping really is Web scraping is a technique to automatically access and extracts large amounts of information from a website, which can save a huge amount of time and effort.

So if we simply put it, it’s kind of copy or sometimes downloading some content from a website that you want. Like if there is a website that hosts some kind of Note’s related to your subjects and you are feeling lazy to manually go to the site and click on each of the downloads each file manually that’s gonna take a lot of time or may be annoying. So you just automate that stuff.

Here is the video Explaining it all For more details do read the complete article:

So in simple words that pretty much covers it. But after that, the question that you might have is why should you go with Python in particular. The answer to that is because it is simple, many libraries are available for web scraping and because it’s neat.

Like if you would go out and want to write the same code in any other language that might be a few hundred lines maybe.

Also Read: How to get Free SSL Certificate For Your Website

And what is API doing in this title? so for those that are not that much familiar with what API is here is what will get you stated in the whole API thing, API stands for Application Programming Interface.

An API is a software intermediary or a Web service that allows two applications to talk to each other. In other words, an API is a messenger that delivers your request to the provider that you’re requesting it from and then delivers the response back to you.

You can read more about that here now if you take a look at most of the modern web application most of them are using API to retrieve and provide different content on their site depending on the user’s actions.

So if you can get an idea of how the API is working you can simply use the API instead of wasting your time on understating the whole site and how you can extract the data from all that junk that is unnecessary.

Like if you are looking for all the link that is sent over using an API on the site you don’t want to look at the whole source of the web page.

View Source

You Just want to see what the API request if sending and what is the response to that API request and you can simply use that to get that required information unless the site is not using any API functionality then you will have to go through this mess.

And if you are wondering how am I getting this beautified source of the web site I am using Quick Source Viewer in chrome.

So after all this boring stuff lets talk about how you can get scraping, so what actually is happening in the background is that you start with the site you want to get the information from, look at the source code of the page and see what tags are there what are the classes that those tags are using.

This is important because using this information you will be extracting the data from the site. So for the sake of an example, there is a site that list’s all the movie name and their description and we want to just get all the movie names that are on the site and are not interested in the description of those movies just because you are not.

If we assume that the site structure is something like this:

<!DOCTYPE html>
<html>
<head>
<title>Movie list</title>
</head>
<body>
<dir class="all-movies">
<span class="movie-name">Name of Movie</span>
<span class="movie-desc">Details of Movie</span>
</dir>
<div class="random-stuff">
</div>
</body>
</html>

From this code what we can see is that there are two div tags and one of them is having a list of all the movies and their details and the class of that div tag is “all-movies” and the class of other div tag is “random-stuff“.

So is this case our target div tag is which is containing all the information related to movies we are not considered with other div tags that are there. After we have our target tag selected we have to see what is the information inside the target tag we want to extract in this case we want the names of all the movies that are listed there.

The same thing applies over here as you can see that there are two span tag and only the span tag with a class of “movie-name” is what we are interested in it doesn’t matter if there are hundred other span tags.

So the idea is that we want to have a unique identifier for what we want to extract and how is it referenced in the code of the website and the most common identifier in web elements is the Class name of tags.

Now you have the basics down about web scraping now we can work of the example that I have gotten down for you over here.

Web Scraping from API Calls Using Python

So if you haven’t guessed already from the above screenshot I am a fan of Anime and currently watching One Piece the thing was I started this anime pretty late and at that time 840 episodes were already out now the good thing about Horrible Subs is that they create batch download links for all the episodes of previously released seasons in a single Downloadable link either torrent or direct download link.

Batch Download

That made it really easy for people like me to download it in a single shot but after I was done with all 700 episodes there were 140 episodes that I had to download one by one and that’s a pain in the ass for someone that is as lazy as me.

so I decided to take a look at the site after spending some time and looking at the source I didn’t find any download link in the site.

via GIPHY

And at that time I was done and took some rest but the inner geek wouldn’t let me rest so I thought if it now being loaded from within the post then there must be some kind of API that if retrieving the list and embedding them in the site so I started looking and wallah there was the API:

API Calls

So I open up the URL in the other tab to see what is it getting.

List of Episodes

So I got a list of episodes that’s nice making progress but these were only 12 episodes per page so if you notice the URL parameters you can see that all the parameters before “nextid” are identifying the information related to the show so the nextid must be pointing to the other pages that have the episodes too so that was pretty much it now I have to code the script.

To go over the information that I had were:

  • The URL I will be retrieving data from.

Now I had to see what data I am going to be extracting so let’s look at the source:

API Response

Now if you look at the source of this API response you can see that all the links to the download are in a div tag that has a class of “rls-links-container” and I decided to download all the episodes in the lowest quality the links of which are stored in another nested div with the class of “rls-link link-480p“.

Now we are nearly done need the last target tag which contains the actual magnet link to the torrent of the episode so that is stored in the span tag in the div with class “dl-type hs-magnet-link

Magnet Link

Finally and there is the torrent link so all that is left for me to do was to write the script that will get the magnet link for me from this source.

This the code that I wrote so let’s explain the code. There is a Python library called Beautifulsoap4 that is used for Scraping and I also imported Requests module which is used to send requests to the website. To make it simple what I did was to save the output into a text file so I can manage the links.

After that, I simply created a variable and stored the API link to the episodes send a get request and stored the response to a variable called response. Then using the power of Beautifulsoap4 I parsed the HTML response and stored that into another variable.

As I already know that there is a div tag with a class that I was looking for in the source so I ran a for loop on the parsed response to find all div tag that has a class of that name.

And after that, I wanted to get the span tag that actually contained the magnet link so I looked for the span tag with that class and as you can see in the screenshot that within that span tag there is “a” tag which is referencing to the actual magnet link so the next line is doing precisely that looking for the “a” tag in the span tag and storing the href data in to the file.

And that’s it you have the list of all the download magnet links stored in the file ready to be imported in your favorite torrent downloader to be downloaded.

Links extracted

 

via GIPHY

And that’s how you can get into web scraping this is just a basic overview of what web scraping is and you can learn more from this. If you liked the article and learned something new and useful from this and do share it with your friends and comment down below if you have any questions.