Web scraping is a technique used to extract data from websites. It allows the gathering of information from web pages for various purposes like data analysis, research, or building applications. In this article, we’ll explore a Python project called “GitHub Topics Scraper” that uses web scraping to extract information from the GitHub topics page and retrieve repository names and details for each topic.
GitHub is a popular platform for hosting and collaborating on code repositories. It has a feature called “topics” that lets users categorize repositories based on specific subjects or themes. The GitHub Topics Scraper automates the process of scraping these topics and retrieving relevant repository information.
The GitHub Topics Scraper is implemented using Python and utilizes the following libraries:
1. requests: This library is used to make HTTP requests and retrieve the HTML content of web pages.
2. BeautifulSoup: This powerful library is used to parse HTML and extract data from it.
3. pandas: This versatile library is used for data manipulation and analysis. It helps organize the scraped data into a structured format.
Let’s dive into the code and understand how each component of the project works.
**Code Snippet 1: Topic Page Authentication**
“`python
import requests
from bs4 import BeautifulSoup
import pandas as pd
def topic_page_authentication(url):
topics_url = url
response = requests.get(topics_url)
page_content = response.text
doc = BeautifulSoup(page_content, ‘html.parser’)
return doc
“`
The above code defines a function called `topic_page_authentication` that takes a URL as an argument. This function authenticates and retrieves the HTML content of the specified web page. It uses the `requests` library to send an HTTP GET request, retrieves the response content, and then parses it using BeautifulSoup to create a navigable object representing the HTML structure.
**Code Snippet 2: Topic Scraper**
“`python
def topic_scraper(doc):
title_class = ‘f3 lh-condensed mb-0 mt-1 Link–primary’
topic_title_tags = doc.find_all(‘p’, {‘class’: title_class})
description_class = ‘f5 color-fg-muted mb-0 mt-1’
topic_desc_tags = doc.find_all(‘p’, {‘class’: description_class})
link_class = ‘no-underline flex-1 d-flex flex-column’
topic_link_tags = doc.find_all(‘a’, {‘class’: link_class})
topic_titles = [tag.text for tag in topic_title_tags]
topic_descriptions = [tag.text.strip() for tag in topic_desc_tags]
base_url = “https://github.com”
topic_urls = [base_url + tag[‘href’] for tag in topic_link_tags]
topics_dict = {‘Title’: topic_titles, ‘Description’: topic_descriptions, ‘URL’: topic_urls}
topics_df = pd.DataFrame(topics_dict)
return topics_df
“`
The above code defines a function called `topic_scraper` that takes a BeautifulSoup object (`doc`) as an argument. This function scrapes and extracts information from the object. It retrieves the topic titles, descriptions, and URLs from specific HTML elements on the web page and stores them in a pandas DataFrame for further analysis or processing.
**Code Snippet 3: Topic URL Extractor**
“`python
def topic_url_extractor(dataframe):
url_lst = [dataframe[‘URL’][i] for i in range(len(dataframe))]
return url_lst
“`
The above code defines a function called `topic_url_extractor` that takes a pandas DataFrame (`dataframe`) as an argument. This function extracts the URLs from the ‘URL’ column of the DataFrame. It iterates over each row of the DataFrame, retrieves the URL value for each row, and adds it to a list. Finally, the function returns the list of extracted URLs.
**Code Snippet 4: Parse Star Count**
“`python
def parse_star_count(stars_str):
stars_str = stars_str.strip()[6:]
if stars_str[-1] == ‘k’:
stars_str = float(stars_str[:-1]) * 1000
return int(stars_str)
“`
The above code defines a function called `parse_star_count` that takes a string (`stars_str`) as an argument. This function parses and converts the star count from a string to an integer. It removes leading and trailing whitespace from the string, and then checks if the last character is ‘k’, indicating that the star count is in thousands. If it is, the function converts the count to an integer by removing the ‘k’ and multiplying by 1000.
These functions are part of the GitHub Topics Scraper project, which uses web scraping to extract information from GitHub topics pages. The parsed data can be used for various purposes such as analysis, research, or building applications.