Ultimate Guide to Proxy Scraping and Checking Using Python: A Comprehensive Tutorial

- October 16, 2024

In this article, we will walk you through creating a robust Python script to scrape, check, and manage proxies. The script utilizes various Python modules and techniques to perform tasks like fetching proxy lists from websites, checking if proxies are working, and organizing them efficiently. We will also introduce how to handle user interactions via a menu-driven interface. Let's dive into the script and explore its components step by step.

Prerequisites

Before proceeding, make sure you have the following Python libraries installed:

bash
pip install aiohttp asyncio requests beautifulsoup4 fake-headers tabulate cryptography colorama pillow

Script Overview

The script contains several functions, each designed to handle a specific aspect of the proxy scraping and checking process. It follows an organized structure to:

Scrape proxies from websites.
Check the validity of the proxies.
Organize and backup good proxies.
User-friendly interface for selecting actions.

Importing Required Libraries

Here's the list of imported modules in the script:

python
import re
import aiohttp
import asyncio
import requests
from cryptography.fernet import Fernet
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
from fake_headers import Headers
from colorama import Fore, init
import os
import sys
from tkinter import messagebox, simpledialog
from PIL import Image, ImageTk
from concurrent.futures import ThreadPoolExecutor, as_completed
from time import sleep

These modules help with various functionalities like HTTP requests, parsing HTML, encrypting data, handling concurrency, and creating user interfaces.

Scraping Proxies

The function scrape_proxies() scrapes proxy lists from websites. It fetches the main page, looks for proxy links, and extracts IP addresses:

python
async def scrape_proxies():
    url = 'https://www.my-proxy.com/free-proxy-list.html'
    
    async with aiohttp.ClientSession() as session:
        page = await fetch_url(session, url)
        if page:
            soup = BeautifulSoup(page, 'html.parser')
            links = soup.find_all('a', href=True)
            urls = ['https://www.my-proxy.com/' + link['href'] for link in links if 'free' in link['href']]
            urls.append(url)
            return urls

# Function to fetch and parse a single URL
async def fetch_url(session, url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'}
    try:
        async with session.get(url, headers=headers, ssl=False) as response:
            if response.status == 200:
                return await response.text()
            else:
                print(f"Failed to retrieve URL: {url} with status {response.status}")
    except Exception as e:
        print(f"Error fetching URL {url}: {e}")
    return None

Checking Proxy Validity

The script can asynchronously check proxies to determine if they are functional. It connects to each proxy using aiohttp and tests if the proxy responds correctly.

python
async def check_proxy(session, proxy_types, proxy):
    global cancel_all, good_proxies_count
    for proxy_type in proxy_types:
        if cancel_all:
            return None

        try:
            proxy_dict = {
                "http": f"{proxy_type}://{proxy}",
                "https": f"{proxy_type}://{proxy}",
            }
            header = Headers(headers=False).generate()

            async with session.get('https://ipapi.com/', headers={'User-Agent': header['User-Agent']}, proxy=proxy_dict['http'], timeout=10) as response:
                if response.status == 200:
                    with open('GoodProxy_temp.txt', 'a') as f:
                        f.write(f'{proxy} | {proxy_type}\n')
                    good_proxies_count += 1
                    print(Fore.GREEN + f'{proxy} is working as {proxy_type}' + Fore.RESET)
                    if stop_threshold and good_proxies_count >= stop_threshold:
                        cancel_all = True
                        print(Fore.YELLOW + "Reached desired number of good proxies." + Fore.RESET)
                        return proxy
        except Exception as e:
            continue

Saving proxies

The script provides functionalities to save working proxies into files and backup existing proxies:

python
def save_proxies(proxies):
    with open('online_proxy.txt', 'w') as f:
        for proxy in proxies:
            f.write(f"{proxy}\n")
    print(f"Total proxies extracted: {len(proxies)}")
    print("Proxies saved to online_proxy.txt")

Running the Script

To run the script, use the following command:

bash
python your_script.py

The program will display a menu where you can select to scrape proxies or check existing ones. The program saves good proxies in a backup file and provides real-time feedback on the proxy status.

Conclusion

This Python script is a comprehensive solution for proxy scraping and checking, equipped with user-friendly features and real-time feedback. With this guide, you can easily modify and extend the script to suit your proxy management needs.

Download resources; click here

Search This Blog

window best software

Discover the Best Website for Free HTTP, SSL, and SOCKS Proxies + Python Script Guide