Scraping Free Premium HTTP, SOCKS4, and SOCKS5 Proxies Using Python’s Asyncio, Aiohttp, and Task Gathering
- Get link
- X
- Other Apps
If you're into web scraping, automation, or network management, you’ll often need a list of reliable proxies to rotate during requests. Proxies can help you stay anonymous, bypass rate limits, or access geo-restricted content. In this guide, I’ll walk you through how to scrape free premium HTTP, SOCKS4, and SOCKS5 proxies using Python, leveraging powerful libraries like asyncio
, aiohttp
, and concurrent.futures
for threading. We will also ensure the proxies are fast and reliable by checking their response times.
Why Python for Proxy Scraping?
Python is an excellent tool for web scraping due to its extensive ecosystem of libraries. For this project, I’m using:
- Asyncio: Handles asynchronous programming, allowing us to run multiple tasks concurrently.
- Aiohttp: A powerful asynchronous HTTP client that works with asyncio to make non-blocking HTTP requests.
- Threading: Helps us efficiently manage tasks when we need to process multiple proxy batches in parallel.
Setting Up the Environment
To follow along with this tutorial, you'll need Python 3 installed on your system. Additionally, you'll want to install the required libraries by running the following:
bashpip install aiohttp asyncio fake-useragent
Step-by-Step Python Script to Scrape Proxies
Below is a sample Python script that fetches free HTTP, SOCKS4, and SOCKS5 proxies and checks their response time to ensure they’re fast and working. This script also divides proxies into packs and uses multithreading to check each pack concurrently.
pythonimport aiohttp
import asyncio
import math
from fake_useragent import UserAgent
from concurrent.futures import ThreadPoolExecutor
import time
# Initialize global variables
good_proxies_count = 0
cancel_all = False
stop_threshold = 10 # Stop after finding this number of good proxies
# Function to check a single proxy
async def check_proxy(session, protocol_choice, proxy, position):
"""Check if a proxy is working and update the results asynchronously."""
global good_proxies_count, cancel_all
for proxy_type in protocol_choice:
if cancel_all:
return # Stop if the threshold is reached
try:
proxy_url = f"{proxy_type}://{proxy}"
headers = {'User-Agent': UserAgent().random}
start_time = time.time()
async with session.get('https://httpbin.org/ip', headers=headers, proxy=proxy_url, timeout=10) as response:
if response.status == 200:
elapsed_time = time.time() - start_time
good_proxies_count += 1
print(f"Proxy {proxy} ({proxy_type}) is working! Response time: {elapsed_time:.2f} seconds.")
# Write good proxy to file or database
if good_proxies_count >= stop_threshold:
cancel_all = True
print(f"Reached the threshold of {stop_threshold} good proxies.")
return
except (aiohttp.ClientError, asyncio.TimeoutError):
pass # Ignore bad proxies
# Main function to check proxies
async def check_working_proxies(proxy_list, protocol_choice):
"""Check proxies in batches asynchronously for their functionality."""
global cancel_all, good_proxies_count
total_proxies = len(proxy_list)
print(f'Total unique proxies: {total_proxies}')
# Divide proxies into packs
pack_size = total_proxies if total_proxies < 100 else math.ceil(total_proxies / 10)
proxy_packs = [proxy_list[i:i + pack_size] for i in range(0, total_proxies, pack_size)]
# Process each pack of proxies concurrently
for pack_num, proxy_pack in enumerate(proxy_packs, start=1):
print(f'Checking proxies in pack {pack_num}/{len(proxy_packs)}')
async with aiohttp.ClientSession() as session:
tasks = [check_proxy(session, protocol_choice, proxy, i) for i, proxy in enumerate(proxy_pack)]
await asyncio.gather(*tasks, return_exceptions=True)
if cancel_all:
break # Stop checking if the threshold is reached
if good_proxies_count > 0:
print(f"Finished checking proxies. {good_proxies_count} good proxies found.")
else:
print("No good proxies found.")
# Sample proxy list (you can scrape these from free proxy sites)
proxy_list = ["123.45.67.89:8080", "98.76.54.32:3128"] # Example proxies
protocol_choice = ['http', 'socks4', 'socks5']
# Running the script
loop = asyncio.get_event_loop()
loop.run_until_complete(check_working_proxies(proxy_list, protocol_choice))
Key Features of the Script
- Asynchronous HTTP Requests: By using
aiohttp
withasyncio
, the script can send multiple requests at once without waiting for one to complete before starting another. - Proxy Response Time: The script measures the time it takes for each proxy to respond, ensuring you only use fast, reliable proxies in your projects.
- Threshold Setting: You can set a threshold (e.g., 10 working proxies), and the script will stop checking once the number of working proxies meets this threshold.
- Packs and Multithreading: Proxies are divided into packs for better management, and using
ThreadPoolExecutor
ensures that multiple packs can be processed concurrently.
Why You Should Use This Script
- Performance: This script uses asynchronous programming to maximize performance and minimize wait times when checking proxies.
- Scalability: You can easily adjust the script to handle hundreds or even thousands of proxies by tweaking the pack size and thread pool.
- Reliability: By checking the response time of each proxy, you can ensure only the fastest proxies are used in your online projects.
Conclusion
Scraping proxies and verifying their performance is essential for any online project involving web scraping, automation, or API management. By using Python’s asyncio
and aiohttp
, this task becomes much more efficient. This script ensures that only fast and reliable proxies make the cut, helping you avoid slow or dead proxies.
Feel free to tweak the script to fit your needs, and don’t forget to watch the full YouTube tutorial for a detailed walkthrough!
download resources; click here
- Get link
- X
- Other Apps
Comments
Post a Comment