Fashion Image Scraper: Large-Scale Dataset Collection

Freelance Client - Computer Vision Dataset

Freelance project focused on building a robust, large-scale image dataset for computer vision training. Scraped product images (primarily women’s and men’s clothing) from multiple high-traffic fashion e-commerce sites including PrettyLittleThing, Bandier, Italist, Mango, Nordstrom, and Wolf & Badger. Designed to extract the latest and most relevant images per product while evading anti-bot measures.

PythonBeautifulSoupSeleniumGeckoDriver (Firefox)requestsCSVProxy RotationRandom Delays

Impact & Results

Successfully collected 50,000+ clean, categorized product images
Enabled downstream computer vision tasks (classification, detection, style analysis)
Delivered high-quality dataset to freelance client on time
Handled real-world anti-scraping challenges at scale

Architecture

Two-phase pipeline: (1) Link collection => CSV storage, (2) Image downloading from saved links
Modular structure with separate directories for links (_links) and data parsing (_data)
Selenium + GeckoDriver for dynamic JavaScript-heavy pages
Requests + BeautifulSoup for static content
IP rotation and randomized delays to avoid detection and rate-limiting
Organized output: Category-specific folders (Men/Women clothing) and subdirectories per retailer

Challenges

Heavy use of JavaScript and infinite scroll on target sites
Aggressive bot detection, IP bans, and CAPTCHA challenges
Maintaining consistency across different site structures
Managing large volume of downloads without overwhelming local/network resources

Solutions

Combined Selenium (for JS rendering) with requests/BS4 (for speed)
Implemented proxy rotation and human-like random delays
Built retailer-specific parsers while keeping core logic reusable
Chunked processing and robust error handling with retries

Key Takeaways

Deep understanding of modern anti-bot techniques and evasion strategies

Effective hybrid scraping approaches (Selenium + requests)

Importance of modular, maintainable scraper architecture

Real-world scale considerations: rate limiting, storage, and reliability

Project Gallery

Browse through project illustrations

Back to Projects