Fashion Image Scraper: Large-Scale Dataset Collection
Freelance Client - Computer Vision Dataset
Freelance project focused on building a robust, large-scale image dataset for computer vision training. Scraped product images (primarily women’s and men’s clothing) from multiple high-traffic fashion e-commerce sites including PrettyLittleThing, Bandier, Italist, Mango, Nordstrom, and Wolf & Badger. Designed to extract the latest and most relevant images per product while evading anti-bot measures.
PythonBeautifulSoupSeleniumGeckoDriver (Firefox)requestsCSVProxy RotationRandom Delays
Impact & Results
- Successfully collected 50,000+ clean, categorized product images
- Enabled downstream computer vision tasks (classification, detection, style analysis)
- Delivered high-quality dataset to freelance client on time
- Handled real-world anti-scraping challenges at scale
Architecture
- Two-phase pipeline: (1) Link collection => CSV storage, (2) Image downloading from saved links
- Modular structure with separate directories for links (_links) and data parsing (_data)
- Selenium + GeckoDriver for dynamic JavaScript-heavy pages
- Requests + BeautifulSoup for static content
- IP rotation and randomized delays to avoid detection and rate-limiting
- Organized output: Category-specific folders (Men/Women clothing) and subdirectories per retailer
Challenges
- Heavy use of JavaScript and infinite scroll on target sites
- Aggressive bot detection, IP bans, and CAPTCHA challenges
- Maintaining consistency across different site structures
- Managing large volume of downloads without overwhelming local/network resources
Solutions
- Combined Selenium (for JS rendering) with requests/BS4 (for speed)
- Implemented proxy rotation and human-like random delays
- Built retailer-specific parsers while keeping core logic reusable
- Chunked processing and robust error handling with retries
Key Takeaways
Deep understanding of modern anti-bot techniques and evasion strategies
Effective hybrid scraping approaches (Selenium + requests)
Importance of modular, maintainable scraper architecture
Real-world scale considerations: rate limiting, storage, and reliability
Project Gallery
Browse through project illustrations

