Elmus W.

Systems & Software Engineer

0%
Initializing
{
}
<
>
;
=
(
)
Portfolio

Fashion Image Scraper: Large-Scale Dataset Collection

Freelance Client - Computer Vision Dataset

Freelance project focused on building a robust, large-scale image dataset for computer vision training. Scraped product images (primarily women’s and men’s clothing) from multiple high-traffic fashion e-commerce sites including PrettyLittleThing, Bandier, Italist, Mango, Nordstrom, and Wolf & Badger. Designed to extract the latest and most relevant images per product while evading anti-bot measures.

PythonBeautifulSoupSeleniumGeckoDriver (Firefox)requestsCSVProxy RotationRandom Delays

Impact & Results

  • Successfully collected 50,000+ clean, categorized product images
  • Enabled downstream computer vision tasks (classification, detection, style analysis)
  • Delivered high-quality dataset to freelance client on time
  • Handled real-world anti-scraping challenges at scale

Architecture

  • Two-phase pipeline: (1) Link collection => CSV storage, (2) Image downloading from saved links
  • Modular structure with separate directories for links (_links) and data parsing (_data)
  • Selenium + GeckoDriver for dynamic JavaScript-heavy pages
  • Requests + BeautifulSoup for static content
  • IP rotation and randomized delays to avoid detection and rate-limiting
  • Organized output: Category-specific folders (Men/Women clothing) and subdirectories per retailer

Challenges

  • Heavy use of JavaScript and infinite scroll on target sites
  • Aggressive bot detection, IP bans, and CAPTCHA challenges
  • Maintaining consistency across different site structures
  • Managing large volume of downloads without overwhelming local/network resources

Solutions

  • Combined Selenium (for JS rendering) with requests/BS4 (for speed)
  • Implemented proxy rotation and human-like random delays
  • Built retailer-specific parsers while keeping core logic reusable
  • Chunked processing and robust error handling with retries

Key Takeaways

Deep understanding of modern anti-bot techniques and evasion strategies
Effective hybrid scraping approaches (Selenium + requests)
Importance of modular, maintainable scraper architecture
Real-world scale considerations: rate limiting, storage, and reliability

Project Gallery

Browse through project illustrations