This project was built using Scrapy (Scraping and Web Crawling Framework).

It contains a set of Spiders to gather product’s data from Etsy Website.

Problem

The client needs data (product_id, url, price, rating, number_of_reviews, product_options, count_of_images, images_urls, favorited_by, store_name and description) from thousands of products of etsy.com to perfom data analysis.

Task

Create an automated and fast solution to navigate the website, find the products by a search text, extract all the data, and save it in a user-friendly format (CSV and XLSX).

Solution

I’ve used the Scrapy Web Crawling Framework to build a Python script to search and scrape (extract) the data of products found in Etsy.

Results

The client was able to quickly download the data in CSV and Excel format of more than 100,000 products from etsy.com.

Testmart Example

The data was used for data analysis and add great value to the client business.


Source code

The solution is available at Github.

GitHub

How to use

You will need Python 3.6+ to run the scripts. Python can be downloaded here.

You have to install the Scrapy framework and other required packages:

  • In command prompt/Terminal: pip install -r requirments.txt

Once you have installed Scrapy framework, just clone/download this project:

git clone https://github.com/cpatrickalves/scraping-etsy

Usage

Spider: search_products.py

This Spider access the Etsy website and search for products based on a given search string.

Supported parameters:

  • search - set the search string
  • count_max - limit the number of items/products to be scraped
  • reviews_option - set the method to get the product’s reviews

For example, to search for ‘3d printed’ products go to the project’s folder and run:

scrapy crawl search_products -a search='3d printed' 

To save the results, use -o parameter:

scrapy crawl search_products -a search='3d printed' -o products.csv

The Spider will create a CSV and Excel files.

To limit the number of products scraped, use the count_max parameter:

scrapy crawl search_products -a search='3d printed' -a count_max=10 -o products.csv

The product reviews data can be obtained in three ways:

  • 1 - Spider will get only the reviews in the product’s page, that is, 4 reviews. This is the default and fastest option for scraping.
  • 2 - Spider will produce an Ajax request to get all reviews in the product’s page (simulate the click in the +More button to load more reviews). In this option, the Spider will usually get 10 reviews.
  • 3 - Spider will visit the page with all store reviews (click in the Read All Reviews button) and get all the reviews for this specific product. As the Spider will visit several pages to get the reviews, this is the slower scraping option and there is a chance to get temporarily blocked by Etsy because of the high number of requests.

To choose the option to scraping the reviews use the -a reviews_option parameter:

scrapy crawl search_products -a search='3d printed' -a reviews_option=3 -o products.csv

Scraping speed

You can change the number of concurrent requests performed by Scrapy in the setting.py file.

CONCURRENT_REQUESTS = 10

Change this if you want to decrease the number of requests to avoid get blocking by Etsy.

If you only need the products URLS, the scraping can be faster, just use the urls_only flag:

scrapy crawl search_products -a search='xbox controller elite' -o products.csv -a urls_only=true