December 2018 - Sasa Buklijas

Published on: 15.12.2018

Disclaimer:
This is primarily written from Python programming language ecosystem point of view.

I have noticed that Selenium has become quite popular for scraping data from web pages.

Yes, you can use Selenium for web scraping, but it is not a good idea.

Also personally, I think that articles that teach how to use Selenium for web scraping are giving a bad example of what tool to use for web scraping.

Why you should not use Selenium for web scraping

First,Selenium is not a web scraping tool.

It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.

Second, in Python, there is a better tool Scrapy open-source web-crawling framework.

The intelligent reader will ask: “What is a benefit in using Scrapy over Python?”

You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.

There are tips on how to make Selenium web scraping faster, and if you use Scrapy then you do not have those kinds of problems and you are faster.

Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.

For what should you use Selenium

I personally only use Selenium for web page testing.

I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.

Exception on when you can use Selenium

The only exception that I could see for using Selenium as web scraping tool is if a website that you are scraping is using JavaScript to get/display data that you need to scrape.

Scrapy does have the solution for JavaScript with Splash, but I have never used it, so far I always found some workaround.

What to use instead of Selenium for web scraping

As you can guess, my advice is to use Scrapy.

I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.

I have found Scrapy to be faster in development time because of a Scrapy shell and cache.

In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.

What about Beautiful Soup + Requests

I have used this combination in the past before I decided to invest time in learning Scrapy.

Do not make the same mistake as I did, development time and execution time is much faster with Scrapy, than with any other tool that I have found so far.

Last words

This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.

I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.

But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).

To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.

Conclusion

After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.

Published on: 01.12.2018

Python packet dataset describes itself as databases for lazy people and they are correct.

For saving data with dataset all you need is just a Python dictionary, the keys of the dictionary are columns in a table and that is all.

import dataset

db = dataset.connect('sqlite:///:memory:')

table = db['sometable']
table.insert(dict(name='John Doe', age=37))
table.insert(dict(name='Jane Doe', age=34, gender='female'))

john = table.find_one(name='John Doe')

import dataset

db = dataset.connect('sqlite:///:memory:')

table = db['sometable']

table.insert(dict(name='John Doe', age=37))

table.insert(dict(name='Jane Doe', age=34, gender='female'))

john = table.find_one(name='John Doe')

Dataset will automatically make all tables and columns necessary.

Internal data is stored in SQLite, PostgreSQL or MySQL database, my experience has only been with SQLite so far.

My experience

In one project I use it just for memory database, after scraping data from a website it is stored in-memory SQLite.

Then I can use standard dataset API to retrieve data with certain criteria and sort it, before emailing it.

On another project, I use it to store data in SQLite and later to retrieve it.

I must admit that for everything else than basic searching, filtering and sorting you have to write SQL queries.

One useful feature is upsert, upsert is a smart combination of insert and update.

If rows with matching keys exist they will be updated, otherwise a new row is inserted in the table.

There is also a feature to export data to CSV or JSON.

Conclusion

If you think that using DB on your next project is overkill, but you do need to filter, search or sort data, take a look at datase.

It is much better than to make custom solutions, I know because I did stored data in pickle format and wrote a custom function for filtering, sorting and retrieving data from pickle, before I learned about dataset.

Sasa Buklijas

Month: December 2018

Do not use Selenium for web scraping

Why you should not use Selenium for web scraping

For what should you use Selenium

Exception on when you can use Selenium

What to use instead of Selenium for web scraping

What about Beautiful Soup + Requests

Last words

Conclusion

Introduction to Python packet Dataset

My experience

Conclusion