Tag: scrapy

Do not use Selenium for web scraping

Published on: 15.12.2018

Disclaimer:
This is primarily written from Python programming language ecosystem point of view.

I have noticed that Selenium has become quite popular for scraping data from web pages.

Yes, you can use Selenium for web scraping, but it is not a good idea.

Also personally, I think that articles that teach how to use Selenium for web scraping are giving a bad example of what tool to use for web scraping.

Why you should not use Selenium for web scraping

First,Selenium is not a web scraping tool.

It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.

Second, in Python, there is a better tool Scrapy open-source web-crawling framework.

The intelligent reader will ask: “What is a benefit in using Scrapy over Python?

You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.

There are tips on how to make Selenium web scraping faster, and if you use Scrapy then you do not have those kinds of problems and you are faster.

Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.

For what should you use Selenium

I personally only use Selenium for web page testing.

I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.

Exception on when you can use Selenium

The only exception that I could see for using Selenium as web scraping tool is if a website that you are scraping is using JavaScript to get/display data that you need to scrape.

Scrapy does have the solution for JavaScript with Splash, but I have never used it, so far I always found some workaround.

What to use instead of Selenium for web scraping

As you can guess, my advice is to use Scrapy.

I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.

I have found Scrapy to be faster in development time because of a Scrapy shell and cache.

In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.

What about Beautiful Soup + Requests

I have used this combination in the past before I decided to invest time in learning Scrapy.

Do not make the same mistake as I did, development time and execution time is much faster with Scrapy, than with any other tool that I have found so far.

Last words

This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.

I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.

But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).

To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.

Conclusion

After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.

The largest expense for the programmer

Published on: 01.05.2018

I will talk about largest expense from somebody who is developing software (primarily writing code) but from the business owner perspective (not from employe perspective).

I got this idea after starting to develop my own software products, not at the time when I was writing code for others.

The topic could be rephrased as “The largest expense for the business owner who is also the sole developer”.

But, I also think that it is correct for software development in general.

Time is not on your side

After prolong thinking about a subject, I have come to conclusion (of course I can be wrong), that largest expense is time.

By the time I mean how much time you will spend to make some software.

One can argue that this is also the only expense (with some hardware, room, and electricity).

So, next question is on what activity in software development is most of the time spent(or wasted).

Hardware is cheap

Better hardware (SSD, more RAM, faster CPU, etc) will reduce time in development and you should use it.

But hardware price is relatively cheap against other time expenses.

Let’s say that by using SSD you will get 10 more minutes of work per workday (altho I would argue that it is at least double).

Multiplying by 260 workdays per year, that is 2600 minutes or 43h.

1 TB SSD is 400$, so even if you only bill 10$ for your working hour, ROI is one year and this is no brain investment. (the number can change, but you get the point)

Anyway, moral of the story is that hardware is cheap.

The largest expense is learning how to do something new

Most of the time is spent on learning how to do something new.

New programing language, new frameworks, new libraries, new tools, new new new …

The endless supply of new things thing that needs to be learned.

Somebody could come to the conclusion learning how to learn fast is the solution.

Certainly learning fast is useful, but it is not the solution, because there are more things to learn that there is time to do it.

Temporary nature of the knowledge capital

And also, in software development, you have additional problems of “temporary nature of the knowledge capital”.

Basically, what you learn today, probably will not be useful in 5 years.

Maybe it will not even exist anymore.

So far this is nothing new, anybody, who has been programming for more than 5 years have practical experience that technologies (languages, frameworks, libraries, tools, etc) go away, new ones come and now you need to spend the time to learn new ways to do the old things.

What is new, or at least what I am trying to argue in the essay, is that from the business standpoint, we should look at it as an expense.

Especially as the largest expense in software development.

Somebody can say: “Hey douchebag, I like learning new technologies and using them in my job. You are just some old dinosaur who does not like programming and probably was never good at it.”

Well, I can agree that 35 I am certainly not young anymore.

But my age does not change the fact is that learning something is not same as doing something.

If I want to do something, first I must know how to do it, and to know it, you need to learn it.

So, learning is an expense to doing.

And if for everything that you need to do, first, you need to learn, that is a lot of learning (and big expense).

Is programming is young man game?

I would say that part of the problem is that most programmers are young and do not have much life/work experience.

Statistic from StackOverflow 2018 Developer Survey Results support that calim.

30% have only 2 years of professional coding. That is one third.

57.5% have only 5 years of professional coding.

In some work field, even after 5 years, you are still considered just a beginner.

And only 12,7% have more than 15 years of professional coding.

Hobby or business

Also 81% of professional developers code as a hobby.

I do not think that this is a bad thing.

But at the same time, if you consider something a hobby, then you will not treat it as a business.

In business, there is income and expense.

And by subtracting expense from income, you get profit.

And you should have some if you expect for your business to survive.

In the hobby, there is only fun.

That is why it is called the hobby.

And hobby is the expense, from the bookkeeping perspective, but everybody is considering his/her hobby as fun, not as the expense.

These two reasons, probably more the second one, are probably main reasons why most programmes do not see learning as the expense.

At least, until they have burnout and then they switch to something else, like project management (real job title should be: project reporting) or leave software development completely.

Looking from the previously mentioned statistic only 6.9% of developer are older than 45 years.

But it is well known that programming is mostly young man game, probably due to “temporary nature of the knowledge capital”.

How to reduce your largest expense

I think that specialization is the large part of the solution.

Find your niche and stick to it.

Then you can reduce (or even completely eliminate) the expense of time spent on learning how to program something and can acquire domain-specific knowledge.

When to learn new things

I am not saying that you should never learn new things.

Not at all.

Just do cost/benefit analysis before.

I will use my own experience as an example.

I do/did a lot of web scraping.

First I have done it with Beautiful Soup and a lot of custom code.

I wrote my own caching, ORM, etc.

In web scraping, the easiest part is to write XPath selectors, there are a lot of other things and over time I have made my own small framework for all of that.

And with each new website that I scraped I had to and new features or improve old ones.

And this was taking larger and larger percent of my time as I was scraping more and more challenging websites.

After some time I decided to try Scrapy .

To scrape the first website with Scrapy I need around one week, with 90% of the time was spent on learning Scrapy (learning how to do the old thing with the new framework).

If I have used my custom old framework I would be finished it in 3 days.

But I would have to write more code than with Scrapy.

Currently, with Scrapy in one day, I can do scraping that I needed at least 3-4 days with my old custom framework.

This is the example when learning new framework was useful.

Conclusion

In software development business for a software developer, the largest expense is time spent learning new technologies.

Reduce it by finding your niche and specializing.

Use cost/benefit analysis to determine should you use learn some new technology.