Tag: webscraping

March 1, 2019March 30, 2019

Verification vs Validation in practice

Published on: 01.03.2019

Verification is the process of checking that the software meets the specification.

It is doing what you wanted it to do.

An example could be that function need to add two numbers, then you verify (like write unit test) that it is correctly doing that.

Validation is the process of checking whether the specification captures the customer’s needs.

Using the example of the function need to add two numbers verification need to confirm that this function is really what user need, eg. maybe you need to multiply two numbers.

Practice vs theory

When I first head about validation I was able to understand it in theory, but in practice, I was thinking it is easy to know what you want why do you need to validate it.

Then I had personal experience of why and how validation is hard.

I build wrong software for myself and I had no one else to blame.

How I build wrong software for myself

My idea was to make software that will be run at 1 AM every day, will take all real-estate ads from https://www.njuskalo.hr/ for my town listed on a previous day, sort them by price for a square meter and send them to email.

Basically, I wanted all new ads per day to my email, sorted by price for a square (one day delay was fine for me).

Looks simple enough, what could go wrong?

After a few days and I had it running in production and it was working, verification was successful, I every day I got all ads from the previous day.

Why validation was wrong

After a week I found out that my software was useless.

What was the problem?

Rember that I wanted to get “I wanted all new ads per day to my email”, I wanted all “new ads per day”, but what I got was all updated and new ads per day.

Let me explain.

Every day I was getting around 200 ads per day and I noticed that a lot of them were the same ads, day after day.

What was happening is that a lot of people were just updating the same ad every day.

And they are doing this so that their ad is always on the first page, sometimes they even do it a few times per day (later I found out that a friend of a friend was contracted by one local real-estate agency to make software that will automatically update ads for them).

Altho my software was working correctly, only after I have made it I found out it is useless because of wrong assumptions.

My assumption was that every ad will be added only once, not that 60% of adds will be updated every week.

I have solved this problem by making version two that could know if an ad is new or updated and if updated what was updated.

Am I stupid

This experience was fascinating to me.

On this project, I was everything: user, project manager, architect, coder, quality assurance, investor, every hat was on my had and I manage to build the wrong thing.

It gave me a practical understanding of why it is common that the end user is not happy with the finaly product.

Even if everything is done correctly it is possible that the final product is not solving user original problem due to wrong initial assumptions.

How to improve validation

One approach is to make MVP, in this way you will spend fewer resources on version one.

If validation of MVP is correct, then add additional features, if not cancel it

Another approach is to get some domain knowledge ether internal or external.

I had built a few web-scrapers in the last few years and now know a few tricks about that domain, but I learned each on the hard way.

I also understood why some companies hire domain experts consultants (just be sure to have a good one).

Technology

For those interested in what tools did use to build my software here is the list: Scrapy, dataset, yagmail.

December 15, 2018December 14, 2018

Do not use Selenium for web scraping

Published on: 15.12.2018

Disclaimer:
This is primarily written from Python programming language ecosystem point of view.

I have noticed that Selenium has become quite popular for scraping data from web pages.

Yes, you can use Selenium for web scraping, but it is not a good idea.

Also personally, I think that articles that teach how to use Selenium for web scraping are giving a bad example of what tool to use for web scraping.

Why you should not use Selenium for web scraping

First,Selenium is not a web scraping tool.

It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.

Second, in Python, there is a better tool Scrapy open-source web-crawling framework.

The intelligent reader will ask: “What is a benefit in using Scrapy over Python?”

You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.

There are tips on how to make Selenium web scraping faster, and if you use Scrapy then you do not have those kinds of problems and you are faster.

Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.

For what should you use Selenium

I personally only use Selenium for web page testing.

I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.

Exception on when you can use Selenium

The only exception that I could see for using Selenium as web scraping tool is if a website that you are scraping is using JavaScript to get/display data that you need to scrape.

Scrapy does have the solution for JavaScript with Splash, but I have never used it, so far I always found some workaround.

What to use instead of Selenium for web scraping

As you can guess, my advice is to use Scrapy.

I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.

I have found Scrapy to be faster in development time because of a Scrapy shell and cache.

In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.

What about Beautiful Soup + Requests

I have used this combination in the past before I decided to invest time in learning Scrapy.

Do not make the same mistake as I did, development time and execution time is much faster with Scrapy, than with any other tool that I have found so far.

Last words

This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.

I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.

But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).

To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.

Conclusion

After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.

May 1, 2018December 16, 2018

The largest expense for the programmer

Published on: 01.05.2018

I will talk about largest expense from somebody who is developing software (primarily writing code) but from the business owner perspective (not from employe perspective).

I got this idea after starting to develop my own software products, not at the time when I was writing code for others.

The topic could be rephrased as “The largest expense for the business owner who is also the sole developer”.

But, I also think that it is correct for software development in general.

Time is not on your side

After prolong thinking about a subject, I have come to conclusion (of course I can be wrong), that largest expense is time.

By the time I mean how much time you will spend to make some software.

One can argue that this is also the only expense (with some hardware, room, and electricity).

So, next question is on what activity in software development is most of the time spent(or wasted).

Hardware is cheap

Better hardware (SSD, more RAM, faster CPU, etc) will reduce time in development and you should use it.

But hardware price is relatively cheap against other time expenses.

Let’s say that by using SSD you will get 10 more minutes of work per workday (altho I would argue that it is at least double).

Multiplying by 260 workdays per year, that is 2600 minutes or 43h.

1 TB SSD is 400$, so even if you only bill 10$ for your working hour, ROI is one year and this is no brain investment. (the number can change, but you get the point)

Anyway, moral of the story is that hardware is cheap.

The largest expense is learning how to do something new

Most of the time is spent on learning how to do something new.

New programing language, new frameworks, new libraries, new tools, new new new …

The endless supply of new things thing that needs to be learned.

Somebody could come to the conclusion learning how to learn fast is the solution.

Certainly learning fast is useful, but it is not the solution, because there are more things to learn that there is time to do it.

Temporary nature of the knowledge capital

And also, in software development, you have additional problems of “temporary nature of the knowledge capital”.

Basically, what you learn today, probably will not be useful in 5 years.

Maybe it will not even exist anymore.

So far this is nothing new, anybody, who has been programming for more than 5 years have practical experience that technologies (languages, frameworks, libraries, tools, etc) go away, new ones come and now you need to spend the time to learn new ways to do the old things.

What is new, or at least what I am trying to argue in the essay, is that from the business standpoint, we should look at it as an expense.

Especially as the largest expense in software development.

Somebody can say: “Hey douchebag, I like learning new technologies and using them in my job. You are just some old dinosaur who does not like programming and probably was never good at it.”

Well, I can agree that 35 I am certainly not young anymore.

But my age does not change the fact is that learning something is not same as doing something.

If I want to do something, first I must know how to do it, and to know it, you need to learn it.

So, learning is an expense to doing.

And if for everything that you need to do, first, you need to learn, that is a lot of learning (and big expense).

Is programming is young man game?

I would say that part of the problem is that most programmers are young and do not have much life/work experience.

Statistic from StackOverflow 2018 Developer Survey Results support that calim.

30% have only 2 years of professional coding. That is one third.

57.5% have only 5 years of professional coding.

In some work field, even after 5 years, you are still considered just a beginner.

And only 12,7% have more than 15 years of professional coding.

Hobby or business

Also 81% of professional developers code as a hobby.

I do not think that this is a bad thing.

But at the same time, if you consider something a hobby, then you will not treat it as a business.

In business, there is income and expense.

And by subtracting expense from income, you get profit.

And you should have some if you expect for your business to survive.

In the hobby, there is only fun.

That is why it is called the hobby.

And hobby is the expense, from the bookkeeping perspective, but everybody is considering his/her hobby as fun, not as the expense.

These two reasons, probably more the second one, are probably main reasons why most programmes do not see learning as the expense.

At least, until they have burnout and then they switch to something else, like project management (real job title should be: project reporting) or leave software development completely.

Looking from the previously mentioned statistic only 6.9% of developer are older than 45 years.

But it is well known that programming is mostly young man game, probably due to “temporary nature of the knowledge capital”.

How to reduce your largest expense

I think that specialization is the large part of the solution.

Find your niche and stick to it.

Then you can reduce (or even completely eliminate) the expense of time spent on learning how to program something and can acquire domain-specific knowledge.

When to learn new things

I am not saying that you should never learn new things.

Not at all.

Just do cost/benefit analysis before.

I will use my own experience as an example.

I do/did a lot of web scraping.

First I have done it with Beautiful Soup and a lot of custom code.

I wrote my own caching, ORM, etc.

In web scraping, the easiest part is to write XPath selectors, there are a lot of other things and over time I have made my own small framework for all of that.

And with each new website that I scraped I had to and new features or improve old ones.

And this was taking larger and larger percent of my time as I was scraping more and more challenging websites.

After some time I decided to try Scrapy .

To scrape the first website with Scrapy I need around one week, with 90% of the time was spent on learning Scrapy (learning how to do the old thing with the new framework).

If I have used my custom old framework I would be finished it in 3 days.

But I would have to write more code than with Scrapy.

Currently, with Scrapy in one day, I can do scraping that I needed at least 3-4 days with my old custom framework.

This is the example when learning new framework was useful.

Conclusion

In software development business for a software developer, the largest expense is time spent learning new technologies.

Reduce it by finding your niche and specializing.

Use cost/benefit analysis to determine should you use learn some new technology.

March 15, 2018February 20, 2018

How to find all emails on the web page ?

Published on: 15.03.2018

Conclusion

Use get_emails() from webscraping Python package.

Python strength

The best thing about Python is huge numbers of 3rd party packages.

With a lot of them, you can solve your problems with just a few lines of code.

Let’s say that you want to find all emails in some HTML document, either for an offline or online web page.

This can be done with webscraping package.

First, install it with:

pip install webscraping

1	pip install webscraping

Code for finding all emails on the single page is:

from webscraping import download, alg

D = download.Download()
html = D.get('http://buklijas.info/')

emails = alg.extract_emails(html)

print emails

from webscraping import download, alg

D = download.Download()

html = D.get('http://buklijas.info/')

emails = alg.extract_emails(html)

print emails

Line 1 is importing download and alg from webscraping package that you have just installed.

Line 3 is creating download.Download() object and calling it D.

Line 4 is saving the web page from where you want to find all emails in html variable.

Line 6 is finding all emails from your html variable and saving all emails in emails Python list.

Line 8 is showing all emails that have been found on the screen.

This will work for a single web page.

How to find emails on the whole site

If you want to search the whole website for emails, not just one page, you can use following code.

from webscraping import download

D = download.Download()

emails = D.get_emails("http://buklijas.info/", max_depth=2, max_urls=None, max_emails=None)

print emails

from webscraping import download

D = download.Download()

emails = D.get_emails("http://buklijas.info/", max_depth=2, max_urls=None, max_emails=None)

print emails

With max_depth, max_urls, max_emails parameters you can define how long your searching should be.

Happy spamming.

P.S. just joking 🙂