Category: software

You can not understand the solution until you had the problem

Published on: 01.04.2019

I understand RESTful web services, or at least I think I do.

I agree that when you have huge teams and code base it makes sense to cut them in small independent pieces and connect them via queues and HTTP.

Collaboration on large software projects is hard and problems are increasing exponentially with a number of people added to the project.

The tradeoff is that the overall speed of your software will decrease (because of HTTP networks calls), but you will get software system that can be maintained and new features added without the need of understanding/changing/impacting whole system.

But I never found a use case for myself as somebody who is one man team working on his own projects.

Until one morning.

Architecture

I have a lot of (around 10) independent software programs that are running on daily (some even every hour) interval.

Most of them are doing some variation of web scraping, storing, analysis and reporting of results via email.

This was all fine until one morning I woke up and saw there where was no emails from my software.

I know that something was not right.

They all use yagmail for sending email, so I was thinking that there is some problem with that, because it is a single point of failure.

After an investigation, I found out that the problem was with Gmail itself it just stopped working, the next day it was fine, so they just had some issue that they need one day to resolve (I am not talking about Gmail web page, but with SMTP username/password authentication).

Why Gmail

Why do I use free Gmail for sending an email and not some more reliable service like SendGrid or Amazon SES?

That is a nice lecture in technical debt, in essence, what was a good idea for an initial requirement, as time progress and requirements or circumstances change, it is not so good idea anymore.

When I started with my first project in development as proof of concept Gmail was an excellent choice: easy to start and working fine.

As the project moved to deployment an additional projects where made it was easy to copy/paste the existing code than to refactor/redesign/rearchitect existing working solution.

REST solution

Emails did not work one day for me and after one day everything was back to normal.

I started to think about what can I do to avoid this problem in the future.

One solution would be to change from Gmail to something else. but here are a few issues that I do not like.

First issues

What if other solution (email provider) stops working in the future, I would again need to write new code for the third solution.

To fix this problem my idea is to use Gmail as primary providers for email sending if email sending fails I will just use a secondary email provider.

With this logic, I can add the third one also and so on, but I think that two are enough for the first version.

Second issues

Currently, I have around 10 apps (and this number will increase with new apps that I plan to do in future) that need email sending, each has a separate code base repository.

If I want to change something in email logic, even something simple ae username/password I need to do same change it in 10 different code bases.

One solution is to make one code base just for sending emails, this would solve the problem of the same changes in multiple apps.

But in order to work, I need to change the folder structure all my apps, update paths in the code bases, and I can use this only if all apps are in the same machine hosting.

If they are on separate machines it will not work.

REST to the rescue

After understanding all the difficulties, making RESTful web services just for email sending made total sense to me.

The only reason why it made sense to me is that I have a use case where REST is useful and look like the only solution.

The first version will just be adapter/facade around yagmail with REST API, but that is a story for another time.

Verification vs Validation in practice

Published on: 01.03.2019

Verification is the process of checking that the software meets the specification.

It is doing what you wanted it to do.

An example could be that function need to add two numbers, then you verify (like write unit test) that it is correctly doing that.

Validation is the process of checking whether the specification captures the customer’s needs.

Using the example of the function need to add two numbers verification need to confirm that this function is really what user need, eg. maybe you need to multiply two numbers.

Practice vs theory

When I first head about validation I was able to understand it in theory, but in practice, I was thinking it is easy to know what you want why do you need to validate it.

Then I had personal experience of why and how validation is hard.

I build wrong software for myself and I had no one else to blame.

How I build wrong software for myself

My idea was to make software that will be run at 1 AM every day, will take all real-estate ads from https://www.njuskalo.hr/ for my town listed on a previous day, sort them by price for a square meter and send them to email.

Basically, I wanted all new ads per day to my email, sorted by price for a square (one day delay was fine for me).

Looks simple enough, what could go wrong?

After a few days and I had it running in production and it was working, verification was successful, I every day I got all ads from the previous day.

Why validation was wrong

After a week I found out that my software was useless.

What was the problem?

Rember that I wanted to get “I wanted all new ads per day to my email”, I wanted all “new ads per day”, but what I got was all updated and new ads per day.

Let me explain.

Every day I was getting around 200 ads per day and I noticed that a lot of them were the same ads, day after day.

What was happening is that a lot of people were just updating the same ad every day.

And they are doing this so that their ad is always on the first page, sometimes they even do it a few times per day (later I found out that a friend of a friend was contracted by one local real-estate agency to make software that will automatically update ads for them).

Altho my software was working correctly, only after I have made it I found out it is useless because of wrong assumptions.

My assumption was that every ad will be added only once, not that 60% of adds will be updated every week.

I have solved this problem by making version two that could know if an ad is new or updated and if updated what was updated.

Am I stupid

This experience was fascinating to me.

On this project, I was everything: user, project manager, architect, coder, quality assurance, investor, every hat was on my had and I manage to build the wrong thing.

It gave me a practical understanding of why it is common that the end user is not happy with the finaly product.

Even if everything is done correctly it is possible that the final product is not solving user original problem due to wrong initial assumptions.

How to improve validation

One approach is to make MVP, in this way you will spend fewer resources on version one.

If validation of MVP is correct, then add additional features, if not cancel it

Another approach is to get some domain knowledge ether internal or external.

I had built a few web-scrapers in the last few years and now know a few tricks about that domain, but I learned each on the hard way.

I also understood why some companies hire domain experts consultants (just be sure to have a good one).

Technology

For those interested in what tools did use to build my software here is the list: Scrapy, dataset, yagmail.

The most important lesson for new programmers

Published on: 15.01.2019

On my “Sending email from Python” blog post what was cross-published on Medium I got a comment asking how to send an email via outlook programmatically.

My first reaction what that this is some troll or bot.

So I did “Let me google that for you” answer and later got “Thank you” response.

That got me thinking, maybe he was not an internet troll, maybe he just does not know how to google.

It never crosses his mind that he can ask google for the answer.

Why I was thinking that somebody was trolling me

I am an experienced (15+ years) software developer, I am experienced because I know that when I do not know something first I google it, that I search on youtube and the last resort is to ask StackOverflow.

This is what professionals do, they do not ask questions on random blogs in hope that somebody will respond.

Learn how to google

For beginners learning to code, best what you can do for yourself (and other) is to learn to google what you do not know.

Today it is easier to learn coding than 20 years ago when I was starting.

At my time the only thing that you had was a book (if you were lucky).

Today there are much more opportunities to learn:

  • you have Youtube today what is the largest free video learning tool
  • google for asking
  • and StackOverflow communities where you can ask questions

Be aware that you should not ask a question specific to your particular coding problem, just bring it to a more abstract level.

Tips on googling

From my experience, it is important to know which keywords to google.

But if you do not know keywords you can always start with “how to …..”.

Any action is better than no action.

Most programmers are financial morons

Published on: 01.01.2019

Let me start with one true story from the year 2011.

At that time I was working as a software programmer (90% C++) in a team of 5 people.

One morning, a friend from team started showing cool new source code editor called Sublime Text.

He was very happy with it, he used it on the job, for his own pet projects, and for his freelancing side jobs for almost few months.

But for him Sublime Text had one drawback, he had to pay 100$ for it (at time of this writing Sublime Text license is 80$, but I think that at that time it was 100$, but I could be wrong).

At that time I know that my friend is a financial moron.

I tried to explain to him, using same logic like in this blog post, but he just could not get it, he only understood that he has to spend money.

Why most programmers are financial morons

Let say that he was only using Sublime Text every second day (altho, knowing him it was probably every day).

With every second day assumption that is 182 day per year.

He was happy with a new tool, it was better for him, so let us say that he got 10 extra minutes of work every day.

10 minutes times 182 days is 30 hours of work more per year.

To get a break even he would need to make 3,33$ per hour of work.

Even at that time, he was charging his freelance rate at 20$ per hour and he had around 5 hours of billable work hours per week.

He is a smart guy, but he was thinking that toll is expensive.

Economically speaking, he does not know how to do a cost-benefit analysis.

It is strange how logically intelligent programmers (believe me, you do have to be logically intelligent to write computer programs) never invest in tools that basically have ROI in days.

Conclusion

Do cost benefit analysis before saying that something is expensive or cheap.

Disclaimer:
I have no interest do you use or buy Sublime Text or not, I am just using it as an example.

Do not use Selenium for web scraping

Published on: 15.12.2018

Disclaimer:
This is primarily written from Python programming language ecosystem point of view.

I have noticed that Selenium has become quite popular for scraping data from web pages.

Yes, you can use Selenium for web scraping, but it is not a good idea.

Also personally, I think that articles that teach how to use Selenium for web scraping are giving a bad example of what tool to use for web scraping.

Why you should not use Selenium for web scraping

First,Selenium is not a web scraping tool.

It is “for automating web applications for testing purposes” and this statement is from the homepage of Selenium.

Second, in Python, there is a better tool Scrapy open-source web-crawling framework.

The intelligent reader will ask: “What is a benefit in using Scrapy over Python?

You get speed and a lot of speed (not Amphetamine :-)), speed in development and speed in web scraping time.

There are tips on how to make Selenium web scraping faster, and if you use Scrapy then you do not have those kinds of problems and you are faster.

Just because these articles exist is proof (at least for me) that people are using the wrong tool for the job, an example of “When your only tool is a hammer, everything looks like a nail“.

For what should you use Selenium

I personally only use Selenium for web page testing.

I would try to use it for automating web applications (if there are no other options), but I never had that use case so far.

Exception on when you can use Selenium

The only exception that I could see for using Selenium as web scraping tool is if a website that you are scraping is using JavaScript to get/display data that you need to scrape.

Scrapy does have the solution for JavaScript with Splash, but I have never used it, so far I always found some workaround.

What to use instead of Selenium for web scraping

As you can guess, my advice is to use Scrapy.

I choose Scrapy because I spend less time developing web scraping programs (web spiders) and execution time is fast.

I have found Scrapy to be faster in development time because of a Scrapy shell and cache.

In execution, it is fast because multiple requests can be done simultaneously, this means that data delivery will not be in the same order as requested, just that you are not confused when debugging.

What about Beautiful Soup + Requests

I have used this combination in the past before I decided to invest time in learning Scrapy.

Do not make the same mistake as I did, development time and execution time is much faster with Scrapy, than with any other tool that I have found so far.

Last words

This is not rant about using Selenium for web scraping, for not production system and learning/hobby it is fine.

I get it, Selenium is easy to start and you can see what is happing in real time on your screen, that is a huge benefit for people starting to do/learn web scraping and it is important to have this kind of early moral bosts when you are learning something new.

But I do think that all these article and tutorial using Selenium for web scraping should have a disclaimer not to use Selenium in real life (if you need to scrape 100K pages in a day, it is not possible to do it in single Selenium instance).

To start with Scrapy it is harder, you have to write XPath selectors and look at source code of HTML page to debug is not fun, but if you want to have fast web scraping that is the price.

Conclusion

After you learn Scrapy you will be faster than with Selenium (Selenium just have a lower-angle learning curve), I personally needed a few days to get the basics.

Introduction to Python packet Dataset

Published on: 01.12.2018

Python packet dataset describes itself as databases for lazy people and they are correct.

For saving data with dataset all you need is just a Python dictionary, the keys of the dictionary are columns in a table and that is all.

Dataset will automatically make all tables and columns necessary.

Internal data is stored in SQLite, PostgreSQL or MySQL database, my experience has only been with SQLite so far.

My experience

In one project I use it just for memory database, after scraping data from a website it is stored in-memory SQLite.

Then I can use standard dataset API to retrieve data with certain criteria and sort it, before emailing it.

On another project, I use it to store data in SQLite and later to retrieve it.

I must admit that for everything else than basic searching, filtering and sorting you have to write SQL queries.

One useful feature is upsert, upsert is a smart combination of insert and update.

If rows with matching keys exist they will be updated, otherwise a new row is inserted in the table.

There is also a feature to export data to CSV or JSON.

Conclusion

If you think that using DB on your next project is overkill, but you do need to filter, search or sort data, take a look at datase.

It is much better than to make custom solutions, I know because I did stored data in pickle format and wrote a custom function for filtering, sorting and retrieving data from pickle, before I learned about dataset.

The question of tradeoff in software, business, and life

Published on: 15.11.2018

In software development, it is common to have discussions about what technology is better or the best.

Those discussions look like a wise discussion for beginners, looking for a perfect solution, the holy grail.

But they are useless because there is no perfect solution, the much more important question to answer is what tradeoffs are you making and why?

Why tradeoffs are necessary?

In any system, if you want to increase one aspect of the system that has to come at the expense of some other aspect.

Let us take the car for example.

I am taking the car as an example because I suppose it is easy to understand.

If you want to make a car acceleration faster, you have to make it lighter and fuel consumption will go up.

So, to increase acceleration you have to decrease weight and fuel efficiency.

This is a simplified example, there are many imperfections, but I hope that reader can get the point.

Basically, you have to do tradeoff.

Back to the discussion on tradeoffs in software development

When you add business aspect into considerations, it is even more complicated.

Things that make sense from a technical standpoint, are a disaster for business and vice versa.

The hard thing about a tradeoff between business and technology is it is almost impossible to have one person who can understand just one side completely so what to say about both at the same time.

Today software systems are so complicated that it is common that there is no single person who understands everything.

That is why REST API is popular, but that is the discussion for another day.

Concrete software example

I have one personal program, that I use every day, it is responsible for saving me 1000$ on average per year, so I do have the real monetary use of it.

And SQLite DB is the main part of it, and I do not ever use indexes in it (no cost benefit from it).

I know that SQLite for my use case, from point of speed, is not the best option.

But I took SQLite because it was fast to start, backups are just copying one file and I am running SQL queries once per day while I am sleeping.

Currently, an average time for all SQL queries are around 30 seconds, and as DB file gets larger query time will also increase.

Even if it gets to 1 hour (what I am not expecting even in the next 100 years), that would be fine for my use cases.

My deployment platform is shared hosting with the flat monthly bill so increased CPU time is also not a problem from me, altho if I used platform with serverless billing per CPU time it could be.

Conclusion

Know what tradeoffs are you making and even more important is why.

Introducing PyAutoGUI

Published on: 01.11.2018

I found out about PyAutoGUI Python packet from Automate the Boring Stuff with Python book.

With PyAutoGUI you can automate GUI interaction on your computer.

PyAutoGUI is working on Windows, OSX and Linux, altho I have used it only on Windows.

Most GUI automation can be done just by automating mouse movements and keyboard input and PyAutoGUI supports it.

There are also functions for providing message boxes, saving the screenshot and finding an element from the image on the screen.

Personally, I have found PyAutoGUI useful, for automating some of my workflows.

To be honest, I use it for 45 minutes of work every month (I know this is not much), but I have found that if I manually have to do same interactions for 45 minutes it is really killing my soul, so I decide just to automate it.

If you want to automate web page interaction use Selenium, because with Selenium you can access elements on a web page independent of their position on the screen.

Because with PyAutoGUI you can just move the mouse to specific coordinates, that means if the resolution or layout of GUI has been changed, you need to update coordinates in your code.

Keep all coordinate in your code as constant in one place, so that you do not need to change them all over your code when change is needed.

This theoretically could be avoided if you use images as reference for finding elements on your GUI, but you will have the same problem if the appearance of the elements is changed, but then instead of changing coordinate in the code you have to change all reference images, what is more work.

My personal preference is to use hardcoded coordinated instead of images as the reference.

Using Selenium has also become popular for scraping pieces of information from web pages, but better is to use specialized framework for scraping like Scrapy, because of additional features like caching HTTP responses (quite a time saver in development).

Making web apps with Jupyter notebook

Published on: 01.10.2018

This article will explain how to make Jupiter notebook as a GUI app on the web.

What is Jupiter notebook

Jupiter notebook is browser-based REPL.

REPL enables you to program in an interactive environment, you can write and then execute your next line of code while all previous lines are already in the executed state.

This trivial feature enables me to cut prototyping development time because for testing the next line of code I do not need to run the whole program again. (REPL is useful only for some types of situations)

I know this explanation is useless if you do not know what is a programming language and have no experience with REPL style prototyping, but if you are in this category I do not know how to explain it (probably it is impossible).

Point is, it makes programming prototyping faster because for testing next line in your code you do not need to run the previous code again and again.

Previously Jupiter notebook was called IPython Notebooks, at that time only Python was available as programing language.

Now it is possible to use Jupiter notebook with many programming languages, altho my experience is only with Python.

Personally, I use Jupiter notebook for exploratory data analysis.
Loading data to Pandas and then trying to understand data with visualizations (Seaborn, Bokeh).

Sharing Jupiter notebook with non-technical persons

Often I would run the same code with different parameters, to produce slightly different visualizations.

If you are familiar with Jupiter notebook environment than you know that this means running the same cell with SHIFT + ENTER, from Cell menu or some other shortcut.

This got me thinking if I wanted to give my notebook to a non-technical person (somebody who know how to use Word, and Excel without knowledge of how to write formulas ) it would be trivial for that person to use it.

Also, a person could change the code and get the unintended outcome (syntax error or wrong result).

Ipywidgets

This problem could be solved with Ipywidgets widgets.

With Ipywidgets widgets you can make GUI inside of Jupiter notebook, it is perfect when you want for somebody (even you) to expose some functionality of your Jupiter notebook with GUI elements.

For having this kind of GUI
Coin Toss GUI

This is the necessary code:

Appmode

This was good but still there where 2 problems:

  • the user had to run all cell as the first step
  • user still had access to the code

Fortunately, Appmode is Jupyter extensions that turn notebooks into web applications.

By default user can still go back in the “code mode”, but it can be easily removed

Hosting your Jupiter notebook

If you are hosting it inside of your network that you just need to run notebook server, like for local development, but add some security.

Github will give you view only of any notebook that is hosted on their server, and there are many more websites with the same functionality.

If you want interactive hosting of your Jupiter notebook so that people can execute them, then there is Binder.

Currently, it is in beta and your Jupiter notebook needs to be in public Github repository.

Conclusion

With the right combination of:

You can execute your Jupiter notebook as a web app for free.

Coin Toss code can be used as an example of how to host Jupiter notebook as a GUI app on the web.

Part of the inspiration came from Bloomberg bqplot project.

Personally, I have found it useful for sharing interactive visualizations.

Sending email from Python

Published on: 15.09.2018

I will show how to send emails from Python programing language using your Gmail account.

Mail server

When you want to send email from GUI/CLI app or from some computer language (in this example Python), you need to have access to a mail server.

The mail server is a computer that is in charge of sending and receiving your email.

Usually, you have access to the mail server via username and password.

With Gmail username is your email address and password should only be known to you :-).

Use yagmail

For sending emails from Python using your Gmail account IMHO best is to use yagmail packet.

You can waste time with smtplib library, but if you are using a Gmail account, just use yagmail.

Install it with:
pipenv install yagmail

Code examples

This is just a simple code example, for production code do not put your password in the source code.

Personally, I use one JSON file (this file is committed to source control with dummy credentials) per project for storing usernames and passwords for my personal projects.

This is a simple one line example:

I usually use it like this:

Adding image in the body of the email, you need to use yagmail.inline:

My advice and experience

Gmail have daily limits of how much emails can you send.

At the time of this writing, it is 500 emails per day, if you exceed it you will not be able to send emails for next 24h (at least that happened to me, and I have sent only 100 emails in a row).

So, my advice is to have separate Gmail account for just for sending emails automatically from your code, you do not want to lose the ability to send email from your personal email account (I was there and it is not funny).

Also, have some delay between sending emails, when I got my account locked for 24h I have only sent around 100 emails in a row, but after I have added time.sleep(15) I never had that problem again.

I am not using Gmail to spam people, I am using it for sending the email report to myself, mostly from web scraping tasks (around 20 per day).

If you have error 534

Solution is

Google blocks sign-in attempts from apps which do not use modern security standards (mentioned on their support page). You can, however, turn on/off this safety feature by going to the link below:

Go to this link and select Turn On
https://www.google.com/settings/security/lesssecureapps

From:
https://stackoverflow.com/a/26852782/2006674

Conclusion

Sending emails from Python is easy with yagmail.