Machine Learning for Image Cropping

Posted in: Uncategorized | 0

What You’re Getting Into

  • Audience: Aspiring Data Scientists or Interested Layperson
  • Subject: Using semi-supervised and reinforcement learning machine learning techniques to train neural networks for automatic image cropping
  • Time: 20min to 60min depending on level and investigation of supplemental material
  • Motivation: My first (mini) personal project after a long hiatus. An explanation of 2 recent papers I recently read that I just loved and used when creating the recipe picture layout on our La Cucina page.
  • Takeaway: Some help in directing you to resources of getting this set up and executing on your own computer

Goodhart’s Law & Finally Getting Off My Assymptote

When a measure becomes a target, it ceases to be a good measure.

Marilyn Strathern

There’s an extremely useful adage in economics, succinctly paraphrased by Marilyn Strathern which I’ve quoted to the right. A classic example can be found in the Cobra Effect named after a historical incidence during the British Rule of Colonial India. The British government, unhappy with the amount of cobras in Delhi set a bounty proportional to the amount of dead cobras handed over to them. Eventually people started specifically breeding cobras to execute for government money. When the British Government realized they had been duped and removed the bounty; the snakes in breeding farms were released and lead to a surplus in Delhi. Indeed the road to hell is paved with good intentions and the British Government found themselves with a hell of a snake problem after only trying to make improvements.

Recently I’ve been going through hell as well. If you want to be a data scientist, coding daily is not enough — you must also stay close to the tippy-top of what’s going on in the Data Science (DS)/Machine Learning (ML)/Artificial Intelligence (AI) and Computer Science (CS) fields. During my MSc in Theoretical Physics this was pretty easy. I would make spare time to code up personal projects and even earned a respectable 50k+ views on my old website. Ironically, once I began to pursue my passion for DS and did an MSc in CS, the amount of projects I did fell abruptly and I found while I was better equipped, I produced less personal projects. When I went to industry even less personal projects were produced (okay, no personal projects were reduced until now) even though I was surrounded by a team of great Data Scientists! What the hell was going on?!

I had fallen for Goodhart’s Law, tunnelling my efforts to optimize the metric of “what project can teach me the most in the least amount of time?”. I had forgotten personal projects were supposed to be personal. I had forgotten that it’s okay if you don’t use a fancy machine learning algorithm and instead wanted to calculate the relative distribution of dogs dying as a function of movie genre. Luckily, triathlon helped me to break through this asymptote and start doing personal projects again. Hopefully this will be the start of many.

A Recipe for Disaster & A Pre-Cooked Solution

There are a LOT of things that go into setting up a website. Inevitably, you will find numerous corner cases cropping up from tinkering with layouts to reviewing just how meticulous suggested SEO considerations can get. One of these unforeseen issues that arose while we were setting up this website was the loading time of our recipes page. Because we only wanted to bring our audience the highest of quality posts we uploaded our highest quality food pictures! Unfortunately, after some testing it became obvious the amount we are (as of writing this post) currently paying is much too low to deliver those HQ images to you quickly. So a compromise had to be made. To decrease loading times, two main things could be done:
  1. Downscale images
  2. Crop images to make them smaller

Now, downscaling is by far the easiest operation. This simply involves resizing your pictures before you upload them to your website. Cropping images is also a very good alternative as it gets rid of all the extra stuff you don’t necessarily need to show AND ideally, makes the picture look more pleasing. Obviously, the idea of a more pleasing photo for less memory is too tempting to pass up and so I had some work to do. A quick look at the backlog of photos I had to crop — and furthermore, the prospect of cropping all future photos — seemed boring and cumbersome.

Enter the Data (-Driven Solution)

Adapt what is useful, reject what is useless, and add what is specifically your own.

Bruce Lee
Training an original machine learning algorithm is fun but also labourious: the amount of time and effort that goes into preprocessing, labelling, and cleaning your data can be immense. Also, it is very likely that the model you would be able to write for a personal project has likely already been studied in great detail within academia. The next best thing you can do then is to use your research skills to identify how your problem was already solved in academia.

After a bit of googling, I found a recent paper that was exactly what I needed! An algorithm trained on online professional photography to assess the visual aesthetics of an image. Next step was to see if this work had been improved upon so a quick search of forward citations led to another recent paper. In fact, once I tumbled down this rabbit hole, I found that there have been a number of papers recently published (e.g. Twitter, Google). This is perhaps in part to both authors open-sourcing their code, something I immensely respect from any publisher as it is not a necessary requirement for their research. You can find both projects here and here. The rest of this blog post discusses the algorithm in brief detail. More experienced readers may prefer to just read the papers but I enjoyed them so much I figured I’d write about them.

Never Half-Aesth Two Things, Whole-Aesth One Thing

view finding network machine learning cropInterestingly, Ron Swanson’s off the cuff woodsy wisdom is the key insights of these papers. Consider the two images to the right of this paragraph taken directly from the paper. Each is a professional photograph that was uploaded to and featured by Flickr. It is undeniable that both of these pictures look 1000 times better than anything I could ever take — yet if I take a random subsection of these photos they look much worse than the original. Furthermore, in theory, the photographer could’ve taken any of these random subsections of the photo seeing as they were clearly within his line of vision! Thus the overall picture provides an inherently more aesthetic photo than any of its subsets. This key assumption unlocks thousands of images on the web to provide a semi-supervised algorithm that can rank images based on aesthetic quality.

Aesthetic Architecture

Given the Ancient Greeks loved debating aesthetics, architecture, and math, I’m sure they’d love the neural network architecture of the paper given below.

neural network architecture view finding network

Modern day machine learners may not get terribly excited about this architecture though as it is a fairly straightforward implementation of a convolutional neural network (CNN). For readers who would like an introduction or refresher to neural networks here is a link to a computerphile episode on them. The spatial pyramid pooling layer (SPP) allows images of arbitrary dimensions to be handled while preserving their aspect ratios. This goes into training an aesthetic function Φ that ranks cropped images in terms of aesthetics by knowing just how bad the random crops are compared to the original.

Once trained on thousands of images the user can now assess the aesthetic quality of their photographs and assign a score to them. By generating hundreds of potential random crops with the same aspect ratio as the original, the one with the highest score can be chosen as the best possible crop of the photo! Their work can be found on this github page. They refer to this architecture as the View Finding Network (VFN). I’ve included some of the key examples from their paper.

machine learning cropping

Steps Toward a Stronger Solution

Reinforcement learning image croppingWhile the photos above truly show some impressive results for automatic cropping, we can note that, as mentioned in the previous section, the crop always maintains the same aspect ratio, which means other images may not be able to be cropped as nicely. The reason for this is that VFN uses a sliding window method. This means that they generate potential crops by calculating a bunch of smaller same aspect ratio crops from the original. While this produces a large amount of potential crops, removing the condition of same aspect ratios would make exponentially more options to explore! Despite the ability for computers to brute-force some problems, the number of potential solutions is simply too large to be feasible for the VFN. Therefore, an alternative approach is used that utilizes reinforcement learning (RL). For those of you unfamiliar with RL, I’ve included a link to a small video talking about RL and how it fits into these types of machine learning.

The basic idea is to measure a small amount of potential steps a cropping process can take and choose the best step possible by ranking it with the VFN trained network. Once that step is done, we repeat the process looking at another set of potential steps and choose the best choice. In this way, one can ignore many bad crops (e.g. a tiny 4 by 4 pixel that doesn’t even show anything at all!) and approximate a potential best crop.

Everyone’s a Critic (Even Computer Programs)

Now that you understand the idea of taking best steps (as measured by VFN), the question is how do we take these steps efficiently? Surely it would be more ideal that we take fewer steps to the right answer than taking more steps to the same answer as it helps to save a lot of time and ends at the same result. Thus, the authors introduce a negative penalty for the steps taken. The AI has to learn to balance this negative feedback along with the positive feedback to get to the best state possible as quickly as possible. The process is described in the illustration below:

Actor critique reinforcement learning architecture

First, we have an image starting on the far left. This is passed to the convolutional layers (see above) and then passes to another neural network along with the original image. This neural network tries to predict the state value that approximates the VFN aesthetic score function. It then selects the best action (small cropping step) from the action space (all possible cropping steps) and performs the crop. The assessment of this decision is performed by comparing the state value with the known aesthetic VFN value and the process is repeated.

This architecture is called an actor-critic architecture. Think about an actor/actress performing a scene. While they may think they are conveying the right emotion, only an objective 3rd party critique can truly assessing how their actions fit in and display the end goal. Thus, while the neural network agent might think that it is performing the best cropping action, it needs feedback for how much better it could’ve done. The results are pretty amazing, check them out below:

reinforcement learning cropping process

The leftmost image (a) of each row is the input, while the second left column (b) is as the paper first outlined in this post. The three images following (c-e) are attempts the authors made, with the last one performing the best. Finally, the rightmost image (f) is what a professional artist chose to crop the image as. The overlap is fairly impressive!

How Can I Crop My Own Pics Using Machine Learning?

If you are an aspiring data scientist that thinks this was a fun project or another blogger like me that just doesn’t want to crop their pictures, here is my recommendation as to how to go about obtaining this program.

  1. Get yourself setup with Anaconda, a convenient package manager for Python.
  2. TF-A2RL is the easier paper to implement using their GitHub page. Go to the GitHub and follow including the pickle download.
  3. The program as it comes takes one sample image located in a test directory. Write a python script that reads in the folder and loops over every image of it instead. There are many ways to do this. I like an SO post that has some of them. Probably the easiest is using the os module.
  4. Read the papers! Knowing what’s going on from a high level overview always helps and while I have spent some time outlining their methods here, it is obviously best to get it straight from the Horse’s Mouth if you wish to implement yourself.


God is Dead.

Most angsty teens have heard this quote. In fact most people have heard this quote. Unfortunately the context in which it is said is often decoupled from how it was said in The Gay Science. Nietzsche wanted us to realize that the enlightenment had converted many to the scientific point of view and religion as we used to know it had died. I don’t think anytime soon we will be saying Art is Dead — the machine learning community. I believe it will be much more like the path of how chess developed, raising humans to next-level gameplay and analysis. I expect in the future this technology may be applied to whatever social media sharing website crops up to allow users to post pictures that help convey the beauty they saw where they were and better share it with others. I will be truly impressed to see what algorithms they develop to dial those business use cases in! For now however, I am grateful at the very least that some algorithms exist to avoid needing to crop my food images. If you’ve made it all the way down to the bottom here and want more reading — why not go check out the recipe page I applied this in?

Leave a Reply