What You’re Getting Into
- Audience: Aspiring Data Scientists or Interested Layperson
- Subject: Using semi-supervised and reinforcement learning machine learning techniques to train neural networks for automatic image cropping
- Time: 20min to 60min depending on level and investigation of supplemental material
- Motivation: My first (mini) personal project after a long hiatus. An explanation of 2 recent papers I recently read that I just loved and used when creating the recipe picture layout on our La Cucina page.
- Takeaway: Some help in directing you to resources of getting this set up and executing on your own computer
Goodhart’s Law & Finally Getting Off My Assymptote
There’s an extremely useful adage in economics, succinctly paraphrased by Marilyn Strathern which I’ve quoted to the right. A classic example can be found in the Cobra Effect named after a historical incidence during the British Rule of Colonial India. The British government, unhappy with the amount of cobras in Delhi set a bounty proportional to the amount of dead cobras handed over to them. Eventually people started specifically breeding cobras to execute for government money. When the British Government realized they had been duped and removed the bounty; the snakes in breeding farms were released and lead to a surplus in Delhi. Indeed the road to hell is paved with good intentions and the British Government found themselves with a hell of a snake problem after only trying to make improvements.
Recently I’ve been going through hell as well. If you want to be a data scientist, coding daily is not enough — you must also stay close to the tippy-top of what’s going on in the Data Science (DS)/Machine Learning (ML)/Artificial Intelligence (AI) and Computer Science (CS) fields. During my MSc in Theoretical Physics this was pretty easy. I would make spare time to code up personal projects and even earned a respectable 50k+ views on my old website. Ironically, once I began to pursue my passion for DS and did an MSc in CS, the amount of projects I did fell abruptly and I found while I was better equipped, I produced less personal projects. When I went to industry even less personal projects were produced (okay, no personal projects were reduced until now) even though I was surrounded by a team of great Data Scientists! What the hell was going on?!
I had fallen for Goodhart’s Law, tunnelling my efforts to optimize the metric of “what project can teach me the most in the least amount of time?”. I had forgotten personal projects were supposed to be personal. I had forgotten that it’s okay if you don’t use a fancy machine learning algorithm and instead wanted to calculate the relative distribution of dogs dying as a function of movie genre. Luckily, triathlon helped me to break through this asymptote and start doing personal projects again. Hopefully this will be the start of many.
A Recipe for Disaster & A Pre-Cooked Solution
- Downscale images
- Crop images to make them smaller
Now, downscaling is by far the easiest operation. This simply involves resizing your pictures before you upload them to your website. Cropping images is also a very good alternative as it gets rid of all the extra stuff you don’t necessarily need to show AND ideally, makes the picture look more pleasing. Obviously, the idea of a more pleasing photo for less memory is too tempting to pass up and so I had some work to do. A quick look at the backlog of photos I had to crop — and furthermore, the prospect of cropping all future photos — seemed boring and cumbersome.
Enter the Data (-Driven Solution)Training an original machine learning algorithm is fun but also labourious: the amount of time and effort that goes into preprocessing, labelling, and cleaning your data can be immense. Also, it is very likely that the model you would be able to write for a personal project has likely already been studied in great detail within academia. The next best thing you can do then is to use your research skills to identify how your problem was already solved in academia.
After a bit of googling, I found a recent paper that was exactly what I needed! An algorithm trained on online professional photography to assess the visual aesthetics of an image. Next step was to see if this work had been improved upon so a quick search of forward citations led to another recent paper. In fact, once I tumbled down this rabbit hole, I found that there have been a number of papers recently published (e.g. Twitter, Google). This is perhaps in part to both authors open-sourcing their code, something I immensely respect from any publisher as it is not a necessary requirement for their research. You can find both projects here and here. The rest of this blog post discusses the algorithm in brief detail. More experienced readers may prefer to just read the papers but I enjoyed them so much I figured I’d write about them.
Never Half-Aesth Two Things, Whole-Aesth One Thing
Interestingly, Ron Swanson’s off the cuff woodsy wisdom is the key insights of these papers. Consider the two images to the right of this paragraph taken directly from the paper. Each is a professional photograph that was uploaded to and featured by Flickr. It is undeniable that both of these pictures look 1000 times better than anything I could ever take — yet if I take a random subsection of these photos they look much worse than the original. Furthermore, in theory, the photographer could’ve taken any of these random subsections of the photo seeing as they were clearly within his line of vision! Thus the overall picture provides an inherently more aesthetic photo than any of its subsets. This key assumption unlocks thousands of images on the web to provide a semi-supervised algorithm that can rank images based on aesthetic quality.
Given the Ancient Greeks loved debating aesthetics, architecture, and math, I’m sure they’d love the neural network architecture of the paper given below.
Modern day machine learners may not get terribly excited about this architecture though as it is a fairly straightforward implementation of a convolutional neural network (CNN). For readers who would like an introduction or refresher to neural networks here is a link to a computerphile episode on them. The spatial pyramid pooling layer (SPP) allows images of arbitrary dimensions to be handled while preserving their aspect ratios. This goes into training an aesthetic function Φ that ranks cropped images in terms of aesthetics by knowing just how bad the random crops are compared to the original.
Once trained on thousands of images the user can now assess the aesthetic quality of their photographs and assign a score to them. By generating hundreds of potential random crops with the same aspect ratio as the original, the one with the highest score can be chosen as the best possible crop of the photo! Their work can be found on this github page. They refer to this architecture as the View Finding Network (VFN). I’ve included some of the key examples from their paper.
Steps Toward a Stronger Solution
While the photos above truly show some impressive results for automatic cropping, we can note that, as mentioned in the previous section, the crop always maintains the same aspect ratio, which means other images may not be able to be cropped as nicely. The reason for this is that VFN uses a sliding window method. This means that they generate potential crops by calculating a bunch of smaller same aspect ratio crops from the original. While this produces a large amount of potential crops, removing the condition of same aspect ratios would make exponentially more options to explore! Despite the ability for computers to brute-force some problems, the number of potential solutions is simply too large to be feasible for the VFN. Therefore, an alternative approach is used that utilizes reinforcement learning (RL). For those of you unfamiliar with RL, I’ve included a link to a small video talking about RL and how it fits into these types of machine learning.
The basic idea is to measure a small amount of potential steps a cropping process can take and choose the best step possible by ranking it with the VFN trained network. Once that step is done, we repeat the process looking at another set of potential steps and choose the best choice. In this way, one can ignore many bad crops (e.g. a tiny 4 by 4 pixel that doesn’t even show anything at all!) and approximate a potential best crop.
Everyone’s a Critic (Even Computer Programs)
Now that you understand the idea of taking best steps (as measured by VFN), the question is how do we take these steps efficiently? Surely it would be more ideal that we take fewer steps to the right answer than taking more steps to the same answer as it helps to save a lot of time and ends at the same result. Thus, the authors introduce a negative penalty for the steps taken. The AI has to learn to balance this negative feedback along with the positive feedback to get to the best state possible as quickly as possible. The process is described in the illustration below:
First, we have an image starting on the far left. This is passed to the convolutional layers (see above) and then passes to another neural network along with the original image. This neural network tries to predict the state value that approximates the VFN aesthetic score function. It then selects the best action (small cropping step) from the action space (all possible cropping steps) and performs the crop. The assessment of this decision is performed by comparing the state value with the known aesthetic VFN value and the process is repeated.
This architecture is called an actor-critic architecture. Think about an actor/actress performing a scene. While they may think they are conveying the right emotion, only an objective 3rd party critique can truly assessing how their actions fit in and display the end goal. Thus, while the neural network agent might think that it is performing the best cropping action, it needs feedback for how much better it could’ve done. The results are pretty amazing, check them out below:
The leftmost image (a) of each row is the input, while the second left column (b) is as the paper first outlined in this post. The three images following (c-e) are attempts the authors made, with the last one performing the best. Finally, the rightmost image (f) is what a professional artist chose to crop the image as. The overlap is fairly impressive!
How Can I Crop My Own Pics Using Machine Learning?
If you are an aspiring data scientist that thinks this was a fun project or another blogger like me that just doesn’t want to crop their pictures, here is my recommendation as to how to go about obtaining this program.
- Get yourself setup with Anaconda, a convenient package manager for Python.
- TF-A2RL is the easier paper to implement using their GitHub page. Go to the GitHub and follow REAME.md including the pickle download.
- The program as it comes takes one sample image located in a test directory. Write a python script that reads in the folder and loops over every image of it instead. There are many ways to do this. I like an SO post that has some of them. Probably the easiest is using the os module.
- Read the papers! Knowing what’s going on from a high level overview always helps and while I have spent some time outlining their methods here, it is obviously best to get it straight from the Horse’s Mouth if you wish to implement yourself.