A typical JPEG file contains a lot more information than just the bytes required to store image data. A JPEG file also has a lot of metadata in each file containing auxiliary information about the image. On an average, this kind of metadata occupies 16% of size of the JPEG file.
In this post, we will specifically focus on the different kinds of metadata stored in JPEG images and the impact this has on performance of a webpage.
We used the data from the HTTP Archive and some BigQuery magic to generate a list of URLs of valid JPEG images. We used the HTTP Archive crawl from 1st August,2016 which contained 4.3 million JPEG images from the top 500,000 websites, totalling up to 195GB. We then downloaded and analysed the metadata from this dataset.
We found that 38.9% of those images had some sort of metadata in them. In these images, metadata occupies 15.8% of the total image size on an average . Let that sink in - if each of these top sites were just visited once, nearly 13 GB of data could have been saved on the internet if these websites were handling the metadata properly!
Before we look at why so much metadata is stored in images, let us take a peek on how this is usually stored in images. There are different formats which can be used to store metadata in JPEG images. The most common ones include:
Well, turns out a lot of people do! 1,647,831 images (38.9%) of the images we analysed stored some sort of metadata. An average JPEG file with metadata looks like this:
Here is a distribution showing the different formats used to store the metadata and how much size they occupy on an average:
To go deeper into what kind of information is stored as metadata, we looked at all the EXIF data stored in the images. This usually contains information like the time the photo was taken, camera settings and so on. These were the 50 most common attributes stored as EXIF data.
As you can see most of this information is not required by the browser to render the image. So you might be asking “why is sending metadata with images still a thing” (John Oliver fans anyone?)
Before we start stripping away all the metadata from our images, there are some use-cases you might have to consider.
Some software store thumbnails of the image as EXIF data too. So you can have a JPEG image as a thumbnail within another JPEG image. For example here is a JPEG image that we found in our dataset (it is actually 17.6 MB. You have been warned!! I know I could have used a smaller image but I just wanted to prove a point about how nasty images can be in the wild even among popular websites) and here is the thumbnail which was stored as EXIF data (6.6kb) within the image.
Browsers do not make use of this data and should be stripped out before you deliver it to your web users.
One of the primary use cases that we have seen stored as IPTC is the name of the photographer and other copyright information. This might be important depending on your use case though most social sharing sites automatically strip out this information. And of course malicious guys can always remove this information and redistribute the photo.
EXIF data can also contain orientation information of the picture and browsers take cues from this to orient the image correctly. We have seen 9.1% of images store orientation information along as part of the EXIF data. This attribute comes into play only when you visit the image directly - say, by visiting the image directly via its URL. When you are embedding the image directly in the webpage, say via an image tag, this attribute has no effect at all (except in mobile Safari). IE <= 11 doesnt recognize this attribute in both cases.
Stripping out this attribute seems to be a good idea to be consistent. That way users see the image oriented the same way regardless of whether they visit the image directly or see it as part of a webpage.
This seems to be the strongest use case for storing metadata in JPEGs. 10% of the images we analysed had an embedded colour profile and the profiles were as large as 7KB on an average. Images can store these colour profiles as ICC metadata. They help images look consistent depending on the capabilities of the device used when viewing them.
However browser support for these ICC profiles has been very flaky and there arent many screens which can take advantage of these embedded profiles.
By default, Dexecure does not strip off ICC profiles from JPEG images since they can lead to different looking images in some cases.
Apart from the obvious disadvantage of the extra data transferred to the user, storing a lot of metadata has the following other disadvantages.
In a JPEG file, EXIF metadata appears before other information such as the frame containing the height and width of the image and the scans containing the image data itself. Browsers use the height and width information in the image to help layout the page and the earlier the browser finds this information, the better.
It would also take a longer time for the browser to render the image since the browser would have to download all the metadata in the JPEG file before it starts getting the scans containing the image data. Chrome prioritises the download of the first image that it sees in a webpage and you would be wasting crucial bytes sending metadata, instead of the image content itself. This especially matters during the initial stages of the page load where the TCP window is quite small due to TCP slow start.
There have been quite a few attacks - both server side and on the client side because of improper handling of image metadata. As with any other user provided data, make sure you are sanitising it properly before using them in a security sensitive context.
As you have seen metadata provides quite a lot of information about the image sometimes too much.
You can read about how the GPS information stored in one of John McAfee’s pictures almost gave his location away here and how much more recently, Apple inadvertently revealed more than they had probably intended with one of the default desktop images shipped in the recent Mac OS X release here.
Most of the metadata can be stripped for images that are intended to be delivered on the web. Of course, removing metadata from images is one of the techniques Dexecure uses to produce leaner images.
What other use cases have you seen for metadata? Do you remove them from your images? Let us know in the comments below!
Stay updated, because #perfmatters!