A few thoughts about steganography

A few thoughts about steganography
3 Janvier 2004 - updated May 5 2004

This is not an exhaustive presentation about steganography goals, techniques, softwares or history. I will write on this page a few things that sometimes come to my mind while I play around with steganography softwares or read some interesting article. Maybe I will try to answer questions I often receive by e-mail too. I'm not an expert in this field, just an enlightened hobbyist with an amateurish background in reverse engineering DOS and Windows softwares. If you want real experts, there are plenty in the academic field, and sometimes they even write very readable and interesting articles, if you except the paragraphs with obscure mathematical formulas. Check here for example. Very enlighting readings.

The texts I have written about stegano softwares are very simple and do not involve any statistical analysis. And even more, I just talk about the softwares I can break, so, there is an obvious bias towards the weakest ones.

A note: here I will mainly talk about image formats, but it's applicable to other formats as well. When you read BMP / JPG, you can replace that in your mind by WAV / MP3.

Why is steganography needed?

We all know what is the purpose of steganography: to hide (generally encrypted) data into other data. The main problem is that, once you encrypt a file with strong crypto, it looks like a random stream of bytes. And, curiously, random bytes are extremely rare in the computer world. At every level, every layer, from the smallest TCP/IP packet that circulates through routers on the internet network, to the biggest DivX movie you have on your hard drive, information is formatted into fixed structures, defined file types, strict hierarchical models. Basically, there is no random data in computers. So, a sudden stream of random bytes appearing somewhere is as visible as an elephant in a supermarket: it obviously does not belong to here. Very easy to detect in a flow of trillions of structured bits. And, if you are the curious type (or a FBI agent), a sudden stream of isolated random bytes, like a PGP encoded message in an e-mail for example, is going to be flagged as suspicious.

The point of cryptography is to transform structured and intelligible data (like a text file) into a stream of random-looking bytes.

The point of steganography is somehow the opposite, to mix random-looking data with decoy information so that it will look like structured format.

Why JPG steganography softwares are so rare?

There are plenty of softwares around that can hide data on BMP images. Unfortunately, BMP pictures are not widely used or exchanged, unlike JPG. So why programmers don't do stegano programs for JPG?

I would say there are two main reasons.

The first one is technical. JPG format is very complex. An order of magnitude more complex than a flat, uncompressed format like BMP. To go from a JPG file to actual pixels, you have to 1. Huffman decompress a stream of bytes using trees in the header, 2. RLE decompress the high frequency coefficients, 3. DPCM decompress the low frequency coefficients, 4. change the order of these coefficients, 5. dequantize them with tables that are in the header, 6. apply an inverse discrete cosine transform on 8x8 blocks to get the values, 7. upsample the two color components, 8. mix the three components to finally get RGB pixel data. And I'm talking about the canonical JPG. There are many variations you must take in account. Although you have a few libraries to deal with this format, and although for steganography with quantized DCT coefficients, you can stop between steps 3 or 4 (and then recompress everything), few programmers will have the patience to tear apart the guts of a JPG image. That's why someone who just want to play around not very seriously with steganography will use BMP format. And that's why even serious programs which claim to do JPG steganography actually fake it (Invisible Secrets hides the data in the comment field of the header; SecurEngine adds the hidden data at the end of the file; nevertheless, both are not bad for BMP steganography, and are certainly coded by people who understand cryptography and steganography).

The second reason is more conceptual. The concept of lossy compression like the one used in JPG (or MP3 for audio) is to remove most of the unimportant or redundant information. The concept of most steganography algorithms is to hide bits by replacing this very same unimportant or redundant information (like the Least Significant Bits). So both techniques are going in opposite directions. The more you compress, the more difficult it is to find room to hide data.
Is there a quality hierarchy for steganography softwares?

I would classify steganography softwares in several categories of increasing quality. It's a little bit artificial as a scale, not absolutely defined with golden rules, but I would say it's a pragmatic way of quickly estimating a software. Notice that I am not taking in account any upstream encryption, before the steganography step. That's another matter, although generally, programs that offer some solid, known and published encryption algorithm, should be trusted more than any "in-house" obfuscation method. So the techniques are, following the just-invented Guillermito classification:

1. Adding data at the end of the carrier file (example: Camouflage, JpegX, SecurEngine for JPG, Safe&Quick Hide Files 2002, Steganography 1.50).

2. Inserting data in some junk or comment field in the header of the file structure (example: Invisible Secrets 2002 for JPG and PNG, Steganozorus for JPG).

3. Embedding data in the carrier byte stream, in a linear, sequential and fixed way (example: InPlainView, InThePicture, Invisible Secrets 2002 for BMP, ImageHide, JSteg).

4. Embedding data in the carrier byte stream, in a pseudo-random way depending on a password (CryptArkan, BMPSecrets, Steganos for BMP, TheThirdEye, JPHide).

5. Embedding data in the carrier byte stream, in a pseudo-random way depending on a password, and changing other bits of the carrier file to compensate for the modifications induced by the hidden data, to avoid modifying statistical properties of the carrier file (example: Outguess, F5).

Note 1: that I don't consider methods 1 and 2 as real steganography. Of course, what "real steganography" is may depend on the definition you use. If we choose to use "hiding data from my little sister" as a (broad) definition, yes, this is steganography.

Note 2: even the most serious programs, Outguess and F5, both from professional researchers in the academic world, both open source and opened to scrutiny, are now considered broken, in part thanks to the brilliant work of Jessica Fridrich's team, Andreas Westfeld, and others.

Note 3: "breaking" a steganography software does not mean that you can recover the hidden data. Actually the researchers in the field always use meaningless random streams as hidden data, which is what a strongly encrypted file looks like. To break a steganography algorithm means that you can decide with a statistically high level of confidence that an image contains embedded information, and estimate the size of this information.

Note 4: remember that low-tech psychological methods work surprisingly well to recover encrypted or hidden data. Like talking about 50 years in jail. Or about a cancelled visa. Or a gun pointed to your temple.

Does the police have a way to know that my picture contains hidden data?

Probably. But they probably don't care too. As I tried to mention, I don't think steganography is very mature and secure yet. Of course, I am talking about total security, or absolute impossibility to know for sure if a carrier is hiding data. Something at the same level than modern strong cryptography algorithms, for example. I don't feel it's here yet. It's very easy to hide data, but it's very hard to hide it well.

But then, the level of security you want to use can be more or less proportional to the importance of the data you want to hide. If you want to avoid your parents to know that you are exchanging secret plans to buy booze with friends, you don't need a world-class algorithm. And police won't care about it. And nobody is going to invest more than five minutes on that.

On the other hand, if you are a terrorist and want to blow up the world (please, not Boston), you probably need to know that companies like Wetstone (financed by the federal US government, and in collaboration with serious academic researchers) developped or are developping products like StegoWatch that can automatically detect and dictionary-attack steganographically hidden information with various programs. From what someone showed me, the results obtained by this expensive software (2000 US$!! When I think I'm hacking steganography softwares for the fun of it...) are interesting but far from perfect, and they still have to develop more plug-ins to crack more steganography programs. And, as usual, academic hackers like Niels Provos did that before any company, and for free, and open source, with StegDetect.

So what software should I use?

I wrote all of these pages to help people make an educated guess for this question. So... Just do it. Or read everything once again :)

Is steganography used by terrorists?

A lot of people talked about it. There is not a single piece of evidence for that. Not a single factual report. It's all rumors and speculation. As usual with everything linked to computer security, there is a lot of alarmist hype around concepts that are not well understood by the general public - scare always sells - fueled mainly by governmental agencies (for political reasons, you need to make people accept a certain loss of privacy), security consultants or companies (to make money), and newspapers (to make money too, but media, just like science, is self-correcting on the long range: there are people selling stories made up from unproven rumors, and the reaction is that you have people selling stories debunking these precedent claims). It somehow reminds me the 1992 Michelangelo scare. A virus that should have destroyed millions of computers worldwide. Almost nothing happened.

Remember what I wrote above about low-tech efficiency. The focus is always on the potential bad sides of high technology, because it seems new and entertaining. But the horror of september 11 was the result of using nothing more than box cutters.

The story about terrorists using steganography surfaced first in the daily newspaper USA Today in february 2001. The articles are still on the web. You can read "Terrorist instructions hidden online", and, the same day, "Terror groups hide behind Web encryption". Notice the name of the journalist. They were written by veteran foreign correspondant Jack Kelley. In july of the same year, the information looks even more precise: "Militants wire Web with links to jihad". A citation: "Lately, al-Qaeda operatives have been sending hundreds of encrypted messages that have been hidden in files on digital photographs on the auction site eBay.com". These articles started the whole craze, and were followed by many more in other newspapers.

But today, in 2004, Jack Kelley is at the center of one of the two or three very high-profile journalism scandals in the USA: it appeared that he faked most of his stories (this is an article on USA Today - an excellent example of "self-correcting" media I was talking about - cheers to this newspaper for that). But it's too late. Steganography is now definitely associated with terrorism.

You can read an excellent article about it on Salon.com, keeping alive the skeptical tradition, written a long time ago, way before the Jack Kelley scandal: "The case of the missing code". Self-correcting media again.
What kind of images should I use?
(later)
Guillermito.