23 Jul 2018

awk: A Silver bullet you should have

Intro to awk programming language

[ awk   logs   ]

Hello world

awk is a field-oriented pattern processing language with a C-style syntax. We can write our first awk script demonstrating a simple output of “Hello world”.

echo "" | awk '{print "Hello world"}'

A basic awk command is of the following form:

awk options 'selection_criteria {action }' input-file > output-file

Available options:

-f program-file : Reads the AWK program source from the file 
                  program-file, instead of from the 
                  first command line argument.
-F fs            : Use fs for the input field separator

You can combine with cat/grep/sed and pipe operator to do more custom operations.

There are many situations where you have the requirements to parse info from original files.

As a software engineer you may often need to dive into log file to debug an issue.

As a data analysist, the first-hand data mined form internet may not be of the format you want, and you want to change it to yours.

Log parser

Supposed that you get a log file which has thousands of lines and is not in a standard csv-like format(we subtract a segment of it below), you have to perform some analysis based on it.

I0713 10:40:47.478701  9083 xxxx.cpp:284] ================================>FPS: 7

I0713 10:43:36.649893 10353 xxxx.cpp:1121] *** DEBUG 5: in removeLostFeatures: invalid/processed feature #: 2/3

For example, you want to know about the FPS trends, and then draw a FPS-time graph. Also, you need to extract the invalid and processed features to a file for further process.

You may find it a little tough, and have no idea where to start with. But if you have a basic knowledge about awk, you may think it’s so easy.

I give you a simple bash code to implement above task, and you can take a reference.

#!/usr/bin/env bash
  
base="${1%/*}/"
echo "working directory: $base"

cat "$@" | grep '>FPS' | awk '{print $NF}' > $base/FPS.txt

cat "$@" | grep 'DEBUG 5' | awk '{print $NF}' | awk 'BEGIN{FS="/"}{print $1, $2}' > $base/lostfeatures.txt

Therefore, it will produce two new files with csv-like format, and you can draw it with python or matlab easily.

Done!

Format tranform

You are assigned a task to tranform every record in a csv file to comma seperated and filter out the records with zero heading or flag_h.

time x y heading flag_p flag_h
1531465177.6090000 800834.5063542 2499953.6377944 275.5880000 3 4
1531465177.7030001 800834.2666566 2499953.6450343 275.4430000 3 4
1531465177.7990000 800833.9966345 2499953.6953184 275.4430000 3 4
1531465177.8980000 800833.7414144 2499953.7120665 0.0000000 3 0
1531465178.0000000 800833.4716350 2499953.7638617 280.0000000 3 0
1531465178.0980000 800833.2220336 2499953.8329117 281.1000000 3 0
1531465195.9050000 800822.7892204 2500001.4433829 0.0000000 3 0
1531465196.0079999 800822.7826699 2500001.7420741 0.0000000 3 3
1531465196.1059999 800822.7694385 2500002.0436874 0.0000000 3 3
1531465196.1059999 800822.7694385 2500002.0436874 358.8210000 3 4
1531465196.2100000 800822.7527709 2500002.3459400 358.8210000 3 4
1531465196.2100000 800822.7527709 2500002.3459400 358.8210000 3 4

You can probably write a c++ program or python code to do this job, and I am definitely sure that it will work as expected. But I do not think it’s the perfect solution.

Let me use awk instead.

#!/usr/bin/env bash
  
base="${1%/*}/"
echo "working directory: $base"

cat "$@" | awk -v OFS=, '(NR==1){$1 = $1;print;next} ($NF != 0 && $(NF-2) != 0){$1=$1;print}' > $base/filter_gps.txt> $base/filter.txt

Please note that the operation $1=$1 in it has no extra meaning except forcing the reconstruction of the record.

Output file:

time,x,y,heading,flag_p,flag_h
1531465177.6090000,800834.5063542,2499953.6377944,275.5880000,3,4
1531465177.7030001,800834.2666566,2499953.6450343,275.4430000,3,4
1531465177.7990000,800833.9966345,2499953.6953184,275.4430000,3,4
1531465196.1059999,800822.7694385,2500002.0436874,358.8210000,3,4
1531465196.2100000,800822.7527709,2500002.3459400,358.8210000,3,4
1531465196.2100000,800822.7527709,2500002.3459400,358.8210000,3,4

So cool is it!

Conclusion

I listed two simple usage about awk, and revealed the great power of text processing with awk.

I will research more about it, and I absolutely believe that it’s worth it.

Appendix

Reference:

1. Greg Grothaus: why you should know just little awk

2.Jonathan Palardy: Why learn awk

3.Advanced Bash-Scripting Guide

4.gawk