Data analysis programming: general info

This page collects info and links (to this site or others) useful in data analysis programming, including resources for various languages, editors, or testing tools, and notes/tips for using them effectively.

General techniques and conventions

Notes about data analysis techniques/conventions, independent of language/interface.

  • Sensor data notes on working with continuous sensor timeseries (from dataloggers, SNOTEL sites, etc.)
  • Data analysis workflow - Notes on collecting, storing, and moving data through the analysis process.

Text editing and data file handling

VIM is a great text editor. Below are a few resources on using it effectively.

An excellent general overview of text/data file handling in a Unix environment is provided by Unix for Poets, by Kenneth Ward Church. PDFs of this are all over the internet.

Other useful resources (including some on this wiki)

Python

Python is a high-level, open-source programming language that, when combined with some numerical, scientific, and plotting packages, makes a very powerful tool for scientific computing and data analysis (on par with Matlab). Useful Python extensions for scientific computing are:

  • NumPy - provides n-dimensional array objects and other useful numeric extensions to Python
  • SciPy - provides a number of high-level mathematical tools for use in scientific computing (integration, optimization, fourier transforms...etc
  • Matplotlib - a plotting library that provides publication quality plots and plotting routines that are similar to Matlab's.
  • IPython - an interactive shell that is designed to work well with NumPy, SciPy, and Matplotlib.
  • SciKits - add on toolkits that complement SciPy (various statistical models, timeseries analysis, machine-learning, image processing, etc.
  • The pandas library - provides high-performance, easy-to-use data structures (like data frames) and data analysis tools that sit on top of NumPy.

Official Python resources

Python forums

Python (and its scientific extensions) have a large user/developer community supporting them. These are some forums that might be helpful:

My Python notes

Collected notes, tips, and tricks for using any of the Python tools above.

Other

MATLAB (and clones)

MATLAB is a proprietary programming language and IDE that is widely used in scientific and engineering computing.

Resources

Clones of Matlab

There are a bunch of free/open-source clones of Matlab that have various levels of syntax compatibility.

R

R is a free, open-source software environment for statistical computing and graphics.

Math and Stats tools

Many toolboxes are available, either standalone or in Python, R, and Matlab, for math and statistical applications. See the math toolbox page page.

Testing data analysis functions

Code used in data analysis can perform fairly complex operations on datasets and generate output that may be significantly changed from the original data. The code itself can also be fairly complex and its actual function may be difficult to discern just by reading the code or looking at the data. It is important to verify that the result of running this code is what is expected and that the output is accurate. Writing test functions that call data analysis code and analyze their output is a useful way to do this.