Data analysis programming: general info
This page collects info and links (to this site or others) useful in data analysis programming, including resources for various languages, editors, or testing tools, and notes/tips for using them effectively.
General techniques and conventions
Notes about data analysis techniques/conventions, independent of language/interface.
- Sensor data notes on working with continuous sensor timeseries (from dataloggers, SNOTEL sites, etc.)
- Data analysis workflow - Notes on collecting, storing, and moving data through the analysis process.
Text editing and data file handling
VIM is a great text editor. Below are a few resources on using it effectively.
- A fairly complete Vim commands cheatsheet.
- The Vim tips wiki
- Seven Habits for effective text editing.
- My Vim notes
An excellent general overview of text/data file handling in a Unix environment is provided by Unix for Poets, by Kenneth Ward Church. PDFs of this are all over the internet.
Other useful resources (including some on this wiki)
- My textfile notes - various command-line ways of manipulating text.
- My shell scripting notes, including Unix shell scripts and useful utilities.
- BASH hackers site is helpful.
- Shell scripting tutorial by Greg Goebel/Public Domain
- sed is a text stream editor great for pattern matching and replacing
- See this tutorial
- This page gives great one-line examples.\
- Awk is also very useful for manipulating text files.
- My awk notes
- The awk gateway
- Awk one-liners explained part one
- Awk one-liners explained part two
- Awk tutorial by Greg Goebel/Public Domain
Python
Python is a high-level, open-source programming language that, when combined with some numerical, scientific, and plotting packages, makes a very powerful tool for scientific computing and data analysis (on par with Matlab). Useful Python extensions for scientific computing are:
- NumPy - provides n-dimensional array objects and other useful numeric extensions to Python
- SciPy - provides a number of high-level mathematical tools for use in scientific computing (integration, optimization, fourier transforms...etc
- Matplotlib - a plotting library that provides publication quality plots and plotting routines that are similar to Matlab's.
- IPython - an interactive shell that is designed to work well with NumPy, SciPy, and Matplotlib.
- SciKits - add on toolkits that complement SciPy (various statistical models, timeseries analysis, machine-learning, image processing, etc.
- The pandas library - provides high-performance, easy-to-use data structures (like data frames) and data analysis tools that sit on top of NumPy.
Official Python resources
- The Python documentation page including tutorials and HowTo's
- Python Language Reference - describing the syntax and core semantics of the language.
- Python Standard Library - describing the standard library (modules, functions, etc) distributed with Python.
- Coding in python should follow the Python Style Guide.
- Official NumPy/SciPy documentation
- PyPlot documentation for the Matlab-like plotting framework in matplotlib.
- Python package index - an index of many add-on tools discussed in this wiki.
Python forums
Python (and its scientific extensions) have a large user/developer community supporting them. These are some forums that might be helpful:
My Python notes
Collected notes, tips, and tricks for using any of the Python tools above.
- General Python notes on debugging, code structure, and other aspects of development.
- Ipython
- NumPy notes - Various notes on using the NumPy package.
Other
- The Python Wiki Vim page and this blog entry give some interesting tips about using vim as a python source editor.
MATLAB (and clones)
MATLAB is a proprietary programming language and IDE that is widely used in scientific and engineering computing.
Resources
- Official MathWorks documentation
- Function reference
- MATLAB Central - the official user/developer community, including a file exchange.
- Kluid forums has matlab and octave forums.
- My MATLAB notes
Clones of Matlab
There are a bunch of free/open-source clones of Matlab that have various levels of syntax compatibility.
- GNU Octave - generally very compatible with Matlab, though some functions are missing.\
- SciLab
- FreeMat
R
R is a free, open-source software environment for statistical computing and graphics.
- R-project homepage
- R manuals
- R wiki
- knitr - a nice report generating engine for R
- My R notes
Math and Stats tools
Many toolboxes are available, either standalone or in Python, R, and Matlab, for math and statistical applications. See the math toolbox page page.
Testing data analysis functions
Code used in data analysis can perform fairly complex operations on datasets and generate output that may be significantly changed from the original data. The code itself can also be fairly complex and its actual function may be difficult to discern just by reading the code or looking at the data. It is important to verify that the result of running this code is what is expected and that the output is accurate. Writing test functions that call data analysis code and analyze their output is a useful way to do this.