General programing best practices¶




Roland Haas, NCSA and UIUC rhaas@illinois.edu

TODO:

1.give talk

What do I do?¶


image credit: NASA

What do I do?
¶

Science¶

  • builds and organizes knowledge
  • tests explanations about the universe
  • systematically,
  • objectively,
  • transparently,
  • and reproducibly.

Otherwise it's not science.

Computers should...¶

  • improve efficiency,
  • reduce human error,
  • automate the mundane,
  • simplify the complex,
  • and accelerate research.

But we don't always them effectively.

Getting started¶

Efrazil CC BY-SA 4.0

Data Storage¶

  • Good: spreadsheet
  • Better: csv file
  • Best: standardized file format, database management system, version control

Formats: csv, YAML, Hierarchical Data Format (HDF), Flexible Image Transport System (FITS), etc.

Management: C++/Python/Fortran APIs, HDF5, Pandas, astropy, etc.

Backing Up Files¶

  • Good: USB stick
  • Better: nightly emails
  • Best: remote version control (GitHub, BitBucket, GitLab), remote backup (CrashPlan, Duplicity)

Version Control Systems: svn, hg, git

Hint: tools like git-annex or git-lfs can help you manager large data files

Managing Changes¶

  • Good: systematic naming convention
  • Better: versioninig storage solution
  • Best: local version control

Tracking down changes¶

  • Good: use git log
  • Better: custom log format, git blame
  • Best: gitk, git gui blame

Hint: GUIs are great to explore and look around

Working with data¶

Multiple File Cleanup¶

  • Good: manually edit every file
  • Better: search and replace in each file
  • Best: scripted batch editing with backups

Hint: try a tutorial on BASH, ZSH, Python, or Perl, e.g. the bash lesson by Software Carpentry.

Executing Workflows¶

  • Good: retype a series of commands from notes
  • Better: shell script
  • Best: build system

Build system tools: make, docker, cmake, autoconf, automake, etc.

Reference: The Carpentries have an associated Automation and Make lesson

Data Structures¶

  • Good: giant vector holding all numbers
  • Better: list of list of numbers
  • Best: appropriate powerful data structures

Hint: In C++, learn about structs, unordered_maps, maps, vectors, and (maybe) classes, etc. In Python the power lies in dictionaries, and numpy arrays, and DataFrames when analyzing data.

Working with code¶

API Design¶

  • Good: single block of procedural code
  • Better: separate functions
  • Best: small, testable functions that handle well defined tasks, grop into classes

DRY: Don't Repeat Yourself. Code replication is bug proliferation.

KISS: Keep it simple, stupid.

Variable Naming¶

  • Good: d1, d2, d3
  • Better: x, y, z
  • Best: p.x, p.y, p.z, p = Point(x,y,z)

Hint: Prof. Jenny Bryan on Naming Things

Style Guides¶

  • Good: Have consistent style
  • Better: Agree with your colleagues on style
  • Best: Follow a standard style guide and use a code formatter

Hint: clang-format, Black code formatter for Python

Comments on commenting¶

  • Good: have readable code
  • Better: document what each code unit provides
  • Best: document design and purpose of each code unit

Hints: Best practices for writing code comments, Python PEP-8

Optimization¶

  • Good: don't
  • Better: add timing code and optimize identified hot spots
  • Best: use a profiler and optimize overall code structure

Tools: Linux perf, hotspot, gprof, igprof, coz, NVIDIA nsight, Python cprofile, line_profiler, snakeviz

Hints: A good algorithm will outperform a locally optimized bad one

Runtime Parameter Handling¶

  • Good: none, hardcoded variables
  • Better: plain text input file, line-by-line homemade string parsing
  • Best: argument / file parsing library

Formats: Python argparse, libconfig, yaml-cpp, spify, etc.

Getting It Right¶

Error Detection¶

  • Good: show results to experts
  • Better: integration testing
  • Best: unit test suite, continuous integration

Error Diagnostics¶

  • Good: re-re-read the code
  • Better: print statements
  • Best: use a linter, a debugger, and a profiler

Tools: cpplint, pyflakes, gdb, lldb, pdb, idb, perf, hotspot, coz, valgrind, kernprof, kcachegrind, cprofile/snakeviz

Error Correction¶

  • Good: fix code
  • Better: fix, add an exception
  • Best: fix, add an exception, add a test

Getting It Together¶

Merging Collaborative Work¶

  • Good: single master copy, waiting
  • Better: email and patches
  • Best: remote version control

Peer Review For Code¶

  • Good: separation of concerns
  • Better: shared repository(s)
  • Best: peer-reviewed pull requests

Hint: reviewing changes is work, keep them simple, stupid.

Teamwork¶

  • Good: weekly research meetings
  • Better: daily conversations, short term goals
  • Best: pair programming, issue tracking

Software Handovers¶

  • Good: zip file, theory paper
  • Better: code repository, theory paper, comments in code, example input file
  • Best: code repository, theory paper, automated documentation, example input file, test suite

Documentation¶

  • Good: paper notes describing model in code
  • Better: electronic documentation in code repository
  • Best: auto-generated documentation describing code intent and usage

Tools: doxygen, sphinx

Getting It Out There¶

Plotting¶

  • Good: custom formatting, clickable GUI
  • Better: plot format templates (Excel, Mathematica)
  • Best: scripted plotting, matplotlib, gnuplot, etc.

Writing¶

  • Good: Microsoft Word, LibreOffice Write
  • Better: Word, Write with track changes
  • Best: plain text markup with version control and a makefile

Tools: LaTeX, markdown, restructured text

Distribution Control¶

  • Good: "email to request access"
  • Better: license file
  • Best: license file, citation file, DOI, forkable repository

Example: SpECTRE

Community Adoption¶

  • Good: none, internal use only
  • Better: online repository, developer email online
  • Best: issue tracker, user/developer forum, communication channels, online documentation

Resources¶

Papers!¶

  • Wilson, Greg, D. A. Aruliah, C. Titus Brown, Neil P. Chue Hong, Matt Davis, Richard T. Guy, Steven H. D. Haddock, Kathryn D. Huff, et al. 2014. Best Practices for Scientific Computing. PLoS Biol 12 (1): e1001745. doi:10.1371/journal.pbio.1001745.
  • Wilson, Greg, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, and Tracy K. Teal. 2016. Good Enough Practices in Scientific Computing. arXiv:1609.00037 [Cs], August. http://arxiv.org/abs/1609.00037.
  • Scopatz, Anthony, and Kathryn D. Huff. 2015. Effective Computation in Physics. 1st edition. ly Media.
  • Blanton, Brian, and Chris Lenhardt. 2014. A Scientists Perspective on Sustainable Scientific Software. Journal of Open Research Software, Issues in Research Software, 2 (1): e17.
  • Donoho, David L., Arian Maleki, Inam Ur Rahman, Morteza Shahram, and Victoria Stodden. 2009. Reproducible Research in Computational Harmonic Analysis. Computing in Science & Engineering 11 (1): 818. doi:10.1109/MCSE.2009.15.
  • Goble, Carole. 2014. Better Software, Better Research. IEEE Internet Computing 18 (5): 48. doi:10.1109/MIC.2014.88.
  • Hannay, J. E, C. MacLeod, J. Singer, H. P Langtangen, D. Pfahl, and G. Wilson. 2009. How Do Scientists Develop and Use Scientific Software? In Proceedings of the 2009 ICSE Workshop on Software Engineering for Computational Science and Engineering, 18.
  • Joppa, L. N., G. McInerny, R. Harper, L. Salido, K. Takeda, K. OHara, D. Gavaghan, and S. Emmott. 2013. Troubling Trends in Scientific Software Use. Science 340 (6134): 81415. doi:10.1126/science.1231535.
  • Merali, Zeeya. 2010. Computational Science: ...Error. Nature 467 (7317): 77577. doi:10.1038/467775a.
  • Petre, Marian, and Greg Wilson. 2014. Code Review For and By Scientists. arXiv:1407.5648 [cs], July.
  • Schossau, Jory, and Greg Wilson. 2014. Which Sustainable Software Practices Do Scientists Find Most Useful? arXiv:1407.6220 [cs], July.
  • Stodden, Victoria. 2010. The Scientific Method in Practice: Reproducibility in the Computational Sciences. SSRN Electronic Journal. doi:10.2139/ssrn.1550193.
  • Wicherts, Jelte M., Marjan Bakker, and Dylan Molenaar. 2011. Willingness to Share Research Data Is Related to the Strength of the Evidence and the Quality of Reporting of Statistical Results. PLoS ONE 6 (11): e26828. doi:10.1371/journal.pone.0026828.

Books!¶

  • Clean Code - Robert C. Martin
  • Working Effectively with Legacy Code - Martin Fowler
  • Effective Computation in Physics - Huff, Scopatz
  • The Elements of Prgoamming Style - Kernghan and Pauger (1974)

Acknowledgements¶

Many of these slides were originally in presentations by Dr. Katy Huff and Dr. Madicken Munk at

  • katyhuff.github.io/2017-09-20-ncsa
  • munkm.github.io/2021-09-24-NCSA.

which are licensed under a Creative Commons Attribution 4.0 International License.

This work has been supported by NSF grants 2004879, 2103680. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

General programing best practices by Roland Haas is licensed under a Creative Commons Attribution 4.0 International License.

Based on a work at http://munkm.github.io/2021-09-24-NCSA.