The basics
Compiling and running parallel programs is more complicated than working with serial programs. There are two ways for a code to run different tasks in parallel and have communication between them: shared memory and message passing. Shared memory assumes you're on a machine with multiple CPUs that effectively runs one copy of the operating system. Every process can access all of the same memory and disks. Message passing is where you're using a group of machines, each with its own CPUs, memory, and copy of the operating system. Filesystems/disks may still be shared.
Shared memory is the easiest to use, but multiple-CPU shared memory machines are far more expensive to buy than distributed clusters. Many machines do a bit of both, with small SMP nodes combined into bigger clusters so you can run in shared memory over a few processors, and use message passing if you want more. The local clusters are almost all like this.
Having said that, there's nothing to stop you from running a message-passing program on a shared memory machine and in fact people often do this if their code happened to be originally written to use message-passing.
Compiling and running parallel codes
If you are trying to compile and run some else's code then the best advice anyone can give is to read all of their documentation, as chances are that someone has already tried to get the code working on a similar system to the one you're using and can tell you exactly how to compile and run it. There is no substitute for reading the manual.
It also helps to know what parallel libraries and compilers are available on your machine. We use the modules system for managing software on clusters and workstations. Type module avail to see what there is on your machine.
Auto-parallelizing compilers
If you have a serial code and don't wish to parallelize it yourself but still want it to go faster, then there are some compilers that can do a degree of shared-memory auto-parallelization for you. They probably aren't as good at it as a human expert and they work best on well-written, clean Fortran.
Portland compiler suite (Linux)
The Portland compilers have some autoparallelizing capability. If you use pgf95 with the -Mconcur option, the compiler will do its best to parallelize the loops. See the man pages for options to -Mconcur to tweak the exact behaviour. You must use the same flags when linking as you did when compiling, of course. To actually make the code run over more than one CPU, you must set the NCPUS environment variable to the number of CPUs to use.
Intel ifort compiler (Linux)
The Intel ifort Fortran compiler has the ability to do some autoparallelization and vectorization. If you use the -parallel option it will attempt to parallelize where possible. See the documentation for many more details, including options to tweak the parallelization. Set the environment variable OMP_NUM_THREADS at runtime to the number of CPUs to run over. Note that by default the number of threads is the number of CPUs on the machine the code was compiled on, so those compiling on dual-processors can often get away without the variable.
OpenMP
This is a standard for use with shared memory machines. To use it you take a serial code and put OpenMP directives within it. These are little comments in the code that an ordinary compiler will ignore, but if you compile with an OpenMP compiler then you will get a parallelized executable. Again you need to set an environment variable (OMP_NUM_THREADS) to the appropriate number of processors. Compilers that support it are Portland (-mp flag), Intel Fortran (-openmp flag), SunONE Studio suite (-openmp flag, see also -stackvar and -xautopar to do autoparallelizing too where possible).
MPI codes
MPI is a standard for message passing. To use it, you must explictly write your code using calls to the MPI library to farm off threads to go and do parallel work, and you must link your program with the MPI library. You then launch the program using the mpirun, mpiexec, or mprun command provided by whatever library you're using. This takes various options, usually including one (-np) which is the number of processes to run. This should normally be the same as the number of processors you want to use, though you can often set it to higher than the number of available processors for testing. When using a distributed system you usually have to provide a list of the nodes that MPI can use too. If you are running an MPI job on a Chemistry compute cluster then the queueing system takes care of setting the number of CPUs and the names of the nodes.
OpenMPI
This is a popular MPI implementation that is installed on most of the local clusters. It has be compiled separately for each underlying compiler, so use the module for the compiler you want.
Use the mpicc and mpif90 commands in the bin directory of the chosen mpi directory to compile your code. This will automatically link your code with the correct libraries. To run, use the mpirun command within your job script. This should communicate with the cluster's queueing system and start the job on the assigned compute nodes.
MPI-CH
This is an MPI implementation that is installed on some of the local clusters. It has to be configured for particular compilers and communication methods so there may be several versions on any given system, controlled by modules. Pick the one you need for your preferred compiler.
Use the mpicc and mpif90 commands in the bin directory of the chosen mpi directory to compile your code. This will automatically link your code with the correct libraries. To run, use the mpiexec command within your job script. This should communicate with the cluster's queueing system and start the job on the assigned compute nodes.
Other MPI libraries on Linux
There are other Linux MPI libraries; some are better for use with specialised interconnects.
GPUs
GPUs work alongside regular CPUs. They have many simple cores and are good for certain types of very parallelizable workload. The only clusters we have with GPUs at the moment are pat and rogue. There are two ways of making your code use GPUs: write it using the GPU manufacturer's programming environment (in our case, CUDA) or use accelerator directives in one of the compilers that supports it (Portland 10 and up).