Unless you're more specific, it's really hard to offer concise advice here.
E.g. for C programs I use sampling-based profiling (like Mac makes super easy with the “Time Profiler” Instrument) to find the bottleneck(s) and then drill down to see associated code sections. Different languages and OSs have different tools for identifying bottlenecks.
If performance is network-bound, that’s a very different set of considerations (with which I have no experience)