Some of the following strategies are vendor and architecture specific but mostly have a corresponding counterpart in other vendors and architectures.
Try to minimize host-device transfer of memory. Also try to hide memory transfer latencies with parallel computation. Host-device transfer has much lower bandwidth than global memory access. (For example, for NVIDIA GTX 280 verses PCI-e it becomes approximately 17 times). So better to store and keep it on the Global memory. Sometimes it is even better to re-compute something in GPU rather than trying to fetch from host.
One large transfer is much better than many smaller transfers amounting to same size.
Try for coalesced memory access as much as possible, that is, avoid out of sequence and misaligned transactions. This is more OpenCL device architecture and compute capability specific.
Use local memory (100 times better latency for GTX 280) for caching, but be careful about overuse to avoid performance penalty due to spilling to global...