Dataset size and composition impact the reliability of performance benchmarks for peptide-MHC binding predictions