Researchers say we need better benchmarks to build more useful AI assistants