FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark