I'm executing my code in GitHub Actions (Linux environment) and trying to retrieve data from a CSV file. However, the first element of the CSV file is showing as '???"'. Interestingly, the same CSV file works fine on Windows after handling the BOM (Byte Order Mark). Here is the Code snippet that I am using to retrieve data from csv file
public void retrieveCSVFile(String directoryPath, String filename, List<String> csvContent, List<ExtentTest> extentLog) throws IOException {
File directory = new File(directoryPath);
Collection<File> files = FileUtils.listFiles(directory, new String[]{"csv"}, false);
Optional<File> fileOptional = files.stream().filter(file -> file.getName().contains(filename)).findFirst(); //retrieve the latest file
if (fileOptional.isPresent()) {
File file = fileOptional.get();
validateCsvHeaders(file, csvContent, extentLog);
} else {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, filename + " is not present in downloads folder"));
}
}
private static void validateCsvHeaders(File file, List<String> csvContent, List<ExtentTest> extentLog) {
try (CSVReader csvReader = new CSVReader(new FileReader(file))) {
List<String> header = Arrays.asList(csvReader.readNext());
for (int i = 0; i < header.size(); i++) {
int finalI = i;
if (removeBOMFromCSV(header.get(i)).contains(csvContent.get(i))) {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.PASS, header.get(finalI) + " is present in CSV file: " + csvContent.get(finalI)));
} else {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.INFO, "Expected header is " + header.get(finalI) + " and actual header is" + csvContent.get(finalI)));
extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, header.get(finalI) + " is not present in CSV file" + csvContent.get(finalI)));
}
}
} catch (IOException | CsvValidationException e) {
extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, "Error reading CSV file " + file.getName() + ": " + e.getMessage()));
}
}
public static String removeBOMFromCSV(String s) {
String[] BOMs = {"\uFEFF", // UTF-8 BOM
"\uFFFE", // UTF-16 BOM (BE)
"\uFEFF", // UTF-16 BOM (LE)
"\uFFFE", // UTF-32 BOM (BE)
"\uFEFF"}; // UTF-32 BOM (LE)
for (String bom : BOMs) {
if (s.startsWith(bom)) {
s = s.substring(bom.length());
s=s.trim();
s = s.replaceAll("[^\\p{Print}]", "");
if(s.startsWith("???\"")){
s=s.substring(4);
}
}
}
return s;
}
Here is what CSV file looks like:
|Age|name|location
|:--|:--:|------:|
|30 |John|India
|25 |Jane|USA
After Executing code in GitHub Actions(env = Linux) getting output as:
???"Age is not present in the CSV file
name is present in CSV file
location is present in CSV file
Expected output
Age is present in CSV file
name is present in CSV file
location is present in CSV file
od -xc xxxx | head
. It would also help to know the file encoding of the inputnew FileReader(file)
← This is the source of your problem. With modern versions of Java, that will always read the file as UTF-8; older versions of Java will read the file using the system’s default charset. Either way, you have already failed the task of reading a BOM, if the file’s encoding happens to differ from the encoding you used to read it. You cannot reliably remove a BOM using a Reader without first knowing the file’s encoding.FilterReader
that filters out BOMs. They exist. iirc there could be one in Commons IO. tbh I'm surprised this functionality is not already in base Java