"CSV Data in Linux Shows '???' in First Element but works fine in windows - How to Resolve?"

Question

I'm executing my code in GitHub Actions (Linux environment) and trying to retrieve data from a CSV file. However, the first element of the CSV file is showing as '???"'. Interestingly, the same CSV file works fine on Windows after handling the BOM (Byte Order Mark). Here is the Code snippet that I am using to retrieve data from csv file

public void retrieveCSVFile(String directoryPath, String filename, List<String> csvContent, List<ExtentTest> extentLog) throws IOException {
    File directory = new File(directoryPath);
    Collection<File> files = FileUtils.listFiles(directory, new String[]{"csv"}, false);
    Optional<File> fileOptional = files.stream().filter(file -> file.getName().contains(filename)).findFirst(); //retrieve the latest file
    if (fileOptional.isPresent()) {
        File file = fileOptional.get();
        validateCsvHeaders(file, csvContent, extentLog);
    } else {
        extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, filename + " is not present in downloads folder"));
    }
}

private static void validateCsvHeaders(File file, List<String> csvContent, List<ExtentTest> extentLog) {
    try (CSVReader csvReader = new CSVReader(new FileReader(file))) {
        List<String> header = Arrays.asList(csvReader.readNext());
        for (int i = 0; i < header.size(); i++) {
            int finalI = i;
            if (removeBOMFromCSV(header.get(i)).contains(csvContent.get(i))) {
                extentLog.stream().forEach(Exlog -> Exlog.log(Status.PASS, header.get(finalI) + " is present in CSV file: " + csvContent.get(finalI)));
            } else {
                extentLog.stream().forEach(Exlog -> Exlog.log(Status.INFO, "Expected header is " + header.get(finalI) + " and actual header is" + csvContent.get(finalI)));
                extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, header.get(finalI) + " is not present in CSV file" + csvContent.get(finalI)));
            }
        }

    } catch (IOException | CsvValidationException e) {
        extentLog.stream().forEach(Exlog -> Exlog.log(Status.FAIL, "Error reading CSV file " + file.getName() + ": " + e.getMessage()));
    }
}

public static String removeBOMFromCSV(String s) {
    String[] BOMs = {"\uFEFF",      // UTF-8 BOM
            "\uFFFE",      // UTF-16 BOM (BE)
            "\uFEFF",      // UTF-16 BOM (LE)
            "\uFFFE",      // UTF-32 BOM (BE)
            "\uFEFF"};     // UTF-32 BOM (LE)
    for (String bom : BOMs) {
        if (s.startsWith(bom)) {
            s = s.substring(bom.length());
            s=s.trim();
            s = s.replaceAll("[^\\p{Print}]", "");
            if(s.startsWith("???\"")){
                s=s.substring(4);
            }
        }
    }
    return s;
}

Here is what CSV file looks like:

|Age|name|location
|:--|:--:|------:|
|30 |John|India
|25 |Jane|USA

After Executing code in GitHub Actions(env = Linux) getting output as:

???"Age is not present in the CSV file

name is present in CSV file

location is present in CSV file

Expected output

Age is present in CSV file

name is present in CSV file

location is present in CSV file

It would be easier to follow if you print out each character value at the start of the first line received (could use Files.readString + substring + toCharArray). — DuncG, Commented Jul 8 at 8:35
@DuncG I attempted to print the words individually, but the output came out as ?,?,?,",A,g,e — Ashutosh jha, Commented Jul 8 at 10:00
All you need is to show what the file actually contains at the start. On GNU/Linux you could show with od -xc xxxx | head. It would also help to know the file encoding of the input — DuncG, Commented Jul 8 at 10:20
new FileReader(file) ← This is the source of your problem. With modern versions of Java, that will always read the file as UTF-8; older versions of Java will read the file using the system’s default charset. Either way, you have already failed the task of reading a BOM, if the file’s encoding happens to differ from the encoding you used to read it. You cannot reliably remove a BOM using a Reader without first knowing the file’s encoding. — VGR, Commented Jul 8 at 14:05
The best way to deal with this is to use a FilterReader that filters out BOMs. They exist. iirc there could be one in Commons IO. tbh I'm surprised this functionality is not already in base Java — g00se, Commented Jul 8 at 16:00

VGR · Accepted Answer · 2024-07-08 17:24:21Z

The representation of a byte order mark in a file will depend on the charset used to encode the file. Since new FileReader(file) already assumes a charset, it is already too late to check for a BOM once you have created a Reader.

You must check for the BOM before you create a Reader. You must check bytes, not characters. unicode.org has a handy reference on how to check a file’s initial bytes. The code might look like this:

private static Reader openWithoutBOM(File file)
throws IOException {
    InputStream in = new BufferedInputStream(new FileInputStream(file));

    in.mark(4);
    ByteBuffer initialBytes = ByteBuffer.wrap(in.readNBytes(4));

    String charset;
    if (initialBytes.getInt(0) == 0x0000feff) {
        charset = "UTF-32BE";
    } else if (initialBytes.getInt(0) == 0xfffe0000) {
        charset = "UTF-32LE";
    } else if (initialBytes.getShort(0) == (short) 0xfeff) {
        charset = "UTF-16BE";

        in.reset();
        in.skipNBytes(2);
    } else if (initialBytes.getShort(0) == (short) 0xfffe) {
        charset = "UTF-16LE";

        in.reset();
        in.skipNBytes(2);
    } else if (initialBytes.get(0) == (byte) 0xef &&
               initialBytes.get(1) == (byte) 0xbb &&
               initialBytes.get(2) == (byte) 0xbf) {

        charset = "UTF-8";

        in.reset();
        in.skipNBytes(3);
    } else {
        // You may want to make a different assumption here.
        // For example, files created with Windows Notepad are
        // UTF-16LE by default.

        charset = "UTF-8";
        in.reset();
    }

    return new InputStreamReader(in, charset);
}

Could be useful. I might steal it ;) Maybe using instead InputStream in = new BufferedInputStream(Files.newInputStream(file)); with obvious adjustment — g00se, Commented Jul 9 at 14:33

Collectives™ on Stack Overflow

"CSV Data in Linux Shows '???' in First Element but works fine in windows - How to Resolve?"

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
java
csv
selenium-webdriver
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged javacsvselenium-webdriver or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
java
csv
selenium-webdriver
or ask your own question.