Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollback part of commit 7dbb3fe710466ec2f573f81e19c7130ec1c7a365 #2273

Open
jzabroski opened this issue Jul 2, 2024 · 8 comments
Open

Rollback part of commit 7dbb3fe710466ec2f573f81e19c7130ec1c7a365 #2273

jzabroski opened this issue Jul 2, 2024 · 8 comments
Labels

Comments

@jzabroski
Copy link

Describe the bug
I don't know why the code is now throwing an exception when it used to return null - the commit where this occurred just says it was done to remove nullability - but why when that is very useful. There is no clear use case for this behavior, and it causes an exception to be thrown on an empty file, which is highly non-desirable behavior.

In any event, the lack of migration guide since v30 to now is also making upgrading difficult and questionable. We're on v31 but given the other bug I filed, I'm not planning to upgrade further.

To Reproduce
7dbb3fe#diff-e1cbba681f102065fada3e91fecbfbc458195a01974189a3483f75e72137afdaL733-L736

Fails with the following:

Empty.csv in CsvHelperBug.csproj

first,last,dob

SampleMap.cs

    public class SampleMap : ClassMap<SampleLine>
    {
        [SuppressMessage("ReSharper", "VirtualMemberCallInConstructor")]
        public SampleMap()
        {
            Map(x => x.FirstName).Name("first");
            Map(x => x.LastName).Name("last");
            Map(x => x.DateOfBirth).Name("dob");
        }
    }

SampleLine.cs

    public class SampleLine
    {
        public string FirstName { get; set; }

        public string LastName { get; set; }

        public DateTime? DateOfBirth { get; set; }
    }

CsvHelperBug/Test.cs

var resourceName = "CsvHelperBug.Empty.csv"; // Modify name based on what csproj your repro is called
using (var streamReader = new StreamReader(GetType().Assembly.GetManifestResourceStream(resourceName)))
{
    using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.CurrentCulture) { HasHeaderRecord = true }))
    {
        csvReader.Context.RegisterClassMap<SampleMap>();
        var results= new List<SampleLine>();
        while (await csvReader.ReadAsync())
        {
            var importLine = csvReader.GetRecord<SampleLine>();
            if (importLine == null)
            {
                //No record found
                continue;
            }
            results.Add(importLine);
        }
        return results;
    }
}

Additionally, the following fails with "No header record was found" (wtf?):

var resourceName = "CsvHelperBug.Empty.csv";
using (var streamReader = new StreamReader(GetType().Assembly.GetManifestResourceStream(resourceName)))
{
    using (var csvReader = new CsvReader(streamReader, new CsvConfiguration(CultureInfo.CurrentCulture) { HasHeaderRecord = true }))
    {
        csvReader.Context.RegisterClassMap<SampleMap>();
        csvReader.ReadHeader(); // throws CsvHelper.ReaderException: No header record was found.
    }
}

Expected behavior
Roll this back! It doesn't make any sense in the perfectly valid scenario where there are no rows.

Screenshots
n/a

Additional context
n/a

@jzabroski jzabroski added the bug label Jul 2, 2024
@JoshClose
Copy link
Owner

The idea is, you should never have a null record. The issue here is ReadAsync should be returning false.

Does this sound reasonable?

As far as the ReadHeader thing, that code hasn't changed in years. You should be calling Read before ReadHeader, but there could probably be a better exception message.

@jzabroski
Copy link
Author

That would work for me. Can you push a fix today

@JoshClose
Copy link
Owner

Working on it right now.

@JoshClose
Copy link
Owner

I have a dilemma.

The way things work is row by row. That means you call Read every time you want a new row. You can then act on the given row. If you have a header record, you're supposed to call ReadHeader, which reads the current row as headers. Then you call you Read again to get to the first record. This allows the user to handle each row however they like. When using GetRecords, everything is handled for you.

The issue here is that GetRecord is calling ReadHeader for you if you haven't called it yet. That was a convenience for people, but presents a problem here, because there might not be another row. Returning null is a bad way to handle that. When enabling nullable it becomes a pain to handle null records when there is almost no chance of getting one.

I think the correct solution here is to have GetRecord stop calling ReadHeader and Read for you. This may be a little annoying, but it makes things consistent.

csv.Read();
csv.ReadHeader();
while (csv.Read())
{
    var record = csv.GetRecord<Foo>();
}

The other option is to have Read call ReadHeader and read a second time, returning false when there isn't another record. I don't like this idea because it doesn't allow the user to do any other inspection on the header record and automatically goes past it. If you're using Read instead of GetRecords, you've opted into a more manual way of reading.

What do you think?

@jzabroski
Copy link
Author

jzabroski commented Jul 2, 2024

I think if CsvConfiguration specifies there is a header record, it should do it for you.

ReadHeader should add a remark that it should only be used for highly customized or niche scenarios, and is primarily intended for internal use when CsvConfiguration.HasHeaderRecord is true.

Otherwise, what is the point of the CsvConfiguration.HasHeaderRecord property?

@JoshClose
Copy link
Owner

what is the point of the CsvConfiguration.HasHeaderRecord property?

Only used for GetRecords.

@jzabroski
Copy link
Author

jzabroski commented Jul 2, 2024

what is the point of the CsvConfiguration.HasHeaderRecord property?

Only used for GetRecords.

Then ReadHeader should add a remarks that it requires CsvConfiguration.HasHeaderRecord to be true, but allows the consumer to directly read.

And yes, your Read fix should work.

In my use case, we read and validate one row at a time with FluentValidation, so that we can parse a batch of records in really large CSV files, validate it, and save to a db only the good rows and report the bad rows. Otherwise I'm not certain we would need to read one row at a time.

A lot of financial data is zipped csv files, so we deflate, which is memory intensive, and to avoid running out of memory, as well as scaling db insert/update/delete operations, we process in batches of 1000.

@jzabroski
Copy link
Author

jzabroski commented Sep 18, 2024

@JoshClose This was my generic solution based on your last comment - do you think this is OK? - at a practical level, it doesn't make a ton of sense given how the code for GetRecord reads (it implies it should read the header for you) - but the below code passes 3 different important test scenarios:

  1. CSV with header record and no data
  2. CSV with error in third column of data - e.g., malformed Date value that cannot be read into a DateTime struct
  3. CSV with header record and 3 rows of data - should read in perfectly
using (var streamReader = new StreamReader(request.Stream))
{
    using (var csvReader = new CsvReader(streamReader, request.Configuration))
    {
        csvReader.Context.RegisterClassMap<TMap>();
        
        var stopwatch = Stopwatch.StartNew();
        if (csvReader.Configuration.HasHeaderRecord)
        {
            csvReader.Read();
            csvReader.ReadHeader();
        }
        if (csvReader.Read())
        {
            var importLine = csvReader.GetRecord<TResult>();
            if (importLine != null)
            {
                if (!IsValid(request, importLine, csvReader.Context, out var error))
                {
                    errors.Add(error);
                }
                else
                {
                    results.Add(importLine);
                }
            }
        }

        return new CsvBatchParserResponse<TResult>
        {
            Results = results,
            Errors = errors,
            ParseTime = stopwatch.Elapsed
        };
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants