Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultipartKit V5 #100

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

MultipartKit V5 #100

wants to merge 24 commits into from

Conversation

ptoffy
Copy link
Member

@ptoffy ptoffy commented Oct 7, 2024

No description provided.

@ptoffy ptoffy added the semver-major Breaking changes label Oct 7, 2024
@ptoffy ptoffy self-assigned this Oct 7, 2024
Copy link
Contributor

@adam-fowler adam-fowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

Sources/MultipartKit/MultipartParser+AsyncStream.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/MultipartParser.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/MultipartParser+AsyncStream.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/MultipartParser.swift Show resolved Hide resolved
@ptoffy ptoffy requested a review from adam-fowler October 30, 2024 12:02
@ptoffy
Copy link
Member Author

ptoffy commented Oct 30, 2024

@adam-fowler Do you want to take a look at this again and see if stuff makes more sense now? I added in some binary data (which even contains hex-CRLF 😄) to the tests so it should be able to parse anything now

@ptoffy ptoffy marked this pull request as ready for review November 18, 2024 15:54
@ptoffy ptoffy requested review from 0xTim and gwynne as code owners November 18, 2024 15:54
@ptoffy ptoffy requested review from czechboy0 and Joannis and removed request for adam-fowler November 18, 2024 20:06
@ptoffy
Copy link
Member Author

ptoffy commented Nov 18, 2024

@Joannis @simonjbeaumont @czechboy0 pulling you into this so you can take a look if you want

Package.swift Show resolved Hide resolved
Package.swift Outdated Show resolved Hide resolved
Package.swift Show resolved Hide resolved
public func decode<D: Decodable>(_ decodable: D.Type, from data: String, boundary: String) throws -> D {
try decode(D.self, from: ByteBuffer(string: data), boundary: boundary)
public func decode<D: Decodable>(_ decodable: D.Type, from string: String, boundary: String) throws -> D {
try decode(D.self, from: Array(string.utf8), boundary: boundary)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a copy from string, can't we use withContiguousMemoryIfAvailable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhh I don't think that would work any better because we need the Collection there. We can't pass in the raw bytes and if we're copying them to an array we're still making a copy at that point right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're doing an .append on the parser you're right that you're already making a copy. Except right now you're making two copies of the same data, and each time you're also allocating space for that data.

Copy link
Member Author

@ptoffy ptoffy Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean we can't pass the raw pointer to the decode method because it expects the collection of bytes, so this can't be done

try string.utf8.withContiguousStorageIfAvailable { bytes in
    decode(D.self, from: bytes, boundary: boundary)
} ?? decode(D.self, from: Array(string.utf8), boundary: boundary)

And if we were to do something like

if let bytes = string.utf8.withContiguousStorageIfAvailable(Array.init) {
    try decode(D.self, from: bytes, boundary: boundary)
} else {
    try decode(D.self, from: Array(string.utf8), boundary: boundary)
}

we're still initialising an array with the raw bytes so I think this doesn't really save us a copy.
Unless you mean a different way of using withContiguousStorageIfAvailable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would really prefer not copying unnecessarily in a library like this

var currentBody = Body()

// Append data to the parser and process the sections
parser.append(buffer: data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to copy data out before parsing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we avoid a whole bunch of needMoreData returns from the parser

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be restructured to be parseOrAppend then?

if parserBuffer.isEmpty {
  let result = parser.parse(data)
  if result == .needMoreData {
    parser.append(data)
  }
} else {
  parser.append(data)
  parser.parse()
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why? We're just doing that once at the beginning, this is the sync parse

Copy link
Contributor

@adam-fowler adam-fowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generic parameter changes look good

Sources/MultipartKit/FormDataEncoder/FormDataEncoder.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/FormDataEncoder/Storage.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/FormDataEncoder/Storage.swift Outdated Show resolved Hide resolved
}
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it the user can receive a body as multiple MultipartSections if the underlying AsyncSequence has broken that body up. This is great as we don't want to pay the memory for large bodies if we can. But there are situations where we want to ensure we have a complete body section eg a block of data we want to run a JSON decode on. Is it possible to add a helper function to the Iterator to do this? eg Iterator.collectBody(upTo: memoryLimit) which returns a header, complete body section

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's currently a parse method on MultipartParser which loads the input all at once. I'm guessing you mean something of a middle ground between this and stream parsing? E.g just for one part of the message

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes just one part of the message eg I have a Multipart message with a zip file in there plus some metadata. I want to save the zip files to disk (using the least amount of memory possible ie streaming it) but parse the metadata with Codable so need the whole of it in memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptoffy ptoffy requested a review from adam-fowler December 28, 2024 15:06
@ptoffy ptoffy requested a review from Joannis December 28, 2024 15:08
Copy link
Member

@0xTim 0xTim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. This is starting to look good and we're getting close to be able to tag an alpha I think.

Would be good to get the performance tests set up so we can measure changes going forward and code coverage to see which edge cases we don't currently test

Sources/MultipartKit/MultipartSection.swift Show resolved Hide resolved
Sources/MultipartKit/FormDataEncoder/Storage.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/MultipartParser.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/MultipartParser.swift Outdated Show resolved Hide resolved
Sources/MultipartKit/MultipartParser.swift Show resolved Hide resolved
Sources/MultipartKit/MultipartParser.swift Show resolved Hide resolved
Copy link
Contributor

@adam-fowler adam-fowler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. Maybe we should have two async sequences though. One that does the collation automatically and one that requires you to use nextCollatedPart when you want a collated part.

I would rename the current AsyncSequence StreamingMultipartParserAsyncSequence and add the following

public struct MultipartParserAsyncSequence<BackingSequence: AsyncSequence>: AsyncSequence
where BackingSequence.Element: MultipartPartBodyElement & RangeReplaceableCollection {
    let streamingSequence: StreamingMultipartParserAsyncSequence<BackingSequence>

    public init(boundary: String, buffer: BackingSequence) {
        self.streamingSequence = .init(boundary: boundary, buffer: buffer)
    }

    public struct AsyncIterator: AsyncIteratorProtocol {
        public mutating func next() async throws -> MultipartSection<BackingSequence.Element>? {
            try await self.streamingIterator.nextCollatedPart()
        }

        var streamingIterator: StreamingMultipartParserAsyncSequence<BackingSequence>.AsyncIterator
    }

    public func makeAsyncIterator() -> AsyncIterator {
        return .init(streamingIterator: self.streamingSequence.makeAsyncIterator())
    }
}

Most people won't care about streaming sections of the multipart file, so the default should provide the collated parts. But if someone wants to stream the body part of a file they can use the StreamingMultipartParserAsyncSequence version.

self.currentCollatedBody.append(contentsOf: chunk)
if !headerFields.isEmpty {
let returningFields = headerFields
headerFields = .init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you re-initializing the header fields, its a variable local to this function

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if there's different parts in a message there might other headers later and this empties them before possibly reading other ones

while let part = try await next() {
switch part {
case .headerFields(let fields):
headerFields.append(contentsOf: fields)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no chance you can read a header part after having read body parts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is if the message has different parts

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI in this case fields will always have just one field because the original sequence returns single header fields and body chunks. I'm not sure if this is misleading and should be clarified somehow, either with docs or with a name change

@ptoffy ptoffy requested a review from adam-fowler January 1, 2025 18:58
@ptoffy
Copy link
Member Author

ptoffy commented Jan 1, 2025

I agree with the proposed changes but I moved the nextCollatedPart() to simply be the next() method of the new sequence which makes more sense to me

@ptoffy ptoffy requested a review from 0xTim January 1, 2025 21:17
Copy link
Member

@0xTim 0xTim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing blocking on my end. Once other comments have been resolved we can look at integrating this into projects to see how it actually works

buffer.writeString(": ")
buffer.writeString(val)
buffer.writeString("\r\n")
buffer.append(contentsOf: Array("--\(boundary)".utf8) + crlf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying the boundary into an array, write the leading tokens, boundary and CRLF separately. Saves copies and allocs, thus a lot of work.

buffer.writeString("\r\n")
buffer.append(contentsOf: Array("--\(boundary)".utf8) + crlf)
for field in part.headerFields {
buffer.append(contentsOf: Array("\(field.description)".utf8) + crlf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, you can easily prevent a copy

buffer.writeString("--")
buffer.writeString(boundary)
buffer.writeString("--\r\n")
buffer.append(contentsOf: Array("--\(boundary)--".utf8) + crlf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary copy

parts: [MultipartPart<some MultipartPartBodyElement>],
into buffer: inout OutputBody
) throws where OutputBody: RangeReplaceableCollection {
let crlf = Array("\r\n".utf8)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRLF is copied into this a lot. I would keep it as "\r\n".utf8 and append those values. Don't copy it into an array if possible

Copy link
Member

@Joannis Joannis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main (only) issue is that this library makes a lot of copies out of every byte carrier (String, Array etc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-major Breaking changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants