Skip to content

Fixed bug with matching the end of a string #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 1 addition & 15 deletions README
Original file line number Diff line number Diff line change
Expand Up @@ -73,21 +73,7 @@ FAQ
---

* Does it work with UTF-8?
As of now, the code assume single-byte characters. To use UTF-8 text, you can
always convert the encoding using mb_convert_encoding():
...
$from_text = mb_convert_encoding($from_text_utf8, 'HTML-ENTITIES', 'UTF-8');
$to_text = mb_convert_encoding($to_text_utf8, 'HTML-ENTITIES', 'UTF-8');
$diff_opcodes = FineDiff::getDiffOpcodes($from_text, $to_text);
...

If ever you want to re-generate the $to_text_utf8 from the $from_text_utf8:
...
$from_text = mb_convert_encoding($from_text_utf8, 'HTML-ENTITIES', 'UTF-8');
$to_text = FineDiff::renderToTextFromOpcodes($from_text, $diff_opcodes);
$to_text_utf8 = mb_convert_encoding($to_text, 'UTF-8', 'HTML-ENTITIES');
....

Yes!

License
-------
Expand Down
23 changes: 19 additions & 4 deletions finediff.php
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,10 @@
*
* 10-Dec-2011 (Christoph Mewes):
* - added UTF-8 support, fixed strange usage of htmlentities
*
* 15-Jul-2013 (Peter Bagnall):
* - fixed bug where getting the diff of "abc def" and "abc def ghi" would fail
* to recognise that def was a match, because whitespace was being included in fragments.
*/

mb_internal_encoding('UTF-8');
Expand Down Expand Up @@ -520,12 +524,17 @@ private static function doFragmentDiff($from_text, $to_text, $delimiters) {
$fragment_index_offset += $fragment_length;
}
if ( $fragment_index_offset > $best_copy_length ) {
$best_copy_length = $fragment_index_offset;
$best_from_start = $from_base_fragment_index;
$best_to_start = $to_base_fragment_index;
// if the matching string is just made up of delimiters then don't count it as a match. This prevents an
// excessive number of whitespaces being seen as matches and therefore breaking up a long replace segment
// to no useful purpose.
if ($fragment_index_offset > $from_base_fragment_length || self::mb_strspn($from_base_fragment, $delimiters, 0)===0) {
$best_copy_length = $fragment_index_offset;
$best_from_start = $from_base_fragment_index;
$best_to_start = $to_base_fragment_index;
}
}
}
$from_base_fragment_index += mb_strlen($from_base_fragment);
$from_base_fragment_index += $from_base_fragment_length;
// If match is larger than half segment size, no point trying to find better
// TODO: Really?
if ( $best_copy_length >= $from_segment_length / 2) {
Expand Down Expand Up @@ -655,6 +664,12 @@ private static function extractFragments($text, $delimiters) {
$start = $end = 0;
for (;;) {
$end += self::mb_strcspn($text, $delimiters, $end);
if ( $end === $start ) {
break;
}
$fragments[$start] = mb_substr($text, $start, $end - $start);
$start = $end;

$end += self::mb_strspn($text, $delimiters, $end);
if ( $end === $start ) {
break;
Expand Down