Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Import of Windows-1252 encoded file looses prolog and becomes mangled UTF-8 #5430

Open
ahenket opened this issue Aug 28, 2024 · 6 comments
Labels
bug issue confirmed as bug

Comments

@ahenket
Copy link

ahenket commented Aug 28, 2024

Describe the bug
When I import attached file through oXygens xmlrpc connection (eXide doesn't let me: different issue), eXist-db looses the prolog that lists that the file is Windows-1252, but does not convert the file into UTF-8. So when you reopen it, it uses the xml default encoding UTF-8 and all characters outside of ASCII are now broken.

Expected behavior
Either keep the encoding of uploaded files, or do an on the fly conversion before committing to the database

To Reproduce
Extract the one file from the zip and upload that file anywhere on your server. Now reopen using oXygen or eXide and look for "pati". The first hit reads "pati�nt" instead of "patiënt" and is in this path: /XMI/XMI.content[1]/UML:Model[1]/UML:Namespace.ownedElement[1]/UML:Package[1]/UML:Namespace.ownedElement[1]/UML:Collaboration[1]/UML:Namespace.ownedElement[1]/UML:ClassifierRole[2]/UML:ModelElement.taggedValue[1]/UML:TaggedValue[1]/@value

nl.zorg.Zwangerschap-v4.1.xmi.zip

There are 27 occurrences of � that were a regular Windows-1252 compatible characters before.

Environment

Key Value
eXist Version: 6.2.0
eXist Build: 2023-02-04T22:42:29Z
Operating System: Mac OS X 14.6.1 aarch64
Java Version: 21.0.2
Default Encoding: UTF-8
@line-o
Copy link
Member

line-o commented Aug 28, 2024

@ahenket I think this issue is addressed and fixed in the develop branch. Would you be able to test it with latest develop and confirm?

@line-o line-o added the bug issue confirmed as bug label Aug 28, 2024
@ahenket
Copy link
Author

ahenket commented Aug 28, 2024

Tried but it would appear not:

% mvn -e -X -DskipTests package
....
[INFO] eXist-db Distributions ............................. FAILURE [  2.369 s]
...
[ERROR] Failed to execute goal com.github.monkeywie:copy-rename-maven-plugin:1.0:rename (rename-jetty-etc-dir-for-appassembler) on project exist-distribution: could not rename /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/org/exist/jetty/etc to /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/jetty: Failed to delete /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/jetty while trying to rename /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/org/exist/jetty/etc -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal com.github.monkeywie:copy-rename-maven-plugin:1.0:rename (rename-jetty-etc-dir-for-appassembler) on project exist-distribution: could not rename /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/org/exist/jetty/etc to /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/jetty
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    ...
Caused by: java.io.IOException: Failed to delete /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/jetty while trying to rename /Users/ahenket/Development/GitHub/eXist/exist/exist-distribution/target/exist-distribution-7.0.0-SNAPSHOT-dir/etc/org/exist/jetty/etc
    at org.codehaus.plexus.util.FileUtils.rename (FileUtils.java:2092)
    ...
[ERROR] 
    ...

@line-o
Copy link
Member

line-o commented Aug 28, 2024

@ahenket mvn clean should do the trick (If you have data in that working directory, make a backup first).

@dizzzz
Copy link
Member

dizzzz commented Aug 29, 2024

Maybe spinning up a docker container is more easy: docker pull duncdrum/existdb:latest

@line-o
Copy link
Member

line-o commented Aug 29, 2024

@dizzzz you are right that is way easier for one-off tests

@ahenket
Copy link
Author

ahenket commented Sep 2, 2024

I built a fresh 7.0.0-SNAPSHOT db, and imported the same file(s) again. This time keeps the prolog and leaves it at Windows-1252. However: it still thinks the xml is UTF-8

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug issue confirmed as bug
Projects
None yet
Development

No branches or pull requests

3 participants