< Back
Get-DTWFileEncoding
Post
NAME Get-DTWFileEncoding
SYNOPSIS
Returns the encoding type of the file
SYNTAX
Get-DTWFileEncoding [-Path] <String> [[-ByteCountToCheck] <Int32>] [[-PercentageMatchUnicode] <Decimal>]
[<CommonParameters>]
DESCRIPTION
Returns the encoding type of the file. It first attempts to determine the
encoding by detecting the Byte Order Marker using Lee Holmes' algorithm
(http://poshcode.org/2153). However, if the file does not have a BOM
it makes an attempt to determine the encoding by analyzing the file content
(does it 'appear' to be UNICODE, does it have characters outside the ASCII
range, etc.). If it can't tell based on the content analyzed, then
it assumes it's ASCII. Note: it does not correctly detect UTF32 BE or LE
if no BOM is present.
If your file doesn't have a BOM and 'doesn't appear to be Unicode' (based on
my algorithm*) but contains non-ASCII characters *after* index ByteCountToCheck,
the file will be incorrectly identified as ASCII. So put a BOM in there, would ya!
For more information and sample encoding files see:
http://danspowershellstuff.blogspot.com ... order.html
And please give me any tips you have about improving the detection algorithm.
*For a full description of the algorithm used to analyze non-BOM files,
see "Determine if Unicode/UTF8 with no BOM algorithm description".
PARAMETERS
-Path <String>
Path to file
Required? true
Position? 1
Default value
Accept pipeline input? true (ByValue, ByPropertyName)
Accept wildcard characters? false
-ByteCountToCheck <Int32>
Number of bytes to check, by default check first 10000 character.
Depending on the size of your file, this might be the entire content of your file.
Required? false
Position? 2
Default value 10000
Accept pipeline input? false
Accept wildcard characters? false
-PercentageMatchUnicode <Decimal>
If pecentage of null 0 value characters found is greater than or equal to
PercentageMatchUnicode then this file is identified as Unicode. Default value .5 (50%)
Required? false
Position? 3
Default value 0.5
Accept pipeline input? false
Accept wildcard characters? false
<CommonParameters>
This cmdlet supports the common parameters: Verbose, Debug,
ErrorAction, ErrorVariable, WarningAction, WarningVariable,
OutBuffer, PipelineVariable, and OutVariable. For more information, see
about_CommonParameters (https:/go.microsoft.com/fwlink/?LinkID=113216).
INPUTS
OUTPUTS
-------------------------- EXAMPLE 1 --------------------------
PS C:\\>Get-IHIFileEncoding -Path .\\SomeFile.ps1 1000
Attempts to determine encoding using only first 1000 characters
BodyName : unicodeFFFE
EncodingName : Unicode (Big-Endian)
HeaderName : unicodeFFFE
WebName : unicodeFFFE
WindowsCodePage : 1200
IsBrowserDisplay : False
IsBrowserSave : False
IsMailNewsDisplay : False
IsMailNewsSave : False
IsSingleByte : False
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : True
CodePage : 1201
RELATED LINKS
SYNOPSIS
Returns the encoding type of the file
SYNTAX
Get-DTWFileEncoding [-Path] <String> [[-ByteCountToCheck] <Int32>] [[-PercentageMatchUnicode] <Decimal>]
[<CommonParameters>]
DESCRIPTION
Returns the encoding type of the file. It first attempts to determine the
encoding by detecting the Byte Order Marker using Lee Holmes' algorithm
(http://poshcode.org/2153). However, if the file does not have a BOM
it makes an attempt to determine the encoding by analyzing the file content
(does it 'appear' to be UNICODE, does it have characters outside the ASCII
range, etc.). If it can't tell based on the content analyzed, then
it assumes it's ASCII. Note: it does not correctly detect UTF32 BE or LE
if no BOM is present.
If your file doesn't have a BOM and 'doesn't appear to be Unicode' (based on
my algorithm*) but contains non-ASCII characters *after* index ByteCountToCheck,
the file will be incorrectly identified as ASCII. So put a BOM in there, would ya!
For more information and sample encoding files see:
http://danspowershellstuff.blogspot.com ... order.html
And please give me any tips you have about improving the detection algorithm.
*For a full description of the algorithm used to analyze non-BOM files,
see "Determine if Unicode/UTF8 with no BOM algorithm description".
PARAMETERS
-Path <String>
Path to file
Required? true
Position? 1
Default value
Accept pipeline input? true (ByValue, ByPropertyName)
Accept wildcard characters? false
-ByteCountToCheck <Int32>
Number of bytes to check, by default check first 10000 character.
Depending on the size of your file, this might be the entire content of your file.
Required? false
Position? 2
Default value 10000
Accept pipeline input? false
Accept wildcard characters? false
-PercentageMatchUnicode <Decimal>
If pecentage of null 0 value characters found is greater than or equal to
PercentageMatchUnicode then this file is identified as Unicode. Default value .5 (50%)
Required? false
Position? 3
Default value 0.5
Accept pipeline input? false
Accept wildcard characters? false
<CommonParameters>
This cmdlet supports the common parameters: Verbose, Debug,
ErrorAction, ErrorVariable, WarningAction, WarningVariable,
OutBuffer, PipelineVariable, and OutVariable. For more information, see
about_CommonParameters (https:/go.microsoft.com/fwlink/?LinkID=113216).
INPUTS
OUTPUTS
-------------------------- EXAMPLE 1 --------------------------
PS C:\\>Get-IHIFileEncoding -Path .\\SomeFile.ps1 1000
Attempts to determine encoding using only first 1000 characters
BodyName : unicodeFFFE
EncodingName : Unicode (Big-Endian)
HeaderName : unicodeFFFE
WebName : unicodeFFFE
WindowsCodePage : 1200
IsBrowserDisplay : False
IsBrowserSave : False
IsMailNewsDisplay : False
IsMailNewsSave : False
IsSingleByte : False
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : True
CodePage : 1201
RELATED LINKS