2016-10-06
A couple months ago I created the library purescript-unicode. garyb and paf31 accepted it into the purescript-contrib Github organization.
The purescript-unicode library is a direct port of Haskell's Unicode functionality. purescript-unicode's Data.Char.Unicode
module contains all of the Unicode-related functions and datatypes provided by Haskell's Data.Char
module.
Originally, I had wanted to use purescript-parsing to parse a programming language. I knew Haskell's parsec library had convenient functionality for parsing programming languages, but it wasn't implemented in purescript-parsing
. I decided to port the functionality from parsec to purescript-parsing. However, I ran into a roadblock. parsec was using many Unicode-related functions from Haskell's Data.Char
module.
I took some time to think about how to proceed, and I came up with a plan to port all of the Unicode functionality from Haskell's Data.Char
module to a separate PureScript package. This is what became purescript-unicode. I then sent a PR to purescript-parsing that adds functionality which depends on purescript-unicode.
purescript-unicode Usage Example
Here is a short example of actually using purescript-unicode:
>>> generalCategory 'a'
Just LowercaseLetter
>>> generalCategory '0'
Just DecimalNumber
>>> generalCategory '♥'
Just OtherSymbol
>>> generalCategory '本'
Just OtherLetter
How is purescript-unicode Implemented?
purescript-unicode works very similarly to Haskell's Data.Char
module.
There is an internal module Data.Char.Unicode.Internal
that is generated by a shell script. The Data.Char.Unicode
module uses the internal module.
Future Improvements
There are multiple possible future improvements.
- Performance: Performance has been completely ignored in the
Data.Char.Unicode.Internal
module. There are TODOs at the bottom of the file with particularly bad examples of this. It would be nice to fix all these TODOs. It would also be nice to have proper benchmarks for this library. - Internal File Generation: The shell script that generates the
Data.Char.Unicode.Internal
module is somewhat hacky. Internally it's usingawk
, so it's not portable to machines withoutawk
. Ideally, this shell script could be rewritten as a purescript-node program. That way it could be runnable by anyone as long as they havepsc
andnode
installed.
tags: purescript