purescript-unicode

2016-10-06

A couple months ago I created the library purescript-unicode. garyb and paf31 accepted it into the purescript-contrib Github organization.

The purescript-unicode library is a direct port of Haskell's Unicode functionality. purescript-unicode's Data.Char.Unicode module contains all of the Unicode-related functions and datatypes provided by Haskell's Data.Char module.

Originally, I had wanted to use purescript-parsing to parse a programming language. I knew Haskell's parsec library had convenient functionality for parsing programming languages, but it wasn't implemented in purescript-parsing. I decided to port the functionality from parsec to purescript-parsing. However, I ran into a roadblock. parsec was using many Unicode-related functions from Haskell's Data.Char module.

I took some time to think about how to proceed, and I came up with a plan to port all of the Unicode functionality from Haskell's Data.Char module to a separate PureScript package. This is what became purescript-unicode. I then sent a PR to purescript-parsing that adds functionality which depends on purescript-unicode.

purescript-unicode Usage Example

Here is a short example of actually using purescript-unicode:

>>> generalCategory 'a'
Just LowercaseLetter
>>> generalCategory '0'
Just DecimalNumber
>>> generalCategory '♥'
Just OtherSymbol
>>> generalCategory '本'
Just OtherLetter
>>> isControl '\04'
true
>>> isControl 'a'
false
>>> isPrint '\04'
false
>>> isPrint 'a'
true
>>> isSpace ' '
true
>>> isSpace 'a'
false
>>> isUpper 'Z'
true
>>> isUpper 'a'
false
>>> isUpper '日'
false
>>> isAlpha 'a'
true
>>> isAlpha '日'
true
>>> isAlpha ' '
false

How is purescript-unicode Implemented?

purescript-unicode works very similarly to Haskell's Data.Char module.

There is an internal module Data.Char.Unicode.Internal that is generated by a shell script. The Data.Char.Unicode module uses the internal module.

Future Improvements

There are multiple possible future improvements.

  • Performance: Performance has been completely ignored in the Data.Char.Unicode.Internal module. There are TODOs at the bottom of the file with particularly bad examples of this. It would be nice to fix all these TODOs. It would also be nice to have proper benchmarks for this library.
  • Internal File Generation: The shell script that generates the Data.Char.Unicode.Internal module is somewhat hacky. Internally it's using awk, so it's not portable to machines without awk. Ideally, this shell script could be rewritten as a purescript-node program. That way it could be runnable by anyone as long as they have psc and node installed.

tags: purescript